An Introduction to microhaplot data preparation

Thomas Ng and Eric C. Anderson


This vignette concerns itself with an overview of microhaplot and on preparation of data for input into microhaplot Once your data has been transferred to the appropriate Haplot folder, see the microhaplot walkthrough vignette to learn more about microhaplot’s graphical features.

How does microhaplot work?

microhaplot is a tool that makes it easy to extract, visualize, and interpret short-read haplotypes—or “microhaplotypes” in the parlance of some—from aligned sequences. Before we go further, it will be good to step back and confirm of what we are calling as a short-read haplotype, or a “microhaplotype” within the context of microhaplot.

What is a microhaplotype?

To define a microhaplotype you need to first define a reference point or position that a read might align to. Then, microhaplot can analyze the reads aligning at those positions, and extracts the variant sequence of the reads at those positions. For example, let’s say that we know SNPs occur at positions 19, 27, 68, and 101 of a particular reference sequence that we have aligned our reads to. If a read aligns to the ref and shows an A at position 19, a G at position 27, a C at position 68, and another C at position 101, then we say the microhaplotype on that read is AGCC.

microhaplot is a tool that

  1. automates the extraction of these haplotypes from aligned data (typically amplicon sequencing data)
  2. provides useful filtering and visualizing capabilities for understanding your microhaplotype data and for calling genotypes at microhaplotypes.

To understand what is involved in getting your data into microhaplot, it will be useful to have a brief introduction to what it does and how it operates. We will start by listing what it is not.

What kind of data can I use with microhaplot?

‘microhaplot’ is designed to work with short read sequencing data where there is a relatively small number sequencing start sites. It is emphatically not geared to whole-genome shotgun sequencing data in which you may have overlapping reads with different starting place. We use ‘microhaplot’ primarily with amplicon sequencing data in which a few hundred fragments (of about 100 - 150 bp each) are amplified by PCR in a few hundred individuals. The DNA from different individuals is bar-coded so that it may be demultiplexed later, and then these amplified fragments are sequenced (usually to fairly high read depth) on an Illumina platform.

Though we use ‘microhaplot’ for amplicon sequencing, it might also be useful for analyzing haplotype data from experiments that sequence captured or baited fragments, (e.g., baited DD-Rad, RAPTURE, myBaits, ExonCapture, etc.), and might even be applicable to standard ddRAD or RAD-PE experiments, however, as the number of loci increases and average read depths per individual-locus combination decreases, the unevenness of coverage can become a bigger problem, (and we have more reservations about allelic dropout/null alleles when using digestion-based protocols (ddRAD, Rapture, etc.) than we do with amplicon sequencing).

Why did we write ‘microhaplot’?

We tried to use existing tools to give us what we needed for our amplicon-based microhaplotypes.

After a few months of this, we realized that there is no software tailored for the relatively straightforward task of extracting and analyzing microhaplotypes. The other software programs have all been designed for different purposes, and, as a consequence, while they work great for identifying variants or performing de-novo assemblies, one might not expect them to work for the specialized task of extracting and analyzing microhaplotypes.

How, specifically, should I prepare my data?

The starting point for microhaplot requires:

  1. An indexed, SAM file of reads aligned to the reference sequences for each amplicon.
  2. A VCF file whose contents define the positions in each reference sequence from which you want to extract variants into microhaplotypes. Note that the individuals that appear in the VCF file need not be the same ones whose reads appear in the assemblies. The VCF is merely used as a way of defining positions to extract. Still, since the content of microhaplotype is based on the variant positions listed in the VCF file, it is crucial for the VCF file to only contain highly reliable variant sites.

We describe here the workflow that we use to get these two necessary files from the short-read amplicon sequencing data that come off our Illumina Mi-seq. This workflow can be tweaked and tailored for your own data. Each step is described in one of the following subsections.

Starting state

Out description starts at the point where we have copied all of our fastq.gz files from the sequencer into a directory called rawdata. Next to that directory we have made two more directories, flash and map, in which we will be creating files and doing work. We have named all the Illumina files so that a directory listing of rawdata looks like this:

In this case, there are 384 such sets of files. The number after the capital S in the filename gives the ID number of each individual fish.

Flashing the reads together

Our data are paired end reads of fragments that are short enough that the primary reads and the paired-end reads overlap. Rather than treat them as two separate reads, we combine them together into a single read using flash. Here is the command we use to cycle over all the files and flash them, run from the flash directory which is at the same level as the rawdata directory. (Note that the /flash/--% in the following code fragment is just the command prompt—it tells you which directory we are in…)

The output that flash spits out to stdout looks like this:

This has created a series of files named like S${i}.extendedFrags.fastq.gz in which the ${i} is replaced by a number between 1 and 384. These files are gzipped fastq files that hold the flashed reads.

Mapping/Aligning the Reads

The reference sequences we will use to align to are in a fasta file called gtseq15_loci.fasta. We put that in the map directory and then commence mapping the flashed reads as follows:

# index the reference. Output will be prefixed with "gtseq15_loci"
/map/--% bwa index -p gtseq15_loci -a is gtseq15_loci.fasta

# here is what comes by on stdout 
[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.00 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.5a-r405
[main] CMD: bwa index -p gtseq15_loci -a is gtseq15_loci.fasta
[main] Real time: 0.061 sec; CPU: 0.007 sec

# map reads. for each input file, the output is named satro_flashed_s${i}_aln.sam
# Note the -R to insert appropriate read group header lines
/map/--% for i in {1..384}; do bwa mem -aM -v 3 -t 10 -R "@RG\tID:s${i}\tLB:amplicon\tPL:ILLUMINA\tSM:satro${i}" ./gtseq15_loci ../flash/S${i}.extendedFrags.fastq.gz  > ./satro_flashed_s${i}_aln.sam; done

# convert the SAMs to BAMs 
/map/--% for i in {1..384}; do samtools view -bhS satro_flashed_s${i}_aln.sam > satro_flashed_s${i}.bam; done

# Sort the BAMs
/map/--% for i in {1..384}; do samtools sort satro_flashed_s${i}.bam -o satro_flashed_s${i}_sorted.bam; done

# Index the BAMs
/map/--% for i in {1..384}; do samtools index satro_flashed_s${i}_sorted.bam; done

# Make a text-file list of the BAMs
/map/--% ls -1 *sorted.bam > bamlist.txt

# generate idx stats and put them in a directory called gtseq17_idx
/map/--% mkdir gtseq17_idx
/map/--% for i in {1..384}; do samtools idxstats satro_flashed_s${i}_sorted.bam > ./gtseq17_idx/s${i}_idxstats.txt; done

Variant detection

We are now set up to detect variants using freebayes. We show how we do that here, but emphasize that if you have already chosen a set of variants that form your microhaplotypes, you could just specify those with a pre-existing VCF file. You don’t need to do new variant detection if you don’t want to. (Of course, if it doesn’t take too long, you may as well do variant detection with every individual you have sequenced at the amplicons, so as to get as many variants as possible.)


After the above is done, the file satro384_noMNP_noComplex_noPriors.vcf can be filtered, if desired, to make sure that everything in it has a solid SNP. Then that file is used with the function prepHaplotFiles to extract haplotypes from the aligned reads in the map directory.
An example of an vcf file and SAM files extracted from an actual GT-seq rockfish data is available in the inst/extdata.