This is a summary document for processing and working with COI data from tree swallow nestlings and adults using fecal samples collected in the field. The wetlab procedures are not described here. A lot of this is data cleaning and checking with a few basic plots and analyses at the end. The figures produced here are pasted in from the separate R script, to find the actual code to produce the analyses check that main script in this same repository.

AMPtk Pipeline

All of the raw sequences for these samples were processed using the AMPtk pipeline in the command line (Palmer et al. 2018). We followed very closely the workflow described on the project website amptk.readthedocs.io/en/latest. Before running AMPtk you need to install the software in a conda environment, install USEARCH and download the COI database. Those steps are kind of a pain, but they are described in detail on the website and not repeated here.

A few notes to avoid headaches repeating this later:

  • It seems that currently there is no way to run AMPtk on Mac with OS Catalina because it can’t execute the 32bit USEARCH program. Use an older computer, virtual machine, or cloud computing.
  • The pipeline expects your sequences to be named: sampleID_whatever_filler_R1/F1.fastq.gz. The BRC sequences come back with the sample name in the middle. I haven’t yet figured out how to make AMPtk recognize that, so I’ve just been batch-renaming the files to get the sample name first.

Once AMPtk is installed. The entire pipeline that I’ve run to this point involves just three commands (will take several hours on a home laptop to run with samples in hundreds). Output files will be saved in the folder where you have set up your conda environment.

  1. Pre-processing sequence data. Takes a set prefix to add to all files and the forward and reverse primer sequences. Specify input folder where sequences are stored.

amptk illumina -i /directory/with/sequences -o trescoi -f GCHCCHGAYATRGCHTTYCC -r TCDGGRTGNCCRAARAAYCA

  1. Denoising with DADA2 (Callahan et al. 2016). This takes in the processed sequences from step one and applies the denoising algorithm that identifies ASVs and models sequencing error to arrive at a final list of ASVs for F/R reads. Forward and reverse reads are then merged together.

amptk dada2 -i tres.coi.demux.fg.gz --platform illumina -o trescoi

  1. Assigning taxonomy. This uses the default suggestions from AMPtk documentation, but there are many other options. It applies the ‘hybrid taxonomy algorithm’, which looks at matches to i) Global Alignment to the downloaded COI arthropod database, ii) UTAX classification, and iii) SINTAX classification. More info is on the website, but basically it retains the best hit that also has the most levels of taxonomic ranks. The arthropod COI database needs to be downloaded and set up for this to run.

amptk taxonomy -i trescoi.cluster.otu_table.txt -f trescoi.cluster.otus.fa -d COI

Those three commands produce a bunch of output files, but only three are needed to pull into R for subsequent analyses.

  • trescoi.cluster.otu_table.txt has samples in columns and OTUs in rows with reads in each cell
  • trescoi.cluster.taxonomy.txt has full taxonomic information for each OTU
  • trescoi.mapping_file.txt has one row for each sample to add metadata

Those three files are read into the R script in this repository as the basis of everything here with the addition of sample metadata. The additional AMPtk files are saved in a folder with the repository but are not used further.

NOTE: There are many other optional steps in AMPtk. For example, one set of commands works to try to remove barcode bleed over where sequences from one barcode/sample are accidentally attributed to other samples. There are different types of classifiers and options, etc. What I’ve done here is (I think) kind of the basic or bare-bones pipeline.

Moving to R

The first part of the main script deals with importing those three AMPtk files and merging them into a single phyloseq object (McMurdie and Holmes 2013). There is a fair amount of wrangling that happens in the script, but it’s really all just reformatting and merging things together into the format that phyloseq expects. It isn’t necessary to use phyloseq for a lot of these things, but once the data is merged into that object it is very easy to work with all of their functions for subsetting, filtering, merging, plotting etc. The one thing that is not in here at the moment is a phylogenetic tree for the relationships between the food items. phyloseq will accept a tree also and that would allow for things like calculating unifrac & Faith’s phylogenetic diversity or plotting diets on trees etc. We could add that later if desired. I have code in our microbiome processing scripts that builds a tree from the gene sequences but we would have to adapt it from there. There may also be something in AMPtk that would do this for us.

Sequencing Depth

One of the first things worth checking is how many sequences we have per sample. This is showing the number of sequences retained for each sample after merging, denoising, whatever other filtering dada2 did, and removing everything except for Arthropods. So many of these samples did have quite a few more reads in the initial return. There are two reasons to look at the read counts. First, to see if samples ‘worked’ for getting data. Most of what we’ve heard from other people using these techniques is that they expect 5-15% of their samples to just fail to return reads for some reason (no dna, bad extraction, bad amplification, not enough sample added to sequencer, etc). We have been really loading our MiSeq runs with the max samples and they are trying to pool all 384 at even DNA concentrations, so I think it’s not surprising that a fair number would fall out. The second reason to look at this is to decide if we might want to rarefy samples to an even depth of reads for some analyses. This could be especially important for comparing alpha diversity metrics where more samples might = more diversity (I say might because it isn’t clear to me that it will keep increasing the way it would with microbiome where you may not get complete sampling of the community). Anyway, here is the log total number of reads per sample for all samples combine (log scale makes it easier to see the variation on the low end, which is what we care about).


I added in a reference line showing where 150 reads would fall. That is a pretty low number to rarefy to, but even there we lose a bunch of samples that are very low. It’s probably worth some extra investigation into what the effect of rarefying to 150 is. It might be that in many of those samples all 150 reads are from the same single food item and in some of the higher read samples you still recover the full (or mostly full) community even with only 150 samples. One thing with this plot is that it does include negative controls and nestlings. Here is the same plot with them broken out.