14 Feb 2022

Q2-DADA2 + PacBio CCS Reads

The big push for today is to get @sixvable’s PacBio CCS read pull request for Q2-DADA2 ready. Steps to go include: finishing the paper for the addition of PacBio CCS read denoising to DADA2, making a decision of refactoring to improve DRYness vs leaving as is to avoid monkeying with existing code/improving maintainability (I am leaning towards a bit of refactoring that produces the best of both), creating a subset of the Zymo dataset(as recommended by Ben Callahan, the author of DADA2 and PacBio CCS functionality + the accompanying paper), running the dataset through R and QIIME2 to ensure the same output, and writing some smoke tests that check that QIIME2 is getting the same results as R does (DADA2 is already tested extensively in R).

Paper

This is a really accurate method, ~ 50% of 1.5 kilo-base 16S rRNA sequencing reads were completely error free! So accurate that the author encountered too little resolution in the reference datasets. You do have to be careful of chemistry RSII, pre-P6-C4, and SMRT Portal generated reads do not have the same accuracy. Also, almost no homopolymeric repeats in 16S.

The authors were able to detect differences to the strain level, as opposed to the genus or higher level, consistently.

Oxford Nanopore generates longer but less accurate reads.

Advantages and Disadvantages of Sequencing Techniques:

Short-read: Cheapest, per-sample depth allows for detection of rare community members.
Shotgun Metagenomics: Species and some variant resolution(reliant on suitable reference datasets), can provide information about functional genetic potential. With deep shotgun sequencing: some de novo assembly of community genomes?
Full Length: Combines the targeting of amplicon sequencing with the resolution achieved by shotgun approaches.

Could also be applied in other domains, such as generating complete oncogene sequences, instead of just detecting partial ones, identification of unknown patheogens, and possibly entire ~5kb 16S-ITS-23S region, which could encompass entire viral genomes, as well as a variety of other uses.

Producing a Subset of Data

Fornutely Benjamin Callahan produced this awesome Github repo of the analysis for the paper!

Retrive mock Zymo sequencing run from NCBI Accension PRJNA521754. The direct download for the raw reads from the sequencer is here.
copy data strip the .1 suffix.
generate manifest file
import:

qiime tools import \
--type 'SampleData[SequencesWithQuality]' \
--input-path zymo_manifest.tsv \
--output-path zymo_raw.qza \
--input-format SingleEndFastqManifestPhred33V2

QIIME2 2022.2 Release Stuff

Working on pre-release activities w/Liz and Evan.

As a side note, if any tag-dev on busywork is red, there is a good chance that it is because it has already been run and it does not want to do it again.

Remember there are pre-built docs on ghost.

Ancillary Tidbits

Getting a local copy of a PR:

git fetch $REPO_THE_PR_IS_ON pull/$PR_NUMBER/head:$DESIRED_BRANCH_NAME
git checkout $DESIRED_BRANCH_NAME