Ok, the subsampling script that I had been plannin on working on yesterday did not really go anywhere then, though I did get the q2-dada2 expected data output setup. So here we go on the subsampling stuff. Here are some design points:
Inputs: a manifest of the fastq files and the desired number of reads from each.
create directory to store subsampled files.
read files from manifest and subsample them one at a time.
If any of the fastqs are unzipped, unzip them(maybe un-necessary if sk-bio
can read compressed files.
Open each file using skbio.io.read
reservoir sample each file to the target number of reads. This function should be a generator that takes the read file generator as input.
write back out to file in the subsampled directory.
compress new file(maybe handled by skbio?)
write entry in a subsampled manifest.