Insert size estimation and report interpretation #536

lmartinezgili · 2023-12-11T14:40:52Z

After running the following fastp command on my paired-end metagenomics data:

 fastp -i $f -I ${f/_1./_2.} \
    -o ./fastp_filtered/${bn} -O ./fastp_filtered/${bn/_1./_2.} \
    --detect_adapter_for_pe \
    --trim_poly_g \
    --qualified_quality_phred 20 \
    --length_required 100 \
    --overrepresentation_analysis \
    -h ./fastp_reports/$(basename ${bn} .fastq.gz).html \
    -j ./fastp_reports/$(basename ${bn} .fastq.gz).json \
    -w 8

the insert size estimation appears to be cut at the peak 271 and it says that >98% of reads did not overlap.

This happens with all my samples. The rest of the parameters seem fine (except in few samples where the run failed). Is it normal for a sample to have 98% non-overlapping reads or is this a cause for concern? I have seen other people with similar questions but I have not found clear answers or guidelines.

Why is there a fixed threshold around 270? Is there a way to plot the full distribution?

Would it be possible to include documentation for help with interpreting the insert size estimation?

Also, before fastp filtering, all my reads were exactly 151 bp, now I got some few sequences ranging from 100 - 151 bp, which issues a warning by fastqc. Is this because some reads get trimmed (e.g. polyG or other reasons) and then still fall into the quality criteria for being kept?

Thanks in advance for your help!

The text was updated successfully, but these errors were encountered:

bobojin46 · 2024-01-06T14:36:36Z

Hi, I was wondering have you solved that？ I have encountered the high proportion of insert size estimation in all my samples, too.

lmartinezgili · 2024-03-07T17:05:40Z

Hi @bobojin46

Sort of...I followed the suggestions I found online about checking the estimated insert size in the sam files output, after aligning the sequences.

My data seems to have bigger insert sizes, ranging from 300-600, so it makes sense that the distribution mode falls outside of the imposed 30-272 range of the fastp output.

It would be nice to get some more information about this though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Insert size estimation and report interpretation #536

Insert size estimation and report interpretation #536

lmartinezgili commented Dec 11, 2023

bobojin46 commented Jan 6, 2024

lmartinezgili commented Mar 7, 2024

Insert size estimation and report interpretation #536

Insert size estimation and report interpretation #536

Comments

lmartinezgili commented Dec 11, 2023

bobojin46 commented Jan 6, 2024

lmartinezgili commented Mar 7, 2024