Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insert size estimation and report interpretation #536

Open
lmartinezgili opened this issue Dec 11, 2023 · 2 comments
Open

Insert size estimation and report interpretation #536

lmartinezgili opened this issue Dec 11, 2023 · 2 comments

Comments

@lmartinezgili
Copy link

After running the following fastp command on my paired-end metagenomics data:

 fastp -i $f -I ${f/_1./_2.} \
    -o ./fastp_filtered/${bn} -O ./fastp_filtered/${bn/_1./_2.} \
    --detect_adapter_for_pe \
    --trim_poly_g \
    --qualified_quality_phred 20 \
    --length_required 100 \
    --overrepresentation_analysis \
    -h ./fastp_reports/$(basename ${bn} .fastq.gz).html \
    -j ./fastp_reports/$(basename ${bn} .fastq.gz).json \
    -w 8

the insert size estimation appears to be cut at the peak 271 and it says that >98% of reads did not overlap.

image

This happens with all my samples. The rest of the parameters seem fine (except in few samples where the run failed). Is it normal for a sample to have 98% non-overlapping reads or is this a cause for concern? I have seen other people with similar questions but I have not found clear answers or guidelines.

Why is there a fixed threshold around 270? Is there a way to plot the full distribution?

Would it be possible to include documentation for help with interpreting the insert size estimation?

Also, before fastp filtering, all my reads were exactly 151 bp, now I got some few sequences ranging from 100 - 151 bp, which issues a warning by fastqc. Is this because some reads get trimmed (e.g. polyG or other reasons) and then still fall into the quality criteria for being kept?

Thanks in advance for your help!

@bobojin46
Copy link

Hi, I was wondering have you solved that? I have encountered the high proportion of insert size estimation in all my samples, too.

@lmartinezgili
Copy link
Author

Hi @bobojin46

Sort of...I followed the suggestions I found online about checking the estimated insert size in the sam files output, after aligning the sequences.

My data seems to have bigger insert sizes, ranging from 300-600, so it makes sense that the distribution mode falls outside of the imposed 30-272 range of the fastp output.

It would be nice to get some more information about this though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants