This is a perl script that trims out undesired “bad sequences” from paired fastq files such as poly-N or poly-A sequences.

  • You can apply this before applying tools like trimmomatic.
  • You define what pattern(s) of sequences are “bad”.
    • Use simple perl regular expressions.
  • For each read, the longest contiguous “good sequence” is returned if it is not “too short”.
    • You define the length of the shortest accepted read.
  • Reads with no accepted contiguous “good sequences” are filtered out completely.
  • If only one read in a pair has an accepted “good sequence”, that one is included in the “unpaired reads” output file.

Figure 1 shows an example where the length of the shortest accepted sequence is 25 bases.


Figure 1. Logic of the trimbadseq2 script.


Files are available this github page.

Download the perl script:

Sample data

Download sample fastq files (i_1.fq and i_2.fq) this github page.

Try running this on these sample fastq files and check the output files:

perl i_1.fq i_2.fq o_1.fq o_2.fq o_u.fq 20 “A{10,}” “N{3,}”

Contact me for feedback!