测序reads 根据碱基质量截短过滤2种策略

发布网友发布时间：2022-12-29 02:41

共1个回答

热心网友时间：2023-11-07 02:41

测序 reads 除了移除接头往往还根据碱基质量进行一定的截短，尽量把 reads 中低质量区域(靠近末端)移除，保证整体的 reads 质量。

这个过滤过程有2种方法，第一种计算累计的质量值(减去设定的阈值后)，第二种是滑窗法(Sliding Window)。

以截短 3' 端为例，第一种方法从末端碱基开始先计算每个碱基与设定阈值的差值，然后在累计差值最小的地方截断。假设设定阈值为10，下面例子来源于 Cutadapt:
原始碱基质量值： 42, 40, 26, 27, 8, 7, 11, 4, 2, 3
计算与阈值差值： 32, 30, 16, 17, -2, -3, 1, -6, -8, -7
计算差值累计： (70), (38), 8, -8, -25, -23, -20, -21, -15, -7
在累计差值为-25时最小因此在这里截断，保留 42, 40, 26, 27 这几个质量值的碱基。

滑窗法计算每个 window 平均碱基质量，如果某个窗口平均质量低于阈值则从该窗口开始截断，留下窗口左边碱基丢弃剩余部分(该窗口也被丢弃)。像 fastp 的 -r/--cut_right 描述如下：
"move a sliding window from front to tail, if meet one window with mean quality < threshold, drop the bases in the window and the right part, and then stop."
所以这个如果很不巧 reads 开头的窗口就低质量，那么整个 reads 会被丢弃。

[参考]
Martin, Marcel. "Cutadapt removes adapter sequences from high-throughput sequencing reads." EMBnet. journal 17.1 (2011): 10-12.
Chen, Shifu, et al. "fastp: an ultra-fast all-in-one FASTQ preprocessor." Bioinformatics 34.17 (2018): i884-i890.