@(文献阅读记录)[MBV, mislabeling] MBV: a method to solve sample mislabeling and detect technical bias in large combined genotype and sequencing assay datasets

[toc]

动机

为了确保各种数据的样本对应关系正确（我有一个样本的全基因组的VCF文件，还有一堆bam，我要找出哪个bam才是这个vcf样本所对应的），方法叫 MBV 就是指 match bam to vcf 。
如果vcf是全基因组的，而bam是其他数据（外显子捕获，RNA）那么还能大概的检查一下PCR amplication bias。
可能可以看出样本污染情况。

感觉上这篇文章能发《bioinformatics》是因为有关系，这个看不出什么难度的方法都能发表

材料和方法

输入：一个包含samples的vcf，一个要查验的bam。
输出：

sampleID	HeGT	HoGT	bamHeGT	bamHoHT	matchHe	matchHo	percentageHe
HG00096	23764	61721	175	499	91	333	29
HG00097	26639	58846	193	481	93	317	23
HG00099	27672	57813	216	458	93	294	26
HG00100	28267	57218	243	431	107	281	24
HG00381	27339	58146	213	461	204	408	28
HG00106	26046	59439	190	484	90	317	30
HG00108	25408	60077	205	469	85	297	31

The sample ID in the VCF against which the sequence data has been matched

The number of missing genotypes for this sample

The total number of heterozygous genotypes examined

The total number of homozygous genotypes examined

The number of heterozygous genotypes considered for the matching, i.e. those that are covered by more than --filter-minimal-coverage 10 reads

The number of homozygous genotypes considered for the matching, i.e. those that are covered by more than --filter-minimal-coverage 10 reads

The number of heterozygous genotypes considered for the matching with fully matching sequence data

The number of homozygous genotypes considered for the matching with fully matching sequence data

The percentage of heterozygous genotypes considered for the matching with fully matching sequence data

The percentage of homozygous genotypes considered for the matching with fully matching sequence data

Dummy field

9，10 ，11列不太明白是什么意思，其他的都算是比较好理解。

利用第7列/第5列，第8列/第6列，就能得出一个大致的杂合纯合基因型match比例。如果对于VCF中每个样本都画在图中，就有如下效果：

图中的绿点就是match的，红点就是未match的。

mislabel结果： B 表示的就是sampleID没有出错。C就是表示又一个sampleID可能混了，所以导致绿色（假设是那个ID的bam得出的结果）在左边。
污染结果： 实验者用已知的污染物进行测试，发现如上结果。
PCR偏好结果：

increasing amplification bias leads to decreased heterozygous concordance with no change in homozygous concordance 如下图所示：

小结

觉得可以用到商业项目中，特别是对于体检项目，可以将之前所有的样本构建成VCF，然后再跑新一批次样本的时候运行一次这个检查，看有没有和之前相同的样本，然后同一批次内还可以在跑一次，这个时间成本需要评估来确定是否合理。
对于所有的商业项目其实都可以这样进行，画图的时候，可以画之前所有的批次一个颜色，同一批次一个颜色，本身对应的ID一个颜色，就可以试试能不能查看出污染及重复样本。
PCR偏好这块还是没有太懂，需要再关注一下。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

@(文献阅读记录)[MBV, mislabeling] MBV: a method to solve sample mislabeling and detect technical bias in large combined genotype and sequencing assay datasets

动机

材料和方法

小结

FilesExpand file tree

bmv.md

Latest commit

History

bmv.md

File metadata and controls

@(文献阅读记录)[MBV, mislabeling] MBV: a method to solve sample mislabeling and detect technical bias in large combined genotype and sequencing assay datasets

动机

材料和方法

小结