WO2022105629A1 - Method for screening snp sites for detecting contamination level of sample and method for detecting contamination level of sample - Google Patents

Method for screening snp sites for detecting contamination level of sample and method for detecting contamination level of sample Download PDF

Info

Publication number
WO2022105629A1
WO2022105629A1 PCT/CN2021/129081 CN2021129081W WO2022105629A1 WO 2022105629 A1 WO2022105629 A1 WO 2022105629A1 CN 2021129081 W CN2021129081 W CN 2021129081W WO 2022105629 A1 WO2022105629 A1 WO 2022105629A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
contamination
sites
marker
detecting
Prior art date
Application number
PCT/CN2021/129081
Other languages
French (fr)
Chinese (zh)
Inventor
柳焱
白健
王寅
屈紫薇
吴�琳
Original Assignee
福建和瑞基因科技有限公司
北京和瑞精准医疗器械科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 福建和瑞基因科技有限公司, 北京和瑞精准医疗器械科技有限公司 filed Critical 福建和瑞基因科技有限公司
Publication of WO2022105629A1 publication Critical patent/WO2022105629A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

A method for screening SNP sites for detecting the contamination level of a sample and a method for detecting the contamination level of a sample, which relate to the technical field of biological sequencing. The screening method comprises: obtaining SNP sites that have a population mutation frequency of 30% to 70% in a target region as candidate marker sites; dividing a region between a starting site and an terminating site among the candidate marker sites present on a single chromosome into a plurality of selection regions, the selection regions having a length of 0.7 to 1.3 Mb; and if there are two or more candidate marker sites in the selection regions, then selecting sites that have an allele frequency in the selection regions of 40% to 60% and different genotypes most in line with the Hardy-Weinberg equilibrium as marker sites, and removing other candidate markers within the regions. The marker sites screened on the basis of the screening method may be used to detect the contamination level of the sample, and has relatively low detection costs and relatively high detection accuracy.

Description

一种用于检测样本污染水平的SNP位点的筛选方法及样本污染水平的检测方法A screening method for SNP sites for detecting sample contamination level and a method for detecting sample contamination level
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本公开要求于2020年11月23日提交中国专利局的申请号为202011321699.3、名称为“一种用于检测样本污染水平的SNP位点的筛选方法及样本污染水平的检测方法”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。This disclosure requires a Chinese patent application with application number 202011321699.3 and titled "A screening method for SNP loci for detecting sample contamination levels and a method for detecting sample contamination levels" filed with the China Patent Office on November 23, 2020 , the entire contents of which are incorporated by reference in this disclosure.
技术领域technical field
本公开涉及生物测序技术领域,具体而言,涉及一种用于检测样本污染水平的SNP位点的筛选方法及样本污染水平的检测方法。The present disclosure relates to the technical field of biological sequencing, and in particular, to a screening method for SNP sites used to detect the contamination level of a sample and a method for detecting the contamination level of the sample.
背景技术Background technique
第二代测序NGS技术因其高通量低成本的特点快速发展并应用与肿瘤临床样本的基因组测序中。然而,多样本并行建库测序导致了样本间数据污染的问题。肿瘤临床样本多采用肿瘤-对照的配对设计,采用对照样本过滤肿瘤样本中存在的对应样本的胚系突变。通常,肿瘤样本的污染会导致大量异源污染的胚系突变被错误的判定为体细胞突变,从而导致检测出的体细胞突变中存现大量假阳性突变。同时,异源污染会使得肿瘤样本的纯度下降,进而导致肿瘤样本中的体细胞突变的检测敏感性降低。因此,准确地检测肿瘤临床样本的污染水平是不可或缺的质控步骤。Next-generation sequencing (NGS) technology has developed rapidly due to its high-throughput and low-cost characteristics and has been applied to genome sequencing of tumor clinical samples. However, multi-sample parallel library construction and sequencing leads to the problem of data contamination between samples. Tumor clinical samples mostly adopt a tumor-control paired design, and control samples are used to filter the germline mutations of the corresponding samples present in the tumor samples. Often, contamination of tumor samples results in a large number of heterologous contaminating germline mutations being incorrectly identified as somatic mutations, resulting in a large number of false-positive mutations in the detected somatic mutations. At the same time, heterologous contamination reduces the purity of tumor samples, which in turn reduces the sensitivity of detection of somatic mutations in tumor samples. Therefore, accurate detection of contamination levels in tumor clinical samples is an indispensable quality control step.
全基因组测序(Whole Genome Sequence,WGS)和全外显子测序(Whole Exome Sequence,WES)能够覆盖一定量的人群多态性单核苷酸多态性(Single Nucleotide Polymorphism,SNP)位点。基于先验的人群频率以及配对样本中的肿瘤样本和对照样本的SNP位点的野生型和突变型信息,用于样本污染水平估计已成功应用于全基因组测序和全外显子测序中。现有的代表方法有ConEst、Conpair和VerifyBamID,此类方法均使用贝叶斯方法,使用WGS/WES覆盖的标记位点,标记数量众多。相比于VerifyBamID使用全部覆盖的SNP位点,ConEst和Conpair均使用其中的纯合位点。由于肿瘤样本中基因拷贝数变异CNV的普遍性,肿瘤样本的杂合位点受CNV影响会偏离配对样本中的突变丰度(Variation Alle Fraction,VAF),从而导致VerifyBamID对于样本的污染水平估计值可能过高。Whole genome sequencing (Whole Genome Sequence, WGS) and whole exome sequencing (Whole Exome Sequence, WES) can cover a certain amount of population polymorphism single nucleotide polymorphism (Single Nucleotide Polymorphism, SNP) loci. Based on prior population frequencies and wild-type and mutant-type information of SNP loci in tumor samples and control samples in paired samples, the estimation of sample contamination levels has been successfully applied in whole-genome sequencing and whole-exome sequencing. Existing representative methods include ConEst, Conpair, and VerifyBamID, all of which use Bayesian methods, use the marker sites covered by WGS/WES, and have a large number of markers. ConEst and Conpair both used homozygous loci among them, compared to VerifyBamID's use of fully covered SNP loci. Due to the prevalence of gene copy number variation CNVs in tumor samples, the heterozygous sites of tumor samples affected by CNVs will deviate from the mutation abundance (Variation Alle Fraction, VAF) in paired samples, resulting in VerifyBamID's estimate of the contamination level of the sample. May be too high.
尽管WGS和WES有助于更全面的检测和理解肿瘤基因突变全貌,其高昂的测序成本导致样本测序深度受限,且尚未有报道存在有效的基于panel的标记位点污染预测方法。污染水平预测作为肿瘤基因检测的必要质控模块,是肿瘤样本基因检测结果可靠性的重要保证。Although WGS and WES contribute to a more comprehensive detection and understanding of tumor gene mutations, their high sequencing costs limit the depth of sample sequencing, and no effective panel-based marker site contamination prediction method has been reported. As a necessary quality control module for tumor genetic testing, contamination level prediction is an important guarantee for the reliability of tumor sample genetic testing results.
因此,开发一套适用于不同panel的标记位点筛选和污染水平预测方法是必要和紧迫的。若简单的将WES范围内的所有标记加入大Panel中,以Conpair算法为例,需覆盖WES范围内的7387个位点,若每个标记设计120bp探针将需要额外覆盖至少0.886Mb的区间大小,显然将极大的增加Panel大小和成本。同时,当标记位点过少时,Conpair算法表现不佳。本公开污染预测算法,使得仅依赖大Panel设计之初覆盖的标记位点即能准确预测样本的污染水平。Therefore, it is necessary and urgent to develop a set of marker locus screening and contamination level prediction methods suitable for different panels. If you simply add all the markers in the WES range to the large Panel, taking the Conpair algorithm as an example, it needs to cover 7387 sites in the WES range. If you design a 120bp probe for each marker, you will need to cover at least 0.886Mb of interval size. , obviously will greatly increase the Panel size and cost. At the same time, when there are too few labeled sites, the Conpair algorithm performs poorly. The contamination prediction algorithm disclosed in the present disclosure makes it possible to accurately predict the contamination level of a sample only by relying on the marker sites covered at the beginning of the design of the large panel.
发明内容SUMMARY OF THE INVENTION
本公开提供了一种用于检测样本污染水平的SNP位点的筛选方法,获取目标区域中人群突变频率为30%~70%的SNP位点作为候选标记位点;The present disclosure provides a method for screening SNP sites for detecting sample contamination levels, and obtaining SNP sites with population mutation frequencies of 30% to 70% in a target region as candidate marker sites;
将单一染色体上存在的所述候选标记位点中的起始位点和终止位点之间的区域划分为多个选择区域,所述选择区域的长度为0.7~1.3Mb;dividing the region between the starting site and the ending site in the candidate marker sites existing on a single chromosome into multiple selection regions, and the length of the selection region is 0.7-1.3Mb;
若所述选择区域内存在两个及以上的候选标记位点,则选择该选择区域内等位基因频率为40%~60%且不同基因型最符合哈迪-温格伯平衡的位点作为标记位点,并去除该选择区域内的其他候选标记的位点。If there are two or more candidate marker loci in the selection region, the locus with the allele frequency of 40% to 60% in the selection region and the different genotypes most in line with Hardy-Wingerberg equilibrium is selected as Mark the site and remove other candidate marked sites within the selected region.
在一些实施方式中,所述不同基因型最符合哈迪-温格伯平衡的位点是指:在所述选择区域内位点不均衡系数U最小的位点,所述位点不均衡系数U的计算公式如下:In some embodiments, the loci with different genotypes most in line with Hardy-Wingerberg equilibrium refer to: the locus with the smallest locus disequilibrium coefficient U in the selected region, the locus disequilibrium coefficient The formula for calculating U is as follows:
U=(S 0-0.25) 2+(S 1-0.5) 2+(S 2-0.25) 2;其中,S 0、S 1和S 2分别为野生型、突变杂合型和突变纯合型在目标区域中的人群出现频率。 U=(S 0 -0.25) 2 +(S 1 -0.5) 2 +(S 2 -0.25) 2 ; wherein, S 0 , S 1 and S 2 are wild type, mutant heterozygous and mutant homozygous, respectively Population frequency in the target area.
在一些实施方式中,当一个所述选择区域内,所述最符合哈迪-温格伯平衡的位点存在多个时,选择其中任意一个。In some embodiments, when there are multiple sites that best fit the Hardy-Wingerberg equilibrium in one of the selected regions, any one of them is selected.
在一些实施方式中,所述SNP候选位点是基于N例无污染的阴性样本基因组中的人群突变频率为30%~70%的SNP位点。In some embodiments, the SNP candidate loci are SNP loci with a population mutation frequency of 30%-70% based on the genomes of N cases of uncontaminated negative samples.
在一些实施方式中,所述N≥50。In some embodiments, the N≧50.
在一些实施方式中,所述N≥100。In some embodiments, the N≧100.
在一些实施方式中,所述SNP候选位点为基因数据库中出现频率为40%~60%的位点。In some embodiments, the SNP candidate loci are loci with an occurrence frequency of 40% to 60% in the gene database.
在一些实施方式中,所述待测样本的来源选自肿瘤、遗传疾病。In some embodiments, the source of the sample to be tested is selected from tumors and genetic diseases.
在一些实施方式中,所述待测样本的类型选自血浆样本、白细胞样本、组织样本、口腔黏膜样本、唾液样本、 脑脊液样本、胸水样本、腹水样本、口腔拭子样本。In some embodiments, the type of the sample to be tested is selected from plasma samples, leukocyte samples, tissue samples, oral mucosa samples, saliva samples, cerebrospinal fluid samples, pleural fluid samples, ascites samples, and buccal swab samples.
本公开还提供了一种用于检测样本污染水平的方法,其包括:使用如前述所述的用于检测样本污染水平的SNP位点的筛选方法筛选得到的污染标记位点。The present disclosure also provides a method for detecting the contamination level of a sample, comprising: screening the contamination marker loci obtained by using the screening method for SNP loci for detecting the contamination level of a sample as described above.
在一些实施方式中,根据前述的用于检测样本污染水平的方法,包括:将待测样本和/或无污染对照样本目标区域中基因型为纯合基因型的所述污染标记位点作为纯合标记位点;将获得的待测样本的突变丰度VAF差值与不同污染水平下污染样本的VAF差值分布分别做秩和检验,分别获得不同污染水平下的P值;In some embodiments, according to the aforementioned method for detecting the contamination level of a sample, comprising: using the contamination marker locus whose genotype is homozygous genotype in the target area of the sample to be tested and/or the non-contamination control sample as a pure The combined marker sites; the obtained VAF difference of mutation abundance of the sample to be tested and the VAF difference distribution of the contaminated samples under different pollution levels were tested by rank sum test respectively, and the P values under different pollution levels were obtained respectively;
其中,所述待测样本的突变丰度VAF差值为:待测样本与无污染的阴性对照样本在所述纯合标记位点的VAF差值绝对值;Wherein, the mutation abundance VAF difference of the sample to be tested is: the absolute value of the VAF difference between the sample to be tested and the uncontaminated negative control sample at the homozygous marker site;
所述不同污染水平下的污染样本的VAF差值分布为:参入不同水平污染源的污染样本分别与无污染的阴性对照样本在所述纯合标记位点的VAF差值绝对值分布。The VAF difference distribution of the contaminated samples under different pollution levels is: the absolute value distribution of the VAF difference at the homozygous marker site between the contaminated samples with different levels of pollution sources and the uncontaminated negative control samples respectively.
在一些实施方式中,根据前述的用于检测样本污染水平的方法,包括:确定不同污染水平下P值的最大值,将P值的最大值对应的污染水平判定为待测样本的污染水平。In some embodiments, according to the aforementioned method for detecting the contamination level of a sample, it includes: determining the maximum value of the P value under different pollution levels, and determining the pollution level corresponding to the maximum value of the P value as the contamination level of the sample to be tested.
在一些实施方式中,所述不同污染水平的污染比例选自0.01%~99.9%。In some embodiments, the pollution ratios of the different pollution levels are selected from 0.01% to 99.9%.
在一些实施方式中,所述污染标记位点包括下表中的(1)~(60)中的至少1个:In some embodiments, the contamination marker site includes at least one of (1) to (60) in the following table:
编号Numbering rsidrsid 染色体chromosome 位置Location 参考碱基reference base 突变后的碱基mutated base
11 rs2294714rs2294714 chr1chr1 62578266257826 TT CC
22 rs7532151rs7532151 chr1chr1 8938894489388944 AA CC
33 rs1800880rs1800880 chr1chr1 156846120156846120 CC TT
44 rs11543979rs11543979 chr1chr1 201981218201981218 CC GG
55 rs1136410rs1136410 chr1chr1 226555302226555302 AA GG
66 rs1128919rs1128919 chr2chr2 148657117148657117 GG AA
77 rs1048108rs1048108 chr2chr2 215674224215674224 GG AA
88 rs7652776rs7652776 chr3chr3 27410242741024 GG CC
99 rs6443222rs6443222 chr3chr3 91632679163267 CC TT
1010 rs2251219rs2251219 chr3chr3 5258478752584787 TT CC
1111 rs3218651rs3218651 chr3chr3 121208176121208176 TT CC
1212 rs3749234rs3749234 chr3chr3 176765269176765269 TT CC
1313 rs17497475rs17497475 chr4chr4 1924223919242239 CC AA
1414 rs1870377rs1870377 chr4chr4 5597297455972974 TT AA
1515 rs3756122rs3756122 chr4chr4 143235865143235865 CC GG
1616 rs2565007rs2565007 chr5chr5 5382026753820267 CC AA
1717 rs13184586rs13184586 chr5chr5 161119125161119125 CC GG
1818 rs56188706rs56188706 chr5chr5 180057356180057356 CC GG
1919 rs2230653rs2230653 chr6chr6 2605660426056604 G G AA
2020 rs4607417rs4607417 chr6chr6 4197827441978274 CC TT
21twenty one rs2243384rs2243384 chr6chr6 117678083117678083 AA GG
22twenty two rs2077647rs2077647 chr6chr6 152129077152129077 TT CC
23twenty three rs62456182rs62456182 chr7chr7 60387226038722 TT CC
24twenty four rs2813838rs2813838 chr7chr7 2414929224149292 CC GG
2525 rs2227983rs2227983 chr7chr7 5522925555229255 GG AA
2626 rs41736rs41736 chr7chr7 116435768116435768 CC TT
2727 rs2267708rs2267708 chr7chr7 124392512124392512 CC TT
2828 rs1293288rs1293288 chr8chr8 1171852811718528 TT CC
2929 rs1805794rs1805794 chr8chr8 9099047990990479 C C GG
3030 rs7839934rs7839934 chr8chr8 144941181144941181 GG CC
3131 rs7026388rs7026388 chr9chr9 85181438518143 TT CC
3232 rs7023954rs7023954 chr9chr9 2181675821816758 GG AA
3333 rs357564rs357564 chr9chr9 9820959498209594 GG AA
3434 rs17114803rs17114803 chr10chr10 104386934104386934 TT CC
3535 rs9344rs9344 chr11chr11 6946291069462910 GG AA
3636 rs641936rs641936 chr11chr11 9419726094197260 AA GG
3737 rs543840rs543840 chr11chr11 115764486115764486 GG AA
3838 rs734075rs734075 chr12chr12 44970474497047 CC AA
3939 rs7955902rs7955902 chr12chr12 4064525740645257 C C AA
4040 rs2290103rs2290103 chr12chr12 7853115678531156 AA GG
4141 rs2259820rs2259820 chr12chr12 121435342121435342 CC TT
4242 rs9534262rs9534262 chr13chr13 3293664632936646 TT CC
4343 rs4883918rs4883918 chr13chr13 7335007973350079 TT CC
4444 rs17655rs17655 chr13chr13 103528002103528002 GG CC
4545 rs7328030rs7328030 chr13chr13 112269444112269444 CC AA
4646 rs2231301rs2231301 chr14chr14 2377709923777099 GG AA
4747 rs2241119rs2241119 chr14chr14 8155896581558965 AA GG
4848 rs1130233rs1130233 chr14chr14 105239894105239894 CC TT
4949 rs2305030rs2305030 chr15chr15 4180523741805237 CC TT
5050 rs2229765rs2229765 chr15chr15 9947822599478225 GG AA
5151 rs2230930rs2230930 chr17chr17 17829571782957 CC TT
5252 rs1799966rs1799966 chr17chr17 4122309441223094 TT CC
5353 rs3744093rs3744093 chr17chr17 5649280056492800 TT CC
5454 rs4647887rs4647887 chr17chr17 7455880674558806 AA GG
5555 rs663651rs663651 chr18chr18 4245665342456653 GG AA
5656 rs2075607rs2075607 chr19chr19 12220121222012 GG CC
5757 rs2302603rs2302603 chr19chr19 1794129417941294 TT CC
5858 rs8113496rs8113496 chr19chr19 2982052929820529 AA GG
5959 rs2242522rs2242522 chr19chr19 3622895436228954 GG TT
6060 rs9617050rs9617050 chr22chr22 5070873150708731 AA GG
.
在一些实施方式中,所述污染标记位点包括上表中的(1)~(60)中的至少10个;In some embodiments, the contamination marker sites include at least 10 of (1) to (60) in the above table;
在一些实施方式中,所述污染标记位点包括上表中的(1)~(60)中的至少20个。In some embodiments, the contamination marker sites include at least 20 of (1) to (60) in the above table.
在一些实施方式中,所述污染标记位点包括下表中1)~30)的中的至少1个或者包括下表中1)~30)的中的至少1个与上述污染标记位点(1)~(60)中至少1个的组合:In some embodiments, the contamination marker site includes at least one of 1) to 30) in the following table, or includes at least one of 1) to 30) in the table below and the contamination marker site ( Combination of at least one of 1)~(60):
编号Numbering rsidrsid 染色体chromosome 位置Location 参考碱基reference base 突变后的碱基mutated base
1)1) rs3219489rs3219489 chr1chr1 4579750545797505 CC GG
2)2) rs2298258rs2298258 chr1chr1 162737116162737116 CC GG
3)3) rs3747636rs3747636 chr1chr1 204403659204403659 AA GG
4)4) rs2227982rs2227982 chr2chr2 242793433242793433 GG AA
5)5) rs11466512rs11466512 chr3chr3 3071312630713126 TT AA
6)6) rs2227931rs2227931 chr3chr3 142222284142222284 AA GG
7)7) rs1350191rs1350191 chr4chr4 154807324154807324 CC TT
8)8) rs2043112rs2043112 chr5chr5 3895579638955796 GG AA
9)9) rs3733875rs3733875 chr5chr5 176637240176637240 GG TT
10)10) rs915894rs915894 chr6chr6 3219039032190390 TT GG
11)11) rs1801270rs1801270 chr6chr6 3665197136651971 CC AA
12)12) rs12055782rs12055782 chr6chr6 128312033128312033 AA GG
13)13) rs9639168rs9639168 chr7chr7 1397880913978809 TT CC
14)14) rs2074566rs2074566 chr7chr7 2622466826224668 CC TT
15)15) rs1129293rs1129293 chr7chr7 106513011106513011 CC TT
16)16) rs3808565rs3808565 chr8chr8 2622764026227640 AA GG
17)17) rs4244612rs4244612 chr8chr8 145741702145741702 CC GG
18)18) rs290223rs290223 chr9chr9 9363984693639846 CC GG
19)19) rs2071313rs2071313 chr11chr11 6457260264572602 GG AA
20)20) rs974144rs974144 chr11chr11 8596862385968623 CC TT
21)twenty one) rs11062385rs11062385 chr12chr12 427575427575 AA GG
22)twenty two) rs3741622rs3741622 chr12chr12 4942597849425978 TT CC
23)twenty three) rs2075784rs2075784 chr12chr12 133263825133263825 GG AA
24)twenty four) rs1805097rs1805097 chr13chr13 110435231110435231 CC TT
25)25) rs9549365rs9549365 chr13chr13 113907391113907391 AA GG
26)26) rs2240308rs2240308 chr17chr17 6355459163554591 GG AA
27)27) rs1567962rs1567962 chr17chr17 7891955878919558 CC TT
28)28) rs1799817rs1799817 chr19chr19 71252977125297 GG AA
29)29) rs12461253rs12461253 chr19chr19 3176976331769763 GG AA
30)30) rs2076578rs2076578 chr22chr22 4156960941569609 CC TT
本公开还提供了一种用于检测样本污染水平的试剂盒,其包括用于检测目标SNP位点的试剂,所述目标SNP位点为由前述用于检测样本污染水平的SNP位点的筛选方法筛选得到的标记位点。The present disclosure also provides a kit for detecting the contamination level of a sample, which includes a reagent for detecting a target SNP site, the target SNP site is selected from the aforementioned SNP site for detecting the contamination level of the sample method to screen the obtained marker sites.
在一些实施方式中,所述污染标记位点包括上表中的(1)~(60)中的至少一个。In some embodiments, the contamination marker site includes at least one of (1) to (60) in the above table.
在一些实施方式中,所述污染标记位点包括上表中的(1)~(60)中的至少10个。In some embodiments, the contamination marker sites include at least 10 of (1) to (60) in the above table.
在一些实施方式中,所述污染标记位点包括上表中的(1)~(60)中的至少20个。In some embodiments, the contamination marker sites include at least 20 of (1) to (60) in the above table.
本公开提供了一种用于检测样本污染水平的组合物,其包括用于检测目标SNP位点的试剂,所述目标SNP位点为由上文任一项所述的用于检测样本污染水平的SNP位点的筛选方法筛选得到的污染标记位点。The present disclosure provides a composition for detecting a level of contamination of a sample, comprising a reagent for detecting a target SNP site, the target SNP site being the method for detecting a level of contamination in a sample described in any one of the above The contamination marker loci obtained by the screening method of the SNP locus.
本公开提供了一种检测系统,其包括:The present disclosure provides a detection system comprising:
存储装置和处理装置;storage devices and processing devices;
所述处理装置运行所述存储装置中的程序时,执行如上文任一项所述的用于检测样本污染水平的SNP位点的筛选方法或如上文任一项所述的用于检测样本污染水平的方法。When the processing device runs the program in the storage device, the screening method for detecting the SNP site of the sample contamination level as described in any one of the above or the method for detecting sample contamination as described in any one of the above is performed. horizontal method.
本公开还提供了一种电子设备,其包括存储器和处理器,所述处理器运行所述存储器中的计算机程序时,执行如前述实施例所述的用于检测样本污染水平的SNP位点的筛选方法或如前述实施例所述的用于检测样本污染水平的方法。The present disclosure also provides an electronic device comprising a memory and a processor, when the processor runs a computer program in the memory, executes the SNP loci for detecting the contamination level of a sample as described in the previous embodiments A screening method or method for detecting the level of contamination in a sample as described in the previous examples.
本公开还提供了所述试剂盒、所述组合物、所述检测系统或所述电子设备在用于肿瘤基因检测中的用途。The present disclosure also provides use of the kit, the composition, the detection system or the electronic device for tumor gene detection.
在一些实施方式中,根据上文所述的筛选方法,使用获取模块获取目标区域中人群突变频率为30%~70%的SNP位点作为候选标记位点;In some embodiments, according to the screening method described above, an acquisition module is used to acquire SNP sites with a population mutation frequency of 30%-70% in the target region as candidate marker sites;
使用区域划分模块将单一染色体上存在的所述候选标记位点中的起始位点和终止位点之间的区域划分为多个选择区域,所述选择区域的长度为0.7~1.3Mb;Using a region division module to divide the region between the start site and the end site in the candidate marker sites existing on a single chromosome into a plurality of selection regions, and the length of the selection region is 0.7-1.3Mb;
使用选择模块进行,若所述选择区域内存在两个及以上的候选标记位点,则选择该选择区域内等位基因频率为40%~60%且不同基因型最符合哈迪-温格伯平衡的位点作为标记位点,并去除该选择区域内的其他候选标记位点。Use the selection module to carry out, if there are two or more candidate marker sites in the selection region, select the allele frequency in the selection region to be 40% to 60% and the different genotypes are most consistent with Hardy-Wingerberg. The balanced sites are used as marker sites, and other candidate marker sites within the selected region are removed.
附图说明Description of drawings
为了更清楚地说明本公开实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,应当理解,以下附图仅示出了本公开的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。In order to illustrate the technical solutions of the embodiments of the present disclosure more clearly, the following briefly introduces the accompanying drawings that need to be used in the embodiments. It should be understood that the following drawings only show some embodiments of the present disclosure, and therefore do not It should be regarded as a limitation of the scope, and for those of ordinary skill in the art, other related drawings can also be obtained according to these drawings without any creative effort.
图1为实施例3中用于检测样本污染水平的方法的流程图;Fig. 1 is the flow chart of the method for detecting sample contamination level in embodiment 3;
图2为试验例1中不同污染水平下的VAF差值分布图;Fig. 2 is the VAF difference distribution diagram under different pollution levels in Test Example 1;
图3为试验例1中纯合标记位点的数量分布图;Fig. 3 is the number distribution map of homozygous marker sites in Test Example 1;
图4为试验例1中本公开实施例3的方法的检测性能的表现结果图;4 is a graph showing the performance results of the detection performance of the method of Example 3 of the present disclosure in Test Example 1;
图5为试验例2中实施例4和实施例5的检测方法的性能表现结果图。FIG. 5 is a graph showing the performance results of the detection methods of Examples 4 and 5 in Test Example 2. FIG.
具体实施方式Detailed ways
为使本公开实施例的目的、技术方案和优点更加清楚,下面将对本公开实施例中的技术方案进行清楚、完整地描述。但下述的实施例仅仅是本公开的实例,并不代表或限制本公开的权利保护范围,本公开保护范围以权利要求书为准。实施例中未注明具体条件者,按照常规条件或制造商建议的条件进行。所用试剂或仪器未注明生产厂商者,均为可以通过市售购买获得的常规产品。In order to make the objectives, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions in the embodiments of the present disclosure will be described clearly and completely below. However, the following embodiments are only examples of the present disclosure, and do not represent or limit the protection scope of the present disclosure, and the protection scope of the present disclosure is subject to the claims. If the specific conditions are not indicated in the examples, it is carried out in accordance with the conventional conditions or the conditions suggested by the manufacturer. The reagents or instruments used without the manufacturer's indication are conventional products that can be purchased from the market.
以下结合实施例对本公开的特征和性能作进一步的详细描述。The features and properties of the present disclosure will be further described in detail below with reference to the embodiments.
术语定义Definition of Terms
如本文所用,本文中的所述“SNP”指单核苷酸多态性,主要指在基因组水平上由单个核苷酸的变异所引起的DNA序列多态性,为人类可以遗传的变异中最常见的一种。As used herein, the "SNP" herein refers to a single nucleotide polymorphism, mainly referring to a DNA sequence polymorphism caused by variation of a single nucleotide at the genome level, among the variations that can be inherited by humans The most common one.
如本文所用,本文中所述“哈迪-温格伯平衡”的英文为Hardy-Weinberg equilibrium,可以指理想状态下,各等位基因的频率在遗传中是稳定不变的,即保持着基因平衡。As used herein, "Hardy-Weinberg Equilibrium" in English is Hardy-Weinberg equilibrium, which can refer to ideally, the frequency of each allele is genetically stable, that is, the gene is maintained balance.
如本文所用,本文中所述“野生型”可以指未突变的基因型;“突变杂合型”是指一对等位基因中,其中一个基因为突变型,另外一个为野生型;“突变纯合型”可以指一对等位基因均存在突变。As used herein, "wild type" as used herein may refer to an unmutated genotype; "mutant heterozygous" refers to a pair of alleles in which one gene is mutant and the other is wild type; "mutant heterozygous" "Homozygous" can refer to the presence of mutations in both alleles.
如本文所用,本文中所述“突变丰度”的英文为VAF,Variant allele fraction,也称为Variant allel frequency(变异等位基因频率),可以指测序过程中突变reads(读长)占总reads的比例,即计算公式可以为:As used herein, the "mutation abundance" described herein is VAF, Variant allele fraction, also known as Variant allele frequency (variant allele frequency), which can refer to the percentage of mutation reads (read length) in the total reads during the sequencing process The ratio of , that is, the calculation formula can be:
VAF=Allele Depth/Total Depth。其中,Allele Depth为基因组每个位点支持突变基因型的reads(读长)覆盖深度,Total Depth为这个位点总reads覆盖深度。VAF=Allele Depth/Total Depth. Among them, Allele Depth is the coverage depth of reads (read length) of each locus in the genome supporting the mutant genotype, and Total Depth is the total reads coverage depth of this locus.
如本文所用,本文中所述“秩和检验”,又称为Wilcoxon秩和检验或rank sum test,它是一种非参数检验(nonparametric test),不依赖于总体分布类型,也不对总体参数进行统计推断的统计方法。As used herein, the "rank sum test" described herein, also known as the Wilcoxon rank sum test or rank sum test, is a nonparametric test that does not depend on the type of population distribution and does not measure population parameters. Statistical methods for statistical inference.
如本文所用,本文中所述“等位基因频率”是指如果:(a)一个染色体中存在某特定基因座,(b)该基因座上有一个基因,(c)一个种群中的每一个个体的体细胞都有n个该特定基因座(例如二倍体生物的细胞中有两个该特定基因座),(d)该基因有等位基因或变种;那么等位基因频率为:等位基因在这个种群中所有该等位基因在特定基因座中所占的百分比。As used herein, "allele frequency" as used herein means if: (a) a particular locus is present on a chromosome, (b) there is a gene at that locus, (c) each of the The somatic cells of an individual have n of this specific locus (for example, there are two of this specific locus in the cells of a diploid organism), (d) the gene has alleles or variants; then the allele frequency is: etc. The percentage of all alleles at a particular locus in this population of an allele.
如本文所用,所述“Fastq格式”是指存储生物序列(诸如核酸序列)以及相应的质量评价的文本格式。As used herein, the "Fastq format" refers to a textual format in which biological sequences, such as nucleic acid sequences, and corresponding quality assessments are stored.
技术方案Technical solutions
首先,本公开实施方式提供了一种用于检测样本污染水平的SNP位点的筛选方法,可以应用于电子设备,所述电子设备用于执行以下步骤:获取目标区域中人群突变频率为30%~70%的SNP位点作为候选标记位点;First, an embodiment of the present disclosure provides a method for screening SNP sites for detecting the level of contamination in a sample, which can be applied to an electronic device for performing the following steps: obtaining a population mutation frequency of 30% in a target area ~70% of SNP sites are used as candidate marker sites;
将单一染色体上存在的所述候选标记位点中的起始位点和终止位点之间的区域划分为多个选择区域,所述选择区域的长度为0.7~1.3Mb;若所述选择区域内存在两个及以上的候选标记位点,则选择该选择区域内等位基因频率为40%~60%且不同基因型最符合哈迪-温格伯平衡的位点作为标记位点,并去除该选择区域内的其他候选标记位点。Divide the region between the start site and the end site in the candidate marker sites existing on a single chromosome into multiple selection regions, and the length of the selection region is 0.7-1.3Mb; if the selection region If there are two or more candidate marker loci in the selected region, the locus with the allele frequency of 40% to 60% in the selected region and the different genotypes most in line with Hardy-Wingerberg equilibrium is selected as the marker locus, and the Other candidate marker sites within the selected region are removed.
经一系列研究发现,基于覆盖N例无污染的阴性样本基因组中的SNP候选位点,去除不同基因型显著偏移哈迪-温格伯平衡的位点以及可能连锁的临近位点,能够得到用于检测样本污染水平的污染标记位点。After a series of studies, it was found that, based on covering the SNP candidate loci in the genomes of N uncontaminated negative samples, removing the loci that significantly shifted the Hardy-Wingerberg equilibrium of different genotypes and the adjacent loci that may be linked, it is possible to obtain Contamination marker sites for detection of sample contamination levels.
需要说明的是,“提取目标区域中人群突变频率在30%~70%的SNP位点”是指提取突变频率落入30%~70%范围内的任意频率点值或频率范围的全部或部分SNP位点,例如,提取突变频率为40%~60%的全部位点或部分位点,提取突变频率为45%~60%的全部位点或部分位点,或者提取突变频率为45%~55%的全部位点或部分位点;或者提取突变频率为30%、35%、40%、45%、50%、55%、60%、65%、70%的全部位点或部分位点。在一些实施方式中,所述等位基因频率可以为40%~60%、45%~60%、40%~55%或者45%~55%,诸如40%、45%、50%、55%和60%中任意一种,可选地为50%。It should be noted that "extracting SNP sites with population mutation frequency in the target region ranging from 30% to 70%" refers to extracting any frequency point value or all or part of the frequency range whose mutation frequency falls within the range of 30% to 70%. For SNP sites, for example, extract all or part of the site with a mutation frequency of 40% to 60%, extract all or part of the site with a mutation frequency of 45% to 60%, or extract a mutation frequency of 45% to 60% 55% of all or part of the site; or all or part of the site with a mutation frequency of 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70% . In some embodiments, the allele frequency may be 40%-60%, 45%-60%, 40%-55%, or 45%-55%, such as 40%, 45%, 50%, 55% and any of 60%, optionally 50%.
“目标区域”为任何希望通过本公开的污染水平检测方法所表征的区域,包括但不限于测序的目标区域以及panel在基因组上的覆盖区域等。"Target region" is any region desired to be characterized by the contamination level detection method of the present disclosure, including but not limited to the target region for sequencing and the coverage region of the panel on the genome, and the like.
“SNP位点”的信息及其在人群中的突变频率可以通过测序或者公共或商业数据库获得。对数据库不作具体限制,可选择现有的基因数据库。在一些实施方式中,基因数据库可以选自:gnomAD,1000G,ExAC和popfreq_max_20150413数据库中的至少一个。Information on "SNP loci" and their mutation frequencies in the population can be obtained by sequencing or public or commercial databases. There is no specific limitation on the database, and an existing gene database can be selected. In some embodiments, the gene database may be selected from at least one of: gnomAD, 1000G, ExAC and popfreq_max_20150413 databases.
以“0.7~1.3Mb”作为选择区域的长度对候选位点进行选择,一方面能够有效避免候选位点之间出现发生连锁的临近位点;另一方面,能在不影响检测结果的情况下,降低标记位点的数量,以实现成本更低、检测有效性更高的检测。在一些实施方式中,选择区域的长度可以为0.8~1.3Mb、0.7~1.2Mb、1.0~1.3Mb或0.8~1.0Mb,诸如可以以0.7Mb、0.8Mb、0.9Mb、1.0Mb、1.1Mb、1.2Mb和1.3Mb中的任意一个长度作为选择区域的长度进行划分。Selecting candidate sites with "0.7-1.3Mb" as the length of the selection region, on the one hand, can effectively avoid the occurrence of linked adjacent sites between candidate sites; , reducing the number of labeled sites for lower cost and more efficient detection. In some embodiments, the length of the selected region can be 0.8-1.3Mb, 0.7-1.2Mb, 1.0-1.3Mb, or 0.8-1.0Mb, such as can be 0.7Mb, 0.8Mb, 0.9Mb, 1.0Mb, 1.1Mb, Any length of 1.2Mb and 1.3Mb is used for division as the length of the selection area.
在一些实施方式中,选择突变杂合型比例为30%~70%的位点作为候选标记位点,突变杂合型比例可以为40%~70%、30%~60%或50%~60%,诸如31%、35%、40%、45%、50%、55%、60%、65%和70%中的任意一种。上述突变杂合型比例的位点突变频率更稳定,将其作为候选标记位点能够增加后续污染水平预测的有效性。In some embodiments, a site with a mutation heterozygosity ratio of 30% to 70% is selected as a candidate marker site, and the mutation heterozygosity ratio can be 40% to 70%, 30% to 60%, or 50% to 60%. %, such as any of 31%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, and 70%. The site mutation frequency of the above mutation heterozygous ratio is more stable, and using it as a candidate marker site can increase the effectiveness of subsequent contamination level prediction.
在一些实施方式中,待测样本的来源选自肿瘤、遗传疾病。In some embodiments, the source of the sample to be tested is selected from tumors, genetic diseases.
在一些实施方式中,遗传疾病包括但不限于染色体病、显性遗传病、隐性遗传病、单基因遗传病、多基因遗传、X连锁遗传病。In some embodiments, genetic disorders include, but are not limited to, chromosomal disorders, dominant disorders, recessive disorders, monogenic disorders, polygenic disorders, X-linked disorders.
在一些实施方式中,待测样本类型选自血浆样本、白细胞样本、组织样本、口腔黏膜样本、唾液样本、脑脊液样本、胸水样本、腹水样本、口腔拭子样本。In some embodiments, the type of sample to be tested is selected from plasma samples, leukocyte samples, tissue samples, oral mucosa samples, saliva samples, cerebrospinal fluid samples, pleural fluid samples, ascites samples, and buccal swab samples.
在一些典型的实施方式中,所述SNP候选位点是基于N例无污染的阴性样本基因组中的人群突变频率为30%~70%的SNP位点。这里的阴性样本可以理解为野生型样本,可以指与待测样本相对应且无污染的阴性对照样本。例如, 当待测样本为肿瘤样本时,阴性样本为与之相对应的、无污染的健康个体样本。In some typical embodiments, the SNP candidate loci are SNP loci with a population mutation frequency of 30%-70% in the genomes of N cases of uncontaminated negative samples. The negative sample here can be understood as a wild-type sample, which can refer to a negative control sample corresponding to the sample to be tested and without contamination. For example, when the sample to be tested is a tumor sample, the negative sample is a corresponding uncontaminated healthy individual sample.
在一些典型的实施方式中,所述N≥50;可选地,所述N≥100;可选地,所述N≥500。例如N为50~100、50~500、100~500、200~500或者300~500。In some typical embodiments, the N≧50; optionally, the N≧100; optionally, the N≧500. For example, N is 50-100, 50-500, 100-500, 200-500, or 300-500.
在一些典型的实施方式中,当每个所述选择区域内,所述最符合哈迪-温格伯平衡的位点存在多个时,选择其中任意一个,以减少或避免污染标记位点中出现连锁的位点。In some typical embodiments, when there are multiple sites that are most in line with Hardy-Wingerberg equilibrium in each of the selected regions, any one of them is selected to reduce or avoid contamination in the labeled sites Linked loci appear.
在一些典型的实施方式中,所述不同基因型最符合哈迪-温格伯平衡的位点是指:在所述选择区域内,位点不均衡系数U最小的位点,所述位点不均衡系数U的计算公式如下:In some typical embodiments, the locus with the different genotypes most in line with Hardy-Wingerberg equilibrium refers to the locus with the smallest locus disequilibrium coefficient U in the selected region, the locus The calculation formula of the unbalance coefficient U is as follows:
U=(S 0-0.25) 2+(S 1-0.5) 2+(S 2-0.25) 2;其中,S 0、S 1和S 2分别为野生型、突变杂合型和突变纯合型在目标区域中的人群出现频率。需要说明的是,“野生型”指未发生突变的基因型。 U=(S 0 -0.25) 2 +(S 1 -0.5) 2 +(S 2 -0.25) 2 ; wherein, S 0 , S 1 and S 2 are wild type, mutant heterozygous and mutant homozygous, respectively Population frequency in the target area. It should be noted that "wild type" refers to a genotype without mutation.
本公开实施方式还提供了一种电子设备,其包括存储器和处理器,所述处理器运行所述存储器中的计算机程序时,执行如前述任意实施方式所述的用于检测样本污染水平的SNP位点的筛选方法。Embodiments of the present disclosure also provide an electronic device, which includes a memory and a processor, when the processor runs a computer program in the memory, executes the SNP for detecting the contamination level of a sample described in any of the foregoing embodiments site screening method.
该电子设备可以包括存储器、处理器、总线和通信接口,该存储器、处理器和通信接口相互之间直接或间接地电性连接,以实现数据的传输或交互。例如,这些元件相互之间可通过一条或多条总线或信号线实现电性连接。处理器可以处理与目标识别有关的信息和/或数据,以执行本公开中描述的一个或多个功能。The electronic device may include a memory, a processor, a bus, and a communication interface, and the memory, the processor, and the communication interface are directly or indirectly electrically connected to each other to realize data transmission or interaction. For example, these elements may be electrically connected to each other through one or more buses or signal lines. The processor may process information and/or data related to target identification to perform one or more functions described in this disclosure.
在一些实施方式中,存储器可以是但不限于,随机存取存储器(Random Access Memory,RAM),只读存储器(Read Only Memory,ROM),可编程只读存储器(Programmable Read-Only Memory,PROM),可擦除只读存储器(Erasable Programmable Read-Only Memory,EPROM),电可擦除只读存储器(Electric Erasable Programmable Read-Only Memory,EEPROM)等。In some embodiments, the memory may be, but is not limited to, Random Access Memory (RAM), Read Only Memory (ROM), Programmable Read-Only Memory (PROM) , Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Read-Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM), etc.
在一些实施方式中,处理器可以是一种集成电路芯片,具有信号处理能力。在一些实施方式中,该处理器可以是通用处理器,包括中央处理器(Central Processing Unit,CPU)、网络处理器(Network Processor,NP)等;还可以是数字信号处理器(Digital Signal Processing,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。In some embodiments, the processor may be an integrated circuit chip with signal processing capabilities. In some embodiments, the processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; it may also be a digital signal processor (Digital Signal Processing, DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components.
该电子设备中的各组件可以采用硬件、软件或其组合实现。在实际应用中,该电子设备可以是服务器、云平台、手机、平板电脑、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、手持计算机、上网本、个人数字助理(personal digital assistant,PDA)、可穿戴电子设备、虚拟现实设备等设备,因此本公开实施方式对电子设备的种类不做限制。Each component in the electronic device can be implemented by hardware, software or a combination thereof. In practical applications, the electronic device can be a server, a cloud platform, a mobile phone, a tablet computer, a notebook computer, an ultra-mobile personal computer (UMPC), a handheld computer, a netbook, a personal digital assistant (personal digital assistant, PDA), wearable electronic devices, virtual reality devices and other devices, so the embodiments of the present disclosure do not limit the types of electronic devices.
本公开实施方式还提供了一种用于检测样本污染水平的方法,其包括:使用如前述任一实施方式所述的用于检测样本污染水平的SNP位点的筛选方法筛选得到的污染标记位点。Embodiments of the present disclosure also provide a method for detecting the contamination level of a sample, which includes: using the screening method for SNP loci for detecting the contamination level of a sample according to any of the foregoing embodiments to screen the contamination marker loci obtained point.
通过对上述筛选方法筛选出的污染标记位点进行确认和检测,能够对样本的污染水平进行预测,相对于现有技术而言,检测的标记位点更少,成本更低,对样本污染水平的预测结果更准确。By confirming and detecting the contamination marker sites screened by the above screening method, the contamination level of the sample can be predicted. Compared with the prior art, the detected marker sites are less, the cost is lower, and the contamination level of the sample can be predicted. prediction results are more accurate.
在一些实施方式中,所述方法还包括对所述污染标记位点进行检测。具体地,对位点检测的方式不作任何限制,如可以通过引物、探针和芯片的中的任意一种方式进行,只要是通过检测上述污染标记位点进行样本污染水平检测的,则在本公开的保护范围内。In some embodiments, the method further comprises detecting the contaminating marker site. Specifically, there is no restriction on the method of site detection. For example, it can be carried out by any one of primers, probes and chips. As long as the sample contamination level is detected by detecting the above-mentioned contamination marker sites, in this paper within the scope of public protection.
在一些典型的实施方式中,所述方法还包括:将待测样本和/或未污染对照样本目标区域中基因型为纯合基因型的所述污染标记位点作为作为纯合标记位点;将获得的待测样本的突变丰度VAF差值与不同污染水平下污染样本的VAF差值分布分别做秩和检验,分别获得不同污染水平下的P值。In some typical embodiments, the method further comprises: using the contamination marker locus whose genotype is homozygous genotype in the target region of the sample to be tested and/or the uncontaminated control sample as the homozygous marker locus; The rank sum test was performed on the VAF difference of the obtained mutation abundance of the sample to be tested and the VAF difference distribution of the contaminated samples under different pollution levels, and the P values under different pollution levels were obtained respectively.
其中,所述待测样本的突变丰度VAF差值为:待测样本与无污染的阴性对照样本在所述纯合标记位点的VAF差值绝对值。在一些实施方式中,待测样本为可能被污染或已经掺入了污染源(异源数据)的样本,可以为肿瘤样本,其阴性对照样本为无污染的健康个体的样本。Wherein, the VAF difference value of the mutation abundance of the sample to be tested is: the absolute value of the VAF difference value of the sample to be tested and the uncontaminated negative control sample at the homozygous marker site. In some embodiments, the sample to be tested is a sample that may be contaminated or has incorporated a source of contamination (heterologous data), such as a tumor sample, and the negative control sample is a sample of a healthy individual without contamination.
所述不同污染水平下的污染样本的VAF差值分布为:参入不同水平污染源的污染样本分别与无污染的阴性对照样本在所述纯合标记位点的VAF差值绝对值分布。The VAF difference distribution of the contaminated samples at different pollution levels is: the absolute value distribution of the VAF difference at the homozygous marker site between the contaminated samples with different levels of pollution sources and the uncontaminated negative control sample respectively.
具体地,上述“将待测样本和/或未污染对照样本目标区域中基因型为纯合基因型的所述污染标记位点作为纯合标记位点”是指:基于待测样本和/或无污染的对照样本的目标区域上存在的每个污染标记位点的基因型信息,将基因型为野生型或突变纯合型的污染标记位点挑选出来,作为纯合标记位点。“待测样本和/或无污染的对照样本的目标区域上存在的每个污染标记位点的基因型信息”可以通过现有基因检测的手段获得。Specifically, the above-mentioned "use the contamination marker site whose genotype is homozygous genotype in the target region of the sample to be tested and/or the uncontaminated control sample as the homozygous marker site" means: based on the sample to be tested and/or According to the genotype information of each contamination marker locus existing on the target region of the non-contamination control sample, the contamination marker loci whose genotype is wild type or mutant homozygous type are selected as the homozygous marker locus. "The genotype information of each contamination marker locus present on the target region of the test sample and/or the non-contamination control sample" can be obtained by existing genetic testing methods.
在一些典型的实施方式中,将未污染对照样本目标区域中基因型为纯合基因型的所述污染标记位点作为纯合标记位点。In some typical embodiments, the contamination marker locus whose genotype is homozygous genotype in the target region of the uncontaminated control sample is used as the homozygous marker locus.
在一些典型的实施方式中,将待测样本和未污染对照样本目标区域中基因型为纯合基因型的所述污染标记位点作为纯合标记位点。“待测样本和无污染的对照样本目标区域中基因型为纯合基因型的所述污染标记位点作为纯合标记位点”是指将待测样本目标区域中基因型为纯合基因型的所述污染标记位点,与无污染的对照样本目标区域中基因型为纯合基因型的所述污染标记位点合并,合并后作为纯合标记位点。无污染的对照样本可以包括但不限于白细 胞样本。In some typical embodiments, the contamination marker sites whose genotypes are homozygous genotypes in the target regions of the test sample and the uncontaminated control sample are used as homozygous marker sites. "The contamination marker site whose genotype is homozygous genotype in the target area of the sample to be tested and the uncontaminated control sample is regarded as a homozygous marker site" means that the genotype in the target area of the sample to be tested is a homozygous genotype The contamination marker site is merged with the contamination marker site whose genotype is homozygous genotype in the target region of the non-polluting control sample, and is combined as a homozygous marker site. Uncontaminated control samples can include, but are not limited to, white blood cell samples.
在一些实施方式中,所述方法还包括构建不同污染水平的污染样本:将样本分为待污染的样本和无污染的对照样本,在待污染的样本中加入不同污染比例的污染源以获得不同污染比例(质量比)的污染样本。待污染的样本或对照样本可选自白细胞样本或现有的细胞系样本。污染样本中的污染源可选自不同于“待污染的样本”的其他白细胞或细胞系样本。In some embodiments, the method further includes constructing contamination samples with different contamination levels: dividing the samples into samples to be contaminated and uncontaminated control samples, adding contamination sources with different contamination ratios to the samples to be contaminated to obtain different contaminations Proportion (mass ratio) of contaminated samples. The samples to be contaminated or control samples can be selected from leukocyte samples or existing cell line samples. The source of contamination in the contaminated sample can be selected from other leukocyte or cell line samples than the "sample to be contaminated".
在一些典型的实施方式中,所述不同污染水平的污染比例选自0.01%~99.9%。在一些实施方式中,同污染水平的污染比例可以为0.01%~95%、1%~99.9%、10%~90%、30%~70%或40%~60%,例如不同污染水平可以为0.01%、1%、5%、10%、15%、20%、25%、30%、35%、40%、45%、50%、55%、60%、65%、70%、75%、80%、85%、90%、95%和99.9%中的任意一种。In some typical embodiments, the pollution ratios of the different pollution levels are selected from 0.01% to 99.9%. In some embodiments, the pollution ratio of the same pollution level can be 0.01%-95%, 1%-99.9%, 10%-90%, 30%-70% or 40%-60%, for example, the different pollution levels can be 0.01%, 1%, 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75% , 80%, 85%, 90%, 95% and 99.9%.
在检测时,不同污染水平的跨度范围可以为:0~1%,每间隔0.1%设置一组污染样本(记为0~1%,每隔0.1%);不同污染水平的跨度范围可以为:1%~10%,每间隔0.5%设置一组污染样本;不同污染水平的跨度范围还可以为:10%~100%,每间隔1%设置一组污染样本,依此类推。During detection, the span range of different pollution levels can be: 0 to 1%, and a set of pollution samples is set at every 0.1% interval (denoted as 0 to 1%, every 0.1%); the span range of different pollution levels can be: 1% to 10%, set a set of pollution samples at 0.5% intervals; the span range of different pollution levels can also be: 10% to 100%, set a set of pollution samples at 1% intervals, and so on.
在一些实施方式中,统计分析采用SPSS10.0统计分析软件进行计算。所有的统计检验均采用双尾检验。In some embodiments, statistical analysis is calculated using SPSS 10.0 statistical analysis software. All statistical tests were two-tailed.
在一些典型的实施方式中,所述方法包括:确定不同污染水平下P值的最大值,将P值的最大值对应的污染水平判定为待测样本的污染水平。P值可以反应样本的VAF差值与理论分布的吻合程度,P值越大越符合。In some typical embodiments, the method includes: determining the maximum value of the P value under different pollution levels, and determining the pollution level corresponding to the maximum value of the P value as the pollution level of the sample to be tested. The P value can reflect the degree of agreement between the VAF difference of the sample and the theoretical distribution, and the larger the P value, the better the agreement.
在一些典型的实施方式中,所述污染标记位点包括表1中的(1)~(60)中的至少1个;In some typical embodiments, the contamination marker site includes at least one of (1) to (60) in Table 1;
表1污染标记位点Table 1 Contamination marker sites
编号Numbering rsidrsid 染色体chromosome 位置Location 参考碱基reference base 突变后的碱基mutated base
11 rs2294714rs2294714 chr1chr1 62578266257826 TT CC
22 rs7532151rs7532151 chr1chr1 8938894489388944 AA CC
33 rs1800880rs1800880 chr1chr1 156846120156846120 CC TT
44 rs11543979rs11543979 chr1chr1 201981218201981218 CC GG
55 rs1136410rs1136410 chr1chr1 226555302226555302 AA GG
66 rs1128919rs1128919 chr2chr2 148657117148657117 GG AA
77 rs1048108rs1048108 chr2chr2 215674224215674224 GG AA
88 rs7652776rs7652776 chr3chr3 27410242741024 GG CC
99 rs6443222rs6443222 chr3chr3 91632679163267 CC TT
1010 rs2251219rs2251219 chr3chr3 5258478752584787 TT CC
1111 rs3218651rs3218651 chr3chr3 121208176121208176 TT CC
1212 rs3749234rs3749234 chr3chr3 176765269176765269 TT CC
1313 rs17497475rs17497475 chr4chr4 1924223919242239 CC AA
1414 rs1870377rs1870377 chr4chr4 5597297455972974 TT AA
1515 rs3756122rs3756122 chr4chr4 143235865143235865 CC GG
1616 rs2565007rs2565007 chr5chr5 5382026753820267 CC AA
1717 rs13184586rs13184586 chr5chr5 161119125161119125 CC GG
1818 rs56188706rs56188706 chr5chr5 180057356180057356 CC GG
1919 rs2230653rs2230653 chr6chr6 2605660426056604 G G AA
2020 rs4607417rs4607417 chr6chr6 4197827441978274 CC TT
21twenty one rs2243384rs2243384 chr6chr6 117678083117678083 AA GG
22twenty two rs2077647rs2077647 chr6chr6 152129077152129077 TT CC
23twenty three rs62456182rs62456182 chr7chr7 60387226038722 TT CC
24twenty four rs2813838rs2813838 chr7chr7 2414929224149292 CC GG
2525 rs2227983rs2227983 chr7chr7 5522925555229255 GG AA
2626 rs41736rs41736 chr7chr7 116435768116435768 CC TT
2727 rs2267708rs2267708 chr7chr7 124392512124392512 CC TT
2828 rs1293288rs1293288 chr8chr8 1171852811718528 TT CC
2929 rs1805794rs1805794 chr8chr8 9099047990990479 C C GG
3030 rs7839934rs7839934 chr8chr8 144941181144941181 GG CC
3131 rs7026388rs7026388 chr9chr9 85181438518143 TT CC
3232 rs7023954rs7023954 chr9chr9 2181675821816758 GG AA
3333 rs357564rs357564 chr9chr9 9820959498209594 GG AA
3434 rs17114803rs17114803 chr10chr10 104386934104386934 TT CC
3535 rs9344rs9344 chr11chr11 6946291069462910 GG AA
3636 rs641936rs641936 chr11chr11 9419726094197260 AA GG
3737 rs543840rs543840 chr11chr11 115764486115764486 GG AA
3838 rs734075rs734075 chr12chr12 44970474497047 CC AA
3939 rs7955902rs7955902 chr12chr12 4064525740645257 C C AA
4040 rs2290103rs2290103 chr12chr12 7853115678531156 AA GG
4141 rs2259820rs2259820 chr12chr12 121435342121435342 CC TT
4242 rs9534262rs9534262 chr13chr13 3293664632936646 TT CC
4343 rs4883918rs4883918 chr13chr13 7335007973350079 TT CC
4444 rs17655rs17655 chr13chr13 103528002103528002 GG CC
4545 rs7328030rs7328030 chr13chr13 112269444112269444 CC AA
4646 rs2231301rs2231301 chr14chr14 2377709923777099 GG AA
4747 rs2241119rs2241119 chr14chr14 8155896581558965 AA GG
4848 rs1130233rs1130233 chr14chr14 105239894105239894 CC TT
4949 rs2305030rs2305030 chr15chr15 4180523741805237 CC TT
5050 rs2229765rs2229765 chr15chr15 9947822599478225 GG AA
5151 rs2230930rs2230930 chr17chr17 17829571782957 CC TT
5252 rs1799966rs1799966 chr17chr17 4122309441223094 TT CC
5353 rs3744093rs3744093 chr17chr17 5649280056492800 TT CC
5454 rs4647887rs4647887 chr17chr17 7455880674558806 AA GG
5555 rs663651rs663651 chr18chr18 4245665342456653 GG AA
5656 rs2075607rs2075607 chr19chr19 12220121222012 GG CC
5757 rs2302603rs2302603 chr19chr19 1794129417941294 TT CC
5858 rs8113496rs8113496 chr19chr19 2982052929820529 AA GG
5959 rs2242522rs2242522 chr19chr19 3622895436228954 GG TT
6060 rs9617050rs9617050 chr22chr22 5070873150708731 AA GG
在一些典型的实施方式中,所述污染标记位点包括表1中的(1)~(60)中的至少10个;可选地,所述污染标记位点包括表1中的(1)~(60)中的至少20个。In some typical embodiments, the contamination marker sites include at least 10 of (1) to (60) in Table 1; optionally, the contamination marker sites include (1) in Table 1 At least 20 of ~(60).
在一些实施方式中,在对标记位点进行筛选前,所述方法还包括:样本的建库步骤和/或测序步骤。样本的建库和测序步骤可分别参照现有的建库和测序步骤进行实施。In some embodiments, before screening the marker sites, the method further comprises: a library building step and/or a sequencing step of the sample. The library building and sequencing steps of the sample can be implemented with reference to the existing library building and sequencing steps, respectively.
本公开实施方式还提供了一种电子设备,其包括存储器和处理器,所述处理器运行所述存储器中的计算机程序时,执行如前述任意实施方式所述的用于检测样本污染水平的方法。Embodiments of the present disclosure also provide an electronic device, which includes a memory and a processor, when the processor runs a computer program in the memory, the processor executes the method for detecting a contamination level of a sample described in any of the foregoing embodiments .
此外,本公开实施方式还提供了一种用于检测样本污染水平的试剂盒,其包括用于检测目标SNP位点的试剂,所述目标SNP位点为由前述任意实施方式所述的筛选方法筛选得到的标记位点。In addition, an embodiment of the present disclosure also provides a kit for detecting a contamination level of a sample, comprising a reagent for detecting a target SNP site, the target SNP site being the screening method described in any of the foregoing embodiments Screened marker sites.
在一些典型的实施方式中,所述污染标记位点包括表1中的(1)~(60)中的至少一个。In some typical embodiments, the contamination marker site includes at least one of (1) to (60) in Table 1.
在一些典型的实施方式中,所述污染标记位点包括表1的(1)~(60)中的至少10个。In some typical embodiments, the contamination marker sites include at least 10 of (1) to (60) in Table 1.
在一些典型的实施方式中,所述污染标记位点包括表1中的(1)~(60)中的至少20个。In some typical embodiments, the contamination marker sites include at least 20 of (1) to (60) in Table 1.
在一些典型的实施方式中,所述污染标记位点包括表1中的(1)~(60)中的至少30个。In some typical embodiments, the contamination marker sites include at least 30 of (1) to (60) in Table 1.
在一些典型的实施方式中,所述污染标记位点包括表1中的(1)~(60)中的20个、30个、40个、50个、60个、70个、80个、90个、100个。In some typical embodiments, the contamination marker sites include 20, 30, 40, 50, 60, 70, 80, 90 of (1) to (60) in Table 1 one, 100.
在一些典型的实施方式中,所述试剂的类型选自:探针、引物和芯片中的至少一种。In some typical embodiments, the type of reagent is selected from at least one of probes, primers and chips.
本公开实施方式还提供了一种用于检测样本污染水平的组合物,其包括用于检测目标SNP位点的试剂,目标SNP位点为由用于检测样本污染水平的SNP位点的筛选方法筛选得到的污染标记位点。Embodiments of the present disclosure also provide a composition for detecting a contamination level of a sample, which includes a reagent for detecting a target SNP site, where the target SNP site is a screening method for a SNP site for detecting the contamination level of a sample Screened contamination marker sites.
在一些典型的实施方式中,污染标记位点包括上表中的(1)~(60)中的至少一个;可选地,污染标记位点包括上表中的(1)~(60)中的至少10个;可选地,污染标记位点包括上表中的(1)~(60)中的至少20个。In some typical embodiments, the contamination labeling site includes at least one of (1) to (60) in the above table; optionally, the contamination labeling site includes (1) to (60) in the above table at least 10 of ; optionally, the contamination marker sites include at least 20 of (1) to (60) in the above table.
本公开实施方式还提供了一种检测系统,其包括:Embodiments of the present disclosure also provide a detection system, which includes:
存储装置和处理装置;storage devices and processing devices;
处理装置运行存储装置中的程序时,执行用于检测样本污染水平的SNP位点的筛选方法或用于检测样本污染水 平的方法。When the processing device runs the program in the storage device, a screening method for SNP loci for detecting the contamination level of the sample or a method for detecting the contamination level of the sample is performed.
本公开实施方式还提供了试剂盒、组合物、检测系统或电子设备在用于肿瘤基因检测中的用途。Embodiments of the present disclosure also provide the use of the kit, composition, detection system or electronic device for tumor gene detection.
在一些实施方式中,使用获取模块获取目标区域中人群突变频率为30%~70%的SNP位点作为候选标记位点;In some embodiments, an acquisition module is used to acquire SNP sites with a population mutation frequency of 30%-70% in the target region as candidate marker sites;
使用区域划分模块将单一染色体上存在的所述候选标记位点中的起始位点和终止位点之间的区域划分为多个选择区域,所述选择区域的长度为0.7~1.3Mb;Using a region division module to divide the region between the start site and the end site in the candidate marker sites existing on a single chromosome into a plurality of selection regions, and the length of the selection region is 0.7-1.3Mb;
使用选择模块进行,若所述选择区域内存在两个及以上的候选标记位点,则选择该选择区域内等位基因频率为40%~60%且不同基因型最符合哈迪-温格伯平衡的位点作为标记位点,并去除该选择区域内的其他候选标记位点。Use the selection module to carry out, if there are two or more candidate marker sites in the selection region, select the allele frequency in the selection region to be 40% to 60% and the different genotypes are most consistent with Hardy-Wingerberg. The balanced sites are used as marker sites, and other candidate marker sites within the selected region are removed.
本公开实施方式提供了一种用于检测样本污染水平的SNP位点的筛选方法及样本污染水平的检测方法,该筛选方法包括获取目标区域中人群突变频率为30%~70%的SNP位点作为候选标记位点;将单一染色体上存在的所述候选标记位点中的起始位点和终止位点之间的区域划分为多个选择区域,所述选择区域的长度为0.7~1.3Mb;若所述选择区域内存在两个及以上的候选标记位点,则选择该选择区域内等位基因频率为40%~60%且不同基因型最符合哈迪-温格伯平衡的位点作为标记位点,并去除该选择区域内的其他候选标记位点。基于上述筛选方法筛选出的标记位点能够用于检测样本的污染水平,与现有技术相比,检测成本更低,检测准确性更高。Embodiments of the present disclosure provide a method for screening SNP sites for detecting contamination levels of samples and a method for detecting contamination levels of samples. The screening method includes obtaining SNP sites with population mutation frequencies of 30% to 70% in a target area. As a candidate marker site; the region between the start site and the end site in the candidate marker site existing on a single chromosome is divided into multiple selection regions, and the length of the selection region is 0.7-1.3Mb ; If there are two or more candidate marker sites in the selected region, select the sites with an allele frequency of 40% to 60% in the selected region and with different genotypes most in line with Hardy-Wingerberg equilibrium as a marker site, and remove other candidate marker sites within the selected region. The marker loci screened based on the above screening method can be used to detect the contamination level of the sample, and compared with the prior art, the detection cost is lower and the detection accuracy is higher.
实施例Example
实施例1Example 1
一种用于检测样本污染水平的SNP位点的筛选方法,具体包括以下步骤:A method for screening SNP sites for detecting sample contamination levels, specifically comprising the following steps:
(1)使用popfreq_max_20150413数据库,提取人群频率为40%~60%的位点作为SNP候选位点。(1) Using the popfreq_max_20150413 database, extract sites with population frequencies of 40% to 60% as SNP candidate sites.
(2)提取N例(500)白细胞样本(无污染的阴性样本)中SNP候选位点的基因型信息。选择人群突变频率为40%~60%的位点作为候选标记位点。(2) Extract the genotype information of SNP candidate loci in N cases (500) leukocyte samples (negative samples without contamination). Sites with population mutation frequencies of 40% to 60% were selected as candidate marker sites.
同时,将每条染色体上存在的所述候选位点中的起始的位点和终止的位点之间的区域划分为多个选择区域,每个选择区域的长度均为1Mb;若所述选择区域内存在2个以上的候选标记位点,则挑选等位基因频率为50%且位点不均衡系数U最小的位点(若存在多个最小位点,任选其中之一)作为污染标记位点,并去除该选择区域内的其他候选标记位点,合并所有染色体上挑选出的污染标记位点,获得最终的污染标记位点合集。At the same time, the region between the starting site and the ending site in the candidate sites existing on each chromosome is divided into multiple selection regions, and the length of each selection region is 1Mb; if the If there are more than 2 candidate marker sites in the selected region, select the site with the allele frequency of 50% and the smallest site unbalance coefficient U (if there are multiple minimum sites, choose one of them) as the contamination Mark the locus, remove other candidate marker loci in the selected region, and combine the selected contamination marker loci on all chromosomes to obtain the final contamination marker locus collection.
上述位点不均衡系数U的计算公式如下:The calculation formula of the above-mentioned position imbalance coefficient U is as follows:
U=(S 0-0.25) 2+(S 1-0.5) 2+(S 2-0.25) 2;其中,S 0、S 1和S 2分别为野生型、突变杂合型和突变纯合型在500例白细胞样本基因组中的比例。 U=(S 0 -0.25) 2 +(S 1 -0.5) 2 +(S 2 -0.25) 2 ; wherein, S 0 , S 1 and S 2 are wild type, mutant heterozygous and mutant homozygous, respectively Proportion in the genome of 500 leukocyte samples.
实施例2Example 2
一组用于检测样本污染水平的SNP位点(Panel),其由实施例1的筛选方法筛选而得,具体包括上文表1中的60个SNP位点。A group of SNP loci (Panel) used to detect the contamination level of the sample, which was screened by the screening method of Example 1, specifically including the 60 SNP loci in Table 1 above.
实施例3Example 3
一种用于检测样本污染水平的方法,流程图参照附图1,具体包括以下步骤。A method for detecting the contamination level of a sample, the flow chart refers to FIG. 1 , and specifically includes the following steps.
(1)不同污染水平下的污染样本的VAF差值分布的构建:(1) Construction of VAF difference distribution of pollution samples under different pollution levels:
1.1样本建库:随机抽取500例肿瘤组织或血浆样本中的1例作为被污染样本,再挑选1例作为污染源,在肿瘤样本中模拟参入不同污染水平的污染源(异源数据),每个污染水平下重复500次。获取不同污染水平的污染样本。分别取50ng配制不同污染水平下的污染样本以及对照样本(未污染的)进行后续的建库实验,建库主要包括以下步骤:1.1 Sample database construction: randomly select 1 of 500 tumor tissue or plasma samples as the contaminated sample, and then select 1 as the pollution source, and simulate the pollution sources (heterologous data) with different pollution levels in the tumor samples. Repeat 500 times horizontally. Obtain contamination samples with different contamination levels. 50ng of the contaminated samples and control samples (uncontaminated) with different pollution levels were taken respectively for subsequent library construction experiments. The construction of the library mainly includes the following steps:
a、将样本进行打断以及末端修复;b、将上述修复后的DNA片段进行接头连接;c、将上述接头连接后的产物进行PCR扩增,得到足量带有接头的DNA片段,即为预文库。d、对预文库进行磁珠纯化,并进行浓度测定和片段质检;e、对预文库进行探针杂交;f、使用链霉亲和素磁珠对探针结合的样本进行捕获;g、将磁珠捕获到的DNA片段进行PCR扩增,得到足量的加上标签的DNA片段,即为终文库;h、对终文库进行磁珠纯化,并进行浓度测定和片段质检,利用qPCR进行定量。a. The sample is interrupted and the ends are repaired; b. The above-mentioned repaired DNA fragments are connected by joints; c. The products after the above-mentioned joints are connected are subjected to PCR amplification to obtain a sufficient amount of DNA fragments with joints, which are Pre-Library. d. Purify the pre-library with magnetic beads, and perform concentration determination and fragment quality inspection; e. Perform probe hybridization on the pre-library; f. Use streptavidin magnetic beads to capture the probe-bound samples; g. Amplify the DNA fragments captured by the magnetic beads by PCR to obtain a sufficient amount of tagged DNA fragments, which is the final library; h. Purify the final library with magnetic beads, and carry out concentration determination and fragment quality inspection, using qPCR Quantify.
1.2收集样本二代测序数据:基于建立获得的文库,采用目标区域多外显子探针捕获,使用基因测序仪,案子仪器标准操作规程进行150bp Pair-End模式测序(Read1:151;Read2:151;Index1:8,Index2:8),最终得到Fastq格式二代测序数据作为原始数据(raw data)。1.2 Collecting the next-generation sequencing data of the samples: Based on the library obtained from the establishment, the multi-exon probe in the target region was used to capture, and the gene sequencer was used to perform 150bp Pair-End mode sequencing using the standard operating procedures of the case instrument (Read1:151; Read2:151 ; Index1:8, Index2:8), and finally obtain the next-generation sequencing data in Fastq format as raw data.
1.3数据拆分质控:使用bcl2fastq进行数据的拆分,使用质控软件fastp进行数据的质控,得到配对样本高质量的数据(clean data)。1.3 Data split quality control: use bcl2fastq to split the data, use the quality control software fastp to perform data quality control, and obtain high-quality data (clean data) for paired samples.
1.4数据比对:使用bwa软件将clean data比对至参考基因组hg19序列上,获得每个测序段片段(read)的比对信息,之后使用gencore软件对比结果进行去重与碱基校正。1.4 Data alignment: Use bwa software to align the clean data to the hg19 sequence of the reference genome to obtain the alignment information of each sequencing segment (read), and then use the gencore software to compare the results for deduplication and base correction.
1.5标记位点筛选:污染标记位点选用实施例2提供的污染标记位点。1.5 Screening of marker sites: The pollution marker sites provided in Example 2 were selected as the pollution marker sites.
1.6纯合标记位点的VAF差值计算:1.6 Calculation of VAF difference at homozygous marker sites:
纯合标记位点的获取方法:基于检测获取的待测样本基因组以及无污染的对照样本上每个污染标记位点的基因 型信息,将待测样本基因组上基因型为野生型或突变纯合型的污染标记位点与无污染对照样本基因组上基因型为野生型或突变纯合型的污染标记位点合并,作为纯合标记位点。The method for obtaining homozygous marker loci: based on the genotype information of each contamination marker locus on the genome of the sample to be tested and the uncontaminated control sample, the genotype on the genome of the sample to be tested is wild-type or homozygous for mutation The contamination marker loci of the genotype were merged with the contamination marker loci of wild type or mutation homozygous genotype on the genome of the uncontaminated control sample as the homozygous marker locus.
通过pysam软件包,分别获得待测样本及其对照样本在每个污染标记位点的VAF值,并计算肿瘤样本(待测样本)在纯合标记位点与对照样本的VAF差值绝对值。得到对应污染水平下污染样本与对照样本在纯合标记位点VAF差值绝对值的参考分布。Through the pysam software package, the VAF values of the test sample and the control sample at each contamination marker site were obtained respectively, and the absolute value of the VAF difference between the tumor sample (test sample) at the homozygous marker site and the control sample was calculated. The reference distribution of the absolute value of the VAF difference between the contaminated sample and the control sample at the homozygous marker site under the corresponding pollution level was obtained.
(2)秩和检验:使用待测样本与对照样本在纯合标记位点的VAF差值与不同污染水平下污染样本与其对照样本在纯合位点的VAF差值绝对值的参考分布分别做秩和检验,计算P值(P-value)(双尾检验)。(2) Rank sum test: Use the reference distribution of the VAF difference between the test sample and the control sample at the homozygous marker site and the absolute value of the VAF difference between the contaminated sample and its control sample at the homozygous site under different pollution levels. Rank sum test, calculation of P-value (P-value) (two-tailed test).
(3)污染水平预测:(3) Prediction of pollution levels:
依据最大的P值,确认待测样本的污染水平。The contamination level of the sample to be tested is confirmed based on the largest P value.
需要说明的是,本实施例中对于待测样本来源的选择仅为示例性的,本公开实施方式可以应用于涉及人源的几乎所有疾病,包括但不限于肿瘤以及遗传疾病(诸如染色体病、显性遗传病、隐性遗传病、单基因遗传病、多基因遗传、X连锁遗传病),均可实现样本的污染水平的检测。It should be noted that the selection of the source of the sample to be tested in this example is only exemplary, and the embodiments of the present disclosure can be applied to almost all diseases involving human sources, including but not limited to tumors and genetic diseases (such as chromosomal diseases, Dominant genetic disease, recessive genetic disease, single-gene genetic disease, polygenic genetic disease, X-linked genetic disease), can realize the detection of the contamination level of the sample.
此外,本实施例中的样本类型选择也仅为示例性的,本公开实施方式中可以选择的样本类型还可以为诸如白细胞样本、口腔黏膜样本、唾液样本、脑脊液样本、胸水样本、腹水样本、口腔拭子样本。In addition, the sample type selection in this embodiment is only exemplary, and the sample types that can be selected in the embodiments of the present disclosure can also be samples such as leukocyte samples, oral mucosa samples, saliva samples, cerebrospinal fluid samples, pleural fluid samples, ascites samples, Oral swab sample.
实施例4Example 4
一种用于检测样本污染水平的方法,大致同实施例3提供的方法,区别在于,选用的污染标记位点为30个,均选自表1中的污染标记位点,位点的信息具体如表2所示。A method for detecting the contamination level of a sample is roughly the same as the method provided in Example 3, except that the selected contamination marker sites are 30, which are all selected from the contamination marker sites in Table 1, and the information of the sites is specific. As shown in table 2.
表2污染标记位点Table 2 Contamination marker sites
rsidrsid 染色体chromosome 位置Location 参考碱基reference base 突变后的碱基mutated base
rs2302603rs2302603 chr19chr19 1794129417941294 TT CC
rs2077647rs2077647 chr6chr6 152129077152129077 TT CC
rs1800880rs1800880 chr1chr1 156846120156846120 CC TT
rs4647887rs4647887 chr17chr17 7455880674558806 AA GG
rs6443222rs6443222 chr3chr3 91632679163267 CC TT
rs11543979rs11543979 chr1chr1 201981218201981218 CC GG
rs56188706rs56188706 chr5chr5 180057356180057356 CC GG
rs2290103rs2290103 chr12chr12 7853115678531156 AA GG
rs1048108rs1048108 chr2chr2 215674224215674224 GG AA
rs2813838rs2813838 chr7chr7 2414929224149292 CC GG
rs7328030rs7328030 chr13chr13 112269444112269444 CC AA
rs2241119rs2241119 chr14chr14 8155896581558965 AA GG
rs1799966rs1799966 chr17chr17 4122309441223094 TT CC
rs1130233rs1130233 chr14chr14 105239894105239894 CC TT
rs3218651rs3218651 chr3chr3 121208176121208176 TT CC
rs17114803rs17114803 chr10chr10 104386934104386934 TT CC
rs1870377rs1870377 chr4chr4 5597297455972974 TT AA
rs641936rs641936 chr11chr11 9419726094197260 AA GG
rs2242522rs2242522 chr19chr19 3622895436228954 GG TT
rs9344rs9344 chr11chr11 6946291069462910 GG AA
rs3744093rs3744093 chr17chr17 5649280056492800 TT CC
rs4607417rs4607417 chr6chr6 4197827441978274 CC TT
rs7026388rs7026388 chr9chr9 85181438518143 TT CC
rs9617050rs9617050 chr22chr22 5070873150708731 AA GG
rs17655rs17655 chr13chr13 103528002103528002 GG CC
rs13184586rs13184586 chr5chr5 161119125161119125 CC GG
rs3756122rs3756122 chr4chr4 143235865143235865 CC GG
rs62456182rs62456182 chr7chr7 60387226038722 TT CC
rs7532151rs7532151 chr1chr1 8938894489388944 AA CC
rs2305030rs2305030 chr15chr15 4180523741805237 CC TT
实施例5Example 5
一种用于检测样本污染水平的方法,大致同实施例3提供的方法,区别在于,污染标记位点为60个,其中,30个选自表1,另外30个为不同于表1的其他位点(加粗的位点为不同于表1的标记位点,其为突变杂合型比例为40%~60%的其他位点),具体污染标记位点请参照表3。A method for detecting the contamination level of a sample, which is roughly the same as the method provided in Example 3, except that there are 60 contamination marker sites, of which 30 are selected from Table 1, and the other 30 are different from Table 1. Sites (bold sites are marked sites different from Table 1, which are other sites with a mutant heterozygosity ratio of 40% to 60%). Please refer to Table 3 for specific contamination marker sites.
表3污染标记位点Table 3 Contamination marker sites
rsidrsid 染色体chromosome 位置Location 参考碱基reference base 突变后的碱基mutated base
rs2294714rs2294714 chr1chr1 62578266257826 TT CC
rs3219489rs3219489 chr1chr1 4579750545797505 CC GG
rs7532151rs7532151 chr1chr1 8938894489388944 AA CC
rs2298258rs2298258 chr1chr1 162737116162737116 CC GG
rs3747636rs3747636 chr1chr1 204403659204403659 AA GG
rs1136410rs1136410 chr1chr1 226555302226555302 AA GG
rs1128919rs1128919 chr2chr2 148657117148657117 GG AA
rs1048108rs1048108 chr2chr2 215674224215674224 GG AA
rs2227982rs2227982 chr2chr2 242793433242793433 GG AA
rs7652776rs7652776 chr3chr3 27410242741024 GG CC
rs11466512rs11466512 chr3chr3 3071312630713126 TT AA
rs3218651rs3218651 chr3chr3 121208176121208176 TT CC
rs2227931rs2227931 chr3chr3 142222284142222284 AA GG
rs3749234rs3749234 chr3chr3 176765269176765269 TT CC
rs1870377rs1870377 chr4chr4 5597297455972974 TT AA
rs1350191rs1350191 chr4chr4 154807324154807324 CC TT
rs2043112rs2043112 chr5chr5 3895579638955796 GG AA
rs2565007rs2565007 chr5chr5 5382026753820267 CC AA
rs3733875rs3733875 chr5chr5 176637240176637240 GG TT
rs56188706rs56188706 chr5chr5 180057356180057356 CC GG
rs2230653rs2230653 chr6chr6 2605660426056604 GG AA
rs915894rs915894 chr6chr6 3219039032190390 TT GG
rs1801270rs1801270 chr6chr6 3665197136651971 CC AA
rs12055782rs12055782 chr6chr6 128312033128312033 AA GG
rs2077647rs2077647 chr6chr6 152129077152129077 TT CC
rs9639168rs9639168 chr7chr7 1397880913978809 TT CC
rs2074566rs2074566 chr7chr7 2622466826224668 CC TT
rs2227983rs2227983 chr7chr7 5522925555229255 GG AA
rs1129293rs1129293 chr7chr7 106513011106513011 CC TT
rs2267708rs2267708 chr7chr7 124392512124392512 CC TT
rs3808565rs3808565 chr8chr8 2622764026227640 AA GG
rs1805794rs1805794 chr8chr8 9099047990990479 CC GG
rs4244612rs4244612 chr8chr8 145741702145741702 CC GG
rs7026388rs7026388 chr9chr9 85181438518143 TT CC
rs290223rs290223 chr9chr9 9363984693639846 CC GG
rs17114803rs17114803 chr10chr10 104386934104386934 TT CC
rs2071313rs2071313 chr11chr11 6457260264572602 GG AA
rs974144rs974144 chr11chr11 8596862385968623 CC TT
rs11062385rs11062385 chr12chr12 427575427575 AA GG
rs3741622rs3741622 chr12chr12 4942597849425978 TT CC
rs2290103rs2290103 chr12chr12 7853115678531156 AA GG
rs2075784rs2075784 chr12chr12 133263825133263825 GG AA
rs4883918rs4883918 chr13chr13 7335007973350079 TT CC
rs1805097rs1805097 chr13chr13 110435231110435231 CC TT
rs9549365rs9549365 chr13chr13 113907391113907391 AA GG
rs2241119rs2241119 chr14chr14 8155896581558965 AA GG
rs1130233rs1130233 chr14chr14 105239894105239894 CC TT
rs2305030rs2305030 chr15chr15 4180523741805237 CC TT
rs2229765rs2229765 chr15chr15 9947822599478225 GG AA
rs2230930rs2230930 chr17chr17 17829571782957 CC TT
rs1799966rs1799966 chr17chr17 4122309441223094 TT CC
rs2240308rs2240308 chr17chr17 6355459163554591 GG AA
rs1567962rs1567962 chr17chr17 7891955878919558 CC TT
rs663651rs663651 chr18chr18 4245665342456653 GG AA
rs2075607rs2075607 chr19chr19 12220121222012 GG CC
rs1799817rs1799817 chr19chr19 71252977125297 GG AA
rs2302603rs2302603 chr19chr19 1794129417941294 TT CC
rs12461253rs12461253 chr19chr19 3176976331769763 GG AA
rs2076578rs2076578 chr22chr22 4156960941569609 CC TT
rs9617050rs9617050 chr22chr22 5070873150708731 AA GG
试验例1Test Example 1
采用实施例3的方法与现有技术Conpair的检测效果对比。The detection effect of the method of Embodiment 3 is compared with that of the prior art Conpair.
首先,构建模拟污染样本,采用实施例3的方法对模拟样本进行污染水平的检测。构建参考分布的步骤包括:随机抽取500例细胞系样本中1例作为被污染样本(已知未污染的样本),随机再选取其他白细胞或细胞系样本作为污染源,按质量比进行不同污染比例的掺入,获得不同污染水平下的污染样本,每个污染水平下随机抽取500个污染样本和其对照样本(未掺入污染源的样本),获得VAF差值分布,请参照图2。First, a simulated pollution sample was constructed, and the method of Example 3 was used to detect the pollution level of the simulated sample. The steps of constructing a reference distribution include: randomly selecting 1 of 500 cell line samples as a contaminated sample (a known uncontaminated sample), randomly selecting other leukocyte or cell line samples as the contamination source, and carrying out different contamination ratios based on mass ratios. 500 pollution samples and their control samples (samples without pollution sources) were randomly selected under each pollution level to obtain the VAF difference distribution, please refer to Figure 2.
然后,构建不同污染水平下的真实样本,真实样本为可能被污染过的临床样本。选择其他细胞系样本作为污染源掺入到真实样本中,获得不同污染水平下的真实样本。分别采用实施例3的方法以及现有的Conpair方法(方法来源于:Bergmann,Ewa A.,et al."Conpair:concordance and contamination estimator for matched tumor–normal pairs."Bioinformatics 20:3196-3198.)对不同污染水平的真实样本进行污染水平的检测。两种方法的标记位点信息如表4所示。Then, real samples under different contamination levels are constructed, and the real samples are clinical samples that may be contaminated. Select other cell line samples as contamination sources to be spiked into real samples to obtain real samples under different contamination levels. Adopt the method of embodiment 3 and existing Conpair method respectively (method comes from: Bergmann, Ewa A., et al. "Conpair:concordance and contamination estimator for matched tumor-normal pairs." Bioinformatics 20:3196-3198.) Contamination level detection is performed on real samples with different contamination levels. The labeling site information of the two methods is shown in Table 4.
表4检测位点Table 4 Detection sites
方法method 标记数量number of marks
ConpairConpair 73877387
本公开的方法Methods of the present disclosure 6060
实施例3提供的检测方法所覆盖的纯合标记位点的数量分布图请参照附图2。其中,真实数据是指在检测真实样本时所获得的纯合标记位点的分布情况;模拟数据,为随机生成的二项分布随机数(模拟参数n为60,p为0.5的二项分布B(60,0.5))。由结果可知,筛选得到的纯合标记位点的数量分布与理论分布一致,说明污染标记位点的筛选条件有效且与理想情况接近。Please refer to FIG. 2 for the distribution map of the number of homozygous marker sites covered by the detection method provided in Example 3. Among them, the real data refers to the distribution of homozygous marker sites obtained when detecting real samples; the simulated data is a randomly generated binomial distribution random number (simulation parameter n is 60, p is 0.5 Binomial distribution B (60,0.5)). It can be seen from the results that the number distribution of homozygous marker sites obtained by screening is consistent with the theoretical distribution, indicating that the screening conditions for contaminated marker sites are effective and close to the ideal situation.
实施例3的检测方法以及Conpair样本污染水平的部分检测结果如表5所示。The detection method of Example 3 and the partial detection results of the pollution level of the Conpair sample are shown in Table 5.
表5检测结果对比Table 5 Comparison of test results
样本名sample name 参考refer to 本公开(实施例3)预测The present disclosure (Example 3) predicts Conpair预测Conpair forecast
PRE-0_005N93B66PRE-0_005N93B66 0.0050.005 0.0030.003 0.564920.56492
PRE-0_005N94B68PRE-0_005N94B68 0.0050.005 0.0030.003 0.547170.54717
PRE-0_005N95B78PRE-0_005N95B78 0.0050.005 0.0040.004 0.546010.54601
PRE-0_01N93B86PRE-0_01N93B86 0.010.01 0.0080.008 0.56040.5604
PRE-0_01N94B94PRE-0_01N94B94 0.010.01 0.0090.009 0.551380.55138
PRE-0_01N95B51PRE-0_01N95B51 0.010.01 0.0080.008 0.554410.55441
PRE-0_02N93B59PRE-0_02N93B59 0.020.02 0.0150.015 0.992970.99297
PRE-0_02N94B60PRE-0_02N94B60 0.020.02 0.020.02 0.559510.55951
PRE-0_02N95B61PRE-0_02N95B61 0.020.02 0.0150.015 0.568070.56807
PRE-0_03N93B66PRE-0_03N93B66 0.030.03 0.0250.025 0.564210.56421
PRE-0_03N94B68PRE-0_03N94B68 0.030.03 0.030.03 0.556020.55602
PRE-0_03N95B78PRE-0_03N95B78 0.030.03 0.0250.025 0.564950.56495
PRE-0_05N93B86PRE-0_05N93B86 0.050.05 0.040.04 0.555150.55515
PRE-0_05N94B94PRE-0_05N94B94 0.050.05 0.050.05 0.525930.52593
PRE-0_05N95B51PRE-0_05N95B51 0.050.05 0.050.05 0.572460.57246
PRE-0_1N93B59PRE-0_1N93B59 0.10.1 0.10.1 0.526410.52641
PRE-0_1N94B60PRE-0_1N94B60 0.10.1 0.10.1 0.517670.51767
PRE-0_1N95B61PRE-0_1N95B61 0.10.1 0.10.1 0.558210.55821
由图3和表5可知,本公开实施例3的方法所需覆盖标记位点少,相比于现有工具大于1000标记位点的限制,可以通过灵活筛选或内嵌至不同panel产品中,应用灵活广泛。As can be seen from Figure 3 and Table 5, the method of Example 3 of the present disclosure needs to cover less marker sites. Compared with the limitation of more than 1000 marker sites in existing tools, it can be flexibly screened or embedded in different panel products. Flexible and extensive application.
实施例3的检测方法的性能表现请参照附图4。具体地,图4中A为实施例3的方法在模拟污染样本检测中的性能表现结果,图4中B为实施例3的方法在不同污染水下真实样本检测中的性能表现结果。For the performance of the detection method of Embodiment 3, please refer to FIG. 4 . Specifically, A in FIG. 4 is the performance result of the method of Embodiment 3 in the detection of simulated contaminated samples, and B in FIG. 4 is the performance result of the method of Embodiment 3 in the detection of real samples in different polluted waters.
由结果可知,模拟污染样本中,分析预测值与理论值相关性决定系数为0.9984,真实样本中,分析预测值与理论值相关性决定系数为0.9987。Conpair在真实样本检测中表现不佳。相比于Conpair,本公开实施例3的方法预测值更接近真实数值。需要说明的是本公开的实施例3仅是对于本公开实施方式的一种示例性说明,并不代表或限制本公开的权利保护范围,采用本公开的其他实施方式的方法也可以获得与实施例3相近的预测结果。It can be seen from the results that in the simulated pollution samples, the coefficient of determination of the correlation between the analytical predicted value and the theoretical value is 0.9984, and in the real sample, the coefficient of determination of the correlation between the analytical predicted value and the theoretical value is 0.9987. Conpair does not perform well in real sample detection. Compared with Conpair, the predicted value of the method of Embodiment 3 of the present disclosure is closer to the real value. It should be noted that Example 3 of the present disclosure is only an exemplary illustration of the implementation of the present disclosure, and does not represent or limit the protection scope of the present disclosure, and methods using other embodiments of the present disclosure can also be obtained and implemented. Example 3 is similar to the predicted results.
试验例2Test Example 2
对比实施例3、4和5提供的方法的检测效果。The detection effects of the methods provided in Examples 3, 4 and 5 were compared.
检测方法Detection method
采用实施例3、4和5提供的方法对61个已知污染水平的样本进行检测,检测结果请参照表6。61 samples with known contamination levels were tested using the methods provided in Examples 3, 4 and 5. Please refer to Table 6 for the test results.
表6检测结果Table 6 Test results
样本编号sample number 实施例5预测值Example 5 Predicted value 实施例3的预测值Predicted value of Example 3 实施例4预测值Example 4 Predicted value 理论污染水平 Theoretical pollution level
11 0.0020.002 0.0040.004 0.0040.004 0.0050.005
22 0.0040.004 0.0020.002 0.0040.004 0.0050.005
33 0.0040.004 0.0050.005 0.0030.003 0.0050.005
44 0.0060.006 0.0090.009 0.0150.015 0.010.01
55 0.0070.007 0.0070.007 0.0090.009 0.010.01
66 0.0090.009 0.0070.007 0.0030.003 0.010.01
77 0.0150.015 0.020.02 0.030.03 0.020.02
88 0.0150.015 0.0150.015 0.020.02 0.020.02
99 0.0150.015 0.020.02 0.0150.015 0.020.02
1010 0.020.02 0.0250.025 0.0350.035 0.030.03
1111 0.020.02 0.0250.025 0.0250.025 0.030.03
1212 0.030.03 0.030.03 0.020.02 0.030.03
1313 0.0350.035 0.050.05 0.050.05 0.050.05
1414 0.050.05 0.040.04 0.040.04 0.050.05
1515 0.050.05 0.050.05 0.0350.035 0.050.05
1616 0.080.08 0.10.1 0.120.12 0.10.1
1717 0.090.09 0.090.09 0.10.1 0.10.1
1818 0.110.11 0.110.11 0.080.08 0.10.1
1919 0.870.87 0.910.91 0.920.92 0.90.9
2020 0.890.89 0.880.88 0.910.91 0.90.9
21twenty one 0.890.89 0.910.91 0.880.88 0.90.9
22twenty two 0.90.9 0.970.97 0.990.99 0.950.95
23twenty three 0.940.94 0.930.93 0.950.95 0.950.95
24twenty four 0.930.93 0.950.95 0.940.94 0.950.95
2525 0.940.94 0.980.98 11 0.970.97
2626 0.970.97 0.960.96 0.970.97 0.970.97
2727 0.960.96 0.980.98 0.940.94 0.970.97
2828 0.970.97 11 11 0.980.98
2929 0.970.97 0.960.96 0.980.98 0.980.98
3030 0.970.97 0.980.98 0.960.96 0.980.98
3131 0.990.99 11 11 0.990.99
3232 0.990.99 11 11 0.990.99
3333 0.990.99 11 0.980.98 0.990.99
3434 0.190.19 0.190.19 0.180.18 0.20.2
3535 0.250.25 0.270.27 0.250.25 0.250.25
3636 0.290.29 0.30.3 0.30.3 0.30.3
3737 0.250.25 0.270.27 0.270.27 0.250.25
3838 0.10.1 0.110.11 0.10.1 0.10.1
3939 0.10.1 0.110.11 0.110.11 0.10.1
4040 0.150.15 0.160.16 0.160.16 0.150.15
4141 0.210.21 0.220.22 0.220.22 0.20.2
4242 0.290.29 0.320.32 0.310.31 0.30.3
4343 0.380.38 0.40.4 0.360.36 0.40.4
4444 0.10.1 0.10.1 0.090.09 0.10.1
4545 0.140.14 0.160.16 0.150.15 0.150.15
4646 0.20.2 0.210.21 0.210.21 0.20.2
4747 0.30.3 0.310.31 0.30.3 0.30.3
4848 0.390.39 0.410.41 0.380.38 0.40.4
4949 0.810.81 0.840.84 0.890.89 0.80.8
5050 0.740.74 0.770.77 0.820.82 0.750.75
5151 0.690.69 0.720.72 0.760.76 0.70.7
5252 0.750.75 0.780.78 0.810.81 0.750.75
5353 0.880.88 0.940.94 0.970.97 0.90.9
5454 0.850.85 0.880.88 0.910.91 0.850.85
5555 0.780.78 0.810.81 0.840.84 0.80.8
5656 0.70.7 0.730.73 0.780.78 0.70.7
5757 0.620.62 0.640.64 0.70.7 0.60.6
5858 0.850.85 0.890.89 0.940.94 0.850.85
5959 0.80.8 0.830.83 0.890.89 0.80.8
6060 0.690.69 0.730.73 0.770.77 0.70.7
6161 0.60.6 0.640.64 0.70.7 0.60.6
实施例4的检测方法的性能表现请参照附图5中B,实施例5提供的检测方法的性能表现请参照图5中A。For the performance of the detection method in Embodiment 4, please refer to B in FIG. 5 , and for the performance of the detection method provided in Embodiment 5, please refer to A in FIG. 5 .
由结果可知,实施例4的检测方法分析预测值与理论值相关性决定系数为0.9937,实施例5的检测方法分析预测值与理论值相关性决定系数为0.9993。可以看到标记位点随机替换30个标记(实施例5)后的性能表型变化较小(0.9987vs 0.9993),表明只要满足标记位点的筛选条件,任意60rs标记位点的性能表现稳定。It can be seen from the results that the coefficient of determination of the correlation between the predicted value and the theoretical value of the detection method in Example 4 is 0.9937, and the coefficient of determination of the correlation between the predicted value and the theoretical value of the detection method of Example 5 is 0.9993. It can be seen that the performance phenotype change after randomly replacing 30 markers at the marker site (Example 5) is small (0.9987vs 0.9993), indicating that as long as the screening conditions of the marker site are met, the performance of any 60rs marker site is stable.
当标记数量降低至30个(实施例4)时,预测性能有所下降,决定系数下降至0.9937,表明,降低标记数量预测性能会有所下降,但30标记预测污染仍具有可行性。When the number of markers was reduced to 30 (Example 4), the prediction performance decreased, and the coefficient of determination decreased to 0.9937, which indicated that the prediction performance would decrease when the number of markers was reduced, but the prediction of contamination by 30 markers was still feasible.
以上所述仅为本公开的典型实施例而已,并不用于限制本公开,对于本领域的技术人员来说,本公开可以有各种更改和变化。凡在本公开的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本公开的保护范 围之内。The above descriptions are only typical embodiments of the present disclosure, and are not intended to limit the present disclosure. For those skilled in the art, the present disclosure may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included within the protection scope of the present disclosure.
工业实用性Industrial Applicability
本公开实施方式提供了一种用于检测样本污染水平的SNP位点的筛选方法及样本污染水平的检测方法,基于上述筛选方法筛选出的标记位点能够用于检测样本的污染水平,与现有技术相比,检测成本更低,检测准确性更高,具有广泛的应用价值。同时基于上述筛选方法筛选出的标记位点在检测样本的污染水平时,性能表现稳定,具有较大的实用性价值。Embodiments of the present disclosure provide a method for screening SNP sites for detecting the level of contamination of a sample and a method for detecting the level of contamination in a sample. The marker sites screened out based on the above screening method can be used for detecting the level of contamination in a sample, which is consistent with the present Compared with other technologies, the detection cost is lower, the detection accuracy is higher, and it has a wide range of application value. At the same time, the marker loci screened based on the above screening method have stable performance in detecting the contamination level of the sample, and have great practical value.

Claims (16)

  1. 一种用于检测样本污染水平的SNP位点的筛选方法,其特征在于,获取目标区域中人群突变频率为30%~70%的SNP位点作为候选标记位点;A method for screening SNP sites for detecting the pollution level of a sample, characterized in that the SNP sites whose population mutation frequency is 30% to 70% in a target area are obtained as candidate marker sites;
    将单一染色体上存在的所述候选标记位点中的起始位点和终止位点之间的区域划分为多个选择区域,所述选择区域的长度为0.7~1.3Mb;dividing the region between the starting site and the ending site in the candidate marker sites existing on a single chromosome into multiple selection regions, and the length of the selection region is 0.7-1.3Mb;
    若所述选择区域内存在两个及以上的候选标记位点,则选择该选择区域内等位基因频率为40%~60%且不同基因型最符合哈迪-温格伯平衡的位点作为标记位点,并去除该选择区域内的其他候选标记位点。If there are two or more candidate marker loci in the selection region, the locus with the allele frequency of 40% to 60% in the selection region and the different genotypes most in line with Hardy-Wingerberg equilibrium is selected as Mark the site, and remove other candidate marker sites within the selected region.
  2. 根据权利要求1所述的用于检测样本污染水平的SNP位点的筛选方法,其特征在于,所述不同基因型最符合哈迪-温格伯平衡的位点是指:在所述选择区域内位点不均衡系数U最小的位点,所述位点不均衡系数U的计算公式如下:The method for screening SNP sites for detecting sample contamination levels according to claim 1, wherein the sites with different genotypes most in line with Hardy-Wingerberg equilibrium refer to: in the selection region For the site with the smallest site unbalance coefficient U, the calculation formula of the site unbalance coefficient U is as follows:
    U=(S 0-0.25) 2+(S 1-0.5) 2+(S 2-0.25) 2;其中,S 0、S 1和S 2分别为野生型、突变杂合型和突变纯合型在目标区域中的人群出现频率。 U=(S 0 -0.25) 2 +(S 1 -0.5) 2 +(S 2 -0.25) 2 ; wherein, S 0 , S 1 and S 2 are wild type, mutant heterozygous and mutant homozygous, respectively Population frequency in the target area.
  3. 根据权利要求2所述的用于检测样本污染水平的SNP位点的筛选方法,其特征在于,当一个所述选择区域内,所述最符合哈迪-温格伯平衡的位点存在多个时,选择其中任意一个;The method for screening SNP sites for detecting the level of contamination in samples according to claim 2, characterized in that, in one of the selected regions, there are multiple sites that are most in line with Hardy-Wingerberg equilibrium , select any one of them;
    优选地,所述SNP候选位点是基于N例无污染的阴性样本基因组中的人群突变频率为30%~70%的SNP位点;Preferably, the SNP candidate loci are SNP loci with a population mutation frequency of 30% to 70% in the genomes of N cases of uncontaminated negative samples;
    优选地,所述N≥50;Preferably, the N≥50;
    优选地,所述N≥100。Preferably, the N≧100.
  4. 根据权利要求1~3任一项所述的用于检测样本污染水平的SNP位点的筛选方法,其特征在于,所述SNP候选位点为基因数据库中出现频率为40%~60%的位点。The method for screening SNP loci for detecting sample contamination level according to any one of claims 1 to 3, wherein the SNP candidate locus is a locus with an occurrence frequency of 40% to 60% in a gene database point.
  5. 根据权利要求1~4任一项所述的用于检测样本污染水平的SNP位点的筛选方法,其特征在于,所述待测样本的来源选自肿瘤、遗传疾病;The method for screening SNP sites for detecting contamination levels of samples according to any one of claims 1 to 4, wherein the source of the sample to be tested is selected from tumors and genetic diseases;
    优选地,所述待测样本的类型选自血浆样本、白细胞样本、组织样本、口腔黏膜样本、唾液样本、脑脊液样本、胸水样本、腹水样本、口腔拭子样本。Preferably, the type of the sample to be tested is selected from plasma samples, white blood cell samples, tissue samples, oral mucosa samples, saliva samples, cerebrospinal fluid samples, pleural fluid samples, ascites samples, and oral swab samples.
  6. 一种用于检测样本污染水平的方法,其特征在于,其包括:使用如权利要求1~5任一项所述的用于检测样本污染水平的SNP位点的筛选方法筛选得到的污染标记位点。A method for detecting the contamination level of a sample, characterized in that it comprises: using the screening method for SNP sites for detecting the contamination level of a sample according to any one of claims 1 to 5 to screen the contamination marker positions obtained point.
  7. 根据权利要求6所述的用于检测样本污染水平的方法,其特征在于,所述方法包括:将待测样本和/或无污染对照样本目标区域中基因型为纯合基因型的所述污染标记位点作为纯合标记位点;将获得的待测样本的突变丰度VAF差值与不同污染水平下污染样本的VAF差值分布分别做秩和检验,分别获得不同污染水平下的P值;The method for detecting the contamination level of a sample according to claim 6, characterized in that the method comprises: adding the contamination whose genotype is a homozygous genotype in the target area of the sample to be tested and/or the non-contamination control sample The marker site was used as a homozygous marker site; the rank sum test was performed on the VAF difference of the obtained mutation abundance of the sample to be tested and the VAF difference distribution of the contaminated samples under different pollution levels, and the P values under different pollution levels were obtained respectively. ;
    其中,所述待测样本的突变丰度VAF差值为:待测样本与无污染的阴性对照样本在所述纯合标记位点的VAF差值绝对值;Wherein, the mutation abundance VAF difference value of the sample to be tested is: the absolute value of the VAF difference value of the sample to be tested and the uncontaminated negative control sample at the homozygous marker site;
    所述不同污染水平下的污染样本的VAF差值分布为:参入不同水平污染源的污染样本分别与无污染的阴性对照样本在所述纯合标记位点的VAF差值绝对值分布。The VAF difference distribution of the contaminated samples under different pollution levels is: the absolute value distribution of the VAF difference at the homozygous marker site between the contaminated samples with different levels of pollution sources and the uncontaminated negative control samples respectively.
  8. 根据权利要求7所述的用于检测样本污染水平的方法,其特征在于,所述方法包括:确定不同污染水平下P值的最大值,将P值的最大值对应的污染水平判定为待测样本的污染水平;The method for detecting the pollution level of a sample according to claim 7, wherein the method comprises: determining the maximum value of the P value under different pollution levels, and determining the pollution level corresponding to the maximum value of the P value as the test to be measured the contamination level of the sample;
    优选地,所述不同污染水平的污染比例选自0.01%~99.9%。Preferably, the pollution ratios of the different pollution levels are selected from 0.01% to 99.9%.
  9. 根据权利要求6所述的用于检测样本污染水平的方法,其特征在于,所述污染标记位点包括下表中的(1)~(60)中的至少1个;The method for detecting the contamination level of a sample according to claim 6, wherein the contamination marker site comprises at least one of (1) to (60) in the following table;
    编号Numbering rsidrsid 染色体chromosome 位置Location 参考碱基reference base 突变后的mutated 11 rs2294714rs2294714 chr1chr1 62578266257826 TT CC 22 rs7532151rs7532151 chr1chr1 8938894489388944 AA CC 33 rs1800880rs1800880 chr1chr1 156846120156846120 CC TT 44 rs11543979rs11543979 chr1chr1 201981218201981218 CC GG 55 rs1136410rs1136410 chr1chr1 226555302226555302 AA GG 66 rs1128919rs1128919 chr2chr2 148657117148657117 GG AA 77 rs1048108rs1048108 chr2chr2 215674224215674224 GG AA 88 rs7652776rs7652776 chr3chr3 27410242741024 GG CC 99 rs6443222rs6443222 chr3chr3 91632679163267 CC TT 1010 rs2251219rs2251219 chr3chr3 5258478752584787 TT CC 1111 rs3218651rs3218651 chr3chr3 121208176121208176 TT CC 1212 rs3749234rs3749234 chr3chr3 176765269176765269 TT CC
    1313 rs17497475rs17497475 chr4chr4 1924223919242239 CC AA 1414 rs1870377rs1870377 chr4chr4 5597297455972974 TT AA 1515 rs3756122rs3756122 chr4chr4 143235865143235865 CC GG 1616 rs2565007rs2565007 chr5chr5 5382026753820267 CC AA 1717 rs13184586rs13184586 chr5chr5 161119125161119125 CC GG 1818 rs56188706rs56188706 chr5chr5 180057356180057356 CC GG 1919 rs2230653rs2230653 chr6chr6 2605660426056604 GG AA 2020 rs4607417rs4607417 chr6chr6 4197827441978274 CC TT 21twenty one rs2243384rs2243384 chr6chr6 117678083117678083 AA GG 22twenty two rs2077647rs2077647 chr6chr6 152129077152129077 TT CC 23twenty three rs62456182rs62456182 chr7chr7 60387226038722 TT CC 24twenty four rs2813838rs2813838 chr7chr7 2414929224149292 CC GG 2525 rs2227983rs2227983 chr7chr7 5522925555229255 GG AA 2626 rs41736rs41736 chr7chr7 116435768116435768 CC TT 2727 rs2267708rs2267708 chr7chr7 124392512124392512 CC TT 2828 rs1293288rs1293288 chr8chr8 1171852811718528 TT CC 2929 rs1805794rs1805794 chr8chr8 9099047990990479 CC GG 3030 rs7839934rs7839934 chr8chr8 144941181144941181 GG CC 3131 rs7026388rs7026388 chr9chr9 85181438518143 TT CC 3232 rs7023954rs7023954 chr9chr9 2181675821816758 GG AA 3333 rs357564rs357564 chr9chr9 9820959498209594 GG AA 3434 rs17114803rs17114803 chr10chr10 104386934104386934 TT CC 3535 rs9344rs9344 chr11chr11 6946291069462910 GG AA 3636 rs641936rs641936 chr11chr11 9419726094197260 AA GG 3737 rs543840rs543840 chr11chr11 115764486115764486 GG AA 3838 rs734075rs734075 chr12chr12 44970474497047 CC AA 3939 rs7955902rs7955902 chr12chr12 4064525740645257 CC AA 4040 rs2290103rs2290103 chr12chr12 7853115678531156 AA GG 4141 rs2259820rs2259820 chr12chr12 121435342121435342 CC TT 4242 rs9534262rs9534262 chr13chr13 3293664632936646 TT CC 4343 rs4883918rs4883918 chr13chr13 7335007973350079 TT CC 4444 rs17655rs17655 chr13chr13 103528002103528002 GG CC 4545 rs7328030rs7328030 chr13chr13 112269444112269444 CC AA 4646 rs2231301rs2231301 chr14chr14 2377709923777099 GG AA 4747 rs2241119rs2241119 chr14chr14 8155896581558965 AA GG 4848 rs1130233rs1130233 chr14chr14 105239894105239894 CC TT 4949 rs2305030rs2305030 chr15chr15 4180523741805237 CC TT 5050 rs2229765rs2229765 chr15chr15 9947822599478225 GG AA 5151 rs2230930rs2230930 chr17chr17 17829571782957 CC TT 5252 rs1799966rs1799966 chr17chr17 4122309441223094 TT CC 5353 rs3744093rs3744093 chr17chr17 5649280056492800 TT CC 5454 rs4647887rs4647887 chr17chr17 7455880674558806 AA GG 5555 rs663651rs663651 chr18chr18 4245665342456653 GG AA 5656 rs2075607rs2075607 chr19chr19 12220121222012 GG CC 5757 rs2302603rs2302603 chr19chr19 1794129417941294 TT CC 5858 rs8113496rs8113496 chr19chr19 2982052929820529 AA GG
    5959 rs2242522rs2242522 chr19chr19 3622895436228954 GG TT 6060 rs9617050rs9617050 chr22chr22 5070873150708731 AA GG
    ;
    优选地,所述污染标记位点包括上表中的(1)~(60)中的至少10个;Preferably, the contamination marker sites include at least 10 of (1) to (60) in the above table;
    优选地,所述污染标记位点包括上表中的(1)~(60)中的至少20个。Preferably, the contamination marker sites include at least 20 of (1) to (60) in the above table.
  10. 根据权利要求6所述的用于检测样本污染水平的方法,其特征在于,所述污染标记位点包括下表中1)~30)的中的至少1个或者包括下表中1)~30)的中的至少1个与权利要求9所述污染标记位点(1)~(60)中至少1个的组合:The method for detecting the contamination level of a sample according to claim 6, wherein the contamination marker site includes at least one of 1) to 30) in the following table or includes 1) to 30 in the following table. A combination of at least one of ) and at least one of the contamination marker sites (1) to (60) of claim 9:
    编号 rsid 染色体 位置 参考碱基 突变后的碱 1) rs3219489 chr1 45797505 C G 2) rs2298258 chr1 162737116 C G 3) rs3747636 chr1 204403659 A G 4) rs2227982 chr2 242793433 G A 5) rs11466512 chr3 30713126 T A 6) rs2227931 chr3 142222284 A G 7) rs1350191 chr4 154807324 C T 8) rs2043112 chr5 38955796 G A 9) rs3733875 chr5 176637240 G T 10) rs915894 chr6 32190390 T G 11) rs1801270 chr6 36651971 C A 12) rs12055782 chr6 128312033 A G 13) rs9639168 chr7 13978809 T C 14) rs2074566 chr7 26224668 C T 15) rs1129293 chr7 106513011 C T 16) rs3808565 chr8 26227640 A G 17) rs4244612 chr8 145741702 C G 18) rs290223 chr9 93639846 C G 19) rs2071313 chr11 64572602 G A 20) rs974144 chr11 85968623 C T 21) rs11062385 chr12 427575 A G 22) rs3741622 chr12 49425978 T C 23) rs2075784 chr12 133263825 G A 24) rs1805097 chr13 110435231 C T 25) rs9549365 chr13 113907391 A G 26) rs2240308 chr17 63554591 G A 27) rs1567962 chr17 78919558 C T 28) rs1799817 chr19 7125297 G A 29) rs12461253 chr19 31769763 G A 30) rs2076578 chr22 41569609 C T
    Numbering rsid chromosome Location reference base mutated base 1) rs3219489 chr1 45797505 C G 2) rs2298258 chr1 162737116 C G 3) rs3747636 chr1 204403659 A G 4) rs2227982 chr2 242793433 G A 5) rs11466512 chr3 30713126 T A 6) rs2227931 chr3 142222284 A G 7) rs1350191 chr4 154807324 C T 8) rs2043112 chr5 38955796 G A 9) rs3733875 chr5 176637240 G T 10) rs915894 chr6 32190390 T G 11) rs1801270 chr6 36651971 C A 12) rs12055782 chr6 128312033 A G 13) rs9639168 chr7 13978809 T C 14) rs2074566 chr7 26224668 C T 15) rs1129293 chr7 106513011 C T 16) rs3808565 chr8 26227640 A G 17) rs4244612 chr8 145741702 C G 18) rs290223 chr9 93639846 C G 19) rs2071313 chr11 64572602 G A 20) rs974144 chr11 85968623 C T twenty one) rs11062385 chr12 427575 A G twenty two) rs3741622 chr12 49425978 T C twenty three) rs2075784 chr12 133263825 G A twenty four) rs1805097 chr13 110435231 C T 25) rs9549365 chr13 113907391 A G 26) rs2240308 chr17 63554591 G A 27) rs1567962 chr17 78919558 C T 28) rs1799817 chr19 7125297 G A 29) rs12461253 chr19 31769763 G A 30) rs2076578 chr22 41569609 C T
    .
  11. 一种用于检测样本污染水平的试剂盒,其特征在于,其包括用于检测目标SNP位点的试剂,所述目标SNP位点为由权利要求1~5任一项所述的用于检测样本污染水平的SNP位点的筛选方法筛选得到的污染标记位点;A kit for detecting the contamination level of a sample, characterized in that it comprises a reagent for detecting a target SNP site, wherein the target SNP site is the one described in any one of claims 1 to 5 for detecting The contamination marker loci obtained by the screening method of the SNP locus of the sample contamination level;
    优选地,所述污染标记位点包括上表中的(1)~(60)中的至少一个;Preferably, the contamination marker site includes at least one of (1) to (60) in the above table;
    优选地,所述污染标记位点包括上表中的(1)~(60)中的至少10个;Preferably, the contamination marker sites include at least 10 of (1) to (60) in the above table;
    优选地,所述污染标记位点包括上表中的(1)~(60)中的至少20个。Preferably, the contamination marker sites include at least 20 of (1) to (60) in the above table.
  12. 一种用于检测样本污染水平的组合物,其特征在于,其包括用于检测目标SNP位点的试剂,所述目标SNP位点为由权利要求1~5任一项所述的用于检测样本污染水平的SNP位点的筛选方法筛选得到的污染标记位点;A composition for detecting the contamination level of a sample, characterized in that it comprises a reagent for detecting a target SNP site, wherein the target SNP site is the one described in any one of claims 1 to 5 for detecting The contamination marker loci obtained by the screening method of the SNP locus of the sample contamination level;
    优选地,所述污染标记位点包括上表中的(1)~(60)中的至少一个;Preferably, the contamination marker site includes at least one of (1) to (60) in the above table;
    优选地,所述污染标记位点包括上表中的(1)~(60)中的至少10个;Preferably, the contamination marker sites include at least 10 of (1) to (60) in the above table;
    优选地,所述污染标记位点包括上表中的(1)~(60)中的至少20个。Preferably, the contamination marker sites include at least 20 of (1) to (60) in the above table.
  13. 一种检测系统,其特征在于,其包括:A detection system, characterized in that it includes:
    存储装置和处理装置;storage devices and processing devices;
    所述处理装置运行所述存储装置中的程序时,执行如权利要求1~5任一项所述的用于检测样本污染水平的SNP位点的筛选方法或如权利要求6~11任一项所述的用于检测样本污染水平的方法。When the processing device runs the program in the storage device, it executes the screening method for SNP sites for detecting the contamination level of the sample according to any one of claims 1 to 5 or any one of claims 6 to 11 The described method for detecting the level of contamination in a sample.
  14. 一种电子设备,其特征在于,其包括存储器和处理器,所述处理器运行所述存储器中的计算机程序时,执行如权利要求1~5任一项所述的用于检测样本污染水平的SNP位点的筛选方法或如权利要求6~11任一项所述的用于检测样本污染水平的方法。An electronic device, characterized in that it includes a memory and a processor, and when the processor runs a computer program in the memory, executes the method for detecting the contamination level of a sample according to any one of claims 1 to 5. The screening method for SNP sites or the method for detecting the contamination level of a sample according to any one of claims 6 to 11.
  15. 如权利要求11所述的试剂盒、权利要求12所述的组合物、权利要求13所述的检测系统或权利要求14所述的电子设备在用于肿瘤基因检测中的用途。Use of the kit according to claim 11 , the composition according to claim 12 , the detection system according to claim 13 or the electronic device according to claim 14 in tumor gene detection.
  16. 根据权利要求1所述的筛选方法,其特征在于,使用获取模块获取目标区域中人群突变频率为30%~70%的SNP位点作为候选标记位点;The screening method according to claim 1, wherein an acquisition module is used to acquire SNP sites with a population mutation frequency of 30% to 70% in the target region as candidate marker sites;
    使用区域划分模块将单一染色体上存在的所述候选标记位点中的起始位点和终止位点之间的区域划分为多个选择区域,所述选择区域的长度为0.7~1.3Mb;Using a region division module to divide the region between the start site and the end site in the candidate marker sites existing on a single chromosome into a plurality of selection regions, and the length of the selection region is 0.7-1.3Mb;
    使用选择模块进行,若所述选择区域内存在两个及以上的候选标记位点,则选择该选择区域内等位基因频率为40%~60%且不同基因型最符合哈迪-温格伯平衡的位点作为标记位点,并去除该选择区域内的其他候选标记位点。Use the selection module to carry out, if there are two or more candidate marker sites in the selection region, select the allele frequency in the selection region to be 40% to 60% and the different genotypes are most consistent with Hardy-Wingerberg. The balanced sites are used as marker sites, and other candidate marker sites within the selected region are removed.
PCT/CN2021/129081 2020-11-23 2021-11-05 Method for screening snp sites for detecting contamination level of sample and method for detecting contamination level of sample WO2022105629A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011321699.3 2020-11-23
CN202011321699.3A CN114530198A (en) 2020-11-23 2020-11-23 Screening method of SNP (single nucleotide polymorphism) sites for detecting sample pollution level and detection method of sample pollution level

Publications (1)

Publication Number Publication Date
WO2022105629A1 true WO2022105629A1 (en) 2022-05-27

Family

ID=81619023

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/129081 WO2022105629A1 (en) 2020-11-23 2021-11-05 Method for screening snp sites for detecting contamination level of sample and method for detecting contamination level of sample

Country Status (2)

Country Link
CN (1) CN114530198A (en)
WO (1) WO2022105629A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115083529A (en) * 2022-07-11 2022-09-20 北京吉因加医学检验实验室有限公司 Method and device for detecting sample pollution rate
CN117253539A (en) * 2023-11-20 2023-12-19 北京求臻医学检验实验室有限公司 Method and system for detecting sample pollution in high-throughput sequencing based on germ line mutation

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116805510A (en) * 2022-09-01 2023-09-26 杭州链康医学检验实验室有限公司 Site combination for judging sample pairing or pollution and application thereof
CN116935966B (en) * 2023-09-13 2024-01-23 北京诺禾致源科技股份有限公司 Method and device for judging pollution of high-throughput sequencing paired data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050106575A1 (en) * 2003-11-13 2005-05-19 Panomics, Inc. Method and kit for detecting mutation or nucleotide variation of organism
CN106156538A (en) * 2016-06-29 2016-11-23 天津诺禾医学检验所有限公司 The annotation method of a kind of full-length genome variation data and annotation system
CN108220403A (en) * 2017-12-26 2018-06-29 北京科迅生物技术有限公司 Detection method, detection device, storage medium and the processor in specific mutation site
CN108304694A (en) * 2018-01-30 2018-07-20 元码基因科技(北京)股份有限公司 Method based on two generation sequencing data analyzing gene mutations

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050106575A1 (en) * 2003-11-13 2005-05-19 Panomics, Inc. Method and kit for detecting mutation or nucleotide variation of organism
CN106156538A (en) * 2016-06-29 2016-11-23 天津诺禾医学检验所有限公司 The annotation method of a kind of full-length genome variation data and annotation system
CN108220403A (en) * 2017-12-26 2018-06-29 北京科迅生物技术有限公司 Detection method, detection device, storage medium and the processor in specific mutation site
CN108304694A (en) * 2018-01-30 2018-07-20 元码基因科技(北京)股份有限公司 Method based on two generation sequencing data analyzing gene mutations

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115083529A (en) * 2022-07-11 2022-09-20 北京吉因加医学检验实验室有限公司 Method and device for detecting sample pollution rate
CN117253539A (en) * 2023-11-20 2023-12-19 北京求臻医学检验实验室有限公司 Method and system for detecting sample pollution in high-throughput sequencing based on germ line mutation
CN117253539B (en) * 2023-11-20 2024-02-06 北京求臻医学检验实验室有限公司 Method and system for detecting sample pollution in high-throughput sequencing based on germ line mutation

Also Published As

Publication number Publication date
CN114530198A (en) 2022-05-24

Similar Documents

Publication Publication Date Title
WO2022105629A1 (en) Method for screening snp sites for detecting contamination level of sample and method for detecting contamination level of sample
US11031100B2 (en) Size-based sequencing analysis of cell-free tumor DNA for classifying level of cancer
Tang et al. Profiling of short-tandem-repeat disease alleles in 12,632 human whole genomes
1000 Genomes Project Consortium A map of human genome variation from population scale sequencing
CN106715711B (en) Method for determining probe sequence and method for detecting genome structure variation
KR101795124B1 (en) Method and system for detecting copy number variation
CN105441432B (en) Composition and its purposes in sequencing and variation detection
US20190062841A1 (en) Diagnostic assay for urine monitoring of bladder cancer
Luo et al. Pilot study of a novel multi‐functional noninvasive prenatal test on fetus aneuploidy, copy number variation, and single‐gene disorder screening
Budis et al. Non-invasive prenatal testing as a valuable source of population specific allelic frequencies
Yin et al. Identification of a de novo fetal variant in osteogenesis imperfecta by targeted sequencing-based noninvasive prenatal testing
CN113136422A (en) Method for detecting high-throughput sequencing sample contamination by grouping SNP sites
Akbari et al. Parent-of-origin detection and chromosome-scale haplotyping using long-read DNA methylation sequencing and Strand-seq
US20210098079A1 (en) Methods for detecting absence of heterozygosity by low-pass genome sequencing
Myers The age of the “ome”: genome, transcriptome and proteome data set collection and analysis
US10106836B2 (en) Determining fetal genomes for multiple fetus pregnancies
Kubiritova et al. On the critical evaluation and confirmation of germline sequence variants identified using massively parallel sequencing
US20210366575A1 (en) Methods and systems for detection and phasing of complex genetic variants
US20220180967A1 (en) Methods and systems for genetic analysis
Qian et al. Noninvasive Prenatal Screening for Common Fetal Aneuploidies Using Single-Molecule Sequencing
Salsi et al. A human pan-genomic analysis reconfigures the genetic and epigenetic make up of facioscapulohumeral muscular dystrophy
Tang et al. Ernesto Guzman, 2 Smriti Ramakrishnan, Victor Lavrenko, Boyko Kakaradov, 2 Claire Hou, 2 Barry Hicks, David Heckerman, Franz J. Och, C. Thomas Caskey, 3 J. Craig Venter, 2,* and Amalio Telenti 2
Szalai Genomic Approach to Complex Diseases
임병찬 Noninvasive Prenatal Diagnosis of Duchenne Muscular Dystrophy: Comprehensive Genetic Diagnosis in Carrier, Proband, and Fetus
CN116323981A (en) Mitochondrial DNA quality control

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21893776

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21893776

Country of ref document: EP

Kind code of ref document: A1