CN115631789B - A Pan-Genome-Based Population Joint Variation Detection Method - Google Patents
A Pan-Genome-Based Population Joint Variation Detection Method Download PDFInfo
- Publication number
- CN115631789B CN115631789B CN202211313819.4A CN202211313819A CN115631789B CN 115631789 B CN115631789 B CN 115631789B CN 202211313819 A CN202211313819 A CN 202211313819A CN 115631789 B CN115631789 B CN 115631789B
- Authority
- CN
- China
- Prior art keywords
- genome
- candidate
- sequence
- haplotype
- window
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 23
- 102000054766 genetic haplotypes Human genes 0.000 claims abstract description 130
- 230000035772 mutation Effects 0.000 claims abstract description 111
- 239000012634 fragment Substances 0.000 claims description 86
- 238000012163 sequencing technique Methods 0.000 claims description 78
- 238000000034 method Methods 0.000 claims description 32
- 238000012217 deletion Methods 0.000 claims description 8
- 230000037430 deletion Effects 0.000 claims description 8
- 238000003780 insertion Methods 0.000 claims description 8
- 230000037431 insertion Effects 0.000 claims description 8
- 238000004458 analytical method Methods 0.000 claims description 6
- 238000011144 upstream manufacturing Methods 0.000 claims description 6
- 108091092878 Microsatellite Proteins 0.000 claims description 4
- 210000003765 sex chromosome Anatomy 0.000 claims description 4
- 108010034791 Heterochromatin Proteins 0.000 claims description 3
- 210000000349 chromosome Anatomy 0.000 claims description 3
- 210000004458 heterochromatin Anatomy 0.000 claims description 3
- 238000010845 search algorithm Methods 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims 3
- 239000000178 monomer Substances 0.000 claims 3
- 238000009411 base construction Methods 0.000 claims 1
- 108090000623 proteins and genes Proteins 0.000 abstract description 4
- 230000035945 sensitivity Effects 0.000 abstract description 4
- 230000001186 cumulative effect Effects 0.000 description 14
- 238000012360 testing method Methods 0.000 description 6
- 108091035707 Consensus sequence Proteins 0.000 description 5
- 201000010099 disease Diseases 0.000 description 4
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 4
- 108700028369 Alleles Proteins 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 230000002068 genetic effect Effects 0.000 description 2
- 108091029461 Constitutive heterochromatin Proteins 0.000 description 1
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 1
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000002230 centromere Anatomy 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000012268 genome sequencing Methods 0.000 description 1
- 238000012165 high-throughput sequencing Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000009456 molecular mechanism Effects 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 125000003729 nucleotide group Chemical group 0.000 description 1
- 230000001717 pathogenic effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 108091035539 telomere Proteins 0.000 description 1
- 210000003411 telomere Anatomy 0.000 description 1
- 102000055501 telomere Human genes 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Chemical & Material Sciences (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- Artificial Intelligence (AREA)
- Analytical Chemistry (AREA)
- Bioethics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
一种基于泛基因组的群体联合变异检测方法,涉及基因生物学技术领域。本发明是为了解决目前群体联合变异检测方法还存在比对效率低、变异检测灵敏度弱、检测结果不够准确的问题。本发明包括:获取参考基因组和被测样本基因组,并利用参考基因组和被测样本基因组识别被测样本基因组上的候选变异位点,并对候选变异点位进行分类;根据候选变异点位的类型利用候选变异位点构建局部单体型序列,并对构建好的局部单体型序列在样本间整合,获得整合后的单体型序列集合;利用整合后的单体型序列检测被测样本基因组的变异位点的基因型;遍历群体内所有被测样本基因组,并重复执行步骤一到三,获得群体内所有变异基因型。本发明用于群体基因组联合变异检测。
A pan-genome-based population joint variation detection method relates to the technical field of gene biology. The invention aims to solve the problems of low comparison efficiency, weak variation detection sensitivity and inaccurate detection results in the current population joint variation detection method. The present invention includes: obtaining a reference genome and a tested sample genome, using the reference genome and the tested sample genome to identify candidate variation sites on the sample genome, and classifying the candidate variation sites; Use candidate mutation sites to construct local haplotype sequences, and integrate the constructed local haplotype sequences among samples to obtain the integrated haplotype sequence set; use the integrated haplotype sequences to detect the genome of the tested sample The genotypes of the variant sites; traverse all the tested sample genomes in the population, and repeat steps 1 to 3 to obtain all the variant genotypes in the population. The invention is used for joint variation detection of population genome.
Description
技术领域Technical Field
本发明涉及基因生物学技术领域,特别涉及一种基于泛基因组的群体联合变异检测方法。The present invention relates to the field of gene biology technology, and in particular to a method for detecting population joint variation based on a pan-genome.
背景技术Background Art
变异检测即检测个体样本和参考基因组之间的核苷酸差异,是众多基因组学研究的基础,能够发现基因组变异与重要表型和疾病之间的关联。在基因组中,存在大量SNV和Indel等碱基序列变化较小的变异,是人类可遗传变异中最常见的变异类型,占已知变异的90%以上。不同个体的基因组序列具有99.99%的同源性,SNV/Indel对生物体的个性化特征和表型起着决定作用。Variant detection is the detection of nucleotide differences between individual samples and reference genomes. It is the basis of many genomic studies and can discover the association between genomic variation and important phenotypes and diseases. In the genome, there are a large number of variations with small base sequence changes, such as SNVs and Indels. They are the most common types of variation in human heritable variation, accounting for more than 90% of known variations. The genome sequences of different individuals have 99.99% homology, and SNV/Indel plays a decisive role in the personalized characteristics and phenotypes of organisms.
随着高通量测序技术的进步,越来越多的生物信息学工具被开发用于检测基因组变异,其中以GATK为代表的系列变异检方法得到了广泛的应用,在检测单样本变异中具有不错的表现。不同群体在遗传和变异等方面具有一定的差异,从而展现出不同的遗传特点,为了刻画不同群体间的差异性以及变异在群体内的出现频率,需要基于大规模人群进行联合变异检测。大规模基因组计划以及个人基因组测序任务的本质是借助系列高可靠性、高精度的数据分析算法解析个人和群体基因组的蕴含的遗传和变异信息,刻画不同群体之间的差异,发现各种疾病的致病基因或者易感基因,揭示疾病发生与发展的分子机制,并实现疾病的精准诊断与治疗。With the advancement of high-throughput sequencing technology, more and more bioinformatics tools have been developed to detect genomic variations. Among them, a series of variation detection methods represented by GATK have been widely used and have good performance in detecting single-sample variations. Different groups have certain differences in inheritance and variation, thus showing different genetic characteristics. In order to characterize the differences between different groups and the frequency of variation within a group, joint variation detection based on a large population is required. The essence of large-scale genome projects and personal genome sequencing tasks is to use a series of high-reliability and high-precision data analysis algorithms to analyze the genetic and variation information contained in individual and group genomes, characterize the differences between different groups, discover pathogenic genes or susceptibility genes for various diseases, reveal the molecular mechanisms of disease occurrence and development, and achieve accurate diagnosis and treatment of diseases.
当前大规模基因组数据分析中的主要采用联合变异检测方法,当前主流算法是由美、英等国的科研机构研发的BWA+GATK检测工作流,已经形成封闭的算法生态。但是这种方法在比对效率、变异检测灵敏度和准确性上仍存在问题,越来越难以适应规模日益增大的大规模基因组计划。At present, the joint variation detection method is mainly used in large-scale genomic data analysis. The current mainstream algorithm is the BWA+GATK detection workflow developed by scientific research institutions in the United States, the United Kingdom and other countries, which has formed a closed algorithm ecosystem. However, this method still has problems in comparison efficiency, variation detection sensitivity and accuracy, and it is becoming increasingly difficult to adapt to the ever-increasing scale of large-scale genome projects.
发明内容Summary of the invention
本发明目的是为了解决目前的群体联合变异检测方法还存在比对效率低、变异检测灵敏度弱、检测结果不够准确的问题,而提出了一种基于泛基因组的群体联合变异检测方法。The purpose of the present invention is to solve the problems of low comparison efficiency, weak variation detection sensitivity and inaccurate detection results in the current population joint variation detection methods, and propose a population joint variation detection method based on pan-genome.
一种基于泛基因组的群体联合变异检测方法具体过程为:A method for joint variation detection of a population based on a pan-genome. The specific process is as follows:
步骤一、获取参考基因组和被测样本基因组,并利用参考基因组和被测样本基因组识别被测样本基因组上的候选变异位点,并对候选变异点位进行分类:Step 1: Obtain the reference genome and the genome of the sample to be tested, and use the reference genome and the genome of the sample to be tested to identify candidate mutation sites on the genome of the sample to be tested, and classify the candidate mutation sites:
首先对参考基因组进行区间块划分,将划分区间块后的参考基因组与被测样本基因组比对获得比对信息集合;然后将比对信息按照比对信息在参考基因组中的坐标顺序进行堆叠获得堆叠结果;然后对堆叠结果进行统计,利用堆叠结果识别被测样本基因组上的候选变异位点的类别;First, the reference genome is divided into interval blocks, and the reference genome after the interval blocks are compared with the genome of the sample to be tested to obtain a comparison information set; then the comparison information is stacked according to the coordinate order of the comparison information in the reference genome to obtain a stacking result; then the stacking results are counted, and the categories of candidate variant sites on the genome of the sample to be tested are identified using the stacking results;
所述被测样本基因组上的候选变异位点的类别包括:高置信候选变异位点、低置信度候选变异位点、低复杂区域候选变异位点;The categories of candidate variation sites on the genome of the sample being tested include: high-confidence candidate variation sites, low-confidence candidate variation sites, and low-complexity region candidate variation sites;
步骤二、根据候选变异点位的类型利用候选变异位点构建局部单体型序列,并对构建好的局部单体型序列在样本间整合,获得整合后的单体型序列集合;Step 2: constructing local haplotype sequences using candidate variant sites according to the types of candidate variant sites, and integrating the constructed local haplotype sequences among samples to obtain an integrated haplotype sequence set;
步骤三、利用整合后的单体型序列检测被测样本基因组的变异位点的基因型:Step 3: Use the integrated haplotype sequence to detect the genotype of the variant site in the genome of the sample being tested:
首先,将单倍体序列集合与比对信息集合测序片段进行non-gap比对获得每条测序片段non-gap比对的累计错配碱基质量之和为S,然后利用每条测序片段non-gap比对的累计错配碱基质量之和S计算变异点位的基因型;First, non-gap alignment is performed on the haploid sequence set and the sequencing fragments of the alignment information set to obtain the sum of the cumulative mismatch base masses of each sequencing fragment non-gap alignment as S, and then the sum of the cumulative mismatch base masses of each sequencing fragment non-gap alignment S is used to calculate the genotype of the variant site;
步骤四、遍历群体内所有被测样本基因组,并重复执行步骤一到三,获得群体内所有变异基因型。Step 4: Traverse the genomes of all tested samples in the population and repeat steps 1 to 3 to obtain all variant genotypes in the population.
进一步地,所述步骤一中的获取参考基因组和被测样本基因组,并利用参考基因组和被测样本基因组识别被测样本基因组上的候选变异位点,并对候选变异点位进行分类,包括以下步骤:Furthermore, the step 1 of obtaining a reference genome and a test sample genome, and using the reference genome and the test sample genome to identify candidate variant sites on the test sample genome, and classifying the candidate variant sites, includes the following steps:
步骤一一、获取参考基因组,并对参考基因组进行区间块划分,获得划分好的参考基因组B=B1,B2,…,Bn;Step 11: Obtain a reference genome, and divide the reference genome into interval blocks to obtain a divided reference genome B=B 1 ,B 2 ,…,B n ;
其中,B1,B2,…,Bn分别表示划分好的每个区间块,n是区间块的总数量;Wherein, B 1 , B 2 ,…, B n represent each divided interval block, and n is the total number of interval blocks;
所述区间块划分按照“染色体:起始位置-结束位置”的格式进行划分;The interval block division is performed according to the format of "chromosome: starting position - ending position";
所述区间块之间没有重叠;There is no overlap between the interval blocks;
步骤一二、获取被测样本基因组,将被测样本基因组与划分好的参考基因组进行比对,获得被测样本基因组落在Bi内的比对信息集合;Step 1 and 2: obtain the genome of the sample to be tested, compare the genome of the sample to be tested with the divided reference genome, and obtain a comparison information set in which the genome of the sample to be tested falls within Bi ;
步骤一三、将步骤一二获得的落在Bi内的所有比对信息按照比对信息在参考基因组中的坐标顺序进行堆叠,获得堆叠结果;Step 13: stack all the alignment information falling within Bi obtained in step 12 according to the coordinate order of the alignment information in the reference genome to obtain a stacking result;
步骤一四、在划分好的参考基因组每个区间块内对步骤一三获得的堆叠结果进行统计,并通过字典变量记录每个区间块内堆叠结果的序列信息,根据字典记录的堆叠结果序列信息识别被测样本的候选变异位点,并对候选变异位点进行分类。Step 14: Statistics are performed on the stacking results obtained in step 13 in each interval block of the divided reference genome, and the sequence information of the stacking results in each interval block is recorded through a dictionary variable. The candidate mutation sites of the tested sample are identified based on the stacking result sequence information recorded in the dictionary, and the candidate mutation sites are classified.
进一步地,所述步骤一三中的将步骤一二获得的落在Bi内的所有比对信息按照比对信息在参考基因组中的坐标顺序进行堆叠,包括以下步骤:Furthermore, the step 13 stacks all the alignment information obtained in step 12 and falling within Bi according to the coordinate order of the alignment information in the reference genome, including the following steps:
首先,根据比对信息集合中的比对信息‘flag’域的值将未比对、次优比对和拆分比对的比对信息过滤掉;再将‘mapq’域中值小于10的比对信息过滤掉,获得过滤后的比对信息集合;First, the alignment information of unaligned, suboptimal alignment and split alignment is filtered out according to the value of the alignment information ‘flag’ field in the alignment information set; then the alignment information with a value less than 10 in the ‘mapq’ field is filtered out to obtain a filtered alignment information set;
其中,未比对的比对信息‘flag’域的值为4;次优比对的比对信息‘flag’域的值为256;拆分比对的比对信息的‘flag’域的值为2048;Among them, the value of the ‘flag’ field of the alignment information of the unaligned alignment is 4; the value of the ‘flag’ field of the alignment information of the suboptimal alignment is 256; the value of the ‘flag’ field of the alignment information of the split alignment is 2048;
然后,利用比对结果分析工具将过滤后的比对信息按照比对信息在参考基因组中的坐标顺序进行堆叠,获得堆叠结果;Then, the filtered alignment information is stacked according to the coordinate order of the alignment information in the reference genome using an alignment result analysis tool to obtain a stacking result;
所述堆叠结果包括:参考序列名、基因组坐标、参考碱基、覆盖当前位置的测序片段数、当前位置的碱基序列、当前位置的碱基质量ASCII码序列。The stacking result includes: reference sequence name, genome coordinates, reference base, number of sequencing fragments covering the current position, base sequence of the current position, and base quality ASCII code sequence of the current position.
进一步地,所述步骤一四中的在划分好的参考基因组每个区间块内对步骤一三获得的堆叠结果进行统计,并通过字典变量记录每个区间块内堆叠结果的序列信息,并根据字典记录的堆叠结果序列信息识别被测样本的候选变异位点,并对候选变异位点进行分类,包括以下步骤:Furthermore, the step 14 performs statistics on the stacking results obtained in step 13 in each interval block of the divided reference genome, and records the sequence information of the stacking results in each interval block through a dictionary variable, and identifies the candidate variant sites of the tested sample according to the stacking result sequence information recorded in the dictionary, and classifies the candidate variant sites, including the following steps:
步骤一四一、在划分好的参考基因组每个区间块内对步骤一三获得的堆叠结果进行统计,并通过字典变量记录每个区间块内每个基因组位置上出现的碱基、插入和删除的序列信息以及每种碱基序列出现的频率信息:Step 141: Perform statistics on the stacking results obtained in step 13 in each interval block of the divided reference genome, and record the bases, inserted and deleted sequence information, and the frequency information of each base sequence at each genomic position in each interval block through dictionary variables:
首先,通过字典变量记录区间块内每个基因组位置上出现的碱基、插入和删除的碱基序列信息;First, the base sequence information of bases, insertions, and deletions occurring at each genomic position within the interval block is recorded through dictionary variables;
所述字典变量包括:pileup_dict;The dictionary variables include: pileup_dict;
所述pileup_dict用于获取碱基序列、插入和删除的碱基序列的出现次数;The pileup_dict is used to obtain the number of occurrences of base sequences, inserted and deleted base sequences;
然后,根据pileup_dict字典变量中记录的每种碱基序列出现的次数获得每种碱基出现的频率:Then, the frequency of occurrence of each base is obtained based on the number of occurrences of each base sequence recorded in the pileup_dict dictionary variable:
Pj1∈keys=Cj1/TP j1∈keys =C j1 /T
其中,T是所有碱基出现的总次数,keys={A,C,G,T,I,D}是碱基的集合,j1是keys中的某一碱基,Cj1表示碱基j1出现的次数;Where T is the total number of occurrences of all bases, keys = {A, C, G, T, I, D} is a set of bases, j1 is a base in keys, and C j1 represents the number of occurrences of base j1;
步骤一四二、利用步骤一四一获得每种碱基出现的频率获得候选变异位点:Step 142: Use step 141 to obtain the frequency of occurrence of each base to obtain candidate variant sites:
若一个基因组位置上至少存在一个非参考序列碱基出现的频率Pj1∈keys>Talt,则当前位置即为候选变异位点;If there is at least one non-reference sequence base at a genomic position with a frequency P j1∈keys >T alt , then the current position is a candidate variant site;
其中,Talt是频率阈值;Where, T alt is the frequency threshold;
所述候选变异位点通过4元组表示,包括:基因组位置,参考碱基,非参考碱基,非参考碱基序列出现的频率;The candidate variant site is represented by a 4-tuple, including: genomic position, reference base, non-reference base, and frequency of occurrence of the non-reference base sequence;
步骤一四三、对步骤一四二获得的候选变异位点进行分类。Step 143: Classify the candidate variant sites obtained in step 142.
进一步地,所述步骤一四三中的对步骤一四二获得的候选变异位点进行分类,具体为:Furthermore, the candidate variant sites obtained in step 142 are classified in step 143, specifically:
将候选变异位点分为以下三类:高置信候选变异位点HCs、低置信度候选变异位点LCs、低复杂区域候选变异位点LCCs;The candidate variant sites are divided into the following three categories: high confidence candidate variant sites HCs, low confidence candidate variant sites LCs, and low complexity region candidate variant sites LCCs;
当候选变异位点满足如下三个特征时即为HCs:Candidate variant sites are HCs when they meet the following three characteristics:
(1)某个非参考碱基的出现频率PHCs大于第二频率阈值THCs,即PHCs>THCs;(1) The occurrence frequency P HCs of a non-reference base is greater than the second frequency threshold T HCs , that is, P HCs >T HCs ;
(2)在以当前候选变异位点在被测样本基因组上的位置为中心,上下游各延伸w-bp的窗口内,不包含其他候选变异位点;(2) No other candidate variant sites are included in the window extending w-bp upstream and downstream of the current candidate variant site on the genome of the sample being tested;
(3)当前候选变异位点在被测样本基因组上的位置不属于低复杂区域,包括:基因组拼接之后的GAP区域构成异染色质结构域;段重复;性染色体的伪常染色体区域;短串联重复;(3) The location of the current candidate variant site on the genome of the sample being tested does not belong to the low-complexity region, including: the GAP region after genome splicing constitutes a heterochromatin domain; segmental duplication; pseudo-autosomal region of sex chromosomes; short tandem repeats;
当候选变异位点满足如下四个特征之一时即为LCs:Candidate variant sites are LCs when they meet one of the following four characteristics:
(1)某个非参考碱基出现频率PLCs在用户设定的第二频率阈值THCs与用户预设的第三频率阈值TLCs之间时,即TLCs<PLCs<THCs;(1) When the occurrence frequency P LCs of a non-reference base is between the second frequency threshold T HCs set by the user and the third frequency threshold T LCs preset by the user, that is, T LCs < P LCs < T HCs ;
(2)当前候选变异位点的位置为候选多等位变异位点;(2) The location of the current candidate variation site is a candidate multi-allelic variation site;
(3)在以当前候选变异位点在被测样本基因组上的位置为中心,上下游各延伸w-bp的窗口内,包含其他候选变异位点;(3) Other candidate variant sites are included in the window extending w-bp upstream and downstream of the current candidate variant site on the genome of the sample being tested;
(4)当前位置属于不低复杂区域;(4) The current location belongs to a high complexity area;
当一个候选变异位点均不满足HCs与LCs的判断条件时,被划分为LCCs。When a candidate mutation site does not meet the judgment criteria of HCs and LCs, it is classified as LCCs.
进一步地,所述步骤二中的根据候选变异点位的类型利用候选变异位点构建局部单体型序列,并对构建好的局部单体型序列在样本间整合,获得整合后的单体型序列集合,包括以下步骤:Furthermore, the step 2 of constructing a local haplotype sequence using the candidate variation site according to the type of the candidate variation site, and integrating the constructed local haplotype sequence between samples to obtain an integrated haplotype sequence set includes the following steps:
步骤二一、将划分好的参考基因组的每个区间块划分为固定长度的窗口,并在区间块内遍历所有的候选变异位点,将区间块内所有候选变异点位分配到最优窗口内:Step 21: Divide each interval block of the divided reference genome into windows of fixed length, traverse all candidate mutation sites in the interval block, and assign all candidate mutation sites in the interval block to the optimal window:
首先,划分好的参考基因组的每个区间块划分为长度W的多个窗口,并设置相邻窗口之间的重叠长度;First, each interval block of the divided reference genome is divided into multiple windows of length W, and the overlap length between adjacent windows is set;
然后,在划分好的参考基因组的每个区间块内遍历所有候选变异位点,并将候选变异位点分配到最优窗口内:Then, all candidate variant sites are traversed in each interval block of the divided reference genome, and the candidate variant sites are assigned to the optimal window:
当候选变异位点相比于窗口起始位置的偏移量在之间时,则当前窗口为当前候选变异位点的最优窗口;When the candidate variant site is offset from the start of the window When , the current window is the optimal window for the current candidate mutation site;
步骤二二、根据分配到最优窗口的候选变异位点类型,在每个窗口内构建被测样本单体型序列;Step 22: constructing a haplotype sequence of the sample being tested in each window according to the type of candidate variant site assigned to the optimal window;
步骤二三、对窗口内的被测样本单体型序列进行整合,获得整合后的单体型序列集合:Step 2: Integrate the haplotype sequences of the tested samples in the window to obtain an integrated haplotype sequence set:
首先,在一个窗口内,遍历所有被测样本单体型序列,根据其包含的变异进行去重,最终生成窗口内非重复的样本单体型序列集合;First, within a window, all haplotype sequences of the tested samples are traversed, and duplicate sequences are removed according to the variants they contain, and finally a set of non-duplicate haplotype sequences of the samples within the window is generated;
然后,遍历所有窗口,并依次生成每个窗口的单体型序列集合h={h1,h2,…,hn}。Then, all windows are traversed, and a haplotype sequence set h = {h 1 , h 2 , ..., h n } of each window is generated in turn.
进一步地,所述步骤二二中的根据分配到最优窗口的候选变异位点类型,在每个窗口内构建被测样本单体型序列,包括以下步骤:Furthermore, in step 22, constructing the haplotype sequence of the sample to be tested in each window according to the type of candidate variant site assigned to the optimal window comprises the following steps:
步骤二二一、针对每一个窗口,判断每个窗口内是否包含候选变异位点,若不包含候选变异位点则参考基因组序列即为当前窗口的单体型序列;若包含候选变异位点则判断窗口内是否包含低置信度候选变异位点,若窗口内包含低置信度候选变异位点则执行步骤二二二获得构造的被测样本单体型序列,若窗口内仅包含高置信度候选变异位点则执行步骤二二三获得构造的被测样本单体型序列;Step 221: for each window, determine whether each window contains a candidate variation site. If it does not contain a candidate variation site, the reference genome sequence is the haplotype sequence of the current window. If it contains a candidate variation site, determine whether the window contains a low-confidence candidate variation site. If the window contains a low-confidence candidate variation site, execute step 222 to obtain the constructed haplotype sequence of the sample being tested. If the window contains only a high-confidence candidate variation site, execute step 223 to obtain the constructed haplotype sequence of the sample being tested.
步骤二二二、在窗口内使用基于de Bruijn图的拼接方法,获取被测样本单体型序列:Step 222: Use the de Bruijn graph-based splicing method within the window to obtain the haplotype sequence of the sample being tested:
提取窗口内所有的测序片段,并过滤掉与窗口重叠程度低于10%的测序片段;Extract all sequencing fragments within the window and filter out the sequencing fragments with an overlap of less than 10% with the window;
设置起始k-mer=41bp,利用剩下的测序片段构建局部deBruijn图,通过深度优先搜索算法,从局部deBruijn图中识别所有可能的一致性序列,并根据测序片段对所有可能的一致性序列的支持程度进行排序,选取支持度最高的多条contigs;Set the starting k-mer to 41bp, use the remaining sequenced fragments to construct a local deBruijn graph, use the depth-first search algorithm to identify all possible consensus sequences from the local deBruijn graph, and sort the sequenced fragments according to their support for all possible consensus sequences, and select the contigs with the highest support;
将k-mer=41时生成的所有contigs加入测序片段中,设置k-mer=46,重新构建局部deBruijn图,并重新生成contigs;Add all contigs generated when k-mer = 41 to the sequenced fragment, set k-mer = 46, reconstruct the local deBruijn graph, and regenerate contigs;
设置步长为5bp,结束k-mer=75bp,依次重复上操作,直至最后一步生成contigs即被测样本单体型序列;Set the step length to 5bp, end k-mer = 75bp, and repeat the above operations until the last step generates contigs, i.e. the haplotype sequence of the sample being tested;
步骤二二三、利用比对分析工具从比对信息集合中提取覆盖当前窗口的所有测序片段,获取比对信息集合中支持每个候选变异位点的测序片段ID;对支持任意两个候选变异位点的测序片段ID取交集,如果交集中包含的测序片段的数量不少于数量阈值,则认为这两个候选变异位点对应的变异出现在同一单体型序列上,将参考序列上相应位置的碱基替换为变异碱基构造单体型序列,获得单体型序列;如果两个候选变异位点不存在任意相同测序片段的支持,则认为其来自不同的单体型序列,则根据不同单体型上包含的候选变异和参考基因组序列,分别构造单体型序列。Step 223: Use the alignment analysis tool to extract all sequencing fragments covering the current window from the alignment information set, and obtain the sequencing fragment ID that supports each candidate variation site in the alignment information set; take the intersection of the sequencing fragment IDs that support any two candidate variation sites. If the number of sequencing fragments contained in the intersection is not less than the quantity threshold, it is considered that the mutations corresponding to the two candidate variation sites appear in the same haplotype sequence, and the bases at the corresponding positions on the reference sequence are replaced with the variant bases to construct a haplotype sequence to obtain a haplotype sequence; if the two candidate variation sites do not have any support from the same sequencing fragments, they are considered to come from different haplotype sequences, and haplotype sequences are constructed separately according to the candidate mutations and reference genome sequences contained in different haplotypes.
进一步地,所述步骤三中的利用整合后的单体型序列检测被测样本基因组的变异位点的基因型,包括以下步骤:Furthermore, the step 3 of using the integrated haplotype sequence to detect the genotype of the variable site of the genome of the sample being tested comprises the following steps:
步骤三一、将步骤二获得的单倍体序列集合与比对信息集合测序片段进行non-gap比对,获得每条测序片段non-gap比对的累计错配碱基质量之和S:Step 3.1. Perform non-gap alignment of the haploid sequence set obtained in step 2 with the sequencing fragments of the alignment information set to obtain the sum of the cumulative mismatch base masses S of the non-gap alignment of each sequencing fragment:
步骤三一一、对每个窗口的候选变异位点和单体型序列进行分组:Step 3: Group the candidate variant sites and haplotype sequences in each window:
首先,从单体型序列集合内选择所有包含候选变异位点的单体型序列,记为H;其余单体型序列均不包含候选变异,记为R;First, all haplotype sequences containing candidate mutation sites are selected from the haplotype sequence set, which are denoted as H; the remaining haplotype sequences do not contain candidate mutations, which are denoted as R;
然后,对被测样本包含的候选变异位点进行分组:如果两个候选变异位点之间的距离小于距离阈值,则将两个候选变异位点视为一组;Then, the candidate variant sites contained in the tested samples are grouped: if the distance between two candidate variant sites is less than the distance threshold, the two candidate variant sites are regarded as a group;
步骤三一二、利用每个窗口分组后的候选变异位点和分组后的单体型序列获得每条测序片段non-gap比对的累计错配碱基质量之和S:Step 3-2: Use the candidate variant sites and the grouped haplotype sequences in each window to obtain the sum of the cumulative mismatch base masses S of the non-gap alignment of each sequencing fragment:
首先,对步骤三一一获得的每一组候选变异位点,从H中挑选包含该组变异的所有单体型序列记为h,并从比对信息集合中提取窗口内的所有测序片段;First, for each group of candidate variant sites obtained in step 3, all haplotype sequences containing the group of variants are selected from H and recorded as h, and all sequencing fragments within the window are extracted from the alignment information set;
然后,分别映射所有测序片段和单体型序列与参考序列之间的完全匹配,获取长度最长的完全匹配块,并确定起始比对位置为p;Then, the complete matches between all sequencing fragments and haplotype sequences and the reference sequence are mapped respectively, the longest complete match block is obtained, and the starting alignment position is determined as p;
然后,从比对起始位置开始,分别向左右两侧进行单碱基延伸,当延伸过程中遇到碱基错配,则累加错配碱基的碱基质量;当延伸至序列两端或者累计错配碱基质量之和大于碱基质量阈值时,则停止碱基延伸;Then, starting from the alignment start position, single base extension is performed to the left and right sides respectively. When a base mismatch is encountered during the extension process, the base mass of the mismatched base is accumulated; when the extension reaches both ends of the sequence or the sum of the accumulated mismatched base masses is greater than the base mass threshold, the base extension is stopped;
最后,从R挑选不重复的单体型序列记为r,分别进行non-gap的碱基延伸,获得每条测序片段non-gap比对的累计错配碱基质量之和记为S;Finally, select non-repeating haplotype sequences from R and record them as r, perform non-gap base extension on each sequenced fragment, and obtain the sum of the cumulative mismatch base masses of non-gap alignment of each sequencing fragment and record them as S;
步骤三二、利用步骤三一获得的每条测序片段non-gap比对的累计错配碱基质量之和S获取变异点位的基因型。Step 32: Use the sum S of the cumulative mismatch base masses of non-gap alignments of each sequencing fragment obtained in step 31 to obtain the genotype of the variant site.
进一步地,所述步骤三二中的利用步骤三一获得的每条测序片段non-gap比对的累计错配碱基质量之和S获取变异点位的基因型,包括以下步骤:Furthermore, the step 32 uses the sum of the cumulative mismatch base masses S of the non-gap alignment of each sequencing fragment obtained in step 31 to obtain the genotype of the variant site, including the following steps:
步骤三二一、利用步骤三一获得的每条测序片段non-gap比对的累计错配碱基质量之和S获取所有测序片段分别与h和r中单体型序列的比对后,每条测序片段的最优比对概率集合 Step 321: Use the sum of the cumulative mismatch base masses S of the non-gap alignment of each sequencing fragment obtained in step 31 to obtain the optimal alignment probability set for each sequencing fragment after all sequencing fragments are aligned with the haplotype sequences in h and r respectively.
通过以下方式获得:设包含变异的单体型序列列表h’={h1,h2,...,hn’},测序片段比对到单体型上的概率集合为p={p1,p2,...,pn’},选定其中最大概率pmax作为当前测序片段比对到含变异单体型上的概率,对应单体型为hmax; Obtained in the following manner: suppose the list of haplotype sequences containing variants is h'={h 1 , h 2 , ..., hn' }, the probability set of sequencing fragments aligned to haplotypes is p={p 1 , p 2 , ..., nn' }, select the maximum probability p max as the probability of the current sequencing fragment aligned to the haplotype containing the variant, and the corresponding haplotype is h max ;
其中,k∈[1,m]是每条测序片段的最优比对概率集合中概率的标号;m是获得的概率总数,n’是窗口内单体型的数量;Where k∈[1,m] is the probability index in the optimal alignment probability set for each sequencing fragment; m is the total number of probabilities obtained, and n’ is the number of haplotypes in the window;
pj通过以下公式获得:pj is obtained by the following formula:
其中,j∈[1,n′]是测序片段比对到n条单体型上的概率的标号;Among them, j∈[1,n′] is the label of the probability of the sequenced fragment being aligned to n haplotypes;
步骤三二二、利用步骤三二一获得的P和Pr计算不同基因型的似然概率,将似然函数最大的作为当前窗口内变异的基因型,如下式:Step 322: Use the P and Pr obtained in step 321 to calculate the likelihood probabilities of different genotypes, and take the genotype with the largest likelihood function as the mutated genotype in the current window, as follows:
进一步地,所述步骤四中的遍历群体内所有被测样本基因组,并重复执行步骤一到三,获得群体内所有变异基因型,具体为:Furthermore, the step 4 traverses all the genomes of the tested samples in the population, and repeats steps 1 to 3 to obtain all the variant genotypes in the population, specifically:
遍历所有窗口内的候选变异位点,针对某个变异位点,如果所有样本在该位点无变异信号,或者有变异信号但检测变异的基因型为0/0,则认为当前群体的所有个体均不存在该变异,则不输出;若存在样本在该位点的基因型为0/1或者1/1,则依次遍历所有样本,输出每个被测样本在该位点的基因型,若某个位点为多等位基因,则拆分成多个双等位基因依次输出。Traverse the candidate mutation sites in all windows. For a certain mutation site, if all samples have no mutation signal at this site, or have mutation signal but the genotype of the detected mutation is 0/0, it is considered that all individuals in the current population do not have this mutation and will not be output; if there is a sample with a genotype of 0/1 or 1/1 at this site, traverse all samples in turn and output the genotype of each tested sample at this site. If a site is multi-allelic, split it into multiple double alleles and output them in turn.
本发明的有益效果为:The beneficial effects of the present invention are:
本发明利用群体基因组中的大量局部单体型信息进行测序片段重比对,并利用群体信息进行特征统计分析,有效提升变异检测的灵敏性与准确性;此外,本发明利用局部单体型的高速比对特性,显著提高变异检测的效率。The present invention utilizes a large amount of local haplotype information in the population genome to re-align sequencing fragments, and utilizes the population information to perform feature statistical analysis, effectively improving the sensitivity and accuracy of variation detection; in addition, the present invention utilizes the high-speed alignment characteristics of local haplotypes to significantly improve the efficiency of variation detection.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1为本发明的示意图。FIG. 1 is a schematic diagram of the present invention.
具体实施方式DETAILED DESCRIPTION
具体实施方式一:如图1所示,本实施方式一种基于泛基因组的群体联合变异检测方法具体过程为:Specific implementation method 1: As shown in FIG1 , the specific process of a population joint variation detection method based on a pan-genome in this implementation method is as follows:
步骤一、获取参考基因组和被测样本基因组,并利用参考基因组和被测样本基因组识别被测样本基因组上的候选变异位点,并对候选变异点位进行分类,包括以下步骤:Step 1: Obtain a reference genome and a test sample genome, and use the reference genome and the test sample genome to identify candidate mutation sites on the test sample genome, and classify the candidate mutation sites, including the following steps:
步骤一一、获取参考基因组,并对参考基因组进行区间块划分,获得划分好的参考基因组B={B1,B2,…,Bn}:Step 11: Obtain a reference genome and divide the reference genome into interval blocks to obtain a divided reference genome B = {B 1 , B 2 , …, B n }:
所述对参考基因组进行区间块划分的长度为10Mbps;The length of the interval block division of the reference genome is 10 Mbps;
所述区间块划分按照“染色体:起始位置-结束位置”的格式进行划分;The interval block division is performed according to the format of "chromosome: starting position - ending position";
所述区间块之间没有重叠;There is no overlap between the interval blocks;
其中,B1,B2,…,Bn分别表示划分好的每个区间块,n是区间块的总数量,i=1,..,n;Wherein, B 1 , B 2 ,…, B n represent each divided interval block respectively, n is the total number of interval blocks, i=1,..,n;
步骤一二、获取被测样本基因组,将被测样本基因组与划分好的参考基因组进行比对,获得被测样本基因组落在Bi内的比对位置,从而获得测样本基因组落在Bi内的比对信息集合;Step 1 and 2: obtain the genome of the sample to be tested, compare the genome of the sample to be tested with the divided reference genome, obtain the comparison position where the genome of the sample to be tested falls within Bi , and thus obtain the comparison information set where the genome of the sample to be tested falls within Bi ;
其中,比对信息通过以下三种文件形式存储:SAM、BAM、CRAM;Among them, the comparison information is stored in the following three file formats: SAM, BAM, and CRAM;
步骤一三、将步骤一二获得的落在Bi内的所有比对信息按照比对信息在参考基因组中的坐标顺序进行堆叠,获得堆叠结果:Step 13: stack all the alignment information obtained in step 1 and 2 that falls within Bi according to the coordinate order of the alignment information in the reference genome to obtain the stacking result:
首先,在堆叠(投影)前,根据比对信息中的‘flag’域的值将未比对(flag=4),次优比对(flag=256)和拆分比对(flag=2048)的比对信息过滤掉;将‘mapq’域中值小于10的比对信息过滤掉,获得过滤后的比对信息集合;First, before stacking (projection), the alignment information of unaligned (flag = 4), suboptimal alignment (flag = 256) and split alignment (flag = 2048) is filtered out according to the value of the ‘flag’ field in the alignment information; the alignment information with a value less than 10 in the ‘mapq’ field is filtered out to obtain a filtered alignment information set;
然后,将过滤后的比对信息集合输入到现有的比对结果分析工具SAMtools中的mpileup功能模块对按照其在参考基因组中的坐标顺序进行堆叠(pileup),即进行空间位置的投影,获得堆叠结果;Then, the filtered alignment information set is input into the mpileup function module in the existing alignment result analysis tool SAMtools to pile up the coordinates in the reference genome, that is, to project the spatial position and obtain the stacking result;
所述堆叠结果包含参考基因组中每个位置的碱基覆盖情况以及相应的碱基质量等,共有6列,分别为:参考序列名、基因组坐标、参考碱基、覆盖当前位置的测序片段数、当前位置的碱基序列、当前位置的碱基质量ASCII码序列。The stacking result includes the base coverage of each position in the reference genome and the corresponding base quality, etc. There are 6 columns in total, namely: reference sequence name, genome coordinates, reference base, number of sequencing fragments covering the current position, base sequence of the current position, and ASCII code sequence of the base quality of the current position.
存在候选变异的样本基因组主要体现在当前位置的碱基序列中,其中包含了匹配,错配,插入、缺失,比对链和比对质量等相关信息,结构相对复杂。解释如下:1)‘.’代表与参考序列正链匹配;2)‘,’代表与参考序列负链匹配;3)‘ATCGN’代表在正链上的不匹配;4)‘atcgn’代表在负链上的不匹配;5)‘*’代表模糊碱基;6)‘^’代表匹配的碱基是一个read(读长)的开始;’^'后面紧跟的ascii码减去33代表比对质量;这两个符号修饰的是后面的碱基,其后紧跟的碱基(ATCGatcgNn)代表该read的第一个碱基;7)‘$’代表一个read的结束,该符号修饰的是其前面的碱基;8)正则式’\+[0-9]+[ACGTNacgtn]+’代表在该位点后插入的碱基,例如’+3agg’表示此处3个碱基的插入;9)正则式’-[0-9]+[ACGTNacgtn]+’代表在该位点后缺失的碱基,例如’-4CTGA’表示此处4个碱基的删除;The sample genome with candidate mutations is mainly reflected in the base sequence at the current position, which contains relevant information such as matches, mismatches, insertions, deletions, alignment strands and alignment quality, and the structure is relatively complex. The explanation is as follows: 1) ‘.’ represents a match with the positive strand of the reference sequence; 2) ‘,’ represents a match with the negative strand of the reference sequence; 3) ‘ATCGN’ represents a mismatch on the positive strand; 4) ‘atcgn’ represents a mismatch on the negative strand; 5) ‘*’ represents an ambiguous base; 6) ‘^’ represents that the matched base is the beginning of a read; the ascii code immediately following ‘^’ minus 33 represents the alignment quality; these two symbols modify the following bases, and the base immediately following them (ATCGatc gNn) represents the first base of the read; 7) ‘$’ represents the end of a read, and this symbol modifies the previous base; 8) The regular expression ‘\+[0-9]+[ACGTNacgtn]+’ represents the base inserted after the site, for example, ‘+3agg’ represents the insertion of 3 bases here; 9) The regular expression ‘-[0-9]+[ACGTNacgtn]+’ represents the base deleted after the site, for example, ‘-4CTGA’ represents the deletion of 4 bases here;
注意,将第五列中的“^任意字符”、“$”、“-[0-9]+[ACGTNacgtn]+”、“+[0-9]+[ACGTNacgtn]+”删除后,碱基数目与第六列的比对质量ASCII字符的数目一致。Note that after deleting the "^any character", "$", "-[0-9]+[ACGTNacgtn]+", and "+[0-9]+[ACGTNacgtn]+" in the fifth column, the number of bases is consistent with the number of alignment quality ASCII characters in the sixth column.
步骤一四、在划分好的参考基因组每个区间块内对步骤一三获得的堆叠结果进行统计,并通过字典变量记录每个区间块内堆叠结果的序列信息,并根据字典记录的堆叠结果序列信息识别被测样本的候选变异位点,并对候选变异位点进行分类;Step 14: statistics the stacking results obtained in step 13 in each interval block of the divided reference genome, and record the sequence information of the stacking results in each interval block through a dictionary variable, and identify the candidate variant sites of the tested sample according to the stacking result sequence information recorded in the dictionary, and classify the candidate variant sites;
步骤一四一、在划分好的参考基因组每个区间块内对步骤一三获得的堆叠结果进行统计,并通过两个字典变量记录每个区间块内每个基因组位置上出现的碱基、插入和删除的序列信息以及每种碱基序列出现的频率信息;Step 141: Count the stacking results obtained in step 13 in each interval block of the divided reference genome, and record the bases, inserted and deleted sequence information and the frequency information of each base sequence at each genome position in each interval block through two dictionary variables;
首先,通过字典变量记录区间块内每个基因组位置上出现的碱基、插入和删除的碱基序列信息:First, the base sequence information of the bases, insertions, and deletions that appear at each genomic position within the interval block is recorded through a dictionary variable:
所述字典变量为pileup_dict,用于获取碱基序列、插入和删除的碱基序列的出现次数,具体为:Key(键值)为基因组位置,Value(值)为覆盖当前位置的所有可能的碱基序列;覆盖当前位置的所有可能的碱基序列包括匹配,错配(“ACGTN”),插入(“I”)和删除(“D”),并记录每种碱基序列出现的次数。例如:pileup_dict={”A”:Ca,”C”:Cc,”G”:Cg,”T”:Ct,”N”:Cn,”I”:Ci,”D”:Cd},其中A,C,G,T表示四种碱基,I,D分表表示插入和删除标识,N表示未知碱基。Ca,Cc,Cg,Ct,Ci,Cd,Cn分别表示上述几种碱基序列在当前位置出现的次数。The dictionary variable is pileup_dict, which is used to obtain the number of occurrences of base sequences, inserted and deleted base sequences, specifically: Key is the genome position, Value is all possible base sequences covering the current position; all possible base sequences covering the current position include matches, mismatches ("ACGTN"), insertions ("I") and deletions ("D"), and the number of occurrences of each base sequence is recorded. For example: pileup_dict = {"A":Ca,"C":Cc,"G":Cg,"T":Ct,"N":Cn,"I":Ci,"D":Cd}, where A, C, G, T represent four bases, I and D represent insertion and deletion marks, and N represents an unknown base. Ca, Cc, Cg, Ct, Ci, Cd, Cn respectively represent the number of times the above-mentioned base sequences appear at the current position.
然后,根据pileup_dict字典变量中记录的每种碱基序列出现的次数获得每种碱基出现的频率:Then, the frequency of occurrence of each base is obtained based on the number of occurrences of each base sequence recorded in the pileup_dict dictionary variable:
其中,T=∑j1∈keysCj1是所有碱基出现的总次数,Cj1表示碱基j1出现的次数,keys={A,C,G,T,I,D},j1是keys中的某一碱基;Where, T = ∑ j1∈keys C j1 is the total number of occurrences of all bases, C j1 represents the number of occurrences of base j1, keys = {A, C, G, T, I, D}, j1 is a base in keys;
步骤一四二、利用步骤一四一获得的每种碱基出现的频率获得候选变异位点:Step 142: Obtain candidate variant sites using the frequency of occurrence of each base obtained in step 141:
如果一个基因组位置上至少存在一个非参考序列碱基满足其出现的频率Pj1∈keys>Talt(默认值:0.2),其中Talt为用户预设的频率阈值,则称当前位置是一个候选变异位点,并通过4元组表示,分别记录基因组位置(ctgpos),参考碱基(refB),非参考碱基(altB),非参考碱基序列出现的频率(Palt)。If there is at least one non-reference sequence base at a genomic position that satisfies its frequency of occurrence P j1∈keys >T alt (default value: 0.2), where T alt is the frequency threshold preset by the user, the current position is called a candidate variant site and is represented by a 4-tuple, which records the genomic position (ctgpos), reference base (refB), non-reference base (altB), and frequency of occurrence of the non-reference base sequence (P alt ).
碱基包括:参考碱基和非参考碱基,非参考碱基表示可能的变异;The bases include: reference bases and non-reference bases, where non-reference bases represent possible variations;
值得注意的是,如果存在多个非参序列碱基的出现频率均大于频率阈值Talt,则称当前位置为候选多等位变异位点(multi-allelic site),同时记录该位点多个可能的非参考碱基信息。It is worth noting that if there are multiple non-reference sequence bases whose frequencies are all greater than the frequency threshold T alt , the current position is called a candidate multi-allelic site, and multiple possible non-reference base information of the site is recorded at the same time.
步骤一四三、对步骤一四二获得的候选变异位点进行分类:Step 143: Classify the candidate variant sites obtained in step 142:
将候选变异位点分为以下三类:高置信候选变异位点(high confidencecandidates,HCs)、低置信度候选变异位点(low confidence candidates,LCs)和低复杂区域候选变异位点(low-complexity candidates,LCCs);The candidate variant sites are divided into the following three categories: high confidence candidates (HCs), low confidence candidates (LCs) and low-complexity candidates (LCCs);
当一个候选变异位点具有如下三个特征时,被划分为HCs:When a candidate variant site has the following three characteristics, it is classified as HCs:
(1)划分为HCs的非参考碱基序列的出现频率PHCs大于用户预先设定的第二频率阈值THCs,即PHCs>THCs(默认为:0.5);(1) The occurrence frequency P HCs of the non-reference base sequence classified as HCs is greater than the second frequency threshold T HCs preset by the user, that is, P HCs >T HCs (default: 0.5);
(2)在以当前候选变异位点在被测样本基因组上的位置为中心,上下游各延伸w-bp的窗口内(默认w=100bp),不包含其他候选变异位点。当同一窗口内出现至少两个候选变异位点时,由于比对骨架的系统偏差或者Smith-Waterman的打分系统偏差,导致距离较近的两个候选变异位点出现位置偏差和变异序列的偏差。(2) In the window of w-bp (w = 100bp by default) extending upstream and downstream from the position of the current candidate variant site on the genome of the sample being tested, no other candidate variant sites are included. When at least two candidate variant sites appear in the same window, due to the systematic deviation of the alignment backbone or the deviation of the Smith-Waterman scoring system, the two candidate variant sites that are close to each other will have position deviation and variant sequence deviation.
(3)当前基因组位置不属于低复杂(low-complexity regions)区域,包括:基因组拼接之后的GAP区域构成异染色质结构域(constitutive heterochromatin domains);段重复(segmental duplications);性染色体的伪常染色体区域(the pseudo-autosomalregions of the sex chromosomes);短串联重复(short tandem repeats);(3) The current genomic position does not belong to the low-complexity regions, including: the GAP region after genome splicing constitutes the heterochromatin domain (constitutive heterochromatin domains); segmental duplications (segmental duplications); the pseudo-autosomal regions of the sex chromosomes (the pseudo-autosomal regions of the sex chromosomes); short tandem repeats (short tandem repeats);
参考基因组是通过测序片段拼接获得的,但不是完整的一条序列,中间会存在一些位置区域,称之为GAP;The reference genome is obtained by splicing sequencing fragments, but it is not a complete sequence. There are some positional regions in the middle, which are called GAPs.
所述基因组拼接之后的GAP区域包括:着丝粒(centromeres),端粒(telomeres)。The GAP region after the genome is spliced includes: centromeres and telomeres.
当一个候选变异位点具有如下四个特征之一时,被划分为LCs:A candidate variant site is classified as LCs when it has one of the following four characteristics:
(1)非参考碱基出现频率PLCs在用户设定的第二频率阈值THCs与用户预设的第三频率阈值TLCs之间时,即TLCs<PLCs<THCs,TLCs默认为0.2。(1) When the non-reference base occurrence frequency P LCs is between the second frequency threshold T HCs set by the user and the third frequency threshold T LCs preset by the user, that is, T LCs < P LCs < T HCs , T LCs defaults to 0.2.
(2)当前候选变异位点所在位置为候选多等位变异位点(multi-allelic site)。(2) The location of the current candidate variation site is a candidate multi-allelic site.
(3)在以当前候选变异位点在被测样本基因组上的位置为中心,上下游各延伸w-bp的窗口内(默认w=100bp),包含其他候选变异位点。(3) Other candidate variant sites are included in a window extending w-bp upstream and downstream (default w = 100bp) centered on the position of the current candidate variant site on the genome of the sample being tested.
(4)当前位置不属于低复杂区域。(4) The current location does not belong to the low complexity area.
否则,当一个候选变异位点均不满足HCs与LCs的判断条件时,被划分为LCCs。Otherwise, when a candidate variant site does not meet the judgment criteria of HCs and LCs, it is classified as LCCs.
步骤二、根据候选变异点位的类型构造局部单体型序列,并对构建好的局部单体型序列在样本间整合,获得整合后的单体型序列,包括以下步骤:Step 2: construct a local haplotype sequence according to the type of the candidate variant site, and integrate the constructed local haplotype sequence among samples to obtain an integrated haplotype sequence, including the following steps:
步骤二一、将划分好的参考基因组的每个区间块划分为固定长度的窗口,并在区间块内遍历所有的候选变异位点,将区间块内所有候选变异点位分配到最优窗口内:Step 21: Divide each interval block of the divided reference genome into windows of fixed length, traverse all candidate mutation sites in the interval block, and assign all candidate mutation sites in the interval block to the optimal window:
由于参考基因组序列分块长度较大,且包含基因组结构变异,部分区域变异较为复杂,难以构建单体型序列。因此,本发明将区间块划分为固定长度W=300bp的多个窗口,相邻窗口之间的重叠长度为150bp,构建W窗口内的局部单体型序列,用于局部基因组变异检测。Since the reference genome sequence block length is large and contains genome structural variation, some regional variations are relatively complex and difficult to construct haplotype sequences. Therefore, the present invention divides the interval block into multiple windows of fixed length W=300bp, with an overlap length of 150bp between adjacent windows, and constructs local haplotype sequences within the W window for local genome variation detection.
在区间块内遍历所有的候选变异位点,并将其分配到一个最优的窗口内,即候选位点尽可能落在窗口的中心位置,具体的,当候选变异位点相比于窗口起始位置的偏移量在之间时,认为当前窗口为最优窗口。Traverse all candidate mutation sites in the interval block and assign them to an optimal window, that is, the candidate site falls in the center of the window as much as possible. Specifically, when the offset of the candidate mutation site compared to the starting position of the window is When , the current window is considered to be the optimal window.
步骤二二、根据分配到最优窗口的候选变异位点类型,在每个窗口内构建被测样本单体型序列;Step 22: constructing a haplotype sequence of the sample being tested in each window according to the type of candidate variant site assigned to the optimal window;
步骤二二一、针对每一个窗口,判断每个窗口内是否包含候选变异位点,若不包含候选变异位点,则窗口内的参考基因组序列即为当前窗口的单体型序列;若包含候选变异位点则判断窗口内是否包含低置信度候选变异位点,若窗口内包含低置信度候选变异位点则执行步骤二二二获得构造的被测样本单体型序列,若窗口内仅包含高置信度候选变异位点则执行步骤二二三获得构造的被测样本单体型序列;Step 221: for each window, determine whether each window contains a candidate variation site. If it does not contain a candidate variation site, the reference genome sequence in the window is the haplotype sequence of the current window; if it contains a candidate variation site, determine whether the window contains a low-confidence candidate variation site. If the window contains a low-confidence candidate variation site, execute step 222 to obtain the constructed haplotype sequence of the sample being tested. If the window contains only a high-confidence candidate variation site, execute step 223 to obtain the constructed haplotype sequence of the sample being tested.
步骤二二二、当窗口内包含低置信度候选变异位点时,在窗口内使用基于deBruijn图的拼接方法,生成高质量的候选变异位点周围测序序列的一致性序列(contigs),即为可能存在的单体型序列:Step 222: When the window contains a low-confidence candidate variant site, a deBruijn graph-based splicing method is used in the window to generate high-quality consensus sequences (contigs) of the sequencing sequences around the candidate variant site, which are possible haplotype sequences:
首先,提取窗口内所有的测序片段,并过滤掉与窗口重叠程度低于10%的测序片段。First, all sequencing fragments within the window are extracted, and sequencing fragments with an overlap with the window of less than 10% are filtered out.
设置起始k-mer=41bp,构建局部deBruijn图,通过深度优先搜索算法,从图中识别所有可能的一致性序列,并根据测序片段对一致性序列的支持程度进行排序,选取支持度最高的多条contigs(默认2条);Set the starting k-mer to 41bp, construct a local deBruijn graph, identify all possible consensus sequences from the graph using a depth-first search algorithm, sort the consensus sequences based on the support of the sequencing fragments, and select the contigs with the highest support (default 2);
将k-mer=41时生成的所有contigs加入测序片段中,设置k-mer=46,重新构建局部deBruijn图,并按照上述步骤重新生成contigs;Add all contigs generated when k-mer = 41 to the sequenced fragment, set k-mer = 46, reconstruct the local deBruijn graph, and regenerate contigs according to the above steps;
设置步长为5bp,结束k-mer=75bp,依次重复上操作,直至最后一步生成contigs。Set the step size to 5bp, end k-mer = 75bp, and repeat the above operations until the last step generates contigs.
步骤二二三、当窗口内仅包含高置信度候选位点时,从通过比对结果分析工具SAMtools中的view功能模块从比对信息集合中提取覆盖当前窗口的所有测序片段,获取比对信息集合中支持每个候选变异位点的测序片段ID(当比对信息集合中包含与候选位点相同的碱基序列时,认为其实支持候选变异位点的测序片段)。对支持任意两个候选变异位点的测序片段ID取交集,如果交集中包含的测序片段数量不少于用户定义的数量阈值(默认为2),则认为这两个候选变异位点对应的变异出现在同一单体型序列上,并将参考序列上相应位置的碱基替换为变异碱基构造单体型序列,获得单体型序列;反之,则认为其来自不同的单体型序列,则根据不同单体型上包含的候选变异和参考基因组序列,分别构造单体型序列。Step 223: When the window contains only high-confidence candidate sites, extract all sequencing fragments covering the current window from the alignment information set through the view function module in the alignment result analysis tool SAMtools, and obtain the sequencing fragment ID supporting each candidate variation site in the alignment information set (when the alignment information set contains the same base sequence as the candidate site, it is considered that the sequencing fragment actually supports the candidate variation site). Take the intersection of the sequencing fragment IDs supporting any two candidate variation sites. If the number of sequencing fragments contained in the intersection is not less than the user-defined number threshold (the default is 2), it is considered that the mutations corresponding to the two candidate variation sites appear in the same haplotype sequence, and the bases at the corresponding positions on the reference sequence are replaced with variant bases to construct the haplotype sequence to obtain the haplotype sequence; otherwise, it is considered to come from different haplotype sequences, and the haplotype sequences are constructed separately according to the candidate mutations and reference genome sequences contained in different haplotypes.
步骤二三、对窗口内的被测样本单体型序列进行整合,获得整合后的单体型序列集合:Step 2: Integrate the haplotype sequences of the tested samples in the window to obtain an integrated haplotype sequence set:
首先,在一个窗口内,遍历所有被测样本单体型序列,根据其包含的候选变异进行去重,最终生成窗口内非重复的样本单体型序列集合。遍历所有窗口,并依次生成每个窗口的单体型序列集合h={h1,h2,…,hn}。First, in a window, all haplotype sequences of the tested samples are traversed, and duplicates are removed according to the candidate variants contained therein, and finally a set of non-duplicate haplotype sequences of the samples in the window is generated. All windows are traversed, and the haplotype sequence set h = {h 1 ,h 2 ,…,h n } of each window is generated in turn.
所述去重具体为:如果两个样本的单体型序列包含相同的变异,则只保留一条单体型序列。一个窗口内,不同的位点可能包含不同类型的变异,The deduplication is specifically as follows: if the haplotype sequences of two samples contain the same variation, only one haplotype sequence is retained. In a window, different sites may contain different types of variation.
步骤三、根据整合后的单体型序列检测群体样本基因组的变异位点的基因型,包括以下步骤:Step 3: Detecting the genotype of the variable site of the genome of the population sample according to the integrated haplotype sequence, including the following steps:
步骤三一、将步骤二获得的单倍体序列集合与比对信息集合测序片段进行non-gap比对(非空隙允许比对),获得每条测序片段non-gap比对的累计错配碱基质量之和为S:Step 31: Perform non-gap alignment (non-gap allowed alignment) on the haploid sequence set obtained in step 2 and the sequencing fragments of the alignment information set, and obtain the sum of the cumulative mismatch base masses of the non-gap alignment of each sequencing fragment as S:
步骤三一一、对每个窗口的候选变异位点和单体型序列进行分组:Step 3: Group the candidate variant sites and haplotype sequences in each window:
首先,在窗口内,依次遍历每个被测样本的候选变异位点信息,根据被测候选变异位点信息,从单体型序列集合内选择所有包含候选变异位点的单体型序列,记为H,其余单体型均不包含候选变异,记为R。First, within the window, the candidate variation site information of each tested sample is traversed in turn. According to the tested candidate variation site information, all haplotype sequences containing the candidate variation site are selected from the haplotype sequence set, which are denoted as H. The remaining haplotypes do not contain the candidate variation and are denoted as R.
然后,对被测样本包含的候选变异位点进行分组:如果两个候选变异位点之间的距离小于用户设定的距离阈值(默认:5bp),则将两个候选变异位点视为一组。Then, the candidate variant sites contained in the tested samples are grouped: if the distance between two candidate variant sites is less than the distance threshold set by the user (default: 5bp), the two candidate variant sites are considered as a group.
步骤三一二、利用每个窗口分组后的候选变异位点和分组后的单体型序列获得每条测序片段non-gap比对的累计错配碱基质量之和S:Step 3-2: Use the candidate variant sites and the grouped haplotype sequences in each window to obtain the sum of the cumulative mismatch base masses S of the non-gap alignment of each sequencing fragment:
首先,对步骤三一一获得的每一组候选变异位点,从H中挑选包含该组变异的所有单体型序列记为h,并从比对信息集合中提取窗口内的所有测序片段。First, for each group of candidate variant sites obtained in step 3, all haplotype sequences containing the group of variants are selected from H and recorded as h, and all sequencing fragments within the window are extracted from the alignment information set.
然后,通过分别映射局部区域内的所有测序片段和单体型序列与参考序列之间的完全匹配(Exactmatch),获取最大完全匹配,即长度最长的完全匹配块,并确定起始比对位置为p。从比对起始位置开始,分别向左右两侧进行单碱基延伸,当延伸过程中遇到碱基错配,则累加错配碱基的碱基质量(ASCII值-33)。当延伸至序列两端或者累计错配碱基质量之和大于用户设定的碱基质量阈值(默认:300)时,则停止碱基延伸。类似的,从R挑选不重复的单体型序列记为r,分别进行non-gap的碱基延伸。获得每条测序片段non-gap比对的累计错配碱基质量之和记为S。Then, by mapping the exact match (Exactmatch) between all sequencing fragments and haplotype sequences in the local area and the reference sequence, the maximum exact match, that is, the longest exact match block, is obtained, and the starting alignment position is determined to be p. Starting from the alignment starting position, single base extension is performed to the left and right sides respectively. When a base mismatch is encountered during the extension process, the base quality of the mismatched base is accumulated (ASCII value -33). When the extension reaches both ends of the sequence or the sum of the cumulative mismatched base quality is greater than the base quality threshold set by the user (default: 300), the base extension is stopped. Similarly, non-repeating haplotype sequences are selected from R and recorded as r, and non-gap base extension is performed separately. The sum of the cumulative mismatched base quality of each sequencing fragment non-gap alignment is obtained and recorded as S.
步骤三二、利用步骤三一获得的每条测序片段non-gap比对的累计错配碱基质量之和S获取变异点位的基因型:Step 3.2: Use the sum of the cumulative mismatch base masses S of the non-gap alignment of each sequencing fragment obtained in step 3.1 to obtain the genotype of the variant site:
步骤三二一、利用每条测序片段non-gap比对的累计错配碱基质量之和S获取所有测序片段分别与h和r中单体型序列的比对后,每条测序片段的最优比对概率集合 Step 321: Use the sum of the cumulative mismatch base masses S of each sequencing fragment non-gap alignment to obtain the optimal alignment probability set for each sequencing fragment after all sequencing fragments are aligned with the haplotype sequences in h and r respectively.
通过以下方式获得:设h’={h1,h2,...,hn’},表示包含变异的单体型序列列表,测序片段比对到n条单体型上的概率集合为p={p1,p2,...,pn’},选定其中概率最大的pmax作为当前测序片段比对到含变异单体型上的概率,对应单体型为hmax。 It is obtained in the following way: let h' = {h 1 , h 2 , ..., hn' }, representing the list of haplotype sequences containing variants, the probability set of sequencing fragments aligning to n haplotypes is p = {p 1 , p 2 , ..., p n' }, and the p max with the largest probability is selected as the probability of the current sequencing fragment aligning to the haplotype containing the variant, and the corresponding haplotype is h max .
其中,k∈[1,m]是每条测序片段的最优比对概率集合中概率的标号;m是获得的概率总数,n’是窗口内单体型的数量;Where k∈[1,m] is the probability index in the optimal alignment probability set for each sequencing fragment; m is the total number of probabilities obtained, and n’ is the number of haplotypes in the window;
pj通过以下公式获得:p j is obtained by the following formula:
其中,j∈[1,n′]是测序片段比对到n’条单体型上的概率的标号;Among them, j∈[1,n′] is the label of the probability of the sequenced fragment being aligned to n′ haplotypes;
步骤三二三、基于P和Pr计算三种可能基因型(0/0,0/1,1/1)的似然概率,将似然函数最大的作为当前窗口内变异的基因型,公式如下:Step 3: Calculate the likelihood of three possible genotypes (0/0, 0/1, 1/1) based on P and Pr, and take the genotype with the largest likelihood function as the mutated genotype in the current window. The formula is as follows:
基因型为0/1或者1/1,则通过P中每个概率推测h中的最优单体型,并将最优单体型中包含的每个变异基因型均设置为0/1或者1/1。If the genotype is 0/1 or 1/1, the optimal haplotype in h is inferred through each probability in P, and each variant genotype contained in the optimal haplotype is set to 0/1 or 1/1.
步骤四、遍历群体内所有被测样本基因组,并重复执行步骤一到三,获得群体内所有变异基因型:Step 4: Traverse the genomes of all tested samples in the population and repeat steps 1 to 3 to obtain all variant genotypes in the population:
遍历所有窗口内的候选变异位点,针对某个变异位点,如果所有样本在该位点无变异信号,或者有变异信号但检测变异的基因型为0/0,则认为当前群体的所有个体均不存在该变异,则不输出;若存在样本在该位点的基因型为0/1或者1/1,则依次遍历所有样本,输出每个被测样本在该位点的基因型,若某个位点为多等位基因,则拆分成多个双等位基因依次输出。Traverse the candidate mutation sites in all windows. For a certain mutation site, if all samples have no mutation signal at this site, or have mutation signal but the genotype of the detected mutation is 0/0, it is considered that all individuals in the current population do not have this mutation and will not be output; if there is a sample with a genotype of 0/1 or 1/1 at this site, traverse all samples in turn and output the genotype of each tested sample at this site. If a site is multi-allelic, split it into multiple double alleles and output them in turn.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211313819.4A CN115631789B (en) | 2022-10-25 | 2022-10-25 | A Pan-Genome-Based Population Joint Variation Detection Method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211313819.4A CN115631789B (en) | 2022-10-25 | 2022-10-25 | A Pan-Genome-Based Population Joint Variation Detection Method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115631789A CN115631789A (en) | 2023-01-20 |
CN115631789B true CN115631789B (en) | 2023-08-15 |
Family
ID=84907572
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211313819.4A Active CN115631789B (en) | 2022-10-25 | 2022-10-25 | A Pan-Genome-Based Population Joint Variation Detection Method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115631789B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116343923B (en) * | 2023-03-21 | 2023-12-08 | 哈尔滨工业大学 | A method for identifying homology of genome structural variation |
CN116705155A (en) * | 2023-08-03 | 2023-09-05 | 海南大学三亚南繁研究院 | Definition method of whole-gene DNA data |
CN117153248B (en) * | 2023-09-05 | 2024-05-07 | 天津极智基因科技有限公司 | Gene region variation detection and visualization method and system based on pan genome |
CN117711487B (en) * | 2024-02-05 | 2024-05-17 | 广州嘉检医学检测有限公司 | Identification method and system for embryo SNV and InDel variation and readable storage medium |
CN118969073A (en) * | 2024-10-21 | 2024-11-15 | 烟台大学 | Method and system for detecting insertion or deletion variation based on allele perception |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101539967A (en) * | 2008-12-12 | 2009-09-23 | 深圳华大基因研究院 | Method for detecting mononucleotide polymorphism |
WO2016000267A1 (en) * | 2014-07-04 | 2016-01-07 | 深圳华大基因股份有限公司 | Method for determining the sequence of a probe and method for detecting genomic structural variation |
CN108121897A (en) * | 2016-11-29 | 2018-06-05 | 华为技术有限公司 | A kind of genome mutation detection method and detection device |
CN112669902A (en) * | 2021-03-16 | 2021-04-16 | 北京贝瑞和康生物技术有限公司 | Method, computing device and storage medium for detecting genomic structural variation |
CN112802548A (en) * | 2021-01-07 | 2021-05-14 | 深圳吉因加医学检验实验室 | Method for predicting allele-specific copy number variation of single-sample whole genome |
CN113409890A (en) * | 2021-05-21 | 2021-09-17 | 银丰基因科技有限公司 | HLA typing method based on next generation sequencing data |
WO2021232388A1 (en) * | 2020-05-22 | 2021-11-25 | 深圳华大智造科技有限公司 | Method for determining base type of predetermined site in embryonic cell chromosome, and application thereof |
CN114496077A (en) * | 2022-04-15 | 2022-05-13 | 北京贝瑞和康生物技术有限公司 | Methods, devices, and media for detecting single nucleotide variations and indels |
CN114999573A (en) * | 2022-04-14 | 2022-09-02 | 哈尔滨因极科技有限公司 | Genome variation detection method and detection system |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013055955A1 (en) * | 2011-10-12 | 2013-04-18 | Complete Genomics, Inc. | Identification of dna fragments and structural variations |
US11913065B2 (en) * | 2012-09-04 | 2024-02-27 | Guardent Health, Inc. | Systems and methods to detect rare mutations and copy number variation |
US20190080045A1 (en) * | 2017-09-13 | 2019-03-14 | The Jackson Laboratory | Detection of high-resolution structural variants using long-read genome sequence analysis |
WO2021016441A1 (en) * | 2019-07-23 | 2021-01-28 | Grail, Inc. | Systems and methods for determining tumor fraction |
-
2022
- 2022-10-25 CN CN202211313819.4A patent/CN115631789B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101539967A (en) * | 2008-12-12 | 2009-09-23 | 深圳华大基因研究院 | Method for detecting mononucleotide polymorphism |
WO2016000267A1 (en) * | 2014-07-04 | 2016-01-07 | 深圳华大基因股份有限公司 | Method for determining the sequence of a probe and method for detecting genomic structural variation |
CN106715711A (en) * | 2014-07-04 | 2017-05-24 | 深圳华大基因股份有限公司 | Method for determining the sequence of a probe and method for detecting genomic structural variation |
CN108121897A (en) * | 2016-11-29 | 2018-06-05 | 华为技术有限公司 | A kind of genome mutation detection method and detection device |
WO2021232388A1 (en) * | 2020-05-22 | 2021-11-25 | 深圳华大智造科技有限公司 | Method for determining base type of predetermined site in embryonic cell chromosome, and application thereof |
CN112802548A (en) * | 2021-01-07 | 2021-05-14 | 深圳吉因加医学检验实验室 | Method for predicting allele-specific copy number variation of single-sample whole genome |
CN112669902A (en) * | 2021-03-16 | 2021-04-16 | 北京贝瑞和康生物技术有限公司 | Method, computing device and storage medium for detecting genomic structural variation |
CN113409890A (en) * | 2021-05-21 | 2021-09-17 | 银丰基因科技有限公司 | HLA typing method based on next generation sequencing data |
CN114999573A (en) * | 2022-04-14 | 2022-09-02 | 哈尔滨因极科技有限公司 | Genome variation detection method and detection system |
CN114496077A (en) * | 2022-04-15 | 2022-05-13 | 北京贝瑞和康生物技术有限公司 | Methods, devices, and media for detecting single nucleotide variations and indels |
Non-Patent Citations (1)
Title |
---|
官登峰.单倍体基因组序列组装方法研究.《中国博士学位论文全文数据库 基础科学辑》.2021,(第01期),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN115631789A (en) | 2023-01-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115631789B (en) | A Pan-Genome-Based Population Joint Variation Detection Method | |
US10354747B1 (en) | Deep learning analysis pipeline for next generation sequencing | |
CN111462823B (en) | Homologous recombination defect judgment method based on DNA sequencing data | |
WO2018218788A1 (en) | Third-generation sequencing sequence alignment method based on global seed scoring optimization | |
CN107133493B (en) | Method for assembling genome sequence, method for detecting structural variation and corresponding system | |
CN111326212B (en) | Structural variation detection method | |
CN111584006A (en) | Circular RNA identification method based on machine learning strategy | |
Smart et al. | A novel phylogenetic approach for de novo discovery of putative nuclear mitochondrial (pNumt) haplotypes | |
CN116469462B (en) | Ultra-low frequency DNA mutation identification method and device based on double sequencing | |
CN116064755A (en) | Device for detecting MRD marker based on linkage gene mutation | |
CN115083521A (en) | Method and system for identifying tumor cell group in single cell transcriptome sequencing data | |
CN108595912B (en) | Method, device and system for detecting chromosome aneuploidy | |
CN111180013B (en) | Device for detecting blood disease fusion gene | |
CN113823356B (en) | Methylation site identification method and device | |
CN117789817A (en) | Analysis system and retrieval method for enrichment and expression profile of cancer cross-tissue immune cell type | |
CN117275577A (en) | Algorithm for detecting human mitochondrial genetic mutation sites based on second-generation sequencing technology | |
CN113628680B (en) | A Benchmark Set-Based Genome Structural Variation Performance Detection Method | |
CN113539369B (en) | Optimized kraken2 algorithm and application thereof in second-generation sequencing | |
EP3663890A1 (en) | Alignment method, device and system | |
CN116312786A (en) | Single cell expression pattern difference evaluation method based on multi-group comparison | |
WO2023184330A1 (en) | Method and apparatus for processing genome methylation sequencing data, device, and medium | |
Wei et al. | RaPID-Query for fast identity by descent search and genealogical analysis | |
Zhang et al. | PocaCNV: a tool to detect copy number variants from population-scale genome sequencing data | |
CN115391284B (en) | Method, system and computer readable storage medium for quickly identifying gene data file | |
CN116343923B (en) | A method for identifying homology of genome structural variation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |