WO2021120529A1 - 一种同源假基因变异检测的方法 - Google Patents
一种同源假基因变异检测的方法 Download PDFInfo
- Publication number
- WO2021120529A1 WO2021120529A1 PCT/CN2020/092903 CN2020092903W WO2021120529A1 WO 2021120529 A1 WO2021120529 A1 WO 2021120529A1 CN 2020092903 W CN2020092903 W CN 2020092903W WO 2021120529 A1 WO2021120529 A1 WO 2021120529A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- mutation
- control set
- site
- sample
- comparison
- Prior art date
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 66
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 107
- 238000000034 method Methods 0.000 claims abstract description 30
- 238000012216 screening Methods 0.000 claims abstract description 9
- 230000035772 mutation Effects 0.000 claims description 107
- 239000000523 sample Substances 0.000 claims description 84
- 238000012360 testing method Methods 0.000 claims description 31
- 108091008109 Pseudogenes Proteins 0.000 claims description 24
- 102000057361 Pseudogenes Human genes 0.000 claims description 24
- 238000003908 quality control method Methods 0.000 claims description 22
- 230000036438 mutation frequency Effects 0.000 claims description 14
- 206010064571 Gene mutation Diseases 0.000 claims description 6
- 238000010276 construction Methods 0.000 claims description 5
- 239000013068 control sample Substances 0.000 claims description 5
- 238000012937 correction Methods 0.000 claims description 4
- 230000008707 rearrangement Effects 0.000 claims description 3
- 230000002068 genetic effect Effects 0.000 claims description 2
- 238000012545 processing Methods 0.000 claims 1
- 238000004458 analytical method Methods 0.000 description 14
- 238000012163 sequencing technique Methods 0.000 description 11
- 238000005516 engineering process Methods 0.000 description 7
- 101710101741 SH3 and multiple ankyrin repeat domains protein 3 Proteins 0.000 description 6
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000012165 high-throughput sequencing Methods 0.000 description 3
- 102100030681 SH3 and multiple ankyrin repeat domains protein 3 Human genes 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 2
- 210000000349 chromosome Anatomy 0.000 description 2
- 239000003814 drug Substances 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000003780 insertion Methods 0.000 description 2
- 230000037431 insertion Effects 0.000 description 2
- 210000002161 motor neuron Anatomy 0.000 description 2
- 102220282444 rs1555587619 Human genes 0.000 description 2
- 208000002320 spinal muscular atrophy Diseases 0.000 description 2
- 230000004083 survival effect Effects 0.000 description 2
- 239000002699 waste material Substances 0.000 description 2
- 238000007482 whole exome sequencing Methods 0.000 description 2
- 238000012070 whole genome sequencing analysis Methods 0.000 description 2
- 108091026890 Coding region Proteins 0.000 description 1
- 208000026350 Inborn Genetic disease Diseases 0.000 description 1
- 101150081851 SMN1 gene Proteins 0.000 description 1
- 101150015954 SMN2 gene Proteins 0.000 description 1
- 150000001413 amino acids Chemical class 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 238000003759 clinical diagnosis Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 208000016361 genetic disease Diseases 0.000 description 1
- 230000007614 genetic variation Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 102220047416 rs137853279 Human genes 0.000 description 1
- 102220026956 rs587778996 Human genes 0.000 description 1
- 238000002864 sequence alignment Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Definitions
- the invention relates to the field of biology and precision medicine gene detection, in particular to a method for homologous pseudogene mutation detection.
- GRS whole genome sequencing
- WES whole exome sequencing
- TRS target regions Sequencing
- the related analysis process is as follows: 1) After the high-throughput sequencing is completed, the short fragment sequence information of the genome is obtained; 2) Sequence comparison with the reference genome, locating the genome coordinates of each short sequence; 3) Comparison Perform genome coordinate sorting, de-duplication, rearrangement, and base quality correction for the results; 4) Perform mutation detection on each base of the genome, and perform genotype evaluation; 5) Finally, obtain individual genome mutation detection results.
- next-generation gene sequencing technology that is, high-throughput sequencing technology to detect genetic mutations in personal samples.
- NGS technology next-generation gene sequencing technology
- the NCBI_chr22_NM033517.1 labeled sequence is based on the GRCh38 genome to extract the target gene region of SHANK3; the NM_033517.1 labeled sequence is the target gene region of SHANK3 included in the National Center for Biotechnology Information (NCBI) database.
- NCBI National Center for Biotechnology Information
- the latest coding sequence According to the comparison results, the SHANK3 gene derived from GRCh38 genome and the SHANK3 gene derived from the NCBI database are significantly different in key positions.
- homologous sequences will cause false positives and false negatives in mutation detection. Since there are a large number of homologous regions in the human reference genome, such as homologous genes, pseudogenes, etc., and the limitations of the current NGS technology, the sequenced sequence is usually shorter. When performing a genome-wide sequence comparison, due to Due to the homologous region, there will be non-unique alignments, which will lead to the occurrence of many variant false positives.
- the two related genes of spinal muscular atrophy (SMA), survival motor neuron gene 1 (SMN1) and survival motor neuron gene 2 (SMN2) are homologous genes, with only 5 different bases. Bases.
- SMA spinal muscular atrophy
- SSN1 survival motor neuron gene 1
- SN2 survival motor neuron gene 2
- Figure 3 when these two genes are compared with the human reference genome GRCh38, the sequence will be filtered because of the homologous region alignment, which leads to the fact that the source of the true mutation cannot be confirmed.
- the latest updated gene sequence of the NCBI database it can be found that an insertion mutation was detected in the Exon1 homology region of SMN1.
- the purpose of the present invention is to provide a method for detecting homologous pseudogene mutations, which is used to solve the problem that the commonly used reference genome sequence is not synchronized with the updated gene sequence, and at the same time solve the problem of inaccurate mutation detection caused by abnormal comparison of homologous regions ; It is also used to solve the problem of long detection time period at present.
- a method for homologous pseudogene mutation detection including the following steps: 1) According to the gene sequence of the NCBI database, the genuine gene is selected to construct a reference gene set; 2) the original normal sample is randomly obtained Create a control set of data, compare the original data of the normal sample of the control set with the reference gene set to obtain the comparison result of the control set; 3) Perform mutation detection on each sample in the control set according to the comparison result of the control set , Construct a control set of mutation site frequency data; 4) obtain the original data of the actual test sample, and compare the actual test sample data with the reference gene set to obtain a comparison result of the actual test sample; compare the result of the actual test sample Perform mutation site detection to obtain the actual sample mutation site detection result; 5) Perform site comparison screening between the actual sample mutation site detection result and the control set mutation site frequency data to remove false positive sites , To get the mutation site of the actual sample.
- the reference gene set is constructed independently based on the latest updated full-length gene sequence of the National Center for Biotechnology Information (NCBI) database.
- NCBI National Center for Biotechnology Information
- the reference genome based on the currently commonly used GRCh38 version covers all gene sequence information, true genes, pseudogenes, and homologous genes.
- true genes the presence of pseudogenes and homologous genes in the sample will be affected. Resulting in false positive judgments on variant sites or unrecognized sources, resulting in missing variants.
- true genes are extracted in the reference gene set based on the latest updated gene full-length sequence of the NCBI database. Based on the high matching degree of the comparison, when the sample is compared, it is generally the true genes in the sample and the reference gene set. True gene comparison improves the accuracy of sample comparison and can effectively avoid the influence of homologous genes or pseudogenes on the detection of true gene mutations.
- the commonly used GRCh38 version of the reference genome contains intergenic sequences and useless sequences, its size is about 3GB base pairs, while the reference gene set independently constructed by this application only contains true gene sequences, and its size is only 1GB base pairs can greatly improve the comparison efficiency and shorten the detection cycle when comparing samples.
- the reference base variation of each sample is detected to obtain the result of the mutation site frequency in the control set.
- the set comparison and mutation detection due to errors in experiments, sequencing, and algorithms, there must be some false positive results in the mutation detection results.
- this application includes the following steps when constructing the reference gene set: 1) download and collect the latest updated gene full-length sequence from the NCBI database, and create a text file; 2) create a gene comparison index file; 3) create a gene sequence information file .
- the present application includes the following steps when quality control of the sample data: 1) First remove the linker sequence and/or the base sequence with a mass value of less than 30 at both ends and/or the sequence with the number of bases greater than 5 in the sequence; 2) Then remove the sequence whose length is less than 35bp.
- step 1 when the linker sequence is removed, or the base sequence with a quality value of less than 30 at both ends, and the sequence with the number of bases greater than 5 are not distinguished, there is no distinction between complete removal or partial removal. After all the above sequences are removed, the final result can be high. Quality data.
- the data comparison between the control set sample and the reference gene set in this application includes the following steps: 1) Compare the quality control control sample with the original reference gene set to obtain the original comparison result file; 2) The original comparison result files are sorted to generate a sort result file; 3) The sort result file is processed to remove duplicate sequences to generate a re-sort result file; 4) The re-sort result file is partially rearranged and alkalinized. Base quality correction, get the comparison result.
- the construction of the frequency data of variation sites in the control set in this application includes the following steps: 1) Perform variation detection on each reference base of each control sample in the control set to obtain the variation detection result files of all samples in the control set; 2) Based on The mutation detection result files of all samples in the control set are combined to process the mutation sites to obtain the mutation result files in the control cluster; 3) Based on the mutation result files of the control cluster, perform frequency statistics on each mutation site , Obtain the statistical results of population mutation frequency.
- the mutation frequency of all the mutation sites can be obtained.
- the original data of the actual test sample is first subjected to quality control, and then the original data of the actual test sample after the quality control is compared with the reference gene set and the site mutation is performed Detection.
- the quality control method and purpose of the actual test sample are consistent with the quality control method and purpose of the control set sample.
- the corresponding site is judged based on the mutation frequency statistical results of the mutation site of the actual test sample and the control set: when the mutation frequency of the control set of a site is ⁇ 0.5, the mutation site corresponding to the actual test sample is classified as a false positive site; When the variation frequency of the control set at a certain locus is ⁇ 0.1 and the variation frequency of the control set is less than 0.5, the corresponding variation locus of the measured sample is attributed to the population polymorphic locus; when the variation frequency of the control set at a certain locus is less than 0.5, the measured sample Corresponding mutation sites are classified as unique mutation sites.
- the mutation site of the actual sample can be obtained.
- the technical solution of the present invention provides a method for homologous pseudogene mutation detection, and can obtain the following beneficial effects:
- a new reference genome can be constructed, which can avoid the problem of non-synchronization between the currently published human reference genome sequence and the continuously updated gene sequence, and improve variation Accuracy of detection.
- SHANK3 gene variation, GRCh38 reference genome variation description chr22:50721359-50721359G>T corresponding to transcript variation NM_033517.1:exon21:c.3484G>T (p.Glu1162*), base position c.3484G>T and amino acid
- the positions p.Glu1162* are all wrong descriptions.
- the description of the transcript variant is correct, NM_033515.1:exon21:c.3526G>T(p.Glu1176*).
- the new reference gene set creatively constructed by this method collects the complete sequences of all updated genes, which can avoid the influence of homologous regions or pseudogene pairs in the human genome used in the prior art.
- SMN1/SMN2 gene mutation chr5:70925124-70925124C>CA
- this mutation occurs in a homologous gene, and routine process analysis will miss the mutation site.
- the locus variation can be prompted, and the annotation is SMN1:NM_000344.3:c.22dupA:p.(Ser8Lysfs*23), which is included in the clinical data (HGMD), and it is described in the HGMD database as DM, the site of harmful mutation.
- This method compares and evaluates the mutation sites of the actual samples by constructing the normal sample control set and obtaining the data of the mutation site frequency of the control set samples, which can avoid the comparison of homologous sequences (including homologous regions, pseudogenes, etc.) To improve the accuracy of gene mutation site judgment.
- Figure 1 shows the key differences between the SHANK3 gene sequence and GRCh38 alignment
- Figure 2 shows the difference between SMN1 and SMN2
- Figure 3 is a comparison diagram of the variation of SMN1 and SMN2 in the Exon1 region in Figure 2;
- FIG. 4 is a flowchart of the homologous pseudogene mutation detection method in the present invention.
- Figure 5 is a flow chart of gene set construction in the present invention.
- Figure 6 is a flow chart of the quality control of the sample data of the control set in the present invention.
- Figure 7 is a flow chart of comparison of sample data in the control set in the present invention.
- Fig. 8 is a flow chart of constructing the frequency data of the variation site of the control set in the present invention.
- Fig. 9 is a flow chart of mutation detection and site screening of actual samples in the present invention.
- a method for homologous pseudogene mutation detection including the following steps: 1) Construct a reference gene set (CG-RefGenome) based on the latest updated gene sequence of the NCBI database; 2) Randomly obtain raw data of normal samples to create a control set (Fastq format) File), compare the original data of the normal sample of the control set with the reference gene set to obtain the comparison result of the control set (BAM file); 3) Perform data comparison on each sample in the control set according to the comparison result of the control set Variant detection, construct the control set mutation site frequency data (VCF file); 4) Obtain the original data of the measured sample (Fastq format file), compare the measured sample data with the reference gene set, and obtain the measured sample ratio Result (BAM file); Perform mutation site detection on the comparison
- the reference gene set is firstly constructed according to the latest updated gene full-length sequence of the NCBI database. First, it can avoid the problem of out of synchronization between the currently published human reference genome sequence and the continuously updated gene sequence, and improve the accuracy of mutation detection. Sex.
- the reference genome based on the currently commonly used GRCh38 version covers all gene sequence information, true genes, pseudogenes, and homologous genes.
- true genes the presence of pseudogenes and homologous genes in the sample will be affected. Resulting in false positive judgments on variant sites or unrecognized sources, resulting in missing variants.
- true genes are extracted in the reference gene set based on the latest updated gene full-length sequence of the NCBI database. Based on the high matching degree of the comparison, when the sample is compared, it is generally the true genes in the sample and the reference gene set. True gene comparison improves the accuracy of sample comparison and can effectively avoid the influence of homologous genes or pseudogenes on the detection of true gene mutations.
- this application includes the following steps when constructing a reference gene set: 1) Firstly, collect the latest updated gene full-length sequence from the NCBI database: download the source file of the gene sequence, and first decompress and merge the source file , And then format the file to obtain a reference gene sequence file in fasta format with the same sequence length in each line. 2) Create a gene comparison index file: Because the sample sequence is compared with the reference gene sequence, the mem module in the bwa software tool is used, which uses the block sort compression (Burrows-Wheeler, BWT) comparison algorithm, The fasta file of the reference gene sequence must be indexed.
- BWT block sort compression
- the index module of the bwa tool is used in this application to process the reference gene sequence file to create a gene comparison index file.
- 3) Create a dictionary to obtain gene sequence information files: because fai files and dict files are the files that GATK tools rely on for base mutation detection. Therefore, in this application, samtools and picard are used to create a gene sequence information file including a fai file and a dict file for the reference gene sequence file.
- this application randomly obtains no less than 30 cases of normal sample raw data (FASTQ format) to create a control set, and uses cutadapt software to perform quality control on the raw data of normal samples in the control set, and then perform quality control after the quality control.
- the original data of the normal sample of the control set is compared with the reference gene set. Due to the deviation of the original sequencing data due to the experimental operation, on-line sequencing and other processes, it will contain invalid sequence data such as primer sequences, error sequences, noise sequences, and low-quality sequences. These sequence data will not only have no effect on subsequent analysis, but will also Affect the accuracy of the analysis results.
- quality control of the original data will not only remove residual primer sequences, but also filter low-quality sequences and error sequences to obtain clean and effective sequence data, which can improve the accuracy of the analysis results and save computing resources to a certain extent. Of waste, reducing analysis time.
- this application includes the following steps when performing quality control on the sample data in the control set: 1) First remove the linker sequence and/or the base sequence and/or base with a quality value of less than 30 at both ends of the sequence. Sequences with a base number greater than 5; 2) Sequences with a sequence length of less than 35bp (base pairs) are eliminated.
- step 1 when the linker sequence is removed, or the base sequence with a quality value of less than 30 at both ends, and the sequence with the number of bases greater than 5 are not distinguished, there is no distinction between complete removal or partial removal. After all the above sequences are removed, the final result can be high. Quality data.
- the data comparison between the control set sample and the reference gene set in this application includes the following steps: 1) The control sample (Clean Fastq format) after quality control is compared with the original reference gene set based on the bwa software. Obtain the original comparison result file (raw.bam); 2) Sort the original comparison result file to generate a sort result file (sort.bam); 3) Perform the process of removing duplicate sequences on the sort result file to generate Re-sorting result file; 4) Performing partial rearrangement and base quality correction on the re-sorting result file to obtain a comparison result.
- the sequenced short sequences of the samples are aligned, and the exact coordinates of each short sequence in the reference genome are correctly located by way of alignment.
- the coordinate position of the sequence recorded in the generated comparison result file is random, and each short sequence needs to be sorted according to the base number sequence of the chromosome of the reference genome.
- Subsequent mutation detection is based on the sequence of each base of the chromosome to determine whether there is mutation. Therefore, sorting the original comparison result file and forming the sorted file is a very important link. Since there is an experimental step of sequence amplification when performing high-throughput sequencing of samples, each sequence is replicated in this step to generate duplicate sequences.
- the base quality value in the corresponding sequence is different, and the base quality value needs to be corrected once to improve the accuracy of subsequent mutation detection Sex.
- the construction of the control set variation site frequency data in this application includes the following steps: 1) Perform mutation detection on each reference base of each control sample in the control set to obtain the variation detection of all samples in the control set Result file; 2) Based on the mutation detection result files of all samples in the control set, merge the mutation sites to obtain the mutation result file in the control cluster; 3) Based on the mutation result file of the control cluster, perform the mutation result file for each The frequency of the mutation site is counted, and the population mutation frequency statistics result is obtained.
- the mutation frequency of all the mutation sites can be obtained.
- the original data of the actual test sample is first subjected to quality control, and the original data of the actual test sample after the quality control is compared with the reference gene set. And carry out site variation detection.
- the quality control methods and purposes of the actual samples are the same as those of the control set samples. Both remove residual primer sequences and filter low-quality sequences and error sequences to obtain clean and effective sequence data, improve the accuracy of the analysis results, and save money. The waste of computing resources reduces the analysis time.
- the mutation site of the actual test sample and the mutation frequency statistical results of the control set are used to judge the corresponding site: when the mutation frequency of the control set of a site is ⁇ 0.5, the corresponding mutation site of the actual test sample is classified as a false positive site ; When the variation frequency of the control set of a certain site is ⁇ 0.1 and the variation frequency of the control set is less than 0.5, the corresponding mutation site of the actual test sample is attributed to the population polymorphic site; when the mutation frequency of the control set of a certain site is less than 0.5, the actual measurement The corresponding variant sites of the sample are classified as unique variant sites. High-quality mutation sites can be obtained by the above method.
Landscapes
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Genetics & Genomics (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- Chemical & Material Sciences (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioinformatics & Computational Biology (AREA)
- Analytical Chemistry (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Description
Claims (9)
- 一种同源假基因变异检测的方法,其特征在于:包括以下步骤:根据NCBI数据库的基因序列选取真基因构建参考基因集;随机获取正常样本原始数据创建对照集,对对照集正常样本原始数据与所述参考基因集进行数据比对,得到对照集比对结果;根据所述对照集比对结果,对对照集中的每个样本进行变异检测,构建对照集变异位点频率数据;获取实测样本原始数据,对所述实测样本原始数据与所述参考基因集进行数据比对,得到实测样本比对结果;对所述实测样本比对结果进行变异位点检测,得到实测样本变异位点检测结果;将所述实测样本变异位点检测结果与所述对照集变异位点频率数据进行位点比对筛查,去除假阳性位点,得到实测样本的基因变异位点。
- 如权利要求1所述的同源假基因变异检测的方法,其特征在于:构建参考基因集包括以下步骤:收集NCBI数据库的基因全长序列,创建文本文件;创建基因对比索引文件;创建基因序列信息文件。
- 如权利要求2所述的同源假基因变异检测的方法,其特征在于:在随机获取正常样本原始数据创建对照集后,对所述对照集正常样本原始数据先进行质控。
- 如权利要求3所述的同源假基因变异检测的方法,其特征在于:所述质控包括以下步骤:去除序列中接头序列和/或序列两端质量值低于30的碱基和/或N碱基数目大于5的序列;剔除序列长度小于35个碱基对的序列。
- 如权利要求4所述的同源假基因变异检测的方法,其特征在于:与参考 基因集进行数据比对包括以下步骤:将质控后的对照集样本与原始参考基因集比对,获得原始比对结果文件;对所述原始比对结果文件进行排序,产生排序结果文件;对所述排序结果文件进行去除重复序列处理,产生去重排序结果文件;对所述去重排序结果文件进行局部重排和碱基质量矫正,得到比对结果。
- 如权利要求5所述的同源假基因变异检测的方法,其特征在于:构建对照集变异位点频率数据包括以下步骤:对对照集中每个对照样本的每个参考碱基进行变异检测,得到对照集中所有样本变异检测结果文件;基于所述对照集中所有样本变异检测结果文件,进行变异位点合并处理,得到对照集群体的变异结果文件;基于所述对照集群体的变异结果文件,对每个变异位点进行频率统计,得到对照集突变频率统计结果。
- 如权利要求1-6任一一条所述的同源假基因变异检测的方法,其特征在于:在获取实测样本原始数据后,对所述实测样本原始数据先进行质控。
- 如权利要求7所述的同源假基因变异检测的方法,其特征在于:所述实测样本变异结果与对照集突变频率统计结果比对进行位点筛查,去除假阳性得到实测样本的变异位点。
- 如权利要求8所述的同源假基因变异检测的方法,其特征在于:所述位点筛查根据对照集突变频率统计结果对相应位点进行判断:当某位点的对照集变异频率≥0.5时,实测样本对应变异位点归结为假阳性位点;当某位点的对照集变异频率≥0.1且对照集变异频率<0.5时,实测样本对应变异位点归结为人群多态性位点;当某位点的对照集变异频率<0.5时,实测样本对应变异位点归为特有变异位点。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911328534.6 | 2019-12-20 | ||
CN201911328534.6A CN111081315B (zh) | 2019-12-20 | 2019-12-20 | 一种同源假基因变异检测的方法 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021120529A1 true WO2021120529A1 (zh) | 2021-06-24 |
Family
ID=70316422
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/092903 WO2021120529A1 (zh) | 2019-12-20 | 2020-05-28 | 一种同源假基因变异检测的方法 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111081315B (zh) |
WO (1) | WO2021120529A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116312776A (zh) * | 2022-12-08 | 2023-06-23 | 上海生物制品研究所有限责任公司 | 一种检测差异化rna编辑位点的方法 |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111081315B (zh) * | 2019-12-20 | 2023-06-06 | 苏州赛美科基因科技有限公司 | 一种同源假基因变异检测的方法 |
CN112365930B (zh) * | 2020-10-19 | 2022-06-10 | 北京大学 | 一种为基因数据库确定最佳序列比对阈值的方法 |
CN112466395B (zh) * | 2020-10-30 | 2021-08-17 | 苏州赛美科基因科技有限公司 | 基于snp多态性位点的样本识别标签筛选方法与样本识别检测方法 |
CN115810393B (zh) * | 2022-12-22 | 2023-08-25 | 南京普恩瑞生物科技有限公司 | 一种基于构建人群SNPs库的测序样本同源性检测方法及系统 |
CN115881225B (zh) * | 2022-12-28 | 2024-01-26 | 云舟生物科技(广州)股份有限公司 | 生物信息序列的分析方法、计算机存储介质及电子设备 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110091895A1 (en) * | 1999-01-14 | 2011-04-21 | Boman Bruce M | Immunoassays to Detect Diseases or Disease Susceptibility Traits |
CN106372459A (zh) * | 2016-08-30 | 2017-02-01 | 天津诺禾致源生物信息科技有限公司 | 一种基于扩增子二代测序拷贝数变异检测的方法及装置 |
CN107491666A (zh) * | 2017-09-01 | 2017-12-19 | 深圳裕策生物科技有限公司 | 异常组织中单样本体细胞突变位点检测方法、装置和存储介质 |
CN107974490A (zh) * | 2017-12-08 | 2018-05-01 | 东莞博奥木华基因科技有限公司 | 基于半导体测序的pku致病基因突变检测方法及装置 |
CN108875302A (zh) * | 2018-06-22 | 2018-11-23 | 广州漫瑞生物信息技术有限公司 | 一种检测细胞游离肿瘤基因拷贝数变异的系统和方法 |
CN111081315A (zh) * | 2019-12-20 | 2020-04-28 | 苏州赛美科基因科技有限公司 | 一种同源假基因变异检测的方法 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105404793B (zh) * | 2015-12-07 | 2018-05-11 | 浙江大学 | 基于概率框架和重测序技术快速发现表型相关基因的方法 |
US11993811B2 (en) * | 2017-01-31 | 2024-05-28 | Myriad Women's Health, Inc. | Systems and methods for identifying and quantifying gene copy number variations |
CN110033829B (zh) * | 2019-04-11 | 2021-07-23 | 北京诺禾心康基因科技有限公司 | 基于差异snp标记物的同源基因的融合检测方法 |
-
2019
- 2019-12-20 CN CN201911328534.6A patent/CN111081315B/zh active Active
-
2020
- 2020-05-28 WO PCT/CN2020/092903 patent/WO2021120529A1/zh active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110091895A1 (en) * | 1999-01-14 | 2011-04-21 | Boman Bruce M | Immunoassays to Detect Diseases or Disease Susceptibility Traits |
CN106372459A (zh) * | 2016-08-30 | 2017-02-01 | 天津诺禾致源生物信息科技有限公司 | 一种基于扩增子二代测序拷贝数变异检测的方法及装置 |
CN107491666A (zh) * | 2017-09-01 | 2017-12-19 | 深圳裕策生物科技有限公司 | 异常组织中单样本体细胞突变位点检测方法、装置和存储介质 |
CN107974490A (zh) * | 2017-12-08 | 2018-05-01 | 东莞博奥木华基因科技有限公司 | 基于半导体测序的pku致病基因突变检测方法及装置 |
CN108875302A (zh) * | 2018-06-22 | 2018-11-23 | 广州漫瑞生物信息技术有限公司 | 一种检测细胞游离肿瘤基因拷贝数变异的系统和方法 |
CN111081315A (zh) * | 2019-12-20 | 2020-04-28 | 苏州赛美科基因科技有限公司 | 一种同源假基因变异检测的方法 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116312776A (zh) * | 2022-12-08 | 2023-06-23 | 上海生物制品研究所有限责任公司 | 一种检测差异化rna编辑位点的方法 |
CN116312776B (zh) * | 2022-12-08 | 2024-01-19 | 上海生物制品研究所有限责任公司 | 一种检测差异化rna编辑位点的方法 |
Also Published As
Publication number | Publication date |
---|---|
CN111081315A (zh) | 2020-04-28 |
CN111081315B (zh) | 2023-06-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021120529A1 (zh) | 一种同源假基因变异检测的方法 | |
US6625545B1 (en) | Method and apparatus for mRNA assembly | |
CN104762402B (zh) | 超快速检测人类基因组单碱基突变和微插入缺失的方法 | |
US11339426B2 (en) | Method capable of differentiating fetal sex and fetal sex chromosome abnormality on various platforms | |
CN109801678B (zh) | 基于全转录组的肿瘤抗原预测方法及其应用 | |
Cox et al. | Integrating gene and protein expression data: pattern analysis and profile mining | |
CN108197434B (zh) | 去除宏基因组测序数据中人源基因序列的方法 | |
US20130166221A1 (en) | Method and system for sequence correlation | |
KR101313087B1 (ko) | Ngs를 위한 서열 재조합 방법 및 장치 | |
WO2019213811A1 (zh) | 检测染色体非整倍性的方法、装置及系统 | |
CN111326212A (zh) | 一种结构变异的检测方法 | |
CN115631789A (zh) | 一种基于泛基因组的群体联合变异检测方法 | |
CN113096737B (zh) | 一种用于对病原体类型进行自动分析的方法及系统 | |
CN111863132A (zh) | 一种筛选致病性变异的方法和系统 | |
CN113223619A (zh) | 比对不同全基因组测序方法的测序结果覆盖率的方法 | |
CN112837748A (zh) | 一种区分不同解剖学起源肿瘤的系统及其方法 | |
EP3795692A1 (en) | Method, apparatus, and system for detecting chromosome aneuploidy | |
Halpin et al. | Multimapping confounds ribosome profiling analysis: A case‐study of the Hsp90 molecular chaperone | |
WO2018033733A1 (en) | Methods and apparatus for identifying genetic variants | |
CN114067909B (zh) | 一种矫正同源重组缺陷评分的方法、装置和存储介质 | |
CN111653312B (zh) | 一种利用基因组数据探究疾病亚型亲缘性的方法 | |
WO2013097149A1 (zh) | 估计基因组重复序列含量的方法和装置 | |
WO2013097143A1 (zh) | 估计基因组杂合率的方法和装置 | |
EP2000935A2 (en) | Method of processing protein peptide data and system | |
CN117238365A (zh) | 基于高通量测序技术的新生儿遗传病早筛方法及装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20903849 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20903849 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 20-01-2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20903849 Country of ref document: EP Kind code of ref document: A1 |