WO2021120529A1 - 一种同源假基因变异检测的方法 - Google Patents

一种同源假基因变异检测的方法 Download PDF

Info

Publication number
WO2021120529A1
WO2021120529A1 PCT/CN2020/092903 CN2020092903W WO2021120529A1 WO 2021120529 A1 WO2021120529 A1 WO 2021120529A1 CN 2020092903 W CN2020092903 W CN 2020092903W WO 2021120529 A1 WO2021120529 A1 WO 2021120529A1
Authority
WO
WIPO (PCT)
Prior art keywords
mutation
control set
site
sample
comparison
Prior art date
Application number
PCT/CN2020/092903
Other languages
English (en)
French (fr)
Inventor
梁萌萌
余伟师
栗海波
李珉
Original Assignee
苏州赛美科基因科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州赛美科基因科技有限公司 filed Critical 苏州赛美科基因科技有限公司
Publication of WO2021120529A1 publication Critical patent/WO2021120529A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • the invention relates to the field of biology and precision medicine gene detection, in particular to a method for homologous pseudogene mutation detection.
  • GRS whole genome sequencing
  • WES whole exome sequencing
  • TRS target regions Sequencing
  • the related analysis process is as follows: 1) After the high-throughput sequencing is completed, the short fragment sequence information of the genome is obtained; 2) Sequence comparison with the reference genome, locating the genome coordinates of each short sequence; 3) Comparison Perform genome coordinate sorting, de-duplication, rearrangement, and base quality correction for the results; 4) Perform mutation detection on each base of the genome, and perform genotype evaluation; 5) Finally, obtain individual genome mutation detection results.
  • next-generation gene sequencing technology that is, high-throughput sequencing technology to detect genetic mutations in personal samples.
  • NGS technology next-generation gene sequencing technology
  • the NCBI_chr22_NM033517.1 labeled sequence is based on the GRCh38 genome to extract the target gene region of SHANK3; the NM_033517.1 labeled sequence is the target gene region of SHANK3 included in the National Center for Biotechnology Information (NCBI) database.
  • NCBI National Center for Biotechnology Information
  • the latest coding sequence According to the comparison results, the SHANK3 gene derived from GRCh38 genome and the SHANK3 gene derived from the NCBI database are significantly different in key positions.
  • homologous sequences will cause false positives and false negatives in mutation detection. Since there are a large number of homologous regions in the human reference genome, such as homologous genes, pseudogenes, etc., and the limitations of the current NGS technology, the sequenced sequence is usually shorter. When performing a genome-wide sequence comparison, due to Due to the homologous region, there will be non-unique alignments, which will lead to the occurrence of many variant false positives.
  • the two related genes of spinal muscular atrophy (SMA), survival motor neuron gene 1 (SMN1) and survival motor neuron gene 2 (SMN2) are homologous genes, with only 5 different bases. Bases.
  • SMA spinal muscular atrophy
  • SSN1 survival motor neuron gene 1
  • SN2 survival motor neuron gene 2
  • Figure 3 when these two genes are compared with the human reference genome GRCh38, the sequence will be filtered because of the homologous region alignment, which leads to the fact that the source of the true mutation cannot be confirmed.
  • the latest updated gene sequence of the NCBI database it can be found that an insertion mutation was detected in the Exon1 homology region of SMN1.
  • the purpose of the present invention is to provide a method for detecting homologous pseudogene mutations, which is used to solve the problem that the commonly used reference genome sequence is not synchronized with the updated gene sequence, and at the same time solve the problem of inaccurate mutation detection caused by abnormal comparison of homologous regions ; It is also used to solve the problem of long detection time period at present.
  • a method for homologous pseudogene mutation detection including the following steps: 1) According to the gene sequence of the NCBI database, the genuine gene is selected to construct a reference gene set; 2) the original normal sample is randomly obtained Create a control set of data, compare the original data of the normal sample of the control set with the reference gene set to obtain the comparison result of the control set; 3) Perform mutation detection on each sample in the control set according to the comparison result of the control set , Construct a control set of mutation site frequency data; 4) obtain the original data of the actual test sample, and compare the actual test sample data with the reference gene set to obtain a comparison result of the actual test sample; compare the result of the actual test sample Perform mutation site detection to obtain the actual sample mutation site detection result; 5) Perform site comparison screening between the actual sample mutation site detection result and the control set mutation site frequency data to remove false positive sites , To get the mutation site of the actual sample.
  • the reference gene set is constructed independently based on the latest updated full-length gene sequence of the National Center for Biotechnology Information (NCBI) database.
  • NCBI National Center for Biotechnology Information
  • the reference genome based on the currently commonly used GRCh38 version covers all gene sequence information, true genes, pseudogenes, and homologous genes.
  • true genes the presence of pseudogenes and homologous genes in the sample will be affected. Resulting in false positive judgments on variant sites or unrecognized sources, resulting in missing variants.
  • true genes are extracted in the reference gene set based on the latest updated gene full-length sequence of the NCBI database. Based on the high matching degree of the comparison, when the sample is compared, it is generally the true genes in the sample and the reference gene set. True gene comparison improves the accuracy of sample comparison and can effectively avoid the influence of homologous genes or pseudogenes on the detection of true gene mutations.
  • the commonly used GRCh38 version of the reference genome contains intergenic sequences and useless sequences, its size is about 3GB base pairs, while the reference gene set independently constructed by this application only contains true gene sequences, and its size is only 1GB base pairs can greatly improve the comparison efficiency and shorten the detection cycle when comparing samples.
  • the reference base variation of each sample is detected to obtain the result of the mutation site frequency in the control set.
  • the set comparison and mutation detection due to errors in experiments, sequencing, and algorithms, there must be some false positive results in the mutation detection results.
  • this application includes the following steps when constructing the reference gene set: 1) download and collect the latest updated gene full-length sequence from the NCBI database, and create a text file; 2) create a gene comparison index file; 3) create a gene sequence information file .
  • the present application includes the following steps when quality control of the sample data: 1) First remove the linker sequence and/or the base sequence with a mass value of less than 30 at both ends and/or the sequence with the number of bases greater than 5 in the sequence; 2) Then remove the sequence whose length is less than 35bp.
  • step 1 when the linker sequence is removed, or the base sequence with a quality value of less than 30 at both ends, and the sequence with the number of bases greater than 5 are not distinguished, there is no distinction between complete removal or partial removal. After all the above sequences are removed, the final result can be high. Quality data.
  • the data comparison between the control set sample and the reference gene set in this application includes the following steps: 1) Compare the quality control control sample with the original reference gene set to obtain the original comparison result file; 2) The original comparison result files are sorted to generate a sort result file; 3) The sort result file is processed to remove duplicate sequences to generate a re-sort result file; 4) The re-sort result file is partially rearranged and alkalinized. Base quality correction, get the comparison result.
  • the construction of the frequency data of variation sites in the control set in this application includes the following steps: 1) Perform variation detection on each reference base of each control sample in the control set to obtain the variation detection result files of all samples in the control set; 2) Based on The mutation detection result files of all samples in the control set are combined to process the mutation sites to obtain the mutation result files in the control cluster; 3) Based on the mutation result files of the control cluster, perform frequency statistics on each mutation site , Obtain the statistical results of population mutation frequency.
  • the mutation frequency of all the mutation sites can be obtained.
  • the original data of the actual test sample is first subjected to quality control, and then the original data of the actual test sample after the quality control is compared with the reference gene set and the site mutation is performed Detection.
  • the quality control method and purpose of the actual test sample are consistent with the quality control method and purpose of the control set sample.
  • the corresponding site is judged based on the mutation frequency statistical results of the mutation site of the actual test sample and the control set: when the mutation frequency of the control set of a site is ⁇ 0.5, the mutation site corresponding to the actual test sample is classified as a false positive site; When the variation frequency of the control set at a certain locus is ⁇ 0.1 and the variation frequency of the control set is less than 0.5, the corresponding variation locus of the measured sample is attributed to the population polymorphic locus; when the variation frequency of the control set at a certain locus is less than 0.5, the measured sample Corresponding mutation sites are classified as unique mutation sites.
  • the mutation site of the actual sample can be obtained.
  • the technical solution of the present invention provides a method for homologous pseudogene mutation detection, and can obtain the following beneficial effects:
  • a new reference genome can be constructed, which can avoid the problem of non-synchronization between the currently published human reference genome sequence and the continuously updated gene sequence, and improve variation Accuracy of detection.
  • SHANK3 gene variation, GRCh38 reference genome variation description chr22:50721359-50721359G>T corresponding to transcript variation NM_033517.1:exon21:c.3484G>T (p.Glu1162*), base position c.3484G>T and amino acid
  • the positions p.Glu1162* are all wrong descriptions.
  • the description of the transcript variant is correct, NM_033515.1:exon21:c.3526G>T(p.Glu1176*).
  • the new reference gene set creatively constructed by this method collects the complete sequences of all updated genes, which can avoid the influence of homologous regions or pseudogene pairs in the human genome used in the prior art.
  • SMN1/SMN2 gene mutation chr5:70925124-70925124C>CA
  • this mutation occurs in a homologous gene, and routine process analysis will miss the mutation site.
  • the locus variation can be prompted, and the annotation is SMN1:NM_000344.3:c.22dupA:p.(Ser8Lysfs*23), which is included in the clinical data (HGMD), and it is described in the HGMD database as DM, the site of harmful mutation.
  • This method compares and evaluates the mutation sites of the actual samples by constructing the normal sample control set and obtaining the data of the mutation site frequency of the control set samples, which can avoid the comparison of homologous sequences (including homologous regions, pseudogenes, etc.) To improve the accuracy of gene mutation site judgment.
  • Figure 1 shows the key differences between the SHANK3 gene sequence and GRCh38 alignment
  • Figure 2 shows the difference between SMN1 and SMN2
  • Figure 3 is a comparison diagram of the variation of SMN1 and SMN2 in the Exon1 region in Figure 2;
  • FIG. 4 is a flowchart of the homologous pseudogene mutation detection method in the present invention.
  • Figure 5 is a flow chart of gene set construction in the present invention.
  • Figure 6 is a flow chart of the quality control of the sample data of the control set in the present invention.
  • Figure 7 is a flow chart of comparison of sample data in the control set in the present invention.
  • Fig. 8 is a flow chart of constructing the frequency data of the variation site of the control set in the present invention.
  • Fig. 9 is a flow chart of mutation detection and site screening of actual samples in the present invention.
  • a method for homologous pseudogene mutation detection including the following steps: 1) Construct a reference gene set (CG-RefGenome) based on the latest updated gene sequence of the NCBI database; 2) Randomly obtain raw data of normal samples to create a control set (Fastq format) File), compare the original data of the normal sample of the control set with the reference gene set to obtain the comparison result of the control set (BAM file); 3) Perform data comparison on each sample in the control set according to the comparison result of the control set Variant detection, construct the control set mutation site frequency data (VCF file); 4) Obtain the original data of the measured sample (Fastq format file), compare the measured sample data with the reference gene set, and obtain the measured sample ratio Result (BAM file); Perform mutation site detection on the comparison
  • the reference gene set is firstly constructed according to the latest updated gene full-length sequence of the NCBI database. First, it can avoid the problem of out of synchronization between the currently published human reference genome sequence and the continuously updated gene sequence, and improve the accuracy of mutation detection. Sex.
  • the reference genome based on the currently commonly used GRCh38 version covers all gene sequence information, true genes, pseudogenes, and homologous genes.
  • true genes the presence of pseudogenes and homologous genes in the sample will be affected. Resulting in false positive judgments on variant sites or unrecognized sources, resulting in missing variants.
  • true genes are extracted in the reference gene set based on the latest updated gene full-length sequence of the NCBI database. Based on the high matching degree of the comparison, when the sample is compared, it is generally the true genes in the sample and the reference gene set. True gene comparison improves the accuracy of sample comparison and can effectively avoid the influence of homologous genes or pseudogenes on the detection of true gene mutations.
  • this application includes the following steps when constructing a reference gene set: 1) Firstly, collect the latest updated gene full-length sequence from the NCBI database: download the source file of the gene sequence, and first decompress and merge the source file , And then format the file to obtain a reference gene sequence file in fasta format with the same sequence length in each line. 2) Create a gene comparison index file: Because the sample sequence is compared with the reference gene sequence, the mem module in the bwa software tool is used, which uses the block sort compression (Burrows-Wheeler, BWT) comparison algorithm, The fasta file of the reference gene sequence must be indexed.
  • BWT block sort compression
  • the index module of the bwa tool is used in this application to process the reference gene sequence file to create a gene comparison index file.
  • 3) Create a dictionary to obtain gene sequence information files: because fai files and dict files are the files that GATK tools rely on for base mutation detection. Therefore, in this application, samtools and picard are used to create a gene sequence information file including a fai file and a dict file for the reference gene sequence file.
  • this application randomly obtains no less than 30 cases of normal sample raw data (FASTQ format) to create a control set, and uses cutadapt software to perform quality control on the raw data of normal samples in the control set, and then perform quality control after the quality control.
  • the original data of the normal sample of the control set is compared with the reference gene set. Due to the deviation of the original sequencing data due to the experimental operation, on-line sequencing and other processes, it will contain invalid sequence data such as primer sequences, error sequences, noise sequences, and low-quality sequences. These sequence data will not only have no effect on subsequent analysis, but will also Affect the accuracy of the analysis results.
  • quality control of the original data will not only remove residual primer sequences, but also filter low-quality sequences and error sequences to obtain clean and effective sequence data, which can improve the accuracy of the analysis results and save computing resources to a certain extent. Of waste, reducing analysis time.
  • this application includes the following steps when performing quality control on the sample data in the control set: 1) First remove the linker sequence and/or the base sequence and/or base with a quality value of less than 30 at both ends of the sequence. Sequences with a base number greater than 5; 2) Sequences with a sequence length of less than 35bp (base pairs) are eliminated.
  • step 1 when the linker sequence is removed, or the base sequence with a quality value of less than 30 at both ends, and the sequence with the number of bases greater than 5 are not distinguished, there is no distinction between complete removal or partial removal. After all the above sequences are removed, the final result can be high. Quality data.
  • the data comparison between the control set sample and the reference gene set in this application includes the following steps: 1) The control sample (Clean Fastq format) after quality control is compared with the original reference gene set based on the bwa software. Obtain the original comparison result file (raw.bam); 2) Sort the original comparison result file to generate a sort result file (sort.bam); 3) Perform the process of removing duplicate sequences on the sort result file to generate Re-sorting result file; 4) Performing partial rearrangement and base quality correction on the re-sorting result file to obtain a comparison result.
  • the sequenced short sequences of the samples are aligned, and the exact coordinates of each short sequence in the reference genome are correctly located by way of alignment.
  • the coordinate position of the sequence recorded in the generated comparison result file is random, and each short sequence needs to be sorted according to the base number sequence of the chromosome of the reference genome.
  • Subsequent mutation detection is based on the sequence of each base of the chromosome to determine whether there is mutation. Therefore, sorting the original comparison result file and forming the sorted file is a very important link. Since there is an experimental step of sequence amplification when performing high-throughput sequencing of samples, each sequence is replicated in this step to generate duplicate sequences.
  • the base quality value in the corresponding sequence is different, and the base quality value needs to be corrected once to improve the accuracy of subsequent mutation detection Sex.
  • the construction of the control set variation site frequency data in this application includes the following steps: 1) Perform mutation detection on each reference base of each control sample in the control set to obtain the variation detection of all samples in the control set Result file; 2) Based on the mutation detection result files of all samples in the control set, merge the mutation sites to obtain the mutation result file in the control cluster; 3) Based on the mutation result file of the control cluster, perform the mutation result file for each The frequency of the mutation site is counted, and the population mutation frequency statistics result is obtained.
  • the mutation frequency of all the mutation sites can be obtained.
  • the original data of the actual test sample is first subjected to quality control, and the original data of the actual test sample after the quality control is compared with the reference gene set. And carry out site variation detection.
  • the quality control methods and purposes of the actual samples are the same as those of the control set samples. Both remove residual primer sequences and filter low-quality sequences and error sequences to obtain clean and effective sequence data, improve the accuracy of the analysis results, and save money. The waste of computing resources reduces the analysis time.
  • the mutation site of the actual test sample and the mutation frequency statistical results of the control set are used to judge the corresponding site: when the mutation frequency of the control set of a site is ⁇ 0.5, the corresponding mutation site of the actual test sample is classified as a false positive site ; When the variation frequency of the control set of a certain site is ⁇ 0.1 and the variation frequency of the control set is less than 0.5, the corresponding mutation site of the actual test sample is attributed to the population polymorphic site; when the mutation frequency of the control set of a certain site is less than 0.5, the actual measurement The corresponding variant sites of the sample are classified as unique variant sites. High-quality mutation sites can be obtained by the above method.

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本发明提供一种同源假基因变异检测的方法,根据最新更新的基因序列构建参考基因集;随机获取正常样本原始数据创建对照集;对对照集正常样本原始数据与所述参考基因集进行数据比对,得到对照集比对结果;并对对照集中的每个样本进行变异检测,构建对照集变异位点频率数据;获取实测样本原始数据,对所述实测样本数据与参考基因集进行数据比对,并对实测样本比对结果进行变异位点检测,得到实测样本变异位点检测结果;将所述实测样本变异位点检测结果与所述对照集变异位点频率数据进行位点比对筛查,得到实测样本的基因变异位点。与现有技术相比,本方法能够解决参考基因组序列和基因序列更新不同步,提升基因位点变异检测的准确性,缩短检测周期。

Description

一种同源假基因变异检测的方法 技术领域
本发明涉及生物学与精准医学基因检测领域,具体涉及一种同源假基因变异检测的方法。
背景技术
目前生物学与精准医学领域,对临床个体进行基因病进行临床诊断时,通常需要进行个人的基因检测,常用检测方法是进行全基因组测序(WGS),全外显子测序(WES)和目标区域测序(TRS),相关分析流程如下:1)高通量测序完成后,获得基因组的短片段序列信息;2)与参考基因组进行序列比对,定位每一条短序列的基因组坐标;3)对比对的结果进行基因组坐标排序,去重,重排以及碱基质量矫正;4)对基因组的每个碱基进行变异检测,并进行基因型评估;5)最终得到个人的基因组变异检测结果。
目前该技术已经成为新一代基因测序技术(NGS技术)即高通量测序技术检测个人样本基因变异的推荐流程。但是目前该技术仍然存在一些问题,如
1)该技术依赖参考基因组(reference genome),目前参考基因组版本为基因组参照序列联盟人类基因组38版本(Genome Reference Consortium Human Genome Build 38,GRCh38)。基因组更新速率较慢,而随着研究的深入,发布的人类基因的参考序列不断更新,造成了参考基因组序列和最新的基因序列间存在不同步的问题。
如图1所示NCBI_chr22_NM033517.1标注序列是基于GRCh38基因组提取SHANK3目标基因区的序列;NM_033517.1标注序列为美国国家生物信息中心(National Center for Biotechnology Information,NCBI)数据库收录的SHANK3的目标基因区的最新编码序列。根据比对结果,来源于GRCh38基因组的SHANK3基因与来源于NCBI数据库收录的SHANK3基因在关键位点存在显著差异。
2)同源序列会造成变异检测假阳性,假阴性问题。由于人类参考基因组中,存在大量的同源区域,例如同源基因,假基因等,而目前的NGS技术的局限,所测序的序列通常较短,在进行全基因组范围的序列比对时,由于同源区域所造成的,会存在非唯一比对的发生,会导致很多变异假阳性的发生。
如图2所示,脊髓性肌萎缩症(SMA)的两个关联基因运动神经元存活基因1(SMN1)和运动神经元存活基因2(SMN2)为同源基因,差异碱基位点只有5个碱基。如图3所示,当将这两个基因与人类参考基因组GRCh38进行比对时,序列会因为同源区比对,导致真实变异无法确认来源而被过滤。而与NCBI数据库最新更新的基因序列比对时可以发现在SMN1的Exon1同源区检到一个插入变异。
3)由于人类参考基因组大小约3GB个碱基对,序列比对比较耗时,因此造成临床样本的基因变异检测周期较长。
发明内容
本发明目的在于提供一种同源假基因变异检测的方法,用于解决目前常用的参考基因组序列与更新的基因序列不同步的问题,同时解决同源区域比对异常造成的变异检测不准确问题;也用于解决目前检测时间周期较长的问题。
为达成上述目的,本发明提出如下技术方案:一种同源假基因变异检测的方法,包括以下步骤:1)根据NCBI数据库的基因序列选取真基因构建参考基因集;2)随机获取正常样本原始数据创建对照集,对对照集正常样本原始数据与参考基因集进行数据比对,得到对照集比对结果;3)根据所述对照集比对结果,对对照集中的的每个样本进行变异检测,构建对照集变异位点频率数据;4)获取实测样本原始数据,对所述实测样本数据与所述参考基因集进行数据比对,得到实测样本比对结果;对所述实测样本比对结果进行变异位点检测,得到实测样本变异位点检测结果;5)将所述实测样本变异位点检测结果与所述对照集变异位点频率数据进行位点比对筛查,去除假阳性位点,得到实测样本的变异位点。
本申请中通过自主根据美国国家生物信息中心(National Center for Biotechnology Information,NCBI)数据库最新更新的基因全长序列构建参考基因集,首先可以避免当前已发布的人类参考基因组序列,与不断更新的基因序列间的不同步问题,提高变异检测的准确性。
同时基于目前常用的GRCh38版本的参考基因组中涵盖了所有的基因序列信息、真基因、假基因以及同源基因等,在进行样本比对时,样本中基因的存在的假基因、同源基因会造成对变异位点造成假阳性判断或来源无法识别造成变异遗漏的情况。本申请中自主根据NCBI数据库最新更新的基因全长序列构建参考基因集中只提取了真基因,基于比对的高匹配度,在进行样本比对时一般为样本中的真基因和参考基因集中的真基因比对,提高了样本对比的准确性,可以有效避免因为同源基因或假基因对真基因变异检测的影响。
其次因为目前常用的GRCh38版本的参考基因组因包含了基因间序列和无用的序列,其大小约3GB个碱基对,而本申请自主构建的参考基因集仅包含了真基因序列,其大小仅为1GB个碱基对,在进行样本比对时,能够大大提高比对效率,缩短检测周期。
在本申请中通过设置对照集,通过对对照集中原始数据与参考基因集进行比对后,对每一个样本的参考碱基变异检测得对对照集中变异位点频率结果,在实测样本与参考基因集进行比对和变异检测后,因为实验、测序、算法存在的误差,变异检测的结果中必然存在部分假阳性结果。将实测样本的基因变异位点一一与对照集中变异位点频率结果进行位点筛查,则可去除假阳性后即可得到实测样本的变异位点。
进一步的,本申请在构建参考基因集时包括以下步骤:1)从NCBI数据库中下载收集最新更新的基因全长序列,创建文本文件;2)创建基因对比索引文件;3)创建基因序列信息文件。在构建参考基因集中要注意仅选取真基因。
进一步的,本申请随机获取正常样本原始数据创建对照集后,对所述对照集正常样本原始数据先进行质控,再对质控后的对照集正常样本原始数据与所 述参考基因集进行数据比对。由于原始测序数据因为实验操作、上机测序等过程存在的偏差,会包含无效序列数据,会影响分析结果的准确性。因此对原始数据进行质控,可提升分析结果的准确性,减少分析时间。
进一步的,本申请在对样本数据进行质控时包括以下步骤:1)先去除序列中接头序列和/或两端质量值低于30的碱基序列和/或碱基数目大于5的序列;2)再剔除序列长度小于35bp的序列。步骤1中去除接头序列、或两端质量值低于30的碱基序列、碱基数目大于5的序列时不分先后顺序,不分全部去除还是部分去除,全部去除上述序列后最终可获得高质量数据。
进一步的,本申请中对照集样本与参考基因集进行数据比对包括以下步骤:1)将质控后的对照样本与原始参考基因集进行对比,获得原始比对结果文件;2)对所述原始比对结果文件进行排序,产生排序结果文件;3)对所述排序结果文件进行去除重复序列处理,产生去重排序结果文件;4)对所述去重排序结果文件进行局部重排和碱基质量矫正,得到对比结果。
进一步的,本申请中构建对照集变异位点频率数据包括以下步骤:1)对对照集中每个对照样本的每个参考碱基进行变异检测,得到对照集中所有样本变异检测结果文件;2)基于所述对照集中所有样本变异检测结果文件,进行变异位点合并处理,得到对照集群体中的变异结果文件;3)基于所述对照集群体的变异结果文件,对每个变异位点进行频率统计,得到人群突变频率统计结果。
通过对对照集中每个样本的变异位点检测,获取对照集变异位点频率数据,可以得出所有变异位点的变异频率。
进一步的,本申请中在获取实测样本原始数据后,对所述实测样本原始数据先进行质控,再质控后的实测样本原始数据与所述参考基因集进行数据比对并进行位点变异检测。实测样本的质控方法与对照集样本的质控方法和目的一致。
进一步的,将实测样本的变异位点与对照集突变频率统计结果对相应位点进行判断:当某位点的对照集变异频率≥0.5时,实测样本对应变异位点归结为 假阳性位点;当某位点的对照集变异频率≥0.1且对照集变异频率<0.5时,实测样本对应变异位点归结为人群多态性位点;当某位点的对照集变异频率<0.5时,实测样本对应变异位点归为特有变异位点。通过上述方法即可得到实测样本的变异位点。
有益效果:
由以上技术方案可知,本发明的技术方案提供了一种同源假基因变异检测的方法,并可得到如下有益效果:
1)通过创新性的选取来源于NCBI数据库最新更新的所有基因的完整序列,构建新的参考基因组,能够避免当前已发布的人类参考基因组序列与不断更新的基因序列间不同步的问题,提高变异检测的准确性。例如SHANK3基因变异,GRCh38参考基因组变异描述chr22:50721359-50721359G>T,对应转录本变异NM_033517.1:exon21:c.3484G>T(p.Glu1162*),碱基位置c.3484G>T和氨基酸位置p.Glu1162*均为错误描述。经过该分析流程检测后,转录本变异描述正确,NM_033515.1:exon21:c.3526G>T(p.Glu1176*)。
2)全外显子样本在测序量为10G时,平均测序深度约100X,比对GRCh38参考基因组时间大约为5~10小时,基于本申请构建的参考基因集序列,比对实际可缩短到3个小时,能有效的提升序列比对与变异检测的效率,大大缩短临床样本的分析周期。
3)本方法创造性构建的新的参考基因集,收集了更新的所有基因的完整序列,可以避免现有技术使用的人类基因组中同源区或假基因对比对的影响。例如SMN1/SMN2基因变异,chr5:70925124-70925124C>CA,该变异发生在同源基因内,常规流程分析会遗漏该变异位点。采用本申请的分析流程后,能够提示出该位点变异,注释为SMN1:NM_000344.3:c.22dupA:p.(Ser8Lysfs*23),临床数据(HGMD)中有收录,HGMD数据库中描述为DM,即有害变异位点。
4)本方法通过构建正常样本对照集并获得对照集样本变异位点频率数据,对实测样本的变异位点进行比对评估,可以避免因同源序列(含同源区域、假 基因等)对比对的影响,提高基因变异位点判断的准确性。
应当理解,前述构思以及在下面更加详细地描述的额外构思的所有组合只要在这样的构思不相互矛盾的情况下都可以被视为本公开的发明主题的一部分。
结合附图从下面的描述中可以更加全面地理解本发明教导的前述和其他方面、实施例和特征。本发明的其他附加方面例如示例性实施方式的特征和/或有益效果将在下面的描述中显见,或通过根据本发明教导的具体实施方式的实践中得知。
附图说明
附图不意在按比例绘制。在附图中,在各个图中示出的每个相同或近似相同的组成部分可以用相同的标号表示。为了清晰起见,在每个图中,并非每个组成部分均被标记。现在,将通过例子并参考附图来描述本发明的各个方面的实施例,其中:
图1为SHANK3基因序列和GRCh38比对的关键差异图;
图2为SMN1和SMN2差异图;
图3为SMN1和SMN2在图2中Exon1区域变异比对图;
图4为本发明中同源假基因变异检测方法流程图;
图5为本发明中基因集构建流程图;
图6为本发明中对照集样本数据质控流程图;
图7为本发明中对照集样本数据比对流程图;
图8为本发明中构建对照集变异位点频率数据流程图;
图9为本发明中实测样本变异检测与位点筛除流程图。
具体实施方式
为了更了解本发明的技术内容,特举具体实施例并配合所附图式说明如下。
在本公开中参照附图来描述本发明的各方面,附图中示出了许多说明的实施例。本公开的实施例不必定意在包括本发明的所有方面。应当理解,上面介 绍的多种构思和实施例,以及下面更加详细地描述的那些构思和实施方式可以以很多方式中任意一种来实施,这是因为本发明所公开的构思和实施例并不限于任何实施方式。另外,本发明公开的一些方面可以单独使用,或者与本发明公开的其他方面的任何适当组合来使用。
为解决现有技术中参考基因组序列和基因序列更新不同步和同源区域比对异常造成的变异检测不准确问题;以及目前检测的时间周期较长的问题,具体实施时,如图4本发明提出一种同源假基因变异检测的方法,包括以下步骤:1)根据NCBI数据库最新更新的基因序列构建参考基因集(CG-RefGenome);2)随机获取正常样本原始数据创建对照集(Fastq格式文件),对对照集正常样本原始数据与参考基因集进行数据比对,得到对照集比对结果(BAM文件);3)根据所述对照集比对结果,对对照集中的的每个样本进行变异检测,构建对照集变异位点频率数据(VCF文件);4)获取实测样本原始数据(Fastq格式文件),对所述实测样本数据与所述参考基因集进行数据比对,得到实测样本比对结果(BAM文件);对所述实测样本比对结果进行变异位点检测,得到实测样本变异位点检测结果(VCF文件);5)将所述实测样本变异位点检测结果与所述对照集变异位点频率数据进行位点比对筛查,去除假阳性位点,得到实测样本的基因变异位点。
本申请中首先通过自主根据NCBI数据库最新更新的基因全长序列构建参考基因集,首先可以避免当前已发布的人类参考基因组序列,与不断更新的基因序列间的不同步问题,提高变异检测的准确性。
同时基于目前常用的GRCh38版本的参考基因组中涵盖了所有的基因序列信息、真基因、假基因以及同源基因等,在进行样本比对时,样本中基因的存在的假基因、同源基因会造成对变异位点造成假阳性判断或来源无法识别造成变异遗漏的情况。本申请中自主根据NCBI数据库最新更新的基因全长序列构建参考基因集中只提取了真基因,基于比对的高匹配度,在进行样本比对时一般为样本中的真基因和参考基因集中的真基因比对,提高了样本对比的准确性, 可以有效避免因为同源基因或假基因对真基因变异检测的影响。
其次因为目前常用的GRCh38版本的参考基因组因包含了基因间序列和无用的序列,其大小约3GB个碱基对,而本申请自主构建仅包含了真基因序列,其大小仅为1GB个碱基对,在进行样本比对时,能够大大提高比对效率,缩短检测周期。
在本申请中通过设置对照集,通过对对照集中原始数据与参考基因集进行比对后,对每一个样本的参考碱基变异检测得对对照集中变异位点频率结果,在实测样本与参考基因集进行比对和变异检测后,因为实验、测序、算法存在的误差,基因变异检测的结果中必然存在部分假阳性结果。将实测样本的基因变异位点一一与对照集中变异位点频率结果进行位点筛查,则可去除假阳性后即可得到高质量变异位点。
具体实施时,如图5所示本申请在构建参考基因集时包括以下步骤:1)首先从NCBI数据库中收集最新更新的基因全长序列:下载基因序列源文件,先对源文件进行解压合并,再对文件进行格式化以得到每行序列长度相同的、fasta格式的参考基因序列文件。2)创建基因对比索引文件:因为样本序列与参考基因序列进行比对时,采用的是bwa软件工具中的mem模块,该比对工具采用块排序压缩(Burrows-Wheeler,BWT)比对算法,必须对参考基因序列的fasta文件进行创建索引,因此本申请中采用bwa工具的index模块,对参考基因序列文件进行处理,创建基因对比索引文件。3)创建字典,获得基因序列信息文件:因为fai文件和dict文件是GATK工具进行碱基变异检测时,所要依赖的文件。因此,本申请中采用samtools与picard工具,对参考基因序列文件创建包括fai文件和dict文件的基因序列信息文件。
具体实施时,所示本申请随机获取不少于30例的正常样本原始数据(FASTQ格式)创建对照集,采用cutadapt软件对所述对照集正常样本原始数据先进行质控,再对质控后的对照集正常样本原始数据与所述参考基因集进行数据比对。由于原始测序数据因为实验操作、上机测序等过程存在的偏差,会包含引物序 列、错误序列、噪声序列、低质量序列等无效序列数据,这些序列数据不仅对后续分析没有任何作用,反而还会影响分析结果的准确性。因此对原始数据进行质控,不仅会去除残留的引物序列,还会过滤低质量序列和错误序列,获得干净而有效的序列数据,可提升分析结果的准确性,同时还可一定程度节省计算资源的浪费,减少分析时间。
具体实施时,如图6本申请在对对照集中的样本数据进行质控时包括以下步骤:1)先去除序列中接头序列和/或两端质量值低于30的碱基序列和/或碱基数目大于5的序列;2)再剔除序列长度小于35bp(碱基对)的序列。步骤1中去除接头序列、或两端质量值低于30的碱基序列、碱基数目大于5的序列时不分先后顺序,不分全部去除还是部分去除,全部去除上述序列后最终可获得高质量数据。
具体实施时,如图7本申请中对照集样本与参考基因集进行数据比对包括以下步骤:1)将质控后的对照样本(Clean Fastq格式)基于bwa软件与原始参考基因集进行对比,获得原始比对结果文件(raw.bam);2)对所述原始比对结果文件进行排序,产生排序结果文件(sort.bam);3)对所述排序结果文件进行去除重复序列处理,产生去重排序结果文件;4)对所述去重排序结果文件进行局部重排和碱基质量矫正,得到对比结果。
首先将样本的测序短序列,通过比对的方式正确定位每个短序列在参考基因组中的准确坐标。由于样本测序短序列进行比对后,产生的比对结果文件记录的序列的坐标位置是随机的,需要将每一条短序列依据参考基因组的染色体的碱基编号顺序进行排序。后续进行变异检测时,是依据染色体的每个碱基顺序依次进行判断是否变异,因此对原始比对结果文件进行排序并形成排序后的文件,是十分重要的环节。由于对样本进行高通量测序时,有序列扩增的实验环节,该环节会对每一条序列进行复制,产生副本序列,这些副本序列并非真实基因组中的序列,需要进行去重处理。同时样本中会存在插入与缺失类型的变异,此类变异会影响附近区域的碱基序列比对,后续变异检测会造成假阳性, 因此需要提前针对该区域的序列进行重排比对,得到正确合理的碱基比对结果。
样本序列比对参考基因组后,由于每个位置有很多序列可以比对,对应序列中的碱基质量值是存在差异的,需要对碱基的质量值进行一次矫正,以提高后续变异检测的准确性。
具体实施时,如图8所示本申请中构建对照集变异位点频率数据包括以下步骤:1)对对照集中每个对照样本的每个参考碱基进行变异检测,得到对照集中所有样本变异检测结果文件;2)基于所述对照集中所有样本变异检测结果文件,进行变异位点合并处理,得到对照集群体中的变异结果文件;3)基于所述对照集群体的变异结果文件,对每个变异位点进行频率统计,得到人群突变频率统计结果。
通过对对照集中每个样本的变异位点检测,获取对照集变异位点频率数据,可以得出所有变异位点的变异频率。
具体实施时,如图9所示本申请中在获取实测样本原始数据后,对所述实测样本原始数据先进行质控,质控后的实测样本原始数据与所述参考基因集进行数据比对并进行位点变异检测。实测样本的质控方法与对照集样本的质控方法和目的一致,均通过去除残留的引物序列,过滤低质量序列和错误序列,获得干净而有效的序列数据,提升分析结果的准确性,节省计算资源的浪费,减少分析时间。
具体实施时,将实测样本的变异位点与对照集突变频率统计结果对相应位点进行判断:当某位点的对照集变异频率≥0.5时,实测样本对应变异位点归结为假阳性位点;当某位点的对照集变异频率≥0.1且对照集变异频率<0.5时,实测样本对应变异位点归结为人群多态性位点;当某位点的对照集变异频率<0.5时,实测样本对应变异位点归为特有变异位点。通过上述方法即可得到高质量变异位点。
虽然本发明已以较佳实施例揭露如上,然其并非用以限定本发明。本发明所属技术领域中具有通常知识者,在不脱离本发明的精神和范围内,当可作各 种的更动与润饰。因此,本发明的保护范围当视权利要求书所界定者为准。

Claims (9)

  1. 一种同源假基因变异检测的方法,其特征在于:包括以下步骤:
    根据NCBI数据库的基因序列选取真基因构建参考基因集;
    随机获取正常样本原始数据创建对照集,对对照集正常样本原始数据与所述参考基因集进行数据比对,得到对照集比对结果;
    根据所述对照集比对结果,对对照集中的每个样本进行变异检测,构建对照集变异位点频率数据;
    获取实测样本原始数据,对所述实测样本原始数据与所述参考基因集进行数据比对,得到实测样本比对结果;对所述实测样本比对结果进行变异位点检测,得到实测样本变异位点检测结果;
    将所述实测样本变异位点检测结果与所述对照集变异位点频率数据进行位点比对筛查,去除假阳性位点,得到实测样本的基因变异位点。
  2. 如权利要求1所述的同源假基因变异检测的方法,其特征在于:构建参考基因集包括以下步骤:
    收集NCBI数据库的基因全长序列,创建文本文件;
    创建基因对比索引文件;
    创建基因序列信息文件。
  3. 如权利要求2所述的同源假基因变异检测的方法,其特征在于:在随机获取正常样本原始数据创建对照集后,对所述对照集正常样本原始数据先进行质控。
  4. 如权利要求3所述的同源假基因变异检测的方法,其特征在于:所述质控包括以下步骤:
    去除序列中接头序列和/或序列两端质量值低于30的碱基和/或N碱基数目大于5的序列;
    剔除序列长度小于35个碱基对的序列。
  5. 如权利要求4所述的同源假基因变异检测的方法,其特征在于:与参考 基因集进行数据比对包括以下步骤:
    将质控后的对照集样本与原始参考基因集比对,获得原始比对结果文件;
    对所述原始比对结果文件进行排序,产生排序结果文件;
    对所述排序结果文件进行去除重复序列处理,产生去重排序结果文件;
    对所述去重排序结果文件进行局部重排和碱基质量矫正,得到比对结果。
  6. 如权利要求5所述的同源假基因变异检测的方法,其特征在于:构建对照集变异位点频率数据包括以下步骤:
    对对照集中每个对照样本的每个参考碱基进行变异检测,得到对照集中所有样本变异检测结果文件;
    基于所述对照集中所有样本变异检测结果文件,进行变异位点合并处理,得到对照集群体的变异结果文件;
    基于所述对照集群体的变异结果文件,对每个变异位点进行频率统计,得到对照集突变频率统计结果。
  7. 如权利要求1-6任一一条所述的同源假基因变异检测的方法,其特征在于:在获取实测样本原始数据后,对所述实测样本原始数据先进行质控。
  8. 如权利要求7所述的同源假基因变异检测的方法,其特征在于:所述实测样本变异结果与对照集突变频率统计结果比对进行位点筛查,去除假阳性得到实测样本的变异位点。
  9. 如权利要求8所述的同源假基因变异检测的方法,其特征在于:所述位点筛查根据对照集突变频率统计结果对相应位点进行判断:
    当某位点的对照集变异频率≥0.5时,实测样本对应变异位点归结为假阳性位点;当某位点的对照集变异频率≥0.1且对照集变异频率<0.5时,实测样本对应变异位点归结为人群多态性位点;
    当某位点的对照集变异频率<0.5时,实测样本对应变异位点归为特有变异位点。
PCT/CN2020/092903 2019-12-20 2020-05-28 一种同源假基因变异检测的方法 WO2021120529A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911328534.6 2019-12-20
CN201911328534.6A CN111081315B (zh) 2019-12-20 2019-12-20 一种同源假基因变异检测的方法

Publications (1)

Publication Number Publication Date
WO2021120529A1 true WO2021120529A1 (zh) 2021-06-24

Family

ID=70316422

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/092903 WO2021120529A1 (zh) 2019-12-20 2020-05-28 一种同源假基因变异检测的方法

Country Status (2)

Country Link
CN (1) CN111081315B (zh)
WO (1) WO2021120529A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116312776A (zh) * 2022-12-08 2023-06-23 上海生物制品研究所有限责任公司 一种检测差异化rna编辑位点的方法

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111081315B (zh) * 2019-12-20 2023-06-06 苏州赛美科基因科技有限公司 一种同源假基因变异检测的方法
CN112365930B (zh) * 2020-10-19 2022-06-10 北京大学 一种为基因数据库确定最佳序列比对阈值的方法
CN112466395B (zh) * 2020-10-30 2021-08-17 苏州赛美科基因科技有限公司 基于snp多态性位点的样本识别标签筛选方法与样本识别检测方法
CN115810393B (zh) * 2022-12-22 2023-08-25 南京普恩瑞生物科技有限公司 一种基于构建人群SNPs库的测序样本同源性检测方法及系统
CN115881225B (zh) * 2022-12-28 2024-01-26 云舟生物科技(广州)股份有限公司 生物信息序列的分析方法、计算机存储介质及电子设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110091895A1 (en) * 1999-01-14 2011-04-21 Boman Bruce M Immunoassays to Detect Diseases or Disease Susceptibility Traits
CN106372459A (zh) * 2016-08-30 2017-02-01 天津诺禾致源生物信息科技有限公司 一种基于扩增子二代测序拷贝数变异检测的方法及装置
CN107491666A (zh) * 2017-09-01 2017-12-19 深圳裕策生物科技有限公司 异常组织中单样本体细胞突变位点检测方法、装置和存储介质
CN107974490A (zh) * 2017-12-08 2018-05-01 东莞博奥木华基因科技有限公司 基于半导体测序的pku致病基因突变检测方法及装置
CN108875302A (zh) * 2018-06-22 2018-11-23 广州漫瑞生物信息技术有限公司 一种检测细胞游离肿瘤基因拷贝数变异的系统和方法
CN111081315A (zh) * 2019-12-20 2020-04-28 苏州赛美科基因科技有限公司 一种同源假基因变异检测的方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105404793B (zh) * 2015-12-07 2018-05-11 浙江大学 基于概率框架和重测序技术快速发现表型相关基因的方法
US11993811B2 (en) * 2017-01-31 2024-05-28 Myriad Women's Health, Inc. Systems and methods for identifying and quantifying gene copy number variations
CN110033829B (zh) * 2019-04-11 2021-07-23 北京诺禾心康基因科技有限公司 基于差异snp标记物的同源基因的融合检测方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110091895A1 (en) * 1999-01-14 2011-04-21 Boman Bruce M Immunoassays to Detect Diseases or Disease Susceptibility Traits
CN106372459A (zh) * 2016-08-30 2017-02-01 天津诺禾致源生物信息科技有限公司 一种基于扩增子二代测序拷贝数变异检测的方法及装置
CN107491666A (zh) * 2017-09-01 2017-12-19 深圳裕策生物科技有限公司 异常组织中单样本体细胞突变位点检测方法、装置和存储介质
CN107974490A (zh) * 2017-12-08 2018-05-01 东莞博奥木华基因科技有限公司 基于半导体测序的pku致病基因突变检测方法及装置
CN108875302A (zh) * 2018-06-22 2018-11-23 广州漫瑞生物信息技术有限公司 一种检测细胞游离肿瘤基因拷贝数变异的系统和方法
CN111081315A (zh) * 2019-12-20 2020-04-28 苏州赛美科基因科技有限公司 一种同源假基因变异检测的方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116312776A (zh) * 2022-12-08 2023-06-23 上海生物制品研究所有限责任公司 一种检测差异化rna编辑位点的方法
CN116312776B (zh) * 2022-12-08 2024-01-19 上海生物制品研究所有限责任公司 一种检测差异化rna编辑位点的方法

Also Published As

Publication number Publication date
CN111081315A (zh) 2020-04-28
CN111081315B (zh) 2023-06-06

Similar Documents

Publication Publication Date Title
WO2021120529A1 (zh) 一种同源假基因变异检测的方法
US6625545B1 (en) Method and apparatus for mRNA assembly
CN104762402B (zh) 超快速检测人类基因组单碱基突变和微插入缺失的方法
US11339426B2 (en) Method capable of differentiating fetal sex and fetal sex chromosome abnormality on various platforms
CN109801678B (zh) 基于全转录组的肿瘤抗原预测方法及其应用
Cox et al. Integrating gene and protein expression data: pattern analysis and profile mining
CN108197434B (zh) 去除宏基因组测序数据中人源基因序列的方法
US20130166221A1 (en) Method and system for sequence correlation
KR101313087B1 (ko) Ngs를 위한 서열 재조합 방법 및 장치
WO2019213811A1 (zh) 检测染色体非整倍性的方法、装置及系统
CN111326212A (zh) 一种结构变异的检测方法
CN115631789A (zh) 一种基于泛基因组的群体联合变异检测方法
CN113096737B (zh) 一种用于对病原体类型进行自动分析的方法及系统
CN111863132A (zh) 一种筛选致病性变异的方法和系统
CN113223619A (zh) 比对不同全基因组测序方法的测序结果覆盖率的方法
CN112837748A (zh) 一种区分不同解剖学起源肿瘤的系统及其方法
EP3795692A1 (en) Method, apparatus, and system for detecting chromosome aneuploidy
Halpin et al. Multimapping confounds ribosome profiling analysis: A case‐study of the Hsp90 molecular chaperone
WO2018033733A1 (en) Methods and apparatus for identifying genetic variants
CN114067909B (zh) 一种矫正同源重组缺陷评分的方法、装置和存储介质
CN111653312B (zh) 一种利用基因组数据探究疾病亚型亲缘性的方法
WO2013097149A1 (zh) 估计基因组重复序列含量的方法和装置
WO2013097143A1 (zh) 估计基因组杂合率的方法和装置
EP2000935A2 (en) Method of processing protein peptide data and system
CN117238365A (zh) 基于高通量测序技术的新生儿遗传病早筛方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20903849

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20903849

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 20-01-2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20903849

Country of ref document: EP

Kind code of ref document: A1