WO2023124779A1 - Third-generation sequencing data analysis method and device for point mutation detection - Google Patents

Third-generation sequencing data analysis method and device for point mutation detection Download PDF

Info

Publication number
WO2023124779A1
WO2023124779A1 PCT/CN2022/136275 CN2022136275W WO2023124779A1 WO 2023124779 A1 WO2023124779 A1 WO 2023124779A1 CN 2022136275 W CN2022136275 W CN 2022136275W WO 2023124779 A1 WO2023124779 A1 WO 2023124779A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
point mutation
short
data
analysis
Prior art date
Application number
PCT/CN2022/136275
Other languages
French (fr)
Chinese (zh)
Inventor
郎继东
孙继国
Original Assignee
成都齐碳科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 成都齐碳科技有限公司 filed Critical 成都齐碳科技有限公司
Publication of WO2023124779A1 publication Critical patent/WO2023124779A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis

Definitions

  • the application belongs to the field of sequencing technology and bioinformatics analysis of sequencing data, and in particular relates to a method for detecting point mutations based on third-generation sequencing data, and the application also relates to a device for detecting point mutations based on third-generation sequencing data.
  • a point mutation is a change in only one base pair.
  • General point mutations can be base substitutions, single base insertions or base deletions; narrow sense point mutations are also called single base substitutions.
  • Base substitutions are further divided into two types: transitions and transversions.
  • common methods for detecting gene point mutations include PCR method, Sanger sequencing method (first-generation sequencing) and next-generation sequencing.
  • the PCR method has the characteristics of high sensitivity, and the technology is mature, but each pair of primers can only detect one mutation, cannot detect too many samples and sites at the same time, and the throughput is low.
  • the cost of Sanger sequencing is low, but the amount of sample required is large, and the detection sensitivity for low-frequency mutations is low.
  • Next-generation sequencing has the characteristics of high throughput, and the cost of sequencing is also decreasing year by year, but the current methods and tools commonly used to detect point mutations are not high in detection specificity (such as Varscan), and the sensitivity to low-frequency detection is also low (such as Mutect). Or the use of local assembly steps leads to too long running time (such as Mutect2), which cannot well meet the needs of point mutation detection.
  • Third-generation sequencing technology also known as third-generation sequencing technology (Third generation sequencing) or single-molecule real-time DNA sequencing technology, is a method that can individually sequence each DNA molecule without PCR amplification during DNA sequencing. technology.
  • third-generation sequencing technology also known as third-generation sequencing technology (Third generation sequencing) or single-molecule real-time DNA sequencing technology.
  • the principles of the third-generation sequencing technology are mainly divided into single-molecule fluorescence sequencing represented by Pacbio's SMRT technology and nanopore sequencing represented by the nanopore electrophoresis technology of Oxford Nanopore and Qitan Technology.
  • One of the main technical characteristics of the third-generation sequencing is to realize the internal reaction speed of DNA polymerase, which can measure 10 bases per second, and the sequencing speed is 20,000 times that of chemical sequencing; With its own continuity, one reaction can measure very long sequences; second-generation sequencing can measure hundreds of bases, but third-generation sequencing can measure thousands of bases. Furthermore, the third-generation sequencing does not need PCR amplification or chemical labeling when sequencing DNA or RNA molecules in real time, avoiding erroneous mutations introduced during the operation, high fidelity, and the sequencing speed can reach 450 bp/s for DNA and 450 bp/s for RNA. 70nt/s, the whole can reach the ultra-long read length of several megabases.
  • the purpose of this application is to address the shortcomings of related technologies and provide an analysis method for detecting point mutations based on third-generation sequencing data.
  • the method provided by this application can well solve the above problems at the data analysis level, not only from the data characteristics It is more effective to avoid the problem of false negatives caused by random indels or high sequencing errors caused by low alignment rates.
  • the design combines the theoretical point of view of "middle alignment, poor on both sides" of bases in the sequencing sequence position,
  • UMI/UID molecular biological labels
  • the present application provides an analysis method for detecting point mutations based on three-generation sequencing data, the method comprising the following steps:
  • a short sequence with a fixed length L is extracted N times, and the short sequence satisfies the difference between the position of the point mutation to be detected on the extracted short sequence and the position on the previously extracted short sequence with a fixed distance D between them, wherein, N, D, and L are all integers; the first sequence subset is finally obtained, which includes N short sequences containing point mutations to be detected;
  • step 2) Extracting the seed sequence from the first sequence subset in step 1), the extraction position is M bases at the beginning and end of each short sequence, and obtaining the second sequence subset, which includes N pairs of seed sequences with a length of M, The seed sequence does not contain the point mutation to be detected;
  • step 4 using the seed sequence of the second sequence subset obtained in step 2) to extract the target sequence from the original data set obtained in step 3), and obtain N data sets containing the target sequence;
  • each result includes the mutation frequency F of the site to be detected, and the reads support number AO of the point mutation , the sequencing depth DP of the point mutation position;
  • D represents the base distance between positions of point mutations in any extracted sequence.
  • the fixed distance D can be any integer greater than 1, not limited to any particular theory, but optionally the distance D is set as Without any theoretical limitation, those skilled in the art can optionally set the value of D, for example, 5 ⁇ D ⁇ 20, 8 ⁇ D ⁇ 15, etc., for example, D can be any integer between 5 and 20.
  • D 0 can be understood as the position of the point mutation to be detected in the extracted short sequence during the first extraction; for example D 0 can be the first base, the second base, the third base, the fourth base in the short sequence extracted for the first time, and so on; in an optional embodiment, D 0 ⁇ L/4 and/or D 0 ⁇ D, for example, D 0 may be D, D+1, D+2, etc.
  • the positions of the point mutations to be detected are respectively located at the 11th base, the 21st base, the 31st base, etc. on the extracted short sequence; it can be understood that D0 is 11, D is 10, X is 1, 2 and 3.
  • step 1) the extraction times N need to be determined according to the fixed length L and the fixed distance D.
  • the point mutation to be detected in the extracted short sequence when N is an even number, among the obtained N short sequences, the second and second Compared with its position on other short sequences, the point mutation to be detected in the extracted short sequence can be located in the middle position of the short sequence or the position closest to the middle; when N is an odd number, the Compared with the positions on other short sequences, the point mutation to be detected in the extracted short sequence is located in the middle position of the short sequence or the position closest to the middle.
  • the fixed length L of each sequence can be an optional length, and the length can be as short as 35bp, or as long as 250bp, optionally 76-151bp.
  • M may be an optional integer, but based on practical considerations, M may be 2, 3, 4 or 5, and optionally, M ⁇ 5.
  • the raw data is long-read data obtained by nanopore sequencing.
  • data preprocessing is performed on the original third-generation sequencing data, including using Porechop software and NanoFilt software to remove adapters and barcode sequences added during the experimental library construction process, and to filter low-quality and too short sequences. Sequence reads to obtain the desired original data set (clean data).
  • the low-quality threshold includes but is not limited to Q5, for example, the threshold may be Q7 or higher; wherein, Q represents the average quality value of sequencing reads, that is, each of the sequencing reads The base accuracy is summed and averaged. It is known to those skilled in the art that the threshold can be adjusted according to the actual situation. For specific adjustment parameters, see https://en.wikipedia.org/wiki/FASTQ_format, which is incorporated herein by reference.
  • the sequence length threshold of too short sequencing reads includes but is not limited to 100 bp; for example, the threshold can be 50 bp, 200 bp, 300 bp, etc. Those skilled in the art can adjust the threshold according to actual conditions.
  • step 4 considering the characteristic interference of the third-generation sequencing data, limit the extraction of the corresponding target sequence length L' ⁇ L+50.
  • the N data sets containing the target sequence obtained after the processing of the foregoing steps of the present application can use next-generation sequencing data to analyze the standard or mature mainstream of point mutations Analysis process, such as GATK Best Practice, etc.
  • N data sets containing the target sequence are subjected to point mutation detection and analysis, and N results are obtained; each result includes a mutation frequency of F, the number of reads supported by the point mutation is AO, and the number of point mutation positions is The sequencing depth is DP.
  • the results of the first data set include the mutation frequency F 1 , the read support number AO 1 of the point mutation, and the sequencing depth DP 1 of the point mutation position;
  • the results of the second data set include the mutation frequency F 2 , the read support number AO 2 of the point mutation, and the sequencing depth DP 2 of the point mutation position;
  • the results of the Nth data set include the mutation frequency F N , the number of reads supported by the point mutation AON , and the sequencing depth D P N of the point mutation position.
  • step 7 the formula is
  • the inventor combined the theoretical point of view of bases on the position of the sequencing sequence "more accurate in the middle, worse on both sides", the idea of molecular biological labels (UMI/UID) at the data analysis level, and "weight” statistics
  • UMI/UID molecular biological labels
  • the method of the present application includes the following steps:
  • a short sequence of fixed length L is extracted N times, and in the short sequence obtained by the first extraction, the position of the point mutation to be detected is D 0 , and the point mutation to be detected is satisfied between the short sequences There is a fixed distance D between the position on the extracted short sequence and its position on the previously extracted short sequence, and finally a first sequence subset is obtained, which includes N short sequences containing point mutations to be detected;
  • L is any integer between 76-151bp
  • D is any integer between 8 and 15
  • N is any integer between 4 and 18
  • D0 is any integer between 5 and 14;
  • step 2) Extracting a seed sequence from each sequence in the first sequence subset obtained in step 1), the extraction positions are respectively M bases at both ends of each sequence, and finally obtaining a second sequence subset of N seed sequence pairs, Where 5 ⁇ M ⁇ D 0 ;
  • step 2 According to the seed sequence pair obtained in step 2), extract the corresponding target sequence from the original data set obtained in step 3), considering the characteristic interference of the third-generation sequencing data, limit the extraction of the corresponding target sequence length L' ⁇ L+ 50. Finally, N data sets containing the target sequence extracted according to the seed sequence pair are obtained;
  • step 4 Perform point mutation detection and analysis on the N data sets containing the target sequence obtained in step 4), use but not limited to analysis processes such as GATK Best Practice to obtain the final results of N target site detection, record each The mutation frequency detected at the target site is F N , the number of mutation reads supported at this site is AON , and the sequencing depth at this position is DPN ;
  • step 7) The target point mutation results obtained in step 5) of weighting and error correction and their frequencies, defined
  • F correct is the final detection mutation frequency of this site
  • the present application also provides a device for detecting point mutations based on three-generation sequencing data, wherein the device includes:
  • a seed sequence extraction module configured to obtain a second sequence subset comprising a pair of seed sequences
  • the preprocessing module is used to preprocess the third-generation sequencing data to obtain the original data set with the expected quality
  • the primary analysis module is used to use the seed sequence pair of the second sequence subset to extract a data set containing the target sequence from the preprocessed original data set, and then perform point mutation detection analysis and obtain data;
  • Advanced analysis module used to further weight and correct the obtained results, and obtain the final analysis results
  • Reporting module for outputting results based on data.
  • the seed sequence extraction module is used to extract a first sequence subset comprising N short sequences containing point mutations to be detected from the reference genome, and then from the first sequence subset Extracting a second sequence subset comprising a seed sequence pair; wherein the seed sequence pair is obtained according to the data processing method described in this application.
  • the preprocessing module is used to filter low-quality and too short sequencing reads, and may include, for example, Porechop software and NanoFilt software.
  • the data obtained by the primary analysis module has similar characteristics to the second-generation NGS sequencing data, and standard or mature mainstream analysis processes for point mutation analysis of NGS data, such as GATK Best Practice, can be used.
  • the advanced analysis module includes a program or software for assigning weight to each result.
  • the weight assignment is in line with the theoretical point of view of "more accurate in the middle and poorer on both sides” of the position of the base in the sequencing sequence, the idea of molecular biological labels (UMI/UID) at the data analysis level, and the method of "weight” statistics.
  • the inventors of the present application have solved the problem that the third-generation sequencing data is limited by the sequencing quality and the dependent comparison algorithm or the data distribution of the deep learning training set from the data analysis level, and The applicable scenarios are not wide enough, and the robustness (robust) is insufficient.
  • the method of this application can be well compatible with the current standards for point mutation analysis of second-generation sequencing data or mature mainstream analysis procedures, such as GATK Best Practice, etc., enriches the technical means for point mutation analysis of third-generation sequencing data, and largely solves the problem of The current situation of insufficient accuracy of point mutation detection by third-generation sequencing, while giving full play to the advantages of long read length of third-generation sequencing data, also further promotes the application of third-generation sequencing in scientific research, especially suitable for mutation detection targeting related hotspot panels .
  • FIG. 1 shows a flow chart of an analysis method for detecting point mutations based on three-generation sequencing data in one embodiment of the present application
  • FIG. 2 is a structural block diagram of a device for detecting point mutations based on three-generation sequencing data in one embodiment of the present application.
  • the present application provides an analysis method for detecting point mutations based on three-generation sequencing data, the method comprising the following steps:
  • a short sequence with a fixed length L is extracted N times, and the short sequence satisfies the difference between the position of the point mutation to be detected on the extracted short sequence and the position on the previously extracted short sequence with a fixed distance D between them, wherein, N, D, and L are all integers; the first sequence subset is finally obtained, which includes N short sequences containing point mutations to be detected;
  • S2 extract the seed sequence from the first sequence subset of S1, the extraction position is M bases at the beginning and end of each short sequence, and obtain the second sequence subset, which includes N pairs of seed sequences with a length of M, the The seed sequence does not contain the point mutation to be detected;
  • S3 Preprocess the original third-generation sequencing data to obtain an original data set with expected quality
  • each result includes the mutation frequency F of the site to be detected, the number of reads supported by the point mutation AO, point The sequencing depth DP of the mutation position;
  • a device for detecting point mutations based on three-generation sequencing data includes: a seed sequence extraction module 101 for obtaining The second sequence subset of the pair; the preprocessing module 102 is used to preprocess the three-generation sequencing data to obtain an original data set with expected quality; the primary analysis module 103 is used to use the seed sequence pair of the second sequence subset from the preprocessing Extract the data set containing the target sequence from the processed raw data set, and then perform point mutation detection and analysis to obtain data; the advanced analysis module 104 is used to further weight and correct the obtained results, and obtain the final analysis result; and the report module 105, for outputting a result according to the data.
  • the seed sequence extraction module is used to extract a first sequence subset comprising N short sequences containing point mutations to be detected from the reference genome, and then extract the first sequence subset from the first sequence subset Concentratingly extracting the second sequence subset including the seed sequence pair; wherein the seed sequence pair is obtained according to the data processing method described in this application.
  • the preprocessing module is used to filter low-quality and too-short sequencing reads, and may include, for example, Porechop software and NanoFilt software.
  • the data obtained by the primary analysis module has similar characteristics to the second-generation NGS sequencing data, and the standard or mature mainstream analysis process of point mutations can be analyzed using NGS data, such as GATK Best Practice, etc. .
  • the advanced analysis module includes a program or software for assigning weight to each result.
  • the weight distribution conforms to the theoretical point of view of bases in the position of the sequencing sequence of "more accurate in the middle, worse on both sides", the idea of molecular biological labels (UMI/UID) at the data analysis level, and the method of "weight” statistics.
  • Embodiment 1 uses the method analysis data of the present application
  • the standard sample containing BRAF-V600E, EGFR-L858R, EGFR-T790M, KRAS-G13D and AKT1-E17K and the standard sample of negative control sample NA12878 were prepared through the experimental library and repeated three times, using the nanometer of QNome-9604 The hole sequencer was used for sequencing, and six original long-read sequencing data were obtained, among which HUM964, HUM965 and HUM966 were positive control data, and HUM967, HUM968 and HUM969 were negative control data.
  • Extract the seed sequence for each short sequence fragment set, and the extraction position is 10 bases at the beginning and end of the short sequence of each target site, and finally obtain 9 fragment sets of short sequence seed pair sequences containing the target site .
  • step 5 From the clean data obtained in step 4, extract the corresponding target sequence according to the short sequence seed pair sequence obtained in step 3. Considering the characteristic interference of the third-generation sequencing data, limit the extraction of the corresponding target sequence length L' ⁇ 151, Finally, nine target sequence data sets extracted from the seed sequence pairs were obtained.
  • step 5 Perform point mutation detection and analysis on the 9 data sets obtained in step 5.
  • GATK Best Practice is used to detect point mutations, and the final results of 9 target site detection are obtained. Record each target site The mutation frequency detected at the site is F N , the number of mutation reads supported at this site is AON , and the sequencing depth at this position is DPN .
  • step 6 The target point mutation results and frequency obtained in step 6 of weighting and error correction correction, defined
  • F correct is the final detection mutation frequency of the site; if F correct ⁇ 1%, it is positive, otherwise it is negative.
  • Nano2NGS represents the method described in this application. From the data in Table 1, it can be known that using the method of this application, BRAF-V600E , EGFR-L858R , EGFR-T790M , KRAS-G13D and AKT1 were detected in three repetitions -E17K mutation, and good reproducibility among the three results, no significant difference from the expected frequency.
  • the Longshot method for example published in the journal Nature Communications (DOI: 10.1038/s41467-019-12493-y), is a point mutation detection method developed by the University of California combined with the hidden Markov chain model obtained by three-generation sequencing, from the data in Table 1 Yes, point mutation data cannot be obtained using this method of analysis.
  • the DeepVariant method (the PEPPER-Margin-DeepVariant method (doi: https://doi.org/10.1101/2021.03.04.433952) developed and optimized based on the Google team’s DeepVariant published on bioRxiv) cannot be directly used for the detection of point mutations in three-generation sequencing method.
  • the method of this application not only effectively avoids the problem of false negatives caused by random indels or high sequencing errors caused by low alignment rates from the data characteristics, but also designs the "intermediate alignment" of binding bases at the position of the sequencing sequence.
  • Theoretical point of view the idea of molecular biological labels (UMI/UID) on the data analysis level, and the method of "weight" statistics to conduct overall evaluation, error correction and correction of the test results, and more effectively control the false positive. result.
  • UMI/UID molecular biological labels
  • the method of this application can be well compatible with the current standards for point mutation analysis of second-generation sequencing data or mature mainstream analysis procedures, such as GATK Best Practice, etc., enriches the technical means for point mutation analysis of third-generation sequencing data, and largely solves the problem of The lack of accuracy of third-generation sequencing to detect point mutations, while giving full play to the advantages of long data length of third-generation sequencing, also further promotes the application of third-generation sequencing in scientific research, especially for mutation detection targeting relevant hotspot panels.
  • B corresponding to A means that B is associated with A, and B can be determined according to A.
  • determining B according to A does not mean determining B only according to A, and B may also be determined according to A and/or other information.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present application provides a third-generation sequencing data analysis method and device for point mutation detection. The analysis method of the present application comprises: 1) extracting a first sequence subset containing a point mutation to undergo detection; 2) extracting seed sequences from the first sequence subset to obtain a second sequence subset; 3) obtaining an original data set having a desired quality; 4) using seed sequence pairs of the second sequence subset to obtain N data sets containing a target sequence; 5) performing point mutation detection and analysis on the N data sets containing the target sequence; 6) allocating a weight W to each point mutation result in N detection results; and 7) calculating a point mutation result and a frequency thereof according to a formula.

Description

基于三代测序数据检测点突变的分析方法和装置Analysis method and device for detecting point mutations based on third-generation sequencing data
相关申请的交叉引用Cross References to Related Applications
本申请要求享有于2021年12月28日提交的名称为“基于三代测序数据检测点突变的分析方法和装置”的中国专利申请202111616129.1的优先权,该申请的全部内容通过引用并入本文中。This application claims priority to the Chinese patent application 202111616129.1 entitled "Analysis method and device for detecting point mutations based on third-generation sequencing data" filed on December 28, 2021, the entire content of which is incorporated herein by reference.
技术领域technical field
本申请属于测序技术和测序数据的生物信息技术分析领域,尤其涉及一种基于三代测序数据检测点突变的方法,本申请还涉及基于三代测序数据检测点突变的装置。The application belongs to the field of sequencing technology and bioinformatics analysis of sequencing data, and in particular relates to a method for detecting point mutations based on third-generation sequencing data, and the application also relates to a device for detecting point mutations based on third-generation sequencing data.
背景技术Background technique
点突变指只有一个碱基对发生改变。广义点突变可以是碱基替换,单碱基插入或碱基缺失;狭义点突变也称作单碱基替换(base substitution)。碱基替换又分为转换(transitions)和颠换(transversions)两类。目前常见的检测基因点突变的方法有PCR法、Sanger测序法(一代测序)和二代测序。PCR法具有敏感性高的特点,且技术已经成熟,但每对引物只能检测一种突变,无法同时检测太多样品和位点,通量较低。Sanger测序法的成本较低,但所需样品用量大,且对低频突变的检测敏感性低。二代测序具有通量高的特点,测序成本也在逐年下降,但目前检测点突变常用的方法工具检测特异性不高(如Varscan),对低频的检测敏 感性也偏低(如Mutect),或者使用了局部组装步骤导致运行时间过长(如Mutect2),不能很好地满足点突变检测的需求。A point mutation is a change in only one base pair. General point mutations can be base substitutions, single base insertions or base deletions; narrow sense point mutations are also called single base substitutions. Base substitutions are further divided into two types: transitions and transversions. At present, common methods for detecting gene point mutations include PCR method, Sanger sequencing method (first-generation sequencing) and next-generation sequencing. The PCR method has the characteristics of high sensitivity, and the technology is mature, but each pair of primers can only detect one mutation, cannot detect too many samples and sites at the same time, and the throughput is low. The cost of Sanger sequencing is low, but the amount of sample required is large, and the detection sensitivity for low-frequency mutations is low. Next-generation sequencing has the characteristics of high throughput, and the cost of sequencing is also decreasing year by year, but the current methods and tools commonly used to detect point mutations are not high in detection specificity (such as Varscan), and the sensitivity to low-frequency detection is also low (such as Mutect). Or the use of local assembly steps leads to too long running time (such as Mutect2), which cannot well meet the needs of point mutation detection.
第三代测序技术,又称三代测序技术(Third generation sequencing)或单分子实时DNA测序技术,是一种在DNA测序时,不需要经过PCR扩增即可实现对每一条DNA分子的单独测序的技术。目前第三代测序技术原理主要分为以Pacbio的SMRT技术为代表的单分子荧光测序以及以牛津纳米孔公司和齐碳科技公司的纳米孔电泳技术为代表的纳米孔测序。三代测序的主要的技术特点之一是实现了DNA聚合酶内在自身的反应速度,一秒可以测10个碱基,测序速度是化学法测序的2万倍;其二是实现了DNA聚合酶内在自身的延续性,一个反应就可以测非常长的序列;二代测序可以测到上百个碱基,但是三代测序就可以测几千个碱基。进一步地,三代测序对DNA或RNA分子进行实时测序时无需进行PCR扩增或化学标记,避免在操作过程中引入的错误突变,高度保真,并且测序速度可以达到DNA为450bp/秒,RNA为70nt/秒,整体可以达到几兆碱基级别的超长读长。Third-generation sequencing technology, also known as third-generation sequencing technology (Third generation sequencing) or single-molecule real-time DNA sequencing technology, is a method that can individually sequence each DNA molecule without PCR amplification during DNA sequencing. technology. At present, the principles of the third-generation sequencing technology are mainly divided into single-molecule fluorescence sequencing represented by Pacbio's SMRT technology and nanopore sequencing represented by the nanopore electrophoresis technology of Oxford Nanopore and Qitan Technology. One of the main technical characteristics of the third-generation sequencing is to realize the internal reaction speed of DNA polymerase, which can measure 10 bases per second, and the sequencing speed is 20,000 times that of chemical sequencing; With its own continuity, one reaction can measure very long sequences; second-generation sequencing can measure hundreds of bases, but third-generation sequencing can measure thousands of bases. Furthermore, the third-generation sequencing does not need PCR amplification or chemical labeling when sequencing DNA or RNA molecules in real time, avoiding erroneous mutations introduced during the operation, high fidelity, and the sequencing speed can reach 450 bp/s for DNA and 450 bp/s for RNA. 70nt/s, the whole can reach the ultra-long read length of several megabases.
目前基于三代测序检测点突变(包含胚系突变以及体细胞突变)的方法还不是很成熟,但全球范围内已经有一些研究课题组致力于开发一些算法来精确识别三代测序数据中的点突变(SNV和InDel),例如发表于Nature Communications杂志上的加利福尼亚大学开发的结合隐马尔可夫链模型的Longshot方法(DOI:10.1038/s41467-019-12493-y),发表于Nature Machine Intelligence杂志上的香港大学开发的结合深度神经网络模型的Clair方法(doi:https://doi.org/10.1038/s42256-020-0167-4),公开于bioRxiv上基于google团队的DeepVariant开发优化的PEPPER-Margin-DeepVariant方法(doi:https://doi.org/10.1101/2021.03.04.433952)等。这些 研究成果不仅仅丰富了基于三代测序数据的突变检测手段,更重要的是为三代测序的广阔发展及广泛的实际应用提供了技术保障。At present, the method for detecting point mutations (including germline mutations and somatic cell mutations) based on three-generation sequencing is not very mature, but some research groups around the world have devoted themselves to developing some algorithms to accurately identify point mutations in three-generation sequencing data ( SNV and InDel), such as the Longshot method combined with hidden Markov chain model developed by the University of California published in Nature Communications (DOI: 10.1038/s41467-019-12493-y), published in Nature Machine Intelligence in Hong Kong The Clair method combined with the deep neural network model developed by the university (doi: https://doi.org/10.1038/s42256-020-0167-4), published on bioRxiv based on the google team's DeepVariant development and optimization of PEPPER-Margin-DeepVariant method (doi: https://doi.org/10.1101/2021.03.04.433952), etc. These research results not only enrich the mutation detection methods based on the third-generation sequencing data, but more importantly, provide technical support for the extensive development and practical application of the third-generation sequencing.
然而,当前基于三代测序检测点突变在方法上还存在很大的挑战和问题。众所周知,三代测序的数据在单碱基识别的精准度上还存在一些问题,造成该问题的因素有很多,比如样本质量,电流通过“motor”蛋白的稳定性及basecalling模型的精确度等,最终体现在数据层面上就是测序质量不高或测序错误的现状以及随机indel分布等的数据特征。故在基于三代测序的数据分析中,如何稳定地检出点突变并且还能较好地控制假阳性及假阴性的检测结果就显得尤为重要,其对检测算法的灵敏度及特异性的要求也提出了巨大的挑战。虽然现阶段有一些基于三代测序数据开发的检测点突变的方法(如上所述),但各自的缺点也非常明显,最主要的就是受限于测序质量以及依赖的比对算法或深度学习训练集的数据分布等,并且适用场景并不够广泛,鲁棒性(robust)不足。However, there are still great challenges and problems in the method of detecting point mutations based on third-generation sequencing. As we all know, there are still some problems in the accuracy of single base recognition in the third-generation sequencing data. It is reflected in the data level that the sequencing quality is not high or the status quo of sequencing errors and data characteristics such as random indel distribution. Therefore, in the data analysis based on third-generation sequencing, how to stably detect point mutations and better control the detection results of false positives and false negatives is particularly important, and the requirements for the sensitivity and specificity of detection algorithms are also raised. a huge challenge. Although there are some methods for detecting point mutations developed based on three-generation sequencing data at this stage (as mentioned above), their respective shortcomings are also very obvious. The most important ones are limited by the quality of sequencing and the comparison algorithm or deep learning training set they rely on. The data distribution, etc., and the applicable scenarios are not wide enough, and the robustness (robust) is insufficient.
因此,对相关技术中基于三代测序数据检测点突变的分析方法进行进一步的改进,使其在稳定地检出点突变的同时,还能较好地控制假阳性及假阴性的问题,具有非常重要的意义。Therefore, it is very important to further improve the analysis method for detecting point mutations based on third-generation sequencing data in related technologies, so that while stably detecting point mutations, it can also better control the problems of false positives and false negatives. meaning.
发明内容Contents of the invention
因此,本申请的目的是针对相关技术的不足,提供一种基于三代测序数据检测点突变的分析方法,本申请提供的方法能够在数据分析层面上良好地解决了上述问题,不仅从数据特征上较为有效地规避掉随机indel或较高测序错误导致的比对率不高导致的假阴性的问题,同时设计结合碱基在测序序列位置上的“中间较准,两边较差”的理论观点、数据分析层面上的分子生物标签(UMI/UID)思想以及“权重”统计的方法对检测结果进行整体评估、纠错及矫正,更加有效地控制了假阳性的结果。Therefore, the purpose of this application is to address the shortcomings of related technologies and provide an analysis method for detecting point mutations based on third-generation sequencing data. The method provided by this application can well solve the above problems at the data analysis level, not only from the data characteristics It is more effective to avoid the problem of false negatives caused by random indels or high sequencing errors caused by low alignment rates. At the same time, the design combines the theoretical point of view of "middle alignment, poor on both sides" of bases in the sequencing sequence position, The idea of molecular biological labels (UMI/UID) at the level of data analysis and the method of "weight" statistics perform overall evaluation, error correction and correction of the test results, and more effectively control the false positive results.
本申请的目的是通过以下技术方案实现的:The purpose of this application is achieved through the following technical solutions:
一方面,本申请提供了一种基于三代测序数据检测点突变的分析方法,所述方法包括以下步骤:In one aspect, the present application provides an analysis method for detecting point mutations based on three-generation sequencing data, the method comprising the following steps:
1)从参考基因组中提取包含待检测的点突变的第一序列子集;1) extracting the first sequence subset comprising the point mutation to be detected from the reference genome;
在所述参考基因组上进行固定长度L的短序列提取N次,所述短序列之间满足待检测的点突变在提取后的短序列上的位置与其在前一次提取的短序列上的位置之间具有固定距离D,
Figure PCTCN2022136275-appb-000001
其中,N、D、L均为整数;最终得到第一序列子集,其包含N个含有待检测的点突变的短序列;
On the reference genome, a short sequence with a fixed length L is extracted N times, and the short sequence satisfies the difference between the position of the point mutation to be detected on the extracted short sequence and the position on the previously extracted short sequence with a fixed distance D between them,
Figure PCTCN2022136275-appb-000001
Wherein, N, D, and L are all integers; the first sequence subset is finally obtained, which includes N short sequences containing point mutations to be detected;
2)从步骤1)的第一序列子集中提取种子序列,提取位置为每条短序列的首尾端各M个碱基,得到第二序列子集,其包含N对长度为M的种子序列,所述种子序列中不含待检测的点突变;2) Extracting the seed sequence from the first sequence subset in step 1), the extraction position is M bases at the beginning and end of each short sequence, and obtaining the second sequence subset, which includes N pairs of seed sequences with a length of M, The seed sequence does not contain the point mutation to be detected;
3)对原始三代测序数据预处理,获得具有期望质量的原始数据集;3) Preprocessing the original third-generation sequencing data to obtain an original data set with expected quality;
4)使用步骤2)获得的第二序列子集的种子序列对从步骤3)得到的原始数据集中提取目的序列,获得N个包含目的序列的数据集;4) using the seed sequence of the second sequence subset obtained in step 2) to extract the target sequence from the original data set obtained in step 3), and obtain N data sets containing the target sequence;
5)分别对步骤4)的N个包含目的序列的数据集进行点突变检测分析,得到N个结果;其中,每个结果包括待检测的位点的突变频率F,点突变的reads支持数AO,点突变位置的测序深度DP;5) Carry out point mutation detection and analysis to the N data sets containing the target sequence in step 4) respectively, and obtain N results; wherein, each result includes the mutation frequency F of the site to be detected, and the reads support number AO of the point mutation , the sequencing depth DP of the point mutation position;
6)对步骤5)的N个检测结果中的每个点突变的结果分配权重W;6) assign weight W to the result of each point mutation in the N detection results of step 5);
7)根据公式计算点突变结果及其频率;7) Calculate the point mutation result and its frequency according to the formula;
Figure PCTCN2022136275-appb-000002
Figure PCTCN2022136275-appb-000002
若F correct≥1%,则为阳性,反之为阴性。 If F correct ≥ 1%, it is positive, otherwise it is negative.
根据本申请一个方面的实施例,在步骤1)中,D表示在任意提取的序列中,点突变所处的位置之间的碱基距离。所述固定距离D可以为大于1的任意整数,不限于任何特定的理论,但是可选地距离D设置为
Figure PCTCN2022136275-appb-000003
无需任何理论的限制,本领域技术人员可以任选地设置D的数值,例如设置为5≤D≤20,8≤D≤15等,例如D可以为5到20之间的任意整数。
According to an embodiment of one aspect of the present application, in step 1), D represents the base distance between positions of point mutations in any extracted sequence. The fixed distance D can be any integer greater than 1, not limited to any particular theory, but optionally the distance D is set as
Figure PCTCN2022136275-appb-000003
Without any theoretical limitation, those skilled in the art can optionally set the value of D, for example, 5≤D≤20, 8≤D≤15, etc., for example, D can be any integer between 5 and 20.
本领域技术人员可以理解的是,如果第一次提取的短序列中,待检测的点突变在短序列上的位置为D 0,则第X次提取时,所述点突变在该提取短序列中的位置L x满足L x=D 0+(X-1)D。 Those skilled in the art can understand that if the position of the point mutation to be detected on the short sequence is D 0 in the short sequence extracted for the first time, then during the Xth extraction, the point mutation in the extracted short sequence The position L x in satisfies L x =D 0 +(X-1)D.
根据本申请一个方面的实施例,对于L x=D 0+(X-1)D而言,D 0可以理解为第一次提取时,待检测的点突变位于提取短序列中的位置;例如D 0可以为第一次提取的短序列中的第一个碱基、第二个碱基、第三个碱基、第四个碱基,以此类推;在可选的实施方案中,D 0≤L/4和/或D 0≥D,例如D 0可以为D、D+1、D+2等。 According to an embodiment of one aspect of the present application, for L x =D 0 +(X-1)D, D 0 can be understood as the position of the point mutation to be detected in the extracted short sequence during the first extraction; for example D 0 can be the first base, the second base, the third base, the fourth base in the short sequence extracted for the first time, and so on; in an optional embodiment, D 0 ≤ L/4 and/or D 0 ≥ D, for example, D 0 may be D, D+1, D+2, etc.
根据本申请一个方面的实施例,比如待检测的点突变的位置分别位于提取的短序列上的第11个碱基,第21个碱基,第31个碱基等;可以理解为D 0为11,D为10,X为1、2和3。 According to an embodiment of one aspect of the present application, for example, the positions of the point mutations to be detected are respectively located at the 11th base, the 21st base, the 31st base, etc. on the extracted short sequence; it can be understood that D0 is 11, D is 10, X is 1, 2 and 3.
根据本申请一个方面的实施例
Figure PCTCN2022136275-appb-000004
According to an embodiment of an aspect of the application
Figure PCTCN2022136275-appb-000004
根据本申请一个方面的实施例在步骤1)中,提取次数N需要根据固定长度L和固定距离D决定。According to an embodiment of an aspect of the present application, in step 1), the extraction times N need to be determined according to the fixed length L and the fixed distance D.
根据本申请一个方面的实施例N为偶数时,所获得的N条短序列中,第
Figure PCTCN2022136275-appb-000005
次和第
Figure PCTCN2022136275-appb-000006
次提取的短序列中待检测的点突变与其在其他的短序列上的位置相比,可以位于该短序列的中间位置或最靠近中间的位置;N为奇数时,第
Figure PCTCN2022136275-appb-000007
次提取得到的短序列中待检测的点突变与其在其他的短序列上的位置相比,位于该短序列的中间位置或最靠近中间的位置。
According to an embodiment of one aspect of the present application, when N is an even number, among the obtained N short sequences, the
Figure PCTCN2022136275-appb-000005
second and second
Figure PCTCN2022136275-appb-000006
Compared with its position on other short sequences, the point mutation to be detected in the extracted short sequence can be located in the middle position of the short sequence or the position closest to the middle; when N is an odd number, the
Figure PCTCN2022136275-appb-000007
Compared with the positions on other short sequences, the point mutation to be detected in the extracted short sequence is located in the middle position of the short sequence or the position closest to the middle.
根据本申请一个方面的实施例,在步骤1)中,每条序列的固定长度 L可以是任选长度,并且该长度可以短至35bp,或长达250bp,可选地为76-151bp。According to an embodiment of one aspect of the present application, in step 1), the fixed length L of each sequence can be an optional length, and the length can be as short as 35bp, or as long as 250bp, optionally 76-151bp.
根据本申请一个方面的实施例,在步骤2)中,M可以为任选的整数,但是基于现实考虑,M可以为2、3、4或5,可选地,M≥5。According to an embodiment of an aspect of the present application, in step 2), M may be an optional integer, but based on practical considerations, M may be 2, 3, 4 or 5, and optionally, M≥5.
根据本申请一个方面的实施例,在步骤3)中,原始数据为经纳米孔测序获得的长读长数据。According to an embodiment of one aspect of the present application, in step 3), the raw data is long-read data obtained by nanopore sequencing.
根据本申请一个方面的实施例,对原始三代测序数据进行数据预处理,包括利用例如Porechop软件以及NanoFilt软件去除实验建库过程中加入的接头及条形码(barcode)序列,过滤低质量以及过短的测序reads,得到期望的原始数据集(clean data)。According to an embodiment of one aspect of the present application, data preprocessing is performed on the original third-generation sequencing data, including using Porechop software and NanoFilt software to remove adapters and barcode sequences added during the experimental library construction process, and to filter low-quality and too short sequences. Sequence reads to obtain the desired original data set (clean data).
根据本申请一个方面的实施例,所述低质量的阈值包括但不限于Q5,例如所述阈值可以为Q7或更高;其中,Q表示测序read的平均质量值,即测序read中的每一个碱基的准确率求和取平均后获得的值。本领域技术人员已知的是,该阈值可以根据实际情况进行调整,具体的调整参数详见https://en.wikipedia.org/wiki/FASTQ_format,该处通过引用将其并入本文。According to an embodiment of one aspect of the present application, the low-quality threshold includes but is not limited to Q5, for example, the threshold may be Q7 or higher; wherein, Q represents the average quality value of sequencing reads, that is, each of the sequencing reads The base accuracy is summed and averaged. It is known to those skilled in the art that the threshold can be adjusted according to the actual situation. For specific adjustment parameters, see https://en.wikipedia.org/wiki/FASTQ_format, which is incorporated herein by reference.
根据本申请一个方面的实施例,过短的测序reads的序列长度阈值包括但不限于100bp;例如所述阈值可以为50bp、200bp、300bp等。本领域技术人员可以根据实际情况进行调整该阈值。According to an embodiment of one aspect of the present application, the sequence length threshold of too short sequencing reads includes but is not limited to 100 bp; for example, the threshold can be 50 bp, 200 bp, 300 bp, etc. Those skilled in the art can adjust the threshold according to actual conditions.
根据本申请一个方面的实施例,在步骤4)中,考虑到三代测序数据的特征干扰,限制提取出相应的目的序列长度L’≤L+50。According to an embodiment of one aspect of the present application, in step 4), considering the characteristic interference of the third-generation sequencing data, limit the extraction of the corresponding target sequence length L'≤L+50.
根据本申请一个方面的实施例,在步骤5)中,经过本申请的前述步骤处理之后的获得的N个包含目的序列的数据集,可以使用二代测序数据分析点突变的标准或成熟的主流分析流程,例如GATK Best Practice等。According to an embodiment of one aspect of the present application, in step 5), the N data sets containing the target sequence obtained after the processing of the foregoing steps of the present application can use next-generation sequencing data to analyze the standard or mature mainstream of point mutations Analysis process, such as GATK Best Practice, etc.
根据本申请一个方面的实施例,N个包含目的序列的数据集进行点突 变检测分析,得到N个结果;每个结果包括突变频率为F,点突变的reads支持数为AO,点突变位置的测序深度为DP。According to an embodiment of one aspect of the present application, N data sets containing the target sequence are subjected to point mutation detection and analysis, and N results are obtained; each result includes a mutation frequency of F, the number of reads supported by the point mutation is AO, and the number of point mutation positions is The sequencing depth is DP.
例如,第一数据集的结果包括突变频率F 1,点突变的reads支持数AO 1,点突变位置的测序深度DP 1For example, the results of the first data set include the mutation frequency F 1 , the read support number AO 1 of the point mutation, and the sequencing depth DP 1 of the point mutation position;
第二数据集的结果包括突变频率F 2,点突变的reads支持数AO 2,点突变位置的测序深度DP 2The results of the second data set include the mutation frequency F 2 , the read support number AO 2 of the point mutation, and the sequencing depth DP 2 of the point mutation position;
……...
第N数据集的结果包括突变频率F N,点突变的reads支持数AO N,点突变位置的测序深度DP NThe results of the Nth data set include the mutation frequency F N , the number of reads supported by the point mutation AON , and the sequencing depth D P N of the point mutation position.
根据本申请一个方面的实施例,在步骤6)中,对N个检测结果中的每个点突变的结果分配权重(Weight),即W 1、W 2、W 3、……,W N- 1,W N,且W 1+W 2+W 3+……+W N-1+W N=1,其中,在步骤1)中获得的N条短序列中,点突变在所述短序列的固定长度L上的位置越邻近中间,与所述短序列相关的检测结果分配的权重越大。 According to an embodiment of one aspect of the present application, in step 6), a weight (Weight) is assigned to each point mutation in the N detection results, that is, W 1 , W 2 , W 3 , ..., W N- 1 , W N , and W 1 +W 2 +W 3 +...+W N-1 +W N =1, wherein, in the N short sequences obtained in step 1), the point mutation is in the short sequence The closer the position on the fixed length L of is to the middle, the greater the weight assigned to the detection result related to the short sequence.
根据本申请一个方面的实施例,N为偶数时,第
Figure PCTCN2022136275-appb-000008
个和第
Figure PCTCN2022136275-appb-000009
个数据集(可以理解为使用第
Figure PCTCN2022136275-appb-000010
次和第
Figure PCTCN2022136275-appb-000011
次提取的短序列获得的种子序列所得到的数据集)具有最大的权重W N/2=W N/2+1,然后W N=W 1,W N-1=W 2,W N- 2=W 3,以此类推。其中,NN为奇数时,第
Figure PCTCN2022136275-appb-000012
个数据集(可以理解为使用第
Figure PCTCN2022136275-appb-000013
次提取的短序列获得的种子序列所得到的数据集)具有最大的权重W N+1/2,然后W N=W 1,W N-1=W 2,W N-2=W 3,以此类推。
According to an embodiment of one aspect of the present application, when N is an even number, the first
Figure PCTCN2022136275-appb-000008
and the first
Figure PCTCN2022136275-appb-000009
data set (can be understood as using the first
Figure PCTCN2022136275-appb-000010
second and second
Figure PCTCN2022136275-appb-000011
The data set obtained from the seed sequence obtained by extracting the short sequence once) has the largest weight W N/2 =W N/2+1 , then W N =W 1 , W N-1 =W 2 ,W N- 2 =W 3 , and so on. Among them, when NN is an odd number, the
Figure PCTCN2022136275-appb-000012
data set (can be understood as using the first
Figure PCTCN2022136275-appb-000013
The data set obtained from the seed sequence obtained from the second extracted short sequence) has the largest weight W N+1/2 , then W N =W 1 , W N-1 =W 2 , W N-2 =W 3 , and And so on.
根据本申请一个方面的实施例,在步骤7)中,所述公式为
Figure PCTCN2022136275-appb-000014
According to an embodiment of one aspect of the present application, in step 7), the formula is
Figure PCTCN2022136275-appb-000014
在所述公式中,发明人同时结合碱基在测序序列位置上的“中间较准,两边较差”的理论观点、数据分析层面上的分子生物标签(UMI/UID)思想以及“权重”统计的方法对检测结果进行整体评估、纠错及矫正,更加有效地控制了假阳性的结果。In the formula, the inventor combined the theoretical point of view of bases on the position of the sequencing sequence "more accurate in the middle, worse on both sides", the idea of molecular biological labels (UMI/UID) at the data analysis level, and "weight" statistics The overall evaluation, error correction and correction of the detection results are carried out by the method, and the false positive results are more effectively controlled.
根据本申请一个方面的实施例,本申请的方法包括以下步骤:According to an embodiment of one aspect of the present application, the method of the present application includes the following steps:
1)从参考基因组中提取包含待检测的点突变的第一序列子集;1) extracting the first sequence subset comprising the point mutation to be detected from the reference genome;
在所述参考基因组上进行固定长度L的短序列提取N次,第一次提取获得的短序列中,待检测的点突变所在位置为D 0,所述短序列之间满足待检测的点突变在提取后的短序列上的位置与其在前一次提取的短序列上的位置之间具有固定距离D,最终得到第一序列子集,其包含N个含有待检测的点突变的短序列; On the reference genome, a short sequence of fixed length L is extracted N times, and in the short sequence obtained by the first extraction, the position of the point mutation to be detected is D 0 , and the point mutation to be detected is satisfied between the short sequences There is a fixed distance D between the position on the extracted short sequence and its position on the previously extracted short sequence, and finally a first sequence subset is obtained, which includes N short sequences containing point mutations to be detected;
其中,L为76-151bp之间的任意整数,D为8到15之间的任意整数,N为4到18之间的任意整数,D 0为5到14之间的任意整数; Wherein, L is any integer between 76-151bp, D is any integer between 8 and 15, N is any integer between 4 and 18, and D0 is any integer between 5 and 14;
2)从步骤1)得到的第一序列子集中的每条序列提取种子序列,提取位置分别为每条序列两端各M个碱基,最终得到N个种子序列对的第二序列子集,其中5≤M<D 02) Extracting a seed sequence from each sequence in the first sequence subset obtained in step 1), the extraction positions are respectively M bases at both ends of each sequence, and finally obtaining a second sequence subset of N seed sequence pairs, Where 5≤M<D 0 ;
3)对原始三代测序数据进行数据预处理,利用例如Porechop软件以及NanoFilt软件去除实验建库过程中加入的接头及barcode序列,过滤低质量以及过短的测序reads,得到具有期望质量的原始数据集;3) Perform data preprocessing on the original third-generation sequencing data, use software such as Porechop and NanoFilt software to remove the joints and barcode sequences added during the experimental library construction process, filter low-quality and too short sequencing reads, and obtain the original data set with the desired quality ;
4)根据步骤2)得到的种子序列对,从步骤3)得到的原始数据集中提取出相应的目的序列,考虑到三代测序数据的特征干扰,限制提取出相应的目的序列长度L’≤L+50,最终得到N个包含根据种子序列对提取出的目的序列数据集;4) According to the seed sequence pair obtained in step 2), extract the corresponding target sequence from the original data set obtained in step 3), considering the characteristic interference of the third-generation sequencing data, limit the extraction of the corresponding target sequence length L'≤L+ 50. Finally, N data sets containing the target sequence extracted according to the seed sequence pair are obtained;
5)对步骤4)得到的N个包含目的序列的数据集分别进行点突变检测分析,利用但不限于利用GATK Best Practice等分析流程,得到N个靶 向位点检测的最终结果,记每个靶向位点检测的突变频率为F N,该位点的突变reads支持数为AO N,该位置的测序深度为DP N5) Perform point mutation detection and analysis on the N data sets containing the target sequence obtained in step 4), use but not limited to analysis processes such as GATK Best Practice to obtain the final results of N target site detection, record each The mutation frequency detected at the target site is F N , the number of mutation reads supported at this site is AON , and the sequencing depth at this position is DPN ;
6)步骤5)的N个检测结果中的每个点突变的结果分配权重(Weight),即W 1、W 2、W 3、……,W N-1,W N,N为偶数时,第
Figure PCTCN2022136275-appb-000015
个和第
Figure PCTCN2022136275-appb-000016
个数据集(可以理解为使用第
Figure PCTCN2022136275-appb-000017
次和第
Figure PCTCN2022136275-appb-000018
次提取的短序列获得的种子序列所得到的数据集)具有最大的权重W N/2=W N/2+1,然后W N=W 1,W N-1=W 2,W N-2=W 3,以此类推。其中,N为奇数时,第
Figure PCTCN2022136275-appb-000019
个数据集(可以理解为使用第
Figure PCTCN2022136275-appb-000020
次提取的短序列获得的种子序列所得到的数据集)具有最大的权重W N+1/2,然后W N=W 1,W N-1=W 2,W N-2=W 3,以此类推。以此类推;
6) Assign a weight (Weight) to each point mutation among the N detection results in step 5), that is, W 1 , W 2 , W 3 , ..., W N-1 , W N , when N is an even number, No.
Figure PCTCN2022136275-appb-000015
and the first
Figure PCTCN2022136275-appb-000016
data set (can be understood as using the first
Figure PCTCN2022136275-appb-000017
second and second
Figure PCTCN2022136275-appb-000018
The data set obtained from the seed sequence obtained by extracting the short sequence once) has the largest weight W N/2 =W N/2+1 , then W N =W 1 , W N-1 =W 2 ,W N-2 =W 3 , and so on. Among them, when N is an odd number, the
Figure PCTCN2022136275-appb-000019
data set (can be understood as using the first
Figure PCTCN2022136275-appb-000020
The data set obtained from the seed sequence obtained from the second extracted short sequence) has the largest weight W N+1/2 , then W N =W 1 , W N-1 =W 2 , W N-2 =W 3 , and And so on. and so on;
7)加权及纠错矫正步骤5)中得到的靶向点突变结果及其频率,定义
Figure PCTCN2022136275-appb-000021
7) The target point mutation results obtained in step 5) of weighting and error correction and their frequencies, defined
Figure PCTCN2022136275-appb-000021
F correct为最终该位点的检测突变频率; F correct is the final detection mutation frequency of this site;
若F correct≥1%,则为阳性,反之为阴性。 If F correct ≥ 1%, it is positive, otherwise it is negative.
本申请还提供了一种基于三代测序数据检测点突变的装置,其中,所述装置包括:The present application also provides a device for detecting point mutations based on three-generation sequencing data, wherein the device includes:
种子序列提取模块,用于获得包含种子序列对的第二序列子集;A seed sequence extraction module, configured to obtain a second sequence subset comprising a pair of seed sequences;
预处理模块,用于对三代测序数据预处理,获得具有期望质量的原始数据集;The preprocessing module is used to preprocess the third-generation sequencing data to obtain the original data set with the expected quality;
初级分析模块,用于使用第二序列子集的种子序列对从预处理后的原始数据集提取包含目的序列的数据集,然后进行点突变检测分析并获得数据;The primary analysis module is used to use the seed sequence pair of the second sequence subset to extract a data set containing the target sequence from the preprocessed original data set, and then perform point mutation detection analysis and obtain data;
高级分析模块,用于对得到的结果进一步加权及矫正,并获得最终的分析结果;及Advanced analysis module, used to further weight and correct the obtained results, and obtain the final analysis results; and
报告模块,用于根据数据输出结果。Reporting module for outputting results based on data.
根据本申请一个方面的实施例,所述种子序列提取模块用于从参考基因组中提取包含N个含有待检测的点突变的短序列的第一序列子集,然后从所述第一序列子集中提取包含种子序列对的第二序列子集;其中所述种子序列对根据本申请所述的数据处理方法获得。According to an embodiment of one aspect of the present application, the seed sequence extraction module is used to extract a first sequence subset comprising N short sequences containing point mutations to be detected from the reference genome, and then from the first sequence subset Extracting a second sequence subset comprising a seed sequence pair; wherein the seed sequence pair is obtained according to the data processing method described in this application.
根据本申请一个方面的实施例,所述预处理模块用于过滤低质量以及过短的测序reads,可以包括例如Porechop软件以及NanoFilt软件。According to an embodiment of one aspect of the present application, the preprocessing module is used to filter low-quality and too short sequencing reads, and may include, for example, Porechop software and NanoFilt software.
根据本申请一个方面的实施例,所述初级分析模块获得的数据具有与二代NGS测序数据类似的特征,可以使用NGS数据分析点突变的标准或成熟的主流分析流程,例如GATK Best Practice等。According to an embodiment of one aspect of the present application, the data obtained by the primary analysis module has similar characteristics to the second-generation NGS sequencing data, and standard or mature mainstream analysis processes for point mutation analysis of NGS data, such as GATK Best Practice, can be used.
根据本申请一个方面的实施例,所述高级分析模块包含用于对每个结果分配权重的程序或软件。其中,所述权重分配符合碱基在测序序列位置上的“中间较准,两边较差”的理论观点、数据分析层面上的分子生物标签(UMI/UID)思想以及“权重”统计的方法。According to an embodiment of one aspect of the present application, the advanced analysis module includes a program or software for assigning weight to each result. Wherein, the weight assignment is in line with the theoretical point of view of "more accurate in the middle and poorer on both sides" of the position of the base in the sequencing sequence, the idea of molecular biological labels (UMI/UID) at the data analysis level, and the method of "weight" statistics.
本申请的发明人,基于三代测序的特有的数据特征,从数据分析层面上较好地解决了三代测序数据受限于测序质量以及依赖的比对算法或深度学习训练集的数据分布问题,以及适用场景并不够广泛,鲁棒性(robust)不足问题。使用本申请的方法,不仅从数据特征上有效地规避随机indel或较高测序错误导致的比对率不高导致的假阴性的问题,同时设计结合碱基在测序序列位置上的“中间较准,两边较差”的理论观点、数据分析层面上的分子生物标签(UMI/UID)思想以及“权重”统计的方法对检测结果进行整体评估、纠错及矫正,更加有效地控制了假阳性的结果。本申请的方法能够很好的兼容目前二代测序数据分析点突变的标准或成熟 的主流分析流程,例如GATK Best Practice等,丰富了三代测序数据分析点突变的技术手段,很大程度上解决了三代测序检测点突变精准度不足的现状,在充分发挥了三代测序数据长读长的优势的同时,也进一步推动了三代测序在科研上的应用,特别适用于靶向相关热点panel的突变检测中。Based on the unique data characteristics of the third-generation sequencing, the inventors of the present application have solved the problem that the third-generation sequencing data is limited by the sequencing quality and the dependent comparison algorithm or the data distribution of the deep learning training set from the data analysis level, and The applicable scenarios are not wide enough, and the robustness (robust) is insufficient. Using the method of this application not only effectively avoids the problem of false negatives caused by random indels or high sequencing errors caused by low alignment rates from the data characteristics, but also designs the "intermediate alignment" of binding bases at the position of the sequencing sequence Theoretical point of view, the idea of molecular biological labels (UMI/UID) on the data analysis level, and the method of "weight" statistics to conduct overall evaluation, error correction and correction of the test results, and more effectively control the false positive. result. The method of this application can be well compatible with the current standards for point mutation analysis of second-generation sequencing data or mature mainstream analysis procedures, such as GATK Best Practice, etc., enriches the technical means for point mutation analysis of third-generation sequencing data, and largely solves the problem of The current situation of insufficient accuracy of point mutation detection by third-generation sequencing, while giving full play to the advantages of long read length of third-generation sequencing data, also further promotes the application of third-generation sequencing in scientific research, especially suitable for mutation detection targeting related hotspot panels .
附图说明Description of drawings
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例中所需要使用的附图作简单地介绍,显而易见地,下面所描述的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following will briefly introduce the accompanying drawings that need to be used in the embodiments of the present application. Obviously, the accompanying drawings described below are only some embodiments of the present application. Those of ordinary skill in the art can also obtain other drawings based on these drawings without making creative efforts.
图1示出为本申请的一个实施方案中基于三代测序数据检测点突变的分析方法的流程框架图;FIG. 1 shows a flow chart of an analysis method for detecting point mutations based on three-generation sequencing data in one embodiment of the present application;
图2示出为本申请的一个实施方案中基于三代测序数据检测点突变的装置的结构框图。FIG. 2 is a structural block diagram of a device for detecting point mutations based on three-generation sequencing data in one embodiment of the present application.
具体实施方式Detailed ways
下面将详细描述本申请的各个方面的特征和示例性实施例。在下面的详细描述中,提出了许多具体细节,以便提供对本申请的全面理解。但是,对于本领域技术人员来说很明显的是,本申请可以在不需要这些具体细节中的一些细节的情况下实施。下面对实施例的描述仅仅是为了通过示出本申请的示例来提供对本申请的更好的理解。Features and exemplary embodiments of various aspects of the present application will be described in detail below. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the application. It will be apparent, however, to one skilled in the art that the present application may be practiced without some of these specific details. The following description of the embodiments is only to provide a better understanding of the present application by showing examples of the present application.
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将结合附图对实施例进行详细描述。It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other. The embodiments will be described in detail below in conjunction with the accompanying drawings.
在三代测序中,三代测序的数据在单碱基识别的精准度上还存在一些 问题,体现在数据层面上就是测序质量不高或测序错误的现状以及随机indel分布等的数据特征。故在下游的数据分析中,如何稳定地检出点突变并且还能较好地控制假阳性及假阴性的检测结果就显得尤为重要。In the third-generation sequencing, there are still some problems in the accuracy of single-base recognition in the third-generation sequencing data, which is reflected in the data level such as low sequencing quality or sequencing errors, as well as data characteristics such as random indel distribution. Therefore, in downstream data analysis, how to stably detect point mutations and better control false positive and false negative detection results is particularly important.
结合本申请的图1和图2,本申请提供了一种基于三代测序数据检测点突变的分析方法,所述方法包括以下步骤:With reference to Figure 1 and Figure 2 of the present application, the present application provides an analysis method for detecting point mutations based on three-generation sequencing data, the method comprising the following steps:
S1:从参考基因组中提取包含待检测的点突变的第一序列子集;S1: extracting the first sequence subset containing the point mutation to be detected from the reference genome;
在所述参考基因组上进行固定长度L的短序列提取N次,所述短序列之间满足待检测的点突变在提取后的短序列上的位置与其在前一次提取的短序列上的位置之间具有固定距离D,
Figure PCTCN2022136275-appb-000022
其中,N、D、L均为整数;最终得到第一序列子集,其包含N个含有待检测的点突变的短序列;
On the reference genome, a short sequence with a fixed length L is extracted N times, and the short sequence satisfies the difference between the position of the point mutation to be detected on the extracted short sequence and the position on the previously extracted short sequence with a fixed distance D between them,
Figure PCTCN2022136275-appb-000022
Wherein, N, D, and L are all integers; the first sequence subset is finally obtained, which includes N short sequences containing point mutations to be detected;
S2:从S1的第一序列子集中提取种子序列,提取位置为每条短序列的首尾端各M个碱基,得到第二序列子集,其包含N对长度为M的种子序列,所述种子序列中不含待检测的点突变;S2: extract the seed sequence from the first sequence subset of S1, the extraction position is M bases at the beginning and end of each short sequence, and obtain the second sequence subset, which includes N pairs of seed sequences with a length of M, the The seed sequence does not contain the point mutation to be detected;
S3:对原始三代测序数据预处理,获得具有期望质量的原始数据集;S3: Preprocess the original third-generation sequencing data to obtain an original data set with expected quality;
S4:使用S2获得的第二序列子集的种子序列对从S3得到的原始数据集中提取目的序列,获得N个包含目的序列的数据集;S4: using the seed sequence pair of the second sequence subset obtained in S2 to extract the target sequence from the original data set obtained in S3, and obtain N data sets containing the target sequence;
S5:分别对S4的N个包含目的序列的数据集进行点突变检测分析,得到N个结果;其中,每个结果包括待检测的位点的突变频率F,点突变的reads支持数AO,点突变位置的测序深度DP;S5: Perform point mutation detection and analysis on the N data sets containing the target sequence in S4 respectively, and obtain N results; wherein, each result includes the mutation frequency F of the site to be detected, the number of reads supported by the point mutation AO, point The sequencing depth DP of the mutation position;
S6:对S5的N个检测结果中的每个点突变的结果分配权重W;S6: Assign a weight W to each point mutation result in the N detection results of S5;
S7:根据公式计算点突变结果及其频率;S7: Calculate the point mutation result and its frequency according to the formula;
Figure PCTCN2022136275-appb-000023
Figure PCTCN2022136275-appb-000023
若F correct≥1%,则为阳性,反之为阴性。 If F correct ≥ 1%, it is positive, otherwise it is negative.
从上述方法可以获知,本申请的发明人通过制备种子序列,并结合测 序的数据特点来进行多次采样提取,将三代测序的长读长的测序序列转变成短序列的序列,然后进行NGS数据类似的点突变分析,同时结合实验上单分子标签技术(UMI/UID)以及权重统计思想对多采样结果进行整合、评估、纠错及矫正来最终评判数据分析结果,能够有效地避免三代测序检测点突变精准度不足的问题。From the above method, it can be known that the inventors of the present application prepared the seed sequence, combined with the characteristics of the sequencing data to perform multiple sampling and extraction, converted the long-read sequencing sequence of the third-generation sequencing into a short sequence sequence, and then performed NGS data Similar point mutation analysis, combined with experimental single-molecule labeling technology (UMI/UID) and weight statistics ideas to integrate, evaluate, error-correct and correct the multi-sampling results to finally judge the data analysis results, can effectively avoid third-generation sequencing detection. The problem of insufficient precision of point mutation.
进一步地,如图2所示,本申请的一个实施方案中,提供了一种基于三代测序数据检测点突变的装置,其中,所述装置包括:种子序列提取模块101,用于获得包含种子序列对的第二序列子集;预处理模块102,用于对三代测序数据预处理,获得具有期望质量的原始数据集;初级分析模块103,用于使用第二序列子集的种子序列对从预处理后的原始数据集提取包含目的序列的数据集,然后进行点突变检测分析并获得数据;高级分析模块104,用于对得到的结果进一步加权及矫正,并获得最终的分析结果;及报告模块105,用于根据数据输出结果。Further, as shown in FIG. 2 , in one embodiment of the present application, a device for detecting point mutations based on three-generation sequencing data is provided, wherein the device includes: a seed sequence extraction module 101 for obtaining The second sequence subset of the pair; the preprocessing module 102 is used to preprocess the three-generation sequencing data to obtain an original data set with expected quality; the primary analysis module 103 is used to use the seed sequence pair of the second sequence subset from the preprocessing Extract the data set containing the target sequence from the processed raw data set, and then perform point mutation detection and analysis to obtain data; the advanced analysis module 104 is used to further weight and correct the obtained results, and obtain the final analysis result; and the report module 105, for outputting a result according to the data.
根据本申请所述的装置,其中,所述种子序列提取模块用于从参考基因组中提取包含N个含有待检测的点突变的短序列的第一序列子集,然后从所述第一序列子集中提取包含种子序列对的第二序列子集;其中所述种子序列对根据本申请所述的数据处理方法获得。According to the device described in the present application, wherein the seed sequence extraction module is used to extract a first sequence subset comprising N short sequences containing point mutations to be detected from the reference genome, and then extract the first sequence subset from the first sequence subset Concentratingly extracting the second sequence subset including the seed sequence pair; wherein the seed sequence pair is obtained according to the data processing method described in this application.
根据本申请所述的装置,其中,所述预处理模块用于过滤低质量以及过短的测序reads,可以包括例如Porechop软件以及NanoFilt软件。According to the device described in the present application, wherein the preprocessing module is used to filter low-quality and too-short sequencing reads, and may include, for example, Porechop software and NanoFilt software.
根据本申请所述的装置,其中,所述初级分析模块获得的数据具有与二代NGS测序数据类似的特征,可以使用NGS数据分析点突变的标准或成熟的主流分析流程,例如GATK Best Practice等。According to the device described in the present application, wherein, the data obtained by the primary analysis module has similar characteristics to the second-generation NGS sequencing data, and the standard or mature mainstream analysis process of point mutations can be analyzed using NGS data, such as GATK Best Practice, etc. .
根据本申请所述的装置,其中,所述高级分析模块包含用于对每个结果分配权重的程序或软件。其中,所述权重分配符合碱基在测序序列位置上的“中间较准,两边较差”的理论观点、数据分析层面上的分子生物标签 (UMI/UID)思想以及“权重”统计的方法。The device according to the present application, wherein the advanced analysis module includes a program or software for assigning weight to each result. Wherein, the weight distribution conforms to the theoretical point of view of bases in the position of the sequencing sequence of "more accurate in the middle, worse on both sides", the idea of molecular biological labels (UMI/UID) at the data analysis level, and the method of "weight" statistics.
实施例1使用本申请的方法分析数据 Embodiment 1 uses the method analysis data of the present application
1.将含有BRAF-V600E、EGFR-L858R、EGFR-T790M、KRAS-G13D以及AKT1-E17K的标准品样本以及阴控样本NA12878的标准品,通过实验文库制备且重复三次,利用QNome-9604的纳米孔测序仪进行测序,得到6个原始的长读长测序数据,其中HUM964、HUM965和HUM966为阳控数据,HUM967、HUM968和HUM969为阴控数据。1. The standard sample containing BRAF-V600E, EGFR-L858R, EGFR-T790M, KRAS-G13D and AKT1-E17K and the standard sample of negative control sample NA12878 were prepared through the experimental library and repeated three times, using the nanometer of QNome-9604 The hole sequencer was used for sequencing, and six original long-read sequencing data were obtained, among which HUM964, HUM965 and HUM966 were positive control data, and HUM967, HUM968 and HUM969 were negative control data.
2.分别对步骤1待检测的5个靶向位点根据其位置在基因组上进行固定长度101bp的短序列提取9次,其中靶向位点在提取后的短序列上的位置分别固定在第11个碱基,第21个碱基,第31个碱基,第41个碱基,第51个碱基,第61个碱基,第71个碱基,第81个碱基以及第91个碱基(即D=10bp),得到最终的9个包含5个靶向位点的短序列片段集合,且短序列片段长度为101bp。2. Extract 9 short sequences with a fixed length of 101 bp on the genome for the 5 target sites to be detected in step 1 according to their positions, in which the positions of the target sites on the extracted short sequences are respectively fixed at 11th base, 21st base, 31st base, 41st base, 51st base, 61st base, 71st base, 81st base and 91st base Base (ie D=10bp), to obtain the final set of 9 short sequence fragments containing 5 targeting sites, and the length of the short sequence fragments is 101bp.
3.对每个短序列片段集合提取种子序列,提取位置分别为各个靶向位点短序列的首尾各10个碱基,最终得到9个包含靶向位点的短序列种子对序列的片段集合。3. Extract the seed sequence for each short sequence fragment set, and the extraction position is 10 bases at the beginning and end of the short sequence of each target site, and finally obtain 9 fragment sets of short sequence seed pair sequences containing the target site .
4.对原始三代测序数据进行数据预处理,利用例如Porechop软件以及NanoFilt软件去除实验建库过程中加入的接头及barcode序列,过滤低质量Q7以及过短100bp以下的测序reads,得到clean data。4. Perform data preprocessing on the original third-generation sequencing data, use software such as Porechop and NanoFilt to remove adapters and barcode sequences added during the experimental library construction process, filter low-quality Q7 and sequencing reads that are too short below 100bp, and obtain clean data.
5.从步骤4得到的clean data中,根据步骤3得到的短序列种子对序列提取出相应的目的序列,考虑到三代测序数据的特征干扰,限制提取出相应的目的序列长度L’<151,最终得到9个根据种子序列对提取出的目的序列数据集合。5. From the clean data obtained in step 4, extract the corresponding target sequence according to the short sequence seed pair sequence obtained in step 3. Considering the characteristic interference of the third-generation sequencing data, limit the extraction of the corresponding target sequence length L'<151, Finally, nine target sequence data sets extracted from the seed sequence pairs were obtained.
6.对步骤5中得到的9个数据集合分别进行点突变检测分析,本实施例中利用GATK Best Practice进行点突变的检测,得到9个靶向位点检测的最终结果,记每个靶向位点检测的突变频率为F N,该位点的突变reads支持数为AO N,该位置的测序深度为DP N6. Perform point mutation detection and analysis on the 9 data sets obtained in step 5. In this embodiment, GATK Best Practice is used to detect point mutations, and the final results of 9 target site detection are obtained. Record each target site The mutation frequency detected at the site is F N , the number of mutation reads supported at this site is AON , and the sequencing depth at this position is DPN .
7.由于步骤5获得的包含长度L’的目的序列的数据集具有与二代测序获得的数据类似的特征,因此该步骤中假设步骤5中得到的目的短序列数据为二代测序平台数据并分配权重,根据碱基在二代测序的序列位置上的“中间较准,两边较差”的二代测序数据特点,对9个检测结果中的每个点突变的结果分配权重(Weight),即W 1、W 2、W 3、W 4、W 5、W 6、W 7、W 8、W 9,且W 1+W 2+W 3+W 4+W 5+W 6+W 7+W 8+W 9=1,W 5=0.25,W 1=W 9=0.05,W 2=W 8=0.075,W 3=W 7=0.1,W 4=W 6=0.15。 7. Since the data set containing the target sequence of length L' obtained in step 5 has similar characteristics to the data obtained by next-generation sequencing, it is assumed in this step that the target short sequence data obtained in step 5 is the data of the next-generation sequencing platform and Assign weights, according to the characteristics of the next-generation sequencing data of the base in the sequence position of the next-generation sequencing, which is "correct in the middle and poor on both sides", assign a weight (Weight) to the results of each point mutation in the 9 detection results, Namely W 1 , W 2 , W 3 , W 4 , W 5 , W 6 , W 7 , W 8 , W 9 , and W 1 +W 2 +W 3 +W 4 +W 5 + W 6 + W 7 + W 8 +W 9 =1, W 5 =0.25, W 1 =W 9 =0.05, W 2 =W 8 =0.075, W 3 =W 7 =0.1, W 4 =W 6 =0.15.
加权及纠错矫正步骤6中得到的靶向点突变结果及频率,定义
Figure PCTCN2022136275-appb-000024
The target point mutation results and frequency obtained in step 6 of weighting and error correction correction, defined
Figure PCTCN2022136275-appb-000024
且F correct为最终该位点的检测突变频率;若F correct≥1%,则为阳性,反之为阴性。 And F correct is the final detection mutation frequency of the site; if F correct ≥ 1%, it is positive, otherwise it is negative.
结果统计如表1所示,可见,本申请方法可以非常灵敏的将各个已知突变结果检出,与预期结论一致,且结果优于目前主流的分析三代测序点突变的算法及软件,有效地控制了假阴性及假阳性的结果,故本申请的方法可行。The result statistics are shown in Table 1. It can be seen that the method of this application can detect each known mutation result very sensitively, which is consistent with the expected conclusion, and the result is better than the current mainstream algorithm and software for analyzing point mutations in third-generation sequencing, effectively The results of false negatives and false positives are controlled, so the method of the present application is feasible.
表1本申请方法检出各突变以及其频率的结果统计。Table 1 Statistics of each mutation detected by the method of this application and its frequency.
Figure PCTCN2022136275-appb-000025
Figure PCTCN2022136275-appb-000025
其中,Nano2NGS表示本申请所述的方法,通过表1的数据可以得知,使用本申请的方法,在三次重复中均检测到了BRAF-V600E EGFR-L858R EGFR-T790M KRAS-G13D以及AKT1-E17K的突变,并且三次结果之间具有良好的重现性,与期望的频率之间没有显著差异。 Among them, Nano2NGS represents the method described in this application. From the data in Table 1, it can be known that using the method of this application, BRAF-V600E , EGFR-L858R , EGFR-T790M , KRAS-G13D and AKT1 were detected in three repetitions -E17K mutation, and good reproducibility among the three results, no significant difference from the expected frequency.
Longshot方法例如发表于Nature Communications杂志(DOI:10.1038/s41467-019-12493-y),为加利福尼亚大学开发的结合隐马尔可夫链模型的得到的三代测序的点突变检测方法,由表1的数据可以,使用该方法分析,无法获得点突变的数据。The Longshot method, for example published in the journal Nature Communications (DOI: 10.1038/s41467-019-12493-y), is a point mutation detection method developed by the University of California combined with the hidden Markov chain model obtained by three-generation sequencing, from the data in Table 1 Yes, point mutation data cannot be obtained using this method of analysis.
DeepVariant方法(公开于bioRxiv上基于google团队的DeepVariant开发优化的PEPPER-Margin-DeepVariant方法(doi:https://doi.org/10.1101/2021.03.04.433952))也无法直接用于三代测序的点突变检测方法。The DeepVariant method (the PEPPER-Margin-DeepVariant method (doi: https://doi.org/10.1101/2021.03.04.433952) developed and optimized based on the Google team’s DeepVariant published on bioRxiv) cannot be directly used for the detection of point mutations in three-generation sequencing method.
iGDA方法虽然可以直接用于三代测序的点突变检测,但是在阴控样本中也检测出点突变,获得假阳性的检测结果。Although the iGDA method can be directly used for the detection of point mutations in third-generation sequencing, point mutations are also detected in negative control samples, resulting in false positive detection results.
因此,本申请的方法不仅从数据特征上有效地规避随机indel或较高测序错误导致的比对率不高导致的假阴性的问题,同时设计结合碱基在测序序列位置上的“中间较准,两边较差”的理论观点、数据分析层面上的分子生物标签(UMI/UID)思想以及“权重”统计的方法对检测结果进行整体评估、纠错及矫正,更加有效地控制了假阳性的结果。本申请的方法能够很好的兼容目前二代测序数据分析点突变的标准或成熟的主流分析流程,例如GATK Best Practice等,丰富了三代测序数据分析点突变的技术手段,很大程度上解决了三代测序检测点突变精准度不足的现状,在充分发挥了三代测序数据长度长的优势的同时,也进一步推动了三代测序在科研上的应用,特别适用于靶向相关热点panel的突变检测中。Therefore, the method of this application not only effectively avoids the problem of false negatives caused by random indels or high sequencing errors caused by low alignment rates from the data characteristics, but also designs the "intermediate alignment" of binding bases at the position of the sequencing sequence. Theoretical point of view, the idea of molecular biological labels (UMI/UID) on the data analysis level, and the method of "weight" statistics to conduct overall evaluation, error correction and correction of the test results, and more effectively control the false positive. result. The method of this application can be well compatible with the current standards for point mutation analysis of second-generation sequencing data or mature mainstream analysis procedures, such as GATK Best Practice, etc., enriches the technical means for point mutation analysis of third-generation sequencing data, and largely solves the problem of The lack of accuracy of third-generation sequencing to detect point mutations, while giving full play to the advantages of long data length of third-generation sequencing, also further promotes the application of third-generation sequencing in scientific research, especially for mutation detection targeting relevant hotspot panels.
另外,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。In addition, the term "and/or" in this article is only an association relationship describing associated objects, which means that there may be three relationships, for example, A and/or B may mean: A exists alone, A and B exist at the same time, There are three cases of B alone. In addition, the character "/" in this article generally indicates that the contextual objects are an "or" relationship.
应理解,在本申请实施例中,“与A相应的B”表示B与A相关联,根据A可以确定B。但还应理解,根据A确定B并不意味着仅仅根据A确定B,还可以根据A和/或其它信息确定B。It should be understood that in this embodiment of the present application, "B corresponding to A" means that B is associated with A, and B can be determined according to A. However, it should also be understood that determining B according to A does not mean determining B only according to A, and B may also be determined according to A and/or other information.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above is only a specific embodiment of the application, but the scope of protection of the application is not limited thereto. Any person familiar with the technical field can easily think of various equivalents within the scope of the technology disclosed in the application. Modifications or replacements, these modifications or replacements shall be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims (15)

  1. 一种基于三代测序数据检测点突变的分析方法,所述方法包括以下步骤:An analysis method for detecting point mutations based on three-generation sequencing data, the method comprising the following steps:
    1)从参考基因组中提取包含待检测的点突变的第一序列子集;1) extracting the first sequence subset comprising the point mutation to be detected from the reference genome;
    在所述参考基因组上进行固定长度L的短序列提取N次,所述短序列之间满足待检测的点突变在提取后的短序列上的位置与其在前一次提取的短序列上的位置之间具有固定距离D,并且
    Figure PCTCN2022136275-appb-100001
    其中,N、D、L均为整数,最终得到第一序列子集,其包含N个含有待检测的点突变的短序列;
    On the reference genome, a short sequence with a fixed length L is extracted N times, and the short sequence satisfies the difference between the position of the point mutation to be detected on the extracted short sequence and the position on the previously extracted short sequence have a fixed distance D between them, and
    Figure PCTCN2022136275-appb-100001
    Wherein, N, D, and L are all integers, and finally the first sequence subset is obtained, which includes N short sequences containing point mutations to be detected;
    2)从步骤1)的第一序列子集中提取种子序列,提取位置为每条短序列的首尾端各M个碱基,得到第二序列子集,其包含N对长度为M的种子序列;2) Extracting the seed sequence from the first sequence subset in step 1), the extraction position is each M bases at the beginning and end of each short sequence, and obtaining the second sequence subset, which includes N pairs of seed sequences with a length of M;
    3)对原始三代测序数据预处理,获得具有期望质量的原始数据集;3) Preprocessing the original third-generation sequencing data to obtain an original data set with expected quality;
    4)使用步骤2)获得的第二序列子集的种子序列对从步骤3)得到的原始数据集中提取目的序列,获得N个包含目的序列的数据集;4) using the seed sequence of the second sequence subset obtained in step 2) to extract the target sequence from the original data set obtained in step 3), and obtain N data sets containing the target sequence;
    5)分别对步骤4)的N个包含目的序列的数据集进行点突变检测分析,得到N个结果;其中,每个结果包括待检测的位点的突变频率F,点突变的reads支持数AO,点突变位置的测序深度DP;5) Carry out point mutation detection and analysis to the N data sets containing the target sequence in step 4) respectively, and obtain N results; wherein, each result includes the mutation frequency F of the site to be detected, and the reads support number AO of the point mutation , the sequencing depth DP of the point mutation position;
    6)对步骤5)的N个检测结果中的每个点突变的结果分配权重W;6) assign weight W to the result of each point mutation in the N detection results of step 5);
    7)根据公式计算点突变结果及其频率;7) Calculate the point mutation result and its frequency according to the formula;
    Figure PCTCN2022136275-appb-100002
    Figure PCTCN2022136275-appb-100002
    若F correct≥1%,则为阳性,反之为阴性,其中,F correct为最终该位点的检测突变频率。 If F correct ≥ 1%, it is positive, otherwise it is negative, where F correct is the final detection mutation frequency of this site.
  2. 根据权利要求1所述的方法,其中,在步骤1)中,
    Figure PCTCN2022136275-appb-100003
    The method according to claim 1, wherein, in step 1),
    Figure PCTCN2022136275-appb-100003
  3. 根据权利要求1所述的方法,其中,在步骤1)中,第一次提取的短序列中,待检测的点突变在短序列上的位置为D 0,第X次提取时,所述点突变在该第X次提取的短序列中的位置L x满足L x=D 0+(X-1)D; The method according to claim 1, wherein, in step 1), in the short sequence extracted for the first time, the position of the point mutation to be detected on the short sequence is D 0 , and during the Xth extraction, the point mutation The position L x of the mutation in the short sequence extracted for the Xth time satisfies L x =D 0 +(X-1)D;
    其中,
    Figure PCTCN2022136275-appb-100004
    in,
    Figure PCTCN2022136275-appb-100004
  4. 根据权利要求1所述的方法,其中,L为76-151bp。The method according to claim 1, wherein L is 76-151 bp.
  5. 根据权利要求1所述的方法,其中,在步骤2)中,M≥5。The method according to claim 1, wherein, in step 2), M≥5.
  6. 根据权利要求1所述的分析方法,其中,在步骤3)中,对原始三代测序数据进行数据预处理,包括过滤低质量以及过短的测序reads;The analysis method according to claim 1, wherein, in step 3), data preprocessing is performed on the original three-generation sequencing data, including filtering low-quality and too short sequencing reads;
    其中,所述低质量的阈值为Q5;和/或过短的测序reads的序列长度阈值为100bp。Wherein, the low quality threshold is Q5; and/or the sequence length threshold of too short sequencing reads is 100bp.
  7. 根据权利要求1所述的分析方法,其中,在步骤4)中,所述目的序列的长度L’≤L+50。The analysis method according to claim 1, wherein, in step 4), the length L'≤L+50 of the target sequence.
  8. 根据权利要求1所述的分析方法,其中,在步骤5)中,所述分析使用GATK Best Practice分析流程。The analysis method according to claim 1, wherein, in step 5), the analysis uses GATK Best Practice analysis process.
  9. 根据权利要求1所述的分析方法,其中,在步骤6)中,对N个检测结果中的每个点突变的结果分配权重,包括:The analysis method according to claim 1, wherein, in step 6), the weight distribution to the result of each point mutation in the N detection results comprises:
    权重W 1至W N的总和为1;和 the sum of the weights W 1 to W N is 1; and
    在步骤1)中获得的N条短序列中,点突变在所述短序列的固定长度L上的位置越邻近中间,与所述短序列相关的检测结果分配的权重越大。Among the N short sequences obtained in step 1), the closer the point mutation is to the middle of the fixed length L of the short sequences, the greater the weight assigned to the detection results related to the short sequences.
  10. 根据权利要求9所述的分析方法,其中,在步骤6)中,对N个检测结果中的每个点突变的结果分配权重,The analysis method according to claim 9, wherein, in step 6), the weight is assigned to the result of each point mutation in the N detection results,
    其中,N为偶数时,第
    Figure PCTCN2022136275-appb-100005
    个和第
    Figure PCTCN2022136275-appb-100006
    个数据集具有最大的权重W N/2=W N/2+1,然后W N=W 1,W N-1=W 2,W N-2=W 3,以此类推;
    Among them, when N is an even number, the
    Figure PCTCN2022136275-appb-100005
    and the first
    Figure PCTCN2022136275-appb-100006
    A data set has the largest weight W N/2 =W N/2+1 , then W N =W 1 , W N-1 =W 2 , W N-2 =W 3 , and so on;
    其中,N为奇数时,第
    Figure PCTCN2022136275-appb-100007
    个数据集具有最大的权重W N+1/2,然后W N=W 1,W N-1=W 2,W N-2=W 3,以此类推。
    Among them, when N is an odd number, the
    Figure PCTCN2022136275-appb-100007
    A data set has the largest weight W N+1/2 , then W N =W 1 , W N-1 =W 2 , W N-2 =W 3 , and so on.
  11. 一种基于三代测序数据检测点突变的分析方法,所述方法包括以下步骤:An analysis method for detecting point mutations based on three-generation sequencing data, the method comprising the following steps:
    1)从参考基因组中提取包含待检测的点突变的第一序列子集;1) extracting the first sequence subset comprising the point mutation to be detected from the reference genome;
    在所述参考基因组上进行固定长度L的短序列提取N次,第一次提取获得的短序列中,待检测的点突变所在位置为D 0,所述短序列之间满足待检测的点突变在提取后的短序列上的位置与其在前一次提取的短序列上的位置之间具有固定距离D,最终得到第一序列子集,其包含N个含有待检测的点突变的短序列; Short sequences of fixed length L are extracted N times on the reference genome, and in the short sequences obtained by the first extraction, the position of the point mutation to be detected is D 0 , and the point mutations to be detected are satisfied between the short sequences There is a fixed distance D between the position on the extracted short sequence and its position on the previously extracted short sequence, and finally a first sequence subset is obtained, which includes N short sequences containing point mutations to be detected;
    其中,L为76-151bp之间的任意整数,D为8到15之间的任意整数,N为4到18之间的任意整数,D 0为5到14之间的任意整数; Wherein, L is any integer between 76-151bp, D is any integer between 8 and 15, N is any integer between 4 and 18, and D0 is any integer between 5 and 14;
    2)从步骤1)得到的第一序列子集中的每条序列提取种子序列,提取位置分别为每条序列两端各M个碱基,最终得到N个种子序列对的第二序列子集,其中5≤M<D 02) Extracting a seed sequence from each sequence in the first sequence subset obtained in step 1), the extraction positions are respectively M bases at both ends of each sequence, and finally obtaining a second sequence subset of N seed sequence pairs, Where 5≤M<D 0 ;
    3)对原始三代测序数据进行数据预处理,利用Porechop软件以及NanoFilt软件去除实验建库过程中加入的接头及条形码序列,过滤低质量以及过短的测序reads,得到具有期望质量的原始数据集;3) Perform data preprocessing on the original third-generation sequencing data, use Porechop software and NanoFilt software to remove the joints and barcode sequences added during the experimental library construction process, filter low-quality and too short sequencing reads, and obtain the original data set with the desired quality;
    4)根据步骤2)得到的种子序列对,从步骤3)得到的原始数据集中提取出相应的目的序列,所述目的序列长度L’≤L+50,最终得到N个包含目的序列的数据集;4) According to the seed sequence pair obtained in step 2), extract the corresponding target sequence from the original data set obtained in step 3), the length of the target sequence L'≤L+50, and finally obtain N data sets containing the target sequence ;
    5)利用GATK Best Practice分析流程对步骤4)中得到的N个包含目的序列的数据集分别进行点突变检测分析,得到N个靶向位点检测的最终结果,记每个靶向位点检测的突变频率为F N,该位点的突变reads支持数 为AO N,该位置的测序深度为DP N5) Use the GATK Best Practice analysis process to perform point mutation detection and analysis on the N data sets containing the target sequence obtained in step 4), and obtain the final results of N target site detection, record the detection of each target site The mutation frequency of the site is F N , the number of mutation reads supported by this site is AO N , and the sequencing depth of this site is D P N ;
    6)步骤5)的N个检测结果中的每个点突变的结果分配权重,权重W 1至W N的总和为1; 6) The result of each point mutation in the N detection results of step 5) is assigned a weight, and the sum of the weights W 1 to W N is 1;
    其中,N为偶数时,第
    Figure PCTCN2022136275-appb-100008
    个和第
    Figure PCTCN2022136275-appb-100009
    个数据集具有最大的权重W N/2=W N/2+1,然后W N=W 1,W N-1=W 2,W N-2=W 3,以此类推;
    Among them, when N is an even number, the
    Figure PCTCN2022136275-appb-100008
    and the first
    Figure PCTCN2022136275-appb-100009
    A data set has the largest weight W N/2 =W N/2+1 , then W N =W 1 , W N-1 =W 2 , W N-2 =W 3 , and so on;
    其中,N为奇数时,第
    Figure PCTCN2022136275-appb-100010
    个数据集具有最大的权重W N+1/2,然后W N=W 1,W N-1=W 2,W N-2=W 3,以此类推;
    Among them, when N is an odd number, the
    Figure PCTCN2022136275-appb-100010
    A data set has the largest weight W N+1/2 , then W N =W 1 , W N-1 =W 2 , W N-2 =W 3 , and so on;
    7)加权及纠错矫正步骤5)中得到的靶向点突变结果及其频率,定义
    Figure PCTCN2022136275-appb-100011
    F correct为最终该位点的检测突变频率;
    7) The target point mutation results obtained in step 5) of weighting and error correction and their frequencies, defined
    Figure PCTCN2022136275-appb-100011
    F correct is the final detection mutation frequency of this site;
    若F correct≥1%,则为阳性,反之为阴性。 If F correct ≥ 1%, it is positive, otherwise it is negative.
  12. 一种基于三代测序数据检测点突变的装置,包括:A device for detecting point mutations based on three-generation sequencing data, comprising:
    种子序列提取模块,用于从参考基因组中提取包含N个含有待检测的点突变的短序列的第一序列子集,然后从所述第一序列子集中提取包含种子序列对的第二序列子集;The seed sequence extraction module is used to extract a first sequence subset comprising N short sequences containing point mutations to be detected from the reference genome, and then extract a second sequence subset comprising a seed sequence pair from the first sequence subset set;
    预处理模块,用于对三代测序数据预处理,获得具有期望质量的原始数据集;The preprocessing module is used to preprocess the third-generation sequencing data to obtain the original data set with the expected quality;
    初级分析模块,用于使用第二序列子集的种子序列对从预处理后的原始数据集提取包含目的序列的数据集,获得N个包含目的序列的数据集,然后进行点突变检测分析并获得N个结果;其中,每个结果包括待检测的位点的突变频率F,点突变的reads支持数AO,点突变位置的测序深度DP;The primary analysis module is used to use the seed sequence pair of the second sequence subset to extract a data set containing the target sequence from the preprocessed original data set, obtain N data sets containing the target sequence, and then perform point mutation detection analysis and obtain N results; wherein, each result includes the mutation frequency F of the site to be detected, the reads support number AO of the point mutation, and the sequencing depth DP of the point mutation position;
    高级分析模块,用于对得到的结果进一步加权及矫正,并获得最终的 分析结果;及An advanced analysis module, which is used to further weight and correct the obtained results, and obtain the final analysis results; and
    报告模块,用于根据数据输出结果;A report module for outputting results based on data;
    所述高级分析模块用于对N个检测结果中的每个点突变的结果分配权重W,根据公式计算点突变结果及其频率;The advanced analysis module is used to assign a weight W to the result of each point mutation in the N detection results, and calculate the point mutation result and its frequency according to the formula;
    Figure PCTCN2022136275-appb-100012
    Figure PCTCN2022136275-appb-100012
    若F correct≥1%,则为阳性,反之为阴性,其中F correct为最终该位点的检测突变频率; If F correct ≥ 1%, it is positive, otherwise it is negative, where F correct is the final detection mutation frequency of this site;
    所述报告模块用于输出点突变结果及其频率。The reporting module is used to output point mutation results and their frequencies.
  13. 根据权利要求12所述的装置,其中,所述预处理模块用于过滤低质量以及过短的测序reads,包括Porechop软件以及NanoFilt软件。The device according to claim 12, wherein the preprocessing module is used to filter low-quality and too short sequencing reads, including Porechop software and NanoFilt software.
  14. 根据权利要求12所述的装置,其中,所述初级分析模块包含GATK Best Practice分析流程。The device according to claim 12, wherein said primary analysis module comprises a GATK Best Practice analysis process.
  15. 根据权利要求12所述的装置,其中,所述高级分析模块包含用于对每个结果分配权重的程序或软件。The apparatus of claim 12, wherein the advanced analysis module includes a program or software for assigning a weight to each result.
PCT/CN2022/136275 2021-12-28 2022-12-02 Third-generation sequencing data analysis method and device for point mutation detection WO2023124779A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111616129.1A CN114005489B (en) 2021-12-28 2021-12-28 Analysis method and device for detecting point mutation based on third-generation sequencing data
CN202111616129.1 2021-12-28

Publications (1)

Publication Number Publication Date
WO2023124779A1 true WO2023124779A1 (en) 2023-07-06

Family

ID=79932112

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/136275 WO2023124779A1 (en) 2021-12-28 2022-12-02 Third-generation sequencing data analysis method and device for point mutation detection

Country Status (2)

Country Link
CN (1) CN114005489B (en)
WO (1) WO2023124779A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114005489B (en) * 2021-12-28 2022-03-22 成都齐碳科技有限公司 Analysis method and device for detecting point mutation based on third-generation sequencing data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109295198A (en) * 2018-09-03 2019-02-01 安吉康尔(深圳)科技有限公司 For detecting the method, apparatus and terminal device of genetic disease genetic mutation
US20200082911A1 (en) * 2018-08-31 2020-03-12 Sysmex Corporation Analysis method, information processing apparatus, gene analysis system and non-transitory storage medium
CN111139291A (en) * 2020-01-14 2020-05-12 首都医科大学附属北京安贞医院 High-throughput sequencing analysis method for monogenic hereditary diseases
CN111243663A (en) * 2020-02-26 2020-06-05 西安交通大学 Gene variation detection method based on pattern growth algorithm
CN114005489A (en) * 2021-12-28 2022-02-01 成都齐碳科技有限公司 Analysis method and device for detecting point mutation based on third-generation sequencing data

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015130696A1 (en) * 2014-02-25 2015-09-03 Bionano Genomics, Inc. Reduction of bias in genomic coverage measurements
KR20180054834A (en) * 2015-09-25 2018-05-24 컨텍스튜얼 게노믹스 인코포레이션 Molecular quality assurance method for use in sequencing
CN110111839A (en) * 2018-02-01 2019-08-09 深圳华大基因股份有限公司 The method and its application of reads number are supported in mutation in a kind of accurate quantification tumour standard items
US20190316209A1 (en) * 2018-04-13 2019-10-17 Grail, Inc. Multi-Assay Prediction Model for Cancer Detection
CN109033749B (en) * 2018-06-29 2020-01-14 裕策医疗器械江苏有限公司 Tumor mutation load detection method, device and storage medium
JP7477888B2 (en) * 2018-11-15 2024-05-02 ノイスコム アーゲー Selection of cancer mutations for the generation of personalized cancer vaccines
CN109616154A (en) * 2018-12-27 2019-04-12 北京优迅医学检验实验室有限公司 The antidote and device of depth is sequenced
CN109887548B (en) * 2019-01-18 2022-11-08 臻悦生物科技江苏有限公司 ctDNA ratio detection method and detection device based on capture sequencing
CN109949861B (en) * 2019-03-29 2020-02-21 裕策医疗器械江苏有限公司 Tumor mutation load detection method, device and storage medium
CN112309502A (en) * 2020-10-14 2021-02-02 深圳市新合生物医疗科技有限公司 Method and system for calculating tumor neoantigen load
CN113096728B (en) * 2021-06-10 2021-08-20 臻和(北京)生物科技有限公司 Method, device, storage medium and equipment for detecting tiny residual focus
CN113862344A (en) * 2021-09-09 2021-12-31 成都齐碳科技有限公司 Method and apparatus for detecting gene fusion

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200082911A1 (en) * 2018-08-31 2020-03-12 Sysmex Corporation Analysis method, information processing apparatus, gene analysis system and non-transitory storage medium
CN109295198A (en) * 2018-09-03 2019-02-01 安吉康尔(深圳)科技有限公司 For detecting the method, apparatus and terminal device of genetic disease genetic mutation
CN111139291A (en) * 2020-01-14 2020-05-12 首都医科大学附属北京安贞医院 High-throughput sequencing analysis method for monogenic hereditary diseases
CN111243663A (en) * 2020-02-26 2020-06-05 西安交通大学 Gene variation detection method based on pattern growth algorithm
CN114005489A (en) * 2021-12-28 2022-02-01 成都齐碳科技有限公司 Analysis method and device for detecting point mutation based on third-generation sequencing data

Also Published As

Publication number Publication date
CN114005489B (en) 2022-03-22
CN114005489A (en) 2022-02-01

Similar Documents

Publication Publication Date Title
US11702708B2 (en) Systems and methods for analyzing viral nucleic acids
CN109767810B (en) High-throughput sequencing data analysis method and device
CN107423578B (en) Device for detecting somatic cell mutation
CN109994155B (en) Gene variation identification method, device and storage medium
US20210257050A1 (en) Systems and methods for using neural networks for germline and somatic variant calling
CN113744807B (en) Macrogenomics-based pathogenic microorganism detection method and device
EP3143537A1 (en) Rare variant calls in ultra-deep sequencing
CN109658983A (en) A kind of method and apparatus identifying and eliminate false positive in variance detection
García-López et al. Fragmentation and coverage variation in viral metagenome assemblies, and their effect in diversity calculations
US11842794B2 (en) Variant calling in single molecule sequencing using a convolutional neural network
US20210332354A1 (en) Systems and methods for identifying differential accessibility of gene regulatory elements at single cell resolution
CN111462823A (en) Homologous recombination defect judgment method based on DNA sequencing data
WO2023124779A1 (en) Third-generation sequencing data analysis method and device for point mutation detection
CN115083521A (en) Method and system for identifying tumor cell group in single cell transcriptome sequencing data
CN115631789A (en) Pangenome-based group joint variation detection method
CN115458052A (en) Gene mutation analysis method, equipment and storage medium based on first generation sequencing
US20190108311A1 (en) Site-specific noise model for targeted sequencing
WO2019132010A1 (en) Method, apparatus and program for estimating base type in base sequence
CN105849284B (en) Method and apparatus for separating quality levels in sequence data and sequencing longer reads
CN113160895A (en) Colorectal cancer risk assessment model and system
Becker et al. TensorSV: structural variation inference using tensors and variable topology neural networks
CN116312798B (en) Metagenome sequencing data species verification method and application
CN114242158B (en) Method, device, storage medium and equipment for detecting ctDNA single nucleotide variation site
CN115862744B (en) Whole genome parallel splicing method established based on relational graph
CN113793641B (en) Method for rapidly judging sample gender from FASTQ file

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22914034

Country of ref document: EP

Kind code of ref document: A1