WO2023124779A1

WO2023124779A1 - Third-generation sequencing data analysis method and device for point mutation detection

Info

Publication number: WO2023124779A1
Application number: PCT/CN2022/136275
Authority: WO
Inventors: 郎继东; 孙继国
Original assignee: 成都齐碳科技有限公司
Priority date: 2021-12-28
Filing date: 2022-12-02
Publication date: 2023-07-06
Also published as: CN114005489B; CN114005489A

Abstract

The present application provides a third-generation sequencing data analysis method and device for point mutation detection. The analysis method of the present application comprises: 1) extracting a first sequence subset containing a point mutation to undergo detection; 2) extracting seed sequences from the first sequence subset to obtain a second sequence subset; 3) obtaining an original data set having a desired quality; 4) using seed sequence pairs of the second sequence subset to obtain N data sets containing a target sequence; 5) performing point mutation detection and analysis on the N data sets containing the target sequence; 6) allocating a weight W to each point mutation result in N detection results; and 7) calculating a point mutation result and a frequency thereof according to a formula.

Description

Analysis method and device for detecting point mutations based on third-generation sequencing data

Cross References to Related Applications

This application claims priority to the Chinese patent application 202111616129.1 entitled "Analysis method and device for detecting point mutations based on third-generation sequencing data" filed on December 28, 2021, the entire content of which is incorporated herein by reference.

technical field

The application belongs to the field of sequencing technology and bioinformatics analysis of sequencing data, and in particular relates to a method for detecting point mutations based on third-generation sequencing data, and the application also relates to a device for detecting point mutations based on third-generation sequencing data.

Background technique

A point mutation is a change in only one base pair. General point mutations can be base substitutions, single base insertions or base deletions; narrow sense point mutations are also called single base substitutions. Base substitutions are further divided into two types: transitions and transversions. At present, common methods for detecting gene point mutations include PCR method, Sanger sequencing method (first-generation sequencing) and next-generation sequencing. The PCR method has the characteristics of high sensitivity, and the technology is mature, but each pair of primers can only detect one mutation, cannot detect too many samples and sites at the same time, and the throughput is low. The cost of Sanger sequencing is low, but the amount of sample required is large, and the detection sensitivity for low-frequency mutations is low. Next-generation sequencing has the characteristics of high throughput, and the cost of sequencing is also decreasing year by year, but the current methods and tools commonly used to detect point mutations are not high in detection specificity (such as Varscan), and the sensitivity to low-frequency detection is also low (such as Mutect). Or the use of local assembly steps leads to too long running time (such as Mutect2), which cannot well meet the needs of point mutation detection.

Third-generation sequencing technology, also known as third-generation sequencing technology (Third generation sequencing) or single-molecule real-time DNA sequencing technology, is a method that can individually sequence each DNA molecule without PCR amplification during DNA sequencing. technology. At present, the principles of the third-generation sequencing technology are mainly divided into single-molecule fluorescence sequencing represented by Pacbio's SMRT technology and nanopore sequencing represented by the nanopore electrophoresis technology of Oxford Nanopore and Qitan Technology. One of the main technical characteristics of the third-generation sequencing is to realize the internal reaction speed of DNA polymerase, which can measure 10 bases per second, and the sequencing speed is 20,000 times that of chemical sequencing; With its own continuity, one reaction can measure very long sequences; second-generation sequencing can measure hundreds of bases, but third-generation sequencing can measure thousands of bases. Furthermore, the third-generation sequencing does not need PCR amplification or chemical labeling when sequencing DNA or RNA molecules in real time, avoiding erroneous mutations introduced during the operation, high fidelity, and the sequencing speed can reach 450 bp/s for DNA and 450 bp/s for RNA. 70nt/s, the whole can reach the ultra-long read length of several megabases.

At present, the method for detecting point mutations (including germline mutations and somatic cell mutations) based on three-generation sequencing is not very mature, but some research groups around the world have devoted themselves to developing some algorithms to accurately identify point mutations in three-generation sequencing data ( SNV and InDel), such as the Longshot method combined with hidden Markov chain model developed by the University of California published in Nature Communications (DOI: 10.1038/s41467-019-12493-y), published in Nature Machine Intelligence in Hong Kong The Clair method combined with the deep neural network model developed by the university (doi: https://doi.org/10.1038/s42256-020-0167-4), published on bioRxiv based on the google team's DeepVariant development and optimization of PEPPER-Margin-DeepVariant method (doi: https://doi.org/10.1101/2021.03.04.433952), etc. These research results not only enrich the mutation detection methods based on the third-generation sequencing data, but more importantly, provide technical support for the extensive development and practical application of the third-generation sequencing.

However, there are still great challenges and problems in the method of detecting point mutations based on third-generation sequencing. As we all know, there are still some problems in the accuracy of single base recognition in the third-generation sequencing data. It is reflected in the data level that the sequencing quality is not high or the status quo of sequencing errors and data characteristics such as random indel distribution. Therefore, in the data analysis based on third-generation sequencing, how to stably detect point mutations and better control the detection results of false positives and false negatives is particularly important, and the requirements for the sensitivity and specificity of detection algorithms are also raised. a huge challenge. Although there are some methods for detecting point mutations developed based on three-generation sequencing data at this stage (as mentioned above), their respective shortcomings are also very obvious. The most important ones are limited by the quality of sequencing and the comparison algorithm or deep learning training set they rely on. The data distribution, etc., and the applicable scenarios are not wide enough, and the robustness (robust) is insufficient.

Therefore, it is very important to further improve the analysis method for detecting point mutations based on third-generation sequencing data in related technologies, so that while stably detecting point mutations, it can also better control the problems of false positives and false negatives. meaning.

Contents of the invention

Therefore, the purpose of this application is to address the shortcomings of related technologies and provide an analysis method for detecting point mutations based on third-generation sequencing data. The method provided by this application can well solve the above problems at the data analysis level, not only from the data characteristics It is more effective to avoid the problem of false negatives caused by random indels or high sequencing errors caused by low alignment rates. At the same time, the design combines the theoretical point of view of "middle alignment, poor on both sides" of bases in the sequencing sequence position, The idea of molecular biological labels (UMI/UID) at the level of data analysis and the method of "weight" statistics perform overall evaluation, error correction and correction of the test results, and more effectively control the false positive results.

The purpose of this application is achieved through the following technical solutions:

In one aspect, the present application provides an analysis method for detecting point mutations based on three-generation sequencing data, the method comprising the following steps:

1) extracting the first sequence subset comprising the point mutation to be detected from the reference genome;

On the reference genome, a short sequence with a fixed length L is extracted N times, and the short sequence satisfies the difference between the position of the point mutation to be detected on the extracted short sequence and the position on the previously extracted short sequence with a fixed distance D between them,

Wherein, N, D, and L are all integers; the first sequence subset is finally obtained, which includes N short sequences containing point mutations to be detected;

2) Extracting the seed sequence from the first sequence subset in step 1), the extraction position is M bases at the beginning and end of each short sequence, and obtaining the second sequence subset, which includes N pairs of seed sequences with a length of M, The seed sequence does not contain the point mutation to be detected;

3) Preprocessing the original third-generation sequencing data to obtain an original data set with expected quality;

4) using the seed sequence of the second sequence subset obtained in step 2) to extract the target sequence from the original data set obtained in step 3), and obtain N data sets containing the target sequence;

5) Carry out point mutation detection and analysis to the N data sets containing the target sequence in step 4) respectively, and obtain N results; wherein, each result includes the mutation frequency F of the site to be detected, and the reads support number AO of the point mutation , the sequencing depth DP of the point mutation position;

6) assign weight W to the result of each point mutation in the N detection results of step 5);

7) Calculate the point mutation result and its frequency according to the formula;

If F _correct ≥ 1%, it is positive, otherwise it is negative.

According to an embodiment of one aspect of the present application, in step 1), D represents the base distance between positions of point mutations in any extracted sequence. The fixed distance D can be any integer greater than 1, not limited to any particular theory, but optionally the distance D is set as

Without any theoretical limitation, those skilled in the art can optionally set the value of D, for example, 5≤D≤20, 8≤D≤15, etc., for example, D can be any integer between 5 and 20.

Those skilled in the art can understand that if the position of the point mutation to be detected on the short sequence is D ₀ in the short sequence extracted for the first time, then during the Xth extraction, the point mutation in the extracted short sequence The position L _x in satisfies L _x =D ₀ +(X-1)D.

According to an embodiment of one aspect of the present application, for L _x =D ₀ +(X-1)D, D ₀ can be understood as the position of the point mutation to be detected in the extracted short sequence during the first extraction; for example D ₀ can be the first base, the second base, the third base, the fourth base in the short sequence extracted for the first time, and so on; in an optional embodiment, D ₀ ≤ L/4 and/or D ₀ ≥ D, for example, D ₀ may be D, D+1, D+2, etc.

According to an embodiment of one aspect of the present application, for example, the positions of the point mutations to be detected are respectively located at the 11th base, the 21st base, the 31st base, etc. on the extracted short sequence; it can be understood that _D0 is 11, D is 10, X is 1, 2 and 3.

According to an embodiment of an aspect of the application

According to an embodiment of an aspect of the present application, in step 1), the extraction times N need to be determined according to the fixed length L and the fixed distance D.

According to an embodiment of one aspect of the present application, when N is an even number, among the obtained N short sequences, the

second and second

Compared with its position on other short sequences, the point mutation to be detected in the extracted short sequence can be located in the middle position of the short sequence or the position closest to the middle; when N is an odd number, the

Compared with the positions on other short sequences, the point mutation to be detected in the extracted short sequence is located in the middle position of the short sequence or the position closest to the middle.

According to an embodiment of one aspect of the present application, in step 1), the fixed length L of each sequence can be an optional length, and the length can be as short as 35bp, or as long as 250bp, optionally 76-151bp.

According to an embodiment of an aspect of the present application, in step 2), M may be an optional integer, but based on practical considerations, M may be 2, 3, 4 or 5, and optionally, M≥5.

According to an embodiment of one aspect of the present application, in step 3), the raw data is long-read data obtained by nanopore sequencing.

According to an embodiment of one aspect of the present application, data preprocessing is performed on the original third-generation sequencing data, including using Porechop software and NanoFilt software to remove adapters and barcode sequences added during the experimental library construction process, and to filter low-quality and too short sequences. Sequence reads to obtain the desired original data set (clean data).

According to an embodiment of one aspect of the present application, the low-quality threshold includes but is not limited to Q5, for example, the threshold may be Q7 or higher; wherein, Q represents the average quality value of sequencing reads, that is, each of the sequencing reads The base accuracy is summed and averaged. It is known to those skilled in the art that the threshold can be adjusted according to the actual situation. For specific adjustment parameters, see https://en.wikipedia.org/wiki/FASTQ_format, which is incorporated herein by reference.

According to an embodiment of one aspect of the present application, the sequence length threshold of too short sequencing reads includes but is not limited to 100 bp; for example, the threshold can be 50 bp, 200 bp, 300 bp, etc. Those skilled in the art can adjust the threshold according to actual conditions.

According to an embodiment of one aspect of the present application, in step 4), considering the characteristic interference of the third-generation sequencing data, limit the extraction of the corresponding target sequence length L'≤L+50.

According to an embodiment of one aspect of the present application, in step 5), the N data sets containing the target sequence obtained after the processing of the foregoing steps of the present application can use next-generation sequencing data to analyze the standard or mature mainstream of point mutations Analysis process, such as GATK Best Practice, etc.

According to an embodiment of one aspect of the present application, N data sets containing the target sequence are subjected to point mutation detection and analysis, and N results are obtained; each result includes a mutation frequency of F, the number of reads supported by the point mutation is AO, and the number of point mutation positions is The sequencing depth is DP.

For example, the results of the first data set include the mutation frequency F ₁ , the read support number AO ₁ of the point mutation, and the sequencing depth DP ₁ of the point mutation position;

The results of the second data set include the mutation frequency F ₂ , the read support number AO ₂ of the point mutation, and the sequencing depth DP ₂ of the point mutation position;

...

The results of the Nth data set include the mutation frequency F _N , the number of reads supported by the point mutation _AON , and the sequencing depth D P _N of the point mutation position.

According to an embodiment of one aspect of the present application, in step 6), a weight (Weight) is assigned to each point mutation in the N detection results, that is, W ₁ , W ₂ , W ₃ , ..., W _N- ₁ , W _N , and W ₁ +W ₂ +W ₃ +...+W _N-1 +W _N =1, wherein, in the N short sequences obtained in step 1), the point mutation is in the short sequence The closer the position on the fixed length L of is to the middle, the greater the weight assigned to the detection result related to the short sequence.

According to an embodiment of one aspect of the present application, when N is an even number, the first

and the first

data set (can be understood as using the first

second and second

The data set obtained from the seed sequence obtained by extracting the short sequence once) has the largest weight W _N/2 =W _N/2+1 , then W _N =W ₁ , W _N-1 =W ₂ ,W _N- ₂ =W ₃ , and so on. Among them, when NN is an odd number, the

data set (can be understood as using the first

The data set obtained from the seed sequence obtained from the second extracted short sequence) has the largest weight W _N+1/2 , then W _N =W ₁ , W _N-1 =W ₂ , W _N-2 =W ₃ , and And so on.

According to an embodiment of one aspect of the present application, in step 7), the formula is

In the formula, the inventor combined the theoretical point of view of bases on the position of the sequencing sequence "more accurate in the middle, worse on both sides", the idea of molecular biological labels (UMI/UID) at the data analysis level, and "weight" statistics The overall evaluation, error correction and correction of the detection results are carried out by the method, and the false positive results are more effectively controlled.

According to an embodiment of one aspect of the present application, the method of the present application includes the following steps:

On the reference genome, a short sequence of fixed length L is extracted N times, and in the short sequence obtained by the first extraction, the position of the point mutation to be detected is D ₀ , and the point mutation to be detected is satisfied between the short sequences There is a fixed distance D between the position on the extracted short sequence and its position on the previously extracted short sequence, and finally a first sequence subset is obtained, which includes N short sequences containing point mutations to be detected;

Wherein, L is any integer between 76-151bp, D is any integer between 8 and 15, N is any integer between 4 and 18, and _D0 is any integer between 5 and 14;

2) Extracting a seed sequence from each sequence in the first sequence subset obtained in step 1), the extraction positions are respectively M bases at both ends of each sequence, and finally obtaining a second sequence subset of N seed sequence pairs, Where 5≤M<D ₀ ;

3) Perform data preprocessing on the original third-generation sequencing data, use software such as Porechop and NanoFilt software to remove the joints and barcode sequences added during the experimental library construction process, filter low-quality and too short sequencing reads, and obtain the original data set with the desired quality ;

4) According to the seed sequence pair obtained in step 2), extract the corresponding target sequence from the original data set obtained in step 3), considering the characteristic interference of the third-generation sequencing data, limit the extraction of the corresponding target sequence length L'≤L+ 50. Finally, N data sets containing the target sequence extracted according to the seed sequence pair are obtained;

5) Perform point mutation detection and analysis on the N data sets containing the target sequence obtained in step 4), use but not limited to analysis processes such as GATK Best Practice to obtain the final results of N target site detection, record each The mutation frequency detected at the target site is F _N , the number of mutation reads supported at this site is _AON , and the sequencing depth at this position is _DPN ;

6) Assign a weight (Weight) to each point mutation among the N detection results in step 5), that is, W ₁ , W ₂ , W ₃ , ..., W _N-1 , W _N , when N is an even number, No.

and the first

data set (can be understood as using the first

second and second

The data set obtained from the seed sequence obtained by extracting the short sequence once) has the largest weight W _N/2 =W _N/2+1 , then W _N =W ₁ , W _N-1 =W ₂ ,W _N-2 =W ₃ , and so on. Among them, when N is an odd number, the

data set (can be understood as using the first

The data set obtained from the seed sequence obtained from the second extracted short sequence) has the largest weight W _N+1/2 , then W _N =W ₁ , W _N-1 =W ₂ , W _N-2 =W ₃ , and And so on. and so on;

7) The target point mutation results obtained in step 5) of weighting and error correction and their frequencies, defined

F _correct is the final detection mutation frequency of this site;

If F _correct ≥ 1%, it is positive, otherwise it is negative.

The present application also provides a device for detecting point mutations based on three-generation sequencing data, wherein the device includes:

A seed sequence extraction module, configured to obtain a second sequence subset comprising a pair of seed sequences;

The preprocessing module is used to preprocess the third-generation sequencing data to obtain the original data set with the expected quality;

The primary analysis module is used to use the seed sequence pair of the second sequence subset to extract a data set containing the target sequence from the preprocessed original data set, and then perform point mutation detection analysis and obtain data;

Advanced analysis module, used to further weight and correct the obtained results, and obtain the final analysis results; and

Reporting module for outputting results based on data.

According to an embodiment of one aspect of the present application, the seed sequence extraction module is used to extract a first sequence subset comprising N short sequences containing point mutations to be detected from the reference genome, and then from the first sequence subset Extracting a second sequence subset comprising a seed sequence pair; wherein the seed sequence pair is obtained according to the data processing method described in this application.

According to an embodiment of one aspect of the present application, the preprocessing module is used to filter low-quality and too short sequencing reads, and may include, for example, Porechop software and NanoFilt software.

According to an embodiment of one aspect of the present application, the data obtained by the primary analysis module has similar characteristics to the second-generation NGS sequencing data, and standard or mature mainstream analysis processes for point mutation analysis of NGS data, such as GATK Best Practice, can be used.

According to an embodiment of one aspect of the present application, the advanced analysis module includes a program or software for assigning weight to each result. Wherein, the weight assignment is in line with the theoretical point of view of "more accurate in the middle and poorer on both sides" of the position of the base in the sequencing sequence, the idea of molecular biological labels (UMI/UID) at the data analysis level, and the method of "weight" statistics.

Based on the unique data characteristics of the third-generation sequencing, the inventors of the present application have solved the problem that the third-generation sequencing data is limited by the sequencing quality and the dependent comparison algorithm or the data distribution of the deep learning training set from the data analysis level, and The applicable scenarios are not wide enough, and the robustness (robust) is insufficient. Using the method of this application not only effectively avoids the problem of false negatives caused by random indels or high sequencing errors caused by low alignment rates from the data characteristics, but also designs the "intermediate alignment" of binding bases at the position of the sequencing sequence Theoretical point of view, the idea of molecular biological labels (UMI/UID) on the data analysis level, and the method of "weight" statistics to conduct overall evaluation, error correction and correction of the test results, and more effectively control the false positive. result. The method of this application can be well compatible with the current standards for point mutation analysis of second-generation sequencing data or mature mainstream analysis procedures, such as GATK Best Practice, etc., enriches the technical means for point mutation analysis of third-generation sequencing data, and largely solves the problem of The current situation of insufficient accuracy of point mutation detection by third-generation sequencing, while giving full play to the advantages of long read length of third-generation sequencing data, also further promotes the application of third-generation sequencing in scientific research, especially suitable for mutation detection targeting related hotspot panels .

Description of drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following will briefly introduce the accompanying drawings that need to be used in the embodiments of the present application. Obviously, the accompanying drawings described below are only some embodiments of the present application. Those of ordinary skill in the art can also obtain other drawings based on these drawings without making creative efforts.

FIG. 1 shows a flow chart of an analysis method for detecting point mutations based on three-generation sequencing data in one embodiment of the present application;

FIG. 2 is a structural block diagram of a device for detecting point mutations based on three-generation sequencing data in one embodiment of the present application.

Detailed ways

Features and exemplary embodiments of various aspects of the present application will be described in detail below. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the application. It will be apparent, however, to one skilled in the art that the present application may be practiced without some of these specific details. The following description of the embodiments is only to provide a better understanding of the present application by showing examples of the present application.

It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other. The embodiments will be described in detail below in conjunction with the accompanying drawings.

In the third-generation sequencing, there are still some problems in the accuracy of single-base recognition in the third-generation sequencing data, which is reflected in the data level such as low sequencing quality or sequencing errors, as well as data characteristics such as random indel distribution. Therefore, in downstream data analysis, how to stably detect point mutations and better control false positive and false negative detection results is particularly important.

With reference to Figure 1 and Figure 2 of the present application, the present application provides an analysis method for detecting point mutations based on three-generation sequencing data, the method comprising the following steps:

S1: extracting the first sequence subset containing the point mutation to be detected from the reference genome;

S2: extract the seed sequence from the first sequence subset of S1, the extraction position is M bases at the beginning and end of each short sequence, and obtain the second sequence subset, which includes N pairs of seed sequences with a length of M, the The seed sequence does not contain the point mutation to be detected;

S3: Preprocess the original third-generation sequencing data to obtain an original data set with expected quality;

S4: using the seed sequence pair of the second sequence subset obtained in S2 to extract the target sequence from the original data set obtained in S3, and obtain N data sets containing the target sequence;

S5: Perform point mutation detection and analysis on the N data sets containing the target sequence in S4 respectively, and obtain N results; wherein, each result includes the mutation frequency F of the site to be detected, the number of reads supported by the point mutation AO, point The sequencing depth DP of the mutation position;

S6: Assign a weight W to each point mutation result in the N detection results of S5;

S7: Calculate the point mutation result and its frequency according to the formula;

If F _correct ≥ 1%, it is positive, otherwise it is negative.

From the above method, it can be known that the inventors of the present application prepared the seed sequence, combined with the characteristics of the sequencing data to perform multiple sampling and extraction, converted the long-read sequencing sequence of the third-generation sequencing into a short sequence sequence, and then performed NGS data Similar point mutation analysis, combined with experimental single-molecule labeling technology (UMI/UID) and weight statistics ideas to integrate, evaluate, error-correct and correct the multi-sampling results to finally judge the data analysis results, can effectively avoid third-generation sequencing detection. The problem of insufficient precision of point mutation.

Further, as shown in FIG. 2 , in one embodiment of the present application, a device for detecting point mutations based on three-generation sequencing data is provided, wherein the device includes: a seed sequence extraction module 101 for obtaining The second sequence subset of the pair; the preprocessing module 102 is used to preprocess the three-generation sequencing data to obtain an original data set with expected quality; the primary analysis module 103 is used to use the seed sequence pair of the second sequence subset from the preprocessing Extract the data set containing the target sequence from the processed raw data set, and then perform point mutation detection and analysis to obtain data; the advanced analysis module 104 is used to further weight and correct the obtained results, and obtain the final analysis result; and the report module 105, for outputting a result according to the data.

According to the device described in the present application, wherein the seed sequence extraction module is used to extract a first sequence subset comprising N short sequences containing point mutations to be detected from the reference genome, and then extract the first sequence subset from the first sequence subset Concentratingly extracting the second sequence subset including the seed sequence pair; wherein the seed sequence pair is obtained according to the data processing method described in this application.

According to the device described in the present application, wherein the preprocessing module is used to filter low-quality and too-short sequencing reads, and may include, for example, Porechop software and NanoFilt software.

According to the device described in the present application, wherein, the data obtained by the primary analysis module has similar characteristics to the second-generation NGS sequencing data, and the standard or mature mainstream analysis process of point mutations can be analyzed using NGS data, such as GATK Best Practice, etc. .

The device according to the present application, wherein the advanced analysis module includes a program or software for assigning weight to each result. Wherein, the weight distribution conforms to the theoretical point of view of bases in the position of the sequencing sequence of "more accurate in the middle, worse on both sides", the idea of molecular biological labels (UMI/UID) at the data analysis level, and the method of "weight" statistics.

Embodiment 1 uses the method analysis data of the present application

1. The standard sample containing BRAF-V600E, EGFR-L858R, EGFR-T790M, KRAS-G13D and AKT1-E17K and the standard sample of negative control sample NA12878 were prepared through the experimental library and repeated three times, using the nanometer of QNome-9604 The hole sequencer was used for sequencing, and six original long-read sequencing data were obtained, among which HUM964, HUM965 and HUM966 were positive control data, and HUM967, HUM968 and HUM969 were negative control data.

2. Extract 9 short sequences with a fixed length of 101 bp on the genome for the 5 target sites to be detected in step 1 according to their positions, in which the positions of the target sites on the extracted short sequences are respectively fixed at 11th base, 21st base, 31st base, 41st base, 51st base, 61st base, 71st base, 81st base and 91st base Base (ie D=10bp), to obtain the final set of 9 short sequence fragments containing 5 targeting sites, and the length of the short sequence fragments is 101bp.

3. Extract the seed sequence for each short sequence fragment set, and the extraction position is 10 bases at the beginning and end of the short sequence of each target site, and finally obtain 9 fragment sets of short sequence seed pair sequences containing the target site .

4. Perform data preprocessing on the original third-generation sequencing data, use software such as Porechop and NanoFilt to remove adapters and barcode sequences added during the experimental library construction process, filter low-quality Q7 and sequencing reads that are too short below 100bp, and obtain clean data.

5. From the clean data obtained in step 4, extract the corresponding target sequence according to the short sequence seed pair sequence obtained in step 3. Considering the characteristic interference of the third-generation sequencing data, limit the extraction of the corresponding target sequence length L'<151, Finally, nine target sequence data sets extracted from the seed sequence pairs were obtained.

6. Perform point mutation detection and analysis on the 9 data sets obtained in step 5. In this embodiment, GATK Best Practice is used to detect point mutations, and the final results of 9 target site detection are obtained. Record each target site The mutation frequency detected at the site is F _N , the number of mutation reads supported at this site is _AON , and the sequencing depth at this position is _DPN .

7. Since the data set containing the target sequence of length L' obtained in step 5 has similar characteristics to the data obtained by next-generation sequencing, it is assumed in this step that the target short sequence data obtained in step 5 is the data of the next-generation sequencing platform and Assign weights, according to the characteristics of the next-generation sequencing data of the base in the sequence position of the next-generation sequencing, which is "correct in the middle and poor on both sides", assign a weight (Weight) to the results of each point mutation in the 9 detection results, Namely W ₁ , W ₂ , W ₃ , W ₄ , W ₅ , W ₆ , W ₇ , W ₈ , W ₉ , and W ₁ +W 2 +W ₃ +W ₄ +W ₅ + _{W 6} ₊ W ₇ + W ₈ +W ₉ =1, W ₅ =0.25, W ₁ =W ₉ =0.05, W ₂ =W ₈ =0.075, W ₃ =W ₇ =0.1, W ₄ =W ₆ =0.15.

The target point mutation results and frequency obtained in step 6 of weighting and error correction correction, defined

And F _correct is the final detection mutation frequency of the site; if F _correct ≥ 1%, it is positive, otherwise it is negative.

The result statistics are shown in Table 1. It can be seen that the method of this application can detect each known mutation result very sensitively, which is consistent with the expected conclusion, and the result is better than the current mainstream algorithm and software for analyzing point mutations in third-generation sequencing, effectively The results of false negatives and false positives are controlled, so the method of the present application is feasible.

Table 1 Statistics of each mutation detected by the method of this application and its frequency.

Among them, Nano2NGS represents the method described in this application. From the data in Table 1, it can be known that using the method of this application, BRAF-V600E _, EGFR-L858R _, EGFR-T790M _, KRAS-G13D and AKT1 were detected in three repetitions -E17K mutation, and good reproducibility among the three results, no significant difference from the expected frequency.

The Longshot method, for example published in the journal Nature Communications (DOI: 10.1038/s41467-019-12493-y), is a point mutation detection method developed by the University of California combined with the hidden Markov chain model obtained by three-generation sequencing, from the data in Table 1 Yes, point mutation data cannot be obtained using this method of analysis.

The DeepVariant method (the PEPPER-Margin-DeepVariant method (doi: https://doi.org/10.1101/2021.03.04.433952) developed and optimized based on the Google team’s DeepVariant published on bioRxiv) cannot be directly used for the detection of point mutations in three-generation sequencing method.

Although the iGDA method can be directly used for the detection of point mutations in third-generation sequencing, point mutations are also detected in negative control samples, resulting in false positive detection results.

Therefore, the method of this application not only effectively avoids the problem of false negatives caused by random indels or high sequencing errors caused by low alignment rates from the data characteristics, but also designs the "intermediate alignment" of binding bases at the position of the sequencing sequence. Theoretical point of view, the idea of molecular biological labels (UMI/UID) on the data analysis level, and the method of "weight" statistics to conduct overall evaluation, error correction and correction of the test results, and more effectively control the false positive. result. The method of this application can be well compatible with the current standards for point mutation analysis of second-generation sequencing data or mature mainstream analysis procedures, such as GATK Best Practice, etc., enriches the technical means for point mutation analysis of third-generation sequencing data, and largely solves the problem of The lack of accuracy of third-generation sequencing to detect point mutations, while giving full play to the advantages of long data length of third-generation sequencing, also further promotes the application of third-generation sequencing in scientific research, especially for mutation detection targeting relevant hotspot panels.

In addition, the term "and/or" in this article is only an association relationship describing associated objects, which means that there may be three relationships, for example, A and/or B may mean: A exists alone, A and B exist at the same time, There are three cases of B alone. In addition, the character "/" in this article generally indicates that the contextual objects are an "or" relationship.

It should be understood that in this embodiment of the present application, "B corresponding to A" means that B is associated with A, and B can be determined according to A. However, it should also be understood that determining B according to A does not mean determining B only according to A, and B may also be determined according to A and/or other information.

The above is only a specific embodiment of the application, but the scope of protection of the application is not limited thereto. Any person familiar with the technical field can easily think of various equivalents within the scope of the technology disclosed in the application. Modifications or replacements, these modifications or replacements shall be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims

An analysis method for detecting point mutations based on three-generation sequencing data, the method comprising the following steps:

1) extracting the first sequence subset comprising the point mutation to be detected from the reference genome;

On the reference genome, a short sequence with a fixed length L is extracted N times, and the short sequence satisfies the difference between the position of the point mutation to be detected on the extracted short sequence and the position on the previously extracted short sequence have a fixed distance D between them, and
Wherein, N, D, and L are all integers, and finally the first sequence subset is obtained, which includes N short sequences containing point mutations to be detected;

2) Extracting the seed sequence from the first sequence subset in step 1), the extraction position is each M bases at the beginning and end of each short sequence, and obtaining the second sequence subset, which includes N pairs of seed sequences with a length of M;

3) Preprocessing the original third-generation sequencing data to obtain an original data set with expected quality;

4) using the seed sequence of the second sequence subset obtained in step 2) to extract the target sequence from the original data set obtained in step 3), and obtain N data sets containing the target sequence;

5) Carry out point mutation detection and analysis to the N data sets containing the target sequence in step 4) respectively, and obtain N results; wherein, each result includes the mutation frequency F of the site to be detected, and the reads support number AO of the point mutation , the sequencing depth DP of the point mutation position;

6) assign weight W to the result of each point mutation in the N detection results of step 5);

7) Calculate the point mutation result and its frequency according to the formula;

If F correct ≥ 1%, it is positive, otherwise it is negative, where F correct is the final detection mutation frequency of this site.
The method according to claim 1, wherein, in step 1),
The method according to claim 1, wherein, in step 1), in the short sequence extracted for the first time, the position of the point mutation to be detected on the short sequence is D 0 , and during the Xth extraction, the point mutation The position L x of the mutation in the short sequence extracted for the Xth time satisfies L x =D 0 +(X-1)D;

in,
The method according to claim 1, wherein L is 76-151 bp.
The method according to claim 1, wherein, in step 2), M≥5.
The analysis method according to claim 1, wherein, in step 3), data preprocessing is performed on the original three-generation sequencing data, including filtering low-quality and too short sequencing reads;

Wherein, the low quality threshold is Q5; and/or the sequence length threshold of too short sequencing reads is 100bp.
The analysis method according to claim 1, wherein, in step 4), the length L'≤L+50 of the target sequence.
The analysis method according to claim 1, wherein, in step 5), the analysis uses GATK Best Practice analysis process.
The analysis method according to claim 1, wherein, in step 6), the weight distribution to the result of each point mutation in the N detection results comprises:

the sum of the weights W 1 to W N is 1; and

Among the N short sequences obtained in step 1), the closer the point mutation is to the middle of the fixed length L of the short sequences, the greater the weight assigned to the detection results related to the short sequences.
The analysis method according to claim 9, wherein, in step 6), the weight is assigned to the result of each point mutation in the N detection results,

Among them, when N is an even number, the
and the first
A data set has the largest weight W N/2 =W N/2+1 , then W N =W 1 , W N-1 =W 2 , W N-2 =W 3 , and so on;

Among them, when N is an odd number, the
A data set has the largest weight W N+1/2 , then W N =W 1 , W N-1 =W 2 , W N-2 =W 3 , and so on.
An analysis method for detecting point mutations based on three-generation sequencing data, the method comprising the following steps:

1) extracting the first sequence subset comprising the point mutation to be detected from the reference genome;

Short sequences of fixed length L are extracted N times on the reference genome, and in the short sequences obtained by the first extraction, the position of the point mutation to be detected is D 0 , and the point mutations to be detected are satisfied between the short sequences There is a fixed distance D between the position on the extracted short sequence and its position on the previously extracted short sequence, and finally a first sequence subset is obtained, which includes N short sequences containing point mutations to be detected;

Wherein, L is any integer between 76-151bp, D is any integer between 8 and 15, N is any integer between 4 and 18, and D0 is any integer between 5 and 14;

2) Extracting a seed sequence from each sequence in the first sequence subset obtained in step 1), the extraction positions are respectively M bases at both ends of each sequence, and finally obtaining a second sequence subset of N seed sequence pairs, Where 5≤M<D 0 ;

3) Perform data preprocessing on the original third-generation sequencing data, use Porechop software and NanoFilt software to remove the joints and barcode sequences added during the experimental library construction process, filter low-quality and too short sequencing reads, and obtain the original data set with the desired quality;

4) According to the seed sequence pair obtained in step 2), extract the corresponding target sequence from the original data set obtained in step 3), the length of the target sequence L'≤L+50, and finally obtain N data sets containing the target sequence ;

5) Use the GATK Best Practice analysis process to perform point mutation detection and analysis on the N data sets containing the target sequence obtained in step 4), and obtain the final results of N target site detection, record the detection of each target site The mutation frequency of the site is F N , the number of mutation reads supported by this site is AO N , and the sequencing depth of this site is D P N ;

6) The result of each point mutation in the N detection results of step 5) is assigned a weight, and the sum of the weights W 1 to W N is 1;

Among them, when N is an even number, the
and the first
A data set has the largest weight W N/2 =W N/2+1 , then W N =W 1 , W N-1 =W 2 , W N-2 =W 3 , and so on;

Among them, when N is an odd number, the
A data set has the largest weight W N+1/2 , then W N =W 1 , W N-1 =W 2 , W N-2 =W 3 , and so on;

7) The target point mutation results obtained in step 5) of weighting and error correction and their frequencies, defined
F correct is the final detection mutation frequency of this site;

If F correct ≥ 1%, it is positive, otherwise it is negative.
A device for detecting point mutations based on three-generation sequencing data, comprising:

The seed sequence extraction module is used to extract a first sequence subset comprising N short sequences containing point mutations to be detected from the reference genome, and then extract a second sequence subset comprising a seed sequence pair from the first sequence subset set;

The preprocessing module is used to preprocess the third-generation sequencing data to obtain the original data set with the expected quality;

The primary analysis module is used to use the seed sequence pair of the second sequence subset to extract a data set containing the target sequence from the preprocessed original data set, obtain N data sets containing the target sequence, and then perform point mutation detection analysis and obtain N results; wherein, each result includes the mutation frequency F of the site to be detected, the reads support number AO of the point mutation, and the sequencing depth DP of the point mutation position;

An advanced analysis module, which is used to further weight and correct the obtained results, and obtain the final analysis results; and

A report module for outputting results based on data;

The advanced analysis module is used to assign a weight W to the result of each point mutation in the N detection results, and calculate the point mutation result and its frequency according to the formula;

If F correct ≥ 1%, it is positive, otherwise it is negative, where F correct is the final detection mutation frequency of this site;

The reporting module is used to output point mutation results and their frequencies.
The device according to claim 12, wherein the preprocessing module is used to filter low-quality and too short sequencing reads, including Porechop software and NanoFilt software.
The device according to claim 12, wherein said primary analysis module comprises a GATK Best Practice analysis process.
The apparatus of claim 12, wherein the advanced analysis module includes a program or software for assigning a weight to each result.