CN114420214A

CN114420214A - Quality evaluation method and screening method of nucleic acid sequencing data

Info

Publication number: CN114420214A
Application number: CN202210104023.1A
Authority: CN
Inventors: 周文雄; 黄家蔚; 司二玲; 陈子天; 吴思彧
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2022-04-29

Abstract

The invention provides a quality evaluation method and a screening method of nucleic acid sequencing data, which are specially suitable for the quality evaluation method and the screening method of error correction code sequencing by utilizing the difference between a sequencing signal and a corrected signal to evaluate the base sequencing quality or screen the nucleic acid sequencing data, particularly utilizing the difference between a signal after phase loss correction and a signal after error correction.

Description

Quality evaluation method and screening method of nucleic acid sequencing data

Technical Field

The invention relates to a quality evaluation method and a screening method of nucleic acid sequencing data, belonging to the field of gene sequencing.

Background

High-throughput sequencing is a technique for sequencing thousands of DNA sequences simultaneously, and is widely used in basic biological and medical research, as well as in vitro diagnostics. Typically, high throughput sequencers, in addition to outputting the DNA sequence being measured, give each base output a quality value to characterize the accuracy of the measurement of that base. Typically, this quality value is given in the form of Phred, i.e. when the base is measured with accuracy D, the quality value given is:

q＝-10 log₁₀(1-p)

for example, a quality value of 10 corresponds to 90% accuracy, a quality value of 20 corresponds to 99% accuracy, and a quality value of 30 corresponds to 99.9% accuracy.

Currently, the quality value assessment method commonly used in the field of high-throughput sequencing is the Phred algorithm (Ewing B, Green P.Base-solving of automated sequence transactions using Phred. II. error properties. genome Res.8: 186-194 (1998)). The Phred algorithm is commonly used in the field of DNA sequencing, but the predictors used by different sequencing techniques differ. For example, the predictors used by iontorent are: the position of the base in the sequence; the length of the homomultimer in which the base is located; the degree to which the sequencing signal of the base is close to an integer; the degree of phase loss of the sequencing signal of the base, and the like. Illumina also discloses a quality evaluation method using a neural network, that is, a neural network is trained to predict the quality of bases by the values of predictors including online overlap, purity, phasing, start5, hexamer score, motif accumulation, endness, approximate homopolymer, intensity decay, penultimate degree of purification, signal overlap with background (sobb) and/or purity G adjustment of offset, etc. (patent CN 202080005431.0).

A good scoring algorithm should have a high degree of discrimination, i.e., high quality bases can be screened out and scored high, low quality bases can be screened out and scored low. When the existing algorithm is used for evaluating the quality of sequencing data sequenced by the error correcting code, the problem of insufficient discrimination/low discrimination exists. Therefore, it is required to develop a base quality evaluation method suitable for error correction code sequencing data and having higher discrimination.

High throughput sequencers can generate large amounts of data in one run. However, not all generated data is high quality data due to defects in optical imaging, signal acquisition, cloning by DNA amplification, chemical reactions, mechanical systems, algorithmic processing, etc., and noise interference. If the data provided to the user after the user is off-line contains low-quality and unreliable sequencing data, the analysis of the sample by the user can be influenced, false positive or false negative misjudgment can be caused, the clinical diagnosis result can be seriously influenced, and the disease condition can be delayed. Therefore, screening, retaining and outputting high quality sequencing data is an essential step for high throughput sequencing.

Illumina used "purity" to screen its sequencing data (US 13006206). Specifically, Illumina labels different fluorescence for four bases respectively, the base with the strongest fluorescence is the base to be detected during sequencing, the purity of the base is defined as the strongest fluorescence intensity of the four fluorescence divided by the sum of the four fluorescence intensities, and the sequence with the purity higher than a given threshold is considered as a high-quality sequence. Ion torrent uses two indicators to screen its sequencing data: SSQ (euclidean distance between the correction signal and its integer value) and PPF (proportion of non-zero signals) (patent EP19181402, wo us11067959), sequences in which both indices are smaller than a given threshold value are considered to be high quality sequences.

In practical applications, high-throughput sequencers are commonly used to screen sequencing data for the quality values assigned to the bases of the DNA sequence being tested. For example, the fastp software would take a moving average of the quality values for each sequence and cut out the sequences where the average is lower. That is, the screening process must be performed after the quality values corresponding to the bases are obtained, which increases the waiting time, and there is a need to develop a screening method for sequencing data that does not depend on the quality values of base sequencing.

Disclosure of Invention

In one aspect, the present invention provides a method for evaluating the quality of nucleic acid sequencing data, comprising:

a) sequencing a reference nucleic acid sample to obtain a set of sequencing signals s; sequencing a nucleic acid sample to be tested by using the same sequencing method to obtain a group of sequencing signals s';

b) performing signal correction on the sequencing signal s to obtain a correction signal c, wherein the correction signal c can be directly converted into a base sequence;

c) comparing the converted base sequence with a reference sequence to obtain a comparison result, and marking the base as correct sequencing or wrong sequencing according to the comparison result;

d) comparing the signal difference of the base with the correct sequencing or the wrong sequencing in the sequencing signal s and the corresponding part in the corrected sequence c, and establishing the relation between the signal difference and the base sequencing accuracy;

e) performing signal correction on the sequencing signal s ' in the same correction mode as the step b) to obtain a correction signal c ', and converting the c ' into a base sequence;

f) comparing the signal difference of the corresponding parts in s 'and c' of each base in the converted base sequence, and predicting the sequencing accuracy of the base by using the relation between the signal difference and the base accuracy established in the step d).

According to a preferred embodiment, the type of signal generated by the sequencing may be an optical signal or an electrical signal; the optical signal may be monochromatic or polychromatic.

According to a preferred embodiment, the nucleic acid to be tested comprises DNA, or RNA or other types of nucleic acid molecules.

According to a preferred embodiment, the genomic sequence of the species to which the reference nucleic acid sample belongs is known; when the reference nucleic acid is DNA, the reference sequence is a genomic sequence of the species to which the reference nucleic acid belongs; where the reference nucleic acid is RNA, the reference sequence is a transcriptome sequence of the species to which the reference nucleic acid belongs.

According to a preferred embodiment, the converted base sequences are compared with a reference sequence to obtain a comparison result, high-quality compared base sequences are further screened out, and bases in the high-quality compared base sequences are marked as sequencing correct or sequencing error; the high quality comparison requires a specific selection of a range of quality values based on the comparison software or algorithm used.

According to a preferred embodiment, the sequencing signal is a signal directly acquired by a sequencer or a normalized signal; correspondingly, the calibration process for the sequencing signal may be phase loss calibration, and the calibration signal is a phase loss calibrated signal.

According to a preferred embodiment, in error correction code sequencing or SOLiD-based sequencing technologies using precise chemical readout, the sequencing signal may be a phase-loss corrected signal; correspondingly, the correction process for the sequencing signal may be an error correction process, and the correction signal is an error correction corrected signal.

According to a preferred embodiment, in single molecule real-time sequencing using circular consistent sequencing mode, or nanopore sequencing technology using DNA replication, repeated multiple sequencing to improve accuracy, the sequencing signal may be the original sequence; accordingly, the process of correcting the sequencing signal may be a process of finding a consensus sequence.

According to a preferred embodiment, the method of correlating signal differences to base accuracy is performed by constructing a look-up table of signal differences to base accuracy.

According to a preferred embodiment, the method for establishing the link between the signal difference and the base accuracy rate is to divide one or more predictors into a plurality of intervals, and count the accuracy rate of the base in each interval and the quality value corresponding to the accuracy rate; the method of evaluation may be to calculate the range of which predictor each base in the nucleic acid to be detected falls into, and assign the value of mass corresponding to that range to that base.

According to a preferred embodiment, the method of establishing a link between signal difference and base accuracy and the corresponding assessment method is the Phred algorithm.

According to a preferred embodiment, the method of correlating signal differences to base accuracy is machine learning.

According to a preferred embodiment, a plurality of different differences of the sequencing signal s and the correction signal c can be compared as a plurality of predictors to jointly assess base quality; let s be(s)₁，s₂，...，s_n)，c＝(c₁，c₂，...，c_n) Then the plurality of different differences in the sequencing signal s and the calibration signal c comprises:

1)

2)

3)

4)max|s_i-c_i|

5)min|s_i-c_i|

6)

7)

according to a preferred embodiment, the comparison of the difference between the sequencing signal s and the correction signal c, which may be the difference of a portion of the two signals, comprises:

1) differences in all sub-signals;

2) differences of the first several sub-signals;

3) differences of the latter several sub-signals;

4) differences of the middle several sub-signals;

5) differences in odd-numbered sub-signals;

6) differences in even-numbered sub-signals;

7) the difference of sub-signals in s or c that are greater than a given threshold;

8) the difference of sub-signals in s or c that is less than a given threshold;

9) a combination of the above options, including but not limited to differences of the first several odd numbered sub-signals.

According to a preferred embodiment, local differences of the sequencing signal s and the correction signal c may be compared; the local difference is for the sequencing signal s numbered i_iAnd correction signal c_iBefore and after itA number of sub-signals(s)_i-m，s_i-m+1，s_i-m+2，…，s_i+m-1，s_i+m) And (c)_i-m，c_i-m+1，c_i-m+2，…，c_i+m-1，c_i+m) The difference between them, namely, a local difference can be calculated for each sub-signal to obtain a group of local differences; m is an integer less than i.

According to a preferred embodiment, the base quality is jointly assessed using other predictors including, but not limited to:

1) the position of the base in the sequence;

2) the length of the homomultimer in which the base is located;

3) the position of the base in the homomeric polymer;

4) the degree to which the sequencing signal of the base is close to an integer;

5) the degree of dephasing of the sequencing signal of the base;

6) the attenuation degree of the sequencing signal of the base;

7) parameters obtained by estimating the sequencing signal of the base in the correction process comprise a unit signal, a background signal, an attenuation coefficient, a lead coefficient and a lag coefficient.

According to a preferred embodiment, in ECC sequencing, the base quality is jointly evaluated using the difference between the sequencing signal s and the calibration signal c, in combination with the length of the degenerate polymer in which the bases are located as a predictor.

According to a preferred embodiment, in ECC sequencing, the base quality is jointly evaluated using the difference between the sequencing signal s and the calibration signal c, in combination with the number of bases in the degenerate polymer where the bases are located as a predictor.

The invention provides a quality evaluation system of nucleic acid sequencing data, which is characterized by comprising the following components:

a processor, a memory, and a program for quality assessment of nucleic acid sequencing data, the program comprising instructions for:

a) performing signal correction on a sequencing signal s obtained by sequencing a reference nucleic acid sample to obtain a correction signal c, wherein the correction signal c can be directly converted into a base sequence;

b) comparing the converted base sequence with a reference sequence to obtain a comparison result, and marking the base as correct sequencing or wrong sequencing according to the comparison result;

c) comparing the signal difference of the corresponding parts of the bases with correct sequencing or incorrect sequencing in the sequencing signal s and the correction signal c, and establishing the relation between the signal difference and the base accuracy;

d) performing signal correction on a new group of sequencing signals s ' obtained by sequencing a nucleic acid sample to be detected in the same correction mode as the step a) to obtain correction signals c ', and converting the correction signals c ' into base sequences;

e) comparing the signal difference of the corresponding parts in s 'and c' of each base in the converted base sequence, and predicting the sequencing accuracy of the base by using the relation between the signal difference and the base accuracy established in the step c).

In another aspect, the present invention provides a method for screening nucleic acid sequencing data, comprising:

a) sequencing a nucleic acid sample to be detected to obtain a group of sequencing signals;

b) performing signal correction on the sequencing signal to obtain a correction signal, wherein the correction signal can be directly converted into a base sequence;

c) comparing the signal difference of the corresponding parts of the converted base sequences in the sequencing signal and the correction signal;

d) discarding the sequencing signal if said signal difference between the sequencing signal and the correction signal is greater than a given threshold, otherwise retaining.

According to a preferred embodiment, a plurality of different differences in the sequencing signal and the calibration signal can be compared; let sequence signal ═(s)₁，s₂，...，s_n) Correction signal ═ c₁，c₂，...，c_n) Then the plurality of different differences in the sequencing signal and the calibration signal comprises:

1)

2)

3)

4)max|s_i-c_i|

5)min|s_i-c_i|

6)

7)

according to a preferred embodiment, the comparing the difference between the sequencing signal and the calibration signal, which may be the difference of a portion of the two signals, comprises:

1) differences in all sub-signals;

2) differences of the first several sub-signals;

3) differences of the latter several sub-signals;

4) differences of the middle several sub-signals;

5) differences in odd-numbered sub-signals;

6) differences in even-numbered sub-signals;

8) the difference of sub-signals in s or c that is less than a given threshold;

According to a preferred embodiment, local differences in the sequencing signal and the correction signal can be compared; the local difference is for the sequencing signal s numbered i_iAnd correction signal c_iA number of sub-signals(s) preceding and following it_i-m，s_i-m+1，s_i-m+2，…，s_i+m-1，s_i+m) And (c)_i-m，c_i-m+1，c_i-m+2，...，c_i+m-1，c_i+m) The difference between them, namely, a local difference can be calculated for each sub-signal to obtain a group of local differences; m is an integer less than i; if a certain local difference in the group of local differences is larger than a given threshold value, cutting off a part with overlarge local difference in the sequence, and outputting a truncated sequence; if the length of the truncated sequence is smaller than a preset value, the whole sequence can be discarded.

According to a preferred embodiment, a plurality of different differences between the sequencing signal and the calibration signal can be compared, and the decision to discard or retain the sequence to be screened can be made after a comprehensive determination.

The invention also provides a screening system of nucleic acid sequencing data, which is characterized by comprising the following components:

a processor, a memory, and a program for screening nucleic acid sequencing data, the program comprising instructions for:

a) performing signal correction on a sequencing signal obtained by sequencing a nucleic acid sample to be detected to obtain a correction signal, wherein the correction signal can be directly converted into a base sequence;

b) comparing the signal difference of the corresponding parts of the converted base sequences in the sequencing signal and the correction signal;

c) discarding the sequencing signal if said signal difference between the sequencing signal and the correction signal is greater than a given threshold, otherwise retaining.

The invention has the advantages of

The quality evaluation method and the screening method of sequencing data disclosed by the invention take the signal difference between the sequencing signal and the correction signal as an important judgment standard, and have the following advantages that:

1. the invention selects the difference between the sequencing signal and the correction signal as a predictor, and the predictor contains the information quantity with the maximum accuracy of the representation base and is suitable for error correcting code sequencing.

2. Compared with the existing method, the method has higher discrimination, and the evaluated quality value can reach up to 45 and exceeds 41 of Illumina and 40 of Roche 454.

3. According to the sequencing data screening method, the quality value corresponding to each base does not need to be obtained first, and the sequencing data is screened only by utilizing the difference between the sequencing signal and the correction signal, so that the sequencing time is saved.

4. The sequencing data screening method can evaluate or screen sequencing data by utilizing the difference between the signals after phase loss correction and the signals after error correction, and is a sequencing data evaluation and screening method specially suitable for error correction code sequencing. The method has wide application range and is suitable for sequencing data obtained by various sequencing methods including error correcting code sequencing.

Drawings

In order to more clearly illustrate the embodiments of the present invention, the drawings to be used in the embodiments will be briefly described below. It should be apparent that the drawings in the following description are only some embodiments according to the present invention, and that those skilled in the art can also obtain corresponding drawings of other embodiments without inventive effort.

FIG. 1 is a graph showing the correspondence between the difference between a phase loss correction signal and an error correction signal and the base quality value.

FIG. 2 is a graph showing the discrimination of the evaluation results of the ECC sequencing data of the nucleic acid to be tested using the quality evaluation table.

FIG. 3 is an evaluation chart of the accuracy of the result scored by the evaluation method of the present invention.

FIG. 4 is a signal difference density profile of a sequencing signal and a calibration signal.

Detailed Description

In high-throughput sequencing technology, to amplify a sequencing signal, a DNA molecule to be detected is generally amplified to form a DNA cluster before sequencing. Ideally, all molecules in the cluster formed by amplifying the same DNA molecule should be synthesized into a new strand with the same length in each round of sequencing, so that the sequencing signal corresponds to the synthesized new strand one by one, and sequence information can be obtained. However, in practical sequencing, an inevitable limiting factor is phase loss (phasing), i.e., the loss of synchronism of the nascent strands of a batch of DNA molecules of the same sequence during synthesis. This is mainly caused by unexpected nucleotide incorporation by polymerase or incomplete extension. The phase loss includes two types: due to the presence of mismatches or contaminating nucleotides, the extension length of a fraction of the molecules in a DNA cluster is higher than expected, resulting in a "lead" (lead); some molecules are not extended by the polymerase because the reaction is incomplete and the extension length is below the expected value, resulting in a "lag" (lag). The lead and lag make the molecules in the DNA cluster have different extension lengths, and the difference can be continuously increased along with the progress of the sequencing reaction, so that the relation between the sequencing signal and the actual sequence is weaker and weaker, the sequencing error is caused, and the increase of the reading length is severely limited. In amplification-based high-throughput sequencing technologies, signal attenuation and phase loss are major reasons limiting sequencing read length. In addition, defects in optical imaging, signal acquisition, DNA amplification, polyclonal, mechanical systems, algorithm processing, etc., and noise interference can lead to eventual sequencing errors or poor quality sequencing data. Before the sequencing data is used for subsequent analysis, the quality evaluation and screening of the sequencing data need to be completed firstly, and sequences with high sequencing accuracy are screened out for subsequent analysis and calculation.

Among them, the method for evaluating the quality of sequencing data generally uses the Phred algorithm, and the algorithm lacks a proper predictor for Error Correcting Code (ECC) sequencing. The invention selects the difference between the sequencing signal and the correction signal as a predictor to evaluate the sequencing quality of the basic group, and has higher discrimination.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. For a better disclosure of the methods and contents of the present invention, the more critical terms of the invention are explained in detail herein.

1.Interpretation of terms

Error Correcting Code (ECC) sequencing and correction

The sequencing reaction of a nucleic acid sequence to be detected consists of more than two (for example three) dual base sample injection flows, the three flows are mutually orthogonal, each flow provides half of the information content of the DNA to be detected, signals of the three flows are integrated, and the information redundancy is utilized, so that the sequence of the DNA to be detected can be deduced, the possible sequencing errors can be detected and corrected, and the gene sequencing precision is greatly improved. This correction process is an error correction.

Phase loss correction

In high-throughput sequencing, each sequencing unit contains thousands to tens of thousands of different DNA molecules which are obtained by copying the same DNA template and have the same sequence, and sequencing signals are amplified to the level capable of being accurately detected. However, some DNA molecules cannot guarantee sufficient reaction in each round in the sequencing process, which leads to a lag phenomenon, and others lead to a lead phenomenon due to advanced reaction caused by substrate doping or mismatching and the like. The extension of these DNA molecules is gradually out of sync as sequencing progresses, a phenomenon known as "dephasing" in sequencing. The process of algorithmically fitting and re-correcting a dephasing sequencing signal according to a model of the sequencing reaction is referred to as "dephasing correction".

Accurate chemical readout

Precise chemical readout (Exact Call Chemistry) of SOLiD next generation sequencing platform sequencing is performed by using a ligation method to obtain a SOLiD color coding sequence based on a 'double-base coding principle', and then data analysis is performed to compare an original color sequence with a reference sequence converted into a color code, so that the SOLiD color sequence is positioned on the reference, and a sequencing error is corrected.

Circular consensus sequencing

The Circular Consensus Sequencing (CCS) mode of PacBio corporation is a method in which a consensus sequence can be calculated by repeatedly sequencing the same molecule, thereby improving accuracy. Circular Consensus Sequences (CCS) reads were generated by aligning libraries from individual ZMWs. Note that the CCS read produced does not include or require alignment with the reference sequence. The CCS reads generated require at least two rounds of reads of subreads from the insert using the CCS algorithm.

Nanopore sequencing technology with improved accuracy

The technique, represented by intramolecular-ligated nanopore consensus sequencing (INC-Seq), first requires circularization of a template DNA molecule, followed by amplification of the molecule using rolling circle amplification, resulting in a circular molecule consisting of multiple repeat units, which are subsequently broken into linear DNA strands prior to sequencing. DNA reads from the same original sequence are sequenced for multiple times, and accurate reads of the original DNA sequence can be obtained by combining the information through calculation. In addition, nanopore sequencing technologies that utilize DNA replication, repeated multiple sequencing to improve accuracy also include the Circ-seq, R2C2, and others.

Comparison of

Alignment (alignment or alignment): "alignment" is a common concept in bioinformatics, where alignment is often used to compare similarities between different nucleic acids or between different proteins. The alignment in the present invention refers to comparing a base sequence obtained by sequencing with a reference sequence, thereby determining whether the base sequence obtained by sequencing is correct or not. Common sequence alignment algorithms and software include, but are not limited to, for example, the Smith-Waterman algorithm, Bowtie, BWA, SOAP, Needleman-Wunch algorithm, Bowtie2, BLAST, ELAND, TMAP, MAQ, minimap2, SHRIMP, and the like.

Comparison quality

The base sequences obtained by sequencing are aligned to the reference sequence by utilizing the conventional alignment software or algorithm, taking bowtie2 as an example, bowtie2 measures the alignment quality by using a scoring mechanism rather than complete mismatch, and the higher the score is, the higher the probability of correct alignment is. In the high quality comparison described in the present invention, the range of quality values needs to be specifically selected according to the comparison software or algorithm used; for example, when the sequence alignment is performed using BWA, a base sequence aligned with high quality refers to a base sequence having an alignment quality of greater than 0, or equal to or greater than 10, or equal to or greater than 20, or equal to or greater than 30, or equal to or greater than 40, or equal to or greater than 50, or equal to or greater than 60.

Normalizing the signal

The original sequencing signal is subjected to attenuation correction to obtain a more accurate unit signal, and the ratio of the sequencing signal subjected to attenuation correction to the unit signal of each sequencing site is the normalized signal of the site in each sequencing wheel.

Predictor

Predictors, which are base features, in the present invention include, for example: the difference between the sequencing signal s and the calibration signal c, the position of the base in the sequence, the length of the homologous polymer where the base is located, the position of the base in the homologous polymer where the base is located, the number of more bases in the degenerate polymer where the base is located, the degree of approximation of the sequencing signal where the base is located to an integer, the degree of dephasing of the sequencing signal where the base is located, the degree of attenuation of the sequencing signal where the base is located, and parameters estimated during calibration of the sequencing signal where the base is located, such as a unit signal, a background signal, an attenuation coefficient, a lead coefficient, a lag coefficient, and the like. The error probability is assigned to the base recognition based on the above-described features (predictors) associated with the base reads. In this method, a training data set is first used, and features in the training data set are matched to a known error rate (error rate is typically determined by alignment with a reference genome). The above method can then be applied to map this feature to an error probability (or quality value, which is a logarithmic transformation of the error probability) for a target dataset. That is, the predictor does not help improve the quality of the base reads, but can help distinguish between low quality and high quality base reads. In the prior art, the assignment of error probabilities to corresponding base reads has been achieved using a number of different methods, such as position-specific error probability, Phred, Pyrobayes, logistic regression, SVM, vector quantization, etc.

Length of degenerate multimers

Degenerate sequencing is a multi-base sequencing, which differs from single-base sequencing in that only one nucleotide molecule is extended in each reaction, the number of nucleotides extended in each reaction can be multiple, the intensity of the fluorescence signal released by the sequencing reaction is positively correlated with the number of released fluorophores, and under the ideal condition of no attenuation and no phase loss, the fluorescence signal of each reaction reflects the number of bases extended in the reaction, which is called the length of Degenerate Polymer (DPL).

Degenerate basesBase of

In the present invention, the degenerate bases are represented by letters of Table 1 below, for example, the letter M represents A and/or C, according to the IUPAC notation nomenclature (Nucleic acid specification).

TABLE 1

Letters	The base represented by
		M	A/C
K	G/T
		R	A/G
Y	C/T
		W	A/T
S	C/G
		B	C/G/T
D	A/G/T
		H	A/C/T
V	A/C/G

Base quality value

The value may be an integer map of the probability of base recognition errors. The current common representation is: formula q-10 log₁₀(1-p) in which: p is the measurement accuracy of the base, and a higher value indicates more reliable base recognition and less possibility of base misdetection. For example, a mass value of 20 (frequently written Q20) calculated as P of 0.99 with a sequencing accuracy of 99%, a mass value of 30 (frequently written Q30) calculated as P of 0.999 with a sequencing accuracy of 99.9%. Currently, the highest quality value obtained by the Illumina sequencer is 41 and the highest quality value obtained by Roche454 is 40. It is understood that the quality value is merely a representation of the accuracy of base sequencing, and the representation itself is not critical and can be directly expressed in terms of accuracy.

2.Detailed Description

Specifically, the first aspect of the present invention provides a method for evaluating the quality of nucleic acid sequencing data, comprising:

According to a preferred embodiment, the sequencing method includes, but is not limited to: dideoxynucleotide termination (Sanger), pyrosequencing (454), semiconductor sequencing (Ion torrent), reversible cycle termination (Illumina), fluorogenic sequencing (cena biotechnology), error correcting code sequencing (ECC), fuzzy sequencing (patent CN201611088606.0), coupled probe-anchored ligation (huada), oligonucleotide ligation sequencing by detection (SOLiD), single-molecule fluorescent sequencing (helios), single-molecule real-time sequencing (PacBio), Nanopore sequencing (Nanopore), etc.; upgrading technologies based on various sequencing methods are also included, for example, a Circular Consensus Sequencing (CCS) method improved by PacBio corporation on the basis of a single-molecule real-time sequencing method is adopted, that is, a consensus sequence can be calculated by repeatedly sequencing the same molecule, so that the accuracy is improved; intramolecular-ligated nanopore consensus sequencing (INC-Seq) based on nanopore sequencing method, and the like.

According to a preferred embodiment, the nucleic acid to be detected comprises DNA, or RNA or other types of nucleic acid molecules, such as X-nucleic acids (or XNAs), Peptide Nucleic Acids (PNA), locked ribonucleic acids (LNA), etc. The type of nucleic acid molecule to be detected does not change the core content of the present invention.

In some embodiments, the reference nucleic acid sample and the test nucleic acid sample are sequenced simultaneously, in alternative embodiments, the reference nucleic acid sample and the test nucleic acid sample may be sequenced at different times, e.g., the reference nucleic acid sample may be sequenced first, and then the test nucleic acid sample may be sequenced using the same sequencing method; the order of sequencing is not critical and must be satisfied if the nucleic acid sample is sequenced using the same sequencing method, e.g., error correction coding.

According to a preferred embodiment, the type of signal generated by the sequencing may be an optical signal or an electrical signal; the optical signal may be monochromatic or polychromatic. For example, for semiconductor sequencing, the signal generated by sequencing is an electrical signal; for pyrosequencing, reversible cycle termination, fluorogenic sequencing, and the like, the sequencing signal generated is a fluorescent signal, i.e., an optical signal.

In a specific implementation mode, a base sequence obtained by converting a correction signal is compared with a reference sequence to obtain a comparison result, and then the base is marked as correct sequencing or wrong sequencing according to the comparison result; preferably, the base sequences with high quality alignment are further screened from the alignment results, and the bases in the base sequences with high quality alignment are labeled as sequencing correct or sequencing error, and undetermined bases (i.e. bases which cannot be successfully aligned to the reference sequence or bases with lower alignment quality) are ignored. According to the alignment result, the base with the alignment result of matching is marked as 'sequencing correct', and the base with the alignment result of 'mismatch', 'insertion' or 'deletion' is marked as 'sequencing error'. In the high quality comparison described in the present invention, the range of quality values needs to be specifically selected according to the comparison software or algorithm used; for example, when the sequence alignment is performed using BWA, a base sequence aligned with high quality refers to a base sequence having an alignment quality of greater than 0, or equal to or greater than 10, or equal to or greater than 20, or equal to or greater than 30, or equal to or greater than 40, or equal to or greater than 50, or equal to or greater than 60.

In the present invention, when the reference nucleic acid is DNA, the reference sequence is a genomic sequence of the species to which the reference nucleic acid belongs; when the reference nucleic acid is RNA, the reference sequence is a transcriptome sequence of the species to which the reference nucleic acid belongs. It should be noted that the reference nucleic acid sample must be known from the genome sequence of the species to which it belongs, and the base mutation rate should be selected to be low, so as to determine whether the difference between the sequence to be sequenced and the reference sequence is due to sequencing error or normal base mutation. For example, the standard nucleic acid sample may be selected from nucleic acids of lambda phage, E.coli, s.cerevisiae, and the like.

According to a preferred embodiment, the sequencing signal may be a signal directly acquired by the sequencer or a signal normalized by the sequencer; correspondingly, the correction process of the sequencing signal can be phase loss correction, and the correction signal is a signal after phase loss correction; the signal correction for the sequencing signal s is the same as the correction for the sequencing signal s'.

According to a preferred embodiment, in single molecule real-time sequencing using circular consensus sequencing, or some nanopore sequencing (e.g., Circ-seq, INC-seq, R2C2, etc.) that uses DNA replication, repeated multiple sequencing to improve accuracy, the sequencing signal may be the original sequence; correspondingly, the process of correcting the sequencing signal can be a process of finding a consistent sequence; the signal correction for the sequencing signal s is the same as the correction for the sequencing signal s'.

According to a preferred embodiment, in error correction code sequencing, SOLiD and other sequencing technologies using an exact chemical read (exact call chemistry) technology, the sequencing signal may be a signal after phase loss correction; accordingly, the calibration process for the sequencing signal may be an error correction calibration process. The signal correction for the sequencing signal s is the same as the correction for the sequencing signal s'. For the ecc sequencing, see patent CN201510944878.5 for a specific error correction process, the error correction process itself is not the key point of the present invention, and it is the signal difference between the corrected signal after error correction and the phase-loss corrected signal before correction that is the key focus of the present invention. The signal differences can be characterized in two broad categories, the first category is general differences, the general differences can be differences of all or part of the sub-signals, and the second category is local differences, which are described in detail below. It should be understood that the signal difference is characterized by a specific application of mathematical formula, and does not affect the core content of the present invention.

According to a preferred embodiment, a plurality of different differences of the sequencing signal s and the correction signal c can be compared as a plurality of predictors for the joint assessment of the baseThe base mass; let s be(s)₁，s₂，...，s_n)，c＝(c₁，c₂，...，c_n) Then the plurality of different differences in the sequencing signal s and the calibration signal c comprises:

1)

2)

3)

4)max|s_i-c_i|

5)min|s_i-c_i|

6)

7)

it is also possible to calculate the signal difference by taking a part of the sub-signals of the sequencing signal s and the correction signal c, which include the following types:

1) differences in all sub-signals;

2) differences of the first several sub-signals;

3) differences of the latter several sub-signals;

4) differences of the middle several sub-signals;

5) differences in odd-numbered sub-signals;

6) differences in even-numbered sub-signals;

8) the difference of sub-signals in s or c that is less than a given threshold;

In a preferred embodiment, local differences in the sequencing signal s and the correction signal c may be compared; the local difference refers to the sequencing signal s for number i_iAnd correction signal c_iA number of sub-signals(s) preceding and following it_i-m，s_i-m+1，s_i-m+2，…，s_i+m-1，s_i+m) And (c)_i-m，c_i-m+1，c_i-m+2，…，c_i+m-1，c_i+m) The difference between them, namely, a local difference can be calculated for each sub-signal to obtain a group of local differences; m is an integer less than i.

In certain implementations, a plurality of different differences in the sequencing signal s and the correction signal c can be compared as a plurality of predictors to jointly assess base quality.

In certain implementations, base quality can be co-evaluated using other reported predictors, including but not limited to:

1) the position of the base in the sequence;

2) the length of the homomultimer in which the base is located;

3) the position of the base in the homomeric polymer;

5) the degree of dephasing of the sequencing signal of the base;

6) the attenuation degree of the sequencing signal of the base;

7) and (3) estimating parameters such as unit signal, background signal, attenuation coefficient, lead coefficient and lag coefficient of the sequencing signal where the base is positioned in the correction process.

In a preferred embodiment, the method of correlating signal differences to base accuracy may be by constructing a look-up table of signal differences to base accuracy. For example, when the signal difference is 0.05, the corresponding accuracy is 99.9%; when the signal difference is 0.1, the corresponding accuracy is 99%; by analogy, a comparison table is constructed. In the step f) of the basic step, the accuracy of the detected basic group can be predicted only by inquiring the accuracy corresponding to the signal difference in the comparison table. In some implementations, the signal differences in the look-up table may be expressed in intervals, for example, when the signal differences are between 0.05 and 0.08, the corresponding accuracy is 99.9%.

In the basic step, the method for establishing the relation between the signal difference and the accuracy of the base can be that one or more predictors are divided into a plurality of intervals, and the accuracy of the base in each interval and the quality value corresponding to the accuracy are counted; the method of evaluation may be to calculate the interval of which predictor each base in the nucleic acid being tested falls into, and then assign the value of mass corresponding to that interval to that base. Specifically, the quality evaluation table can be queried line by line, if the predictor values of the base to be evaluated are all smaller than the predictor threshold value recorded in a certain line in the table, the quality value recorded in the line is assigned to the base, otherwise, the next line is queried continuously.

In the basic step, the method of establishing the link between signal difference and base accuracy, and the corresponding assessment method, may be the Phred algorithm (Ewing B, Green P.base-calibrating of automated sequence processers using Phred. II. error reagents. genome Res.8: 186-194 (1998)). The method comprises two steps of training and evaluation, wherein the training step is briefly described as follows:

1. each base is labeled as correct or incorrect using some supervised method.

2. Using the sequencing signal, several predictor values were calculated for each base. Let a total of m predictors (p)₁，p₂，...，p_m) The value of each predictor monotonically increases as the error rate increases.

3. Each predictor is partitioned into several thresholds. Let each predictor divide n separately₁，n₂，...，n_mA threshold value, all predictors are divided into N₁×n₂×...×n_mAnd (4) each interval.

4. And calculating the error rate of the corresponding base of each interval. Let a threshold value in a certain interval be (t)₁，t₂，...，t_m) If the base corresponding to the interval is the predictor value, p is satisfied₁≤t₁，p₂≤t₂，...，p_m≤t_mThe base of (1).

5. And selecting an interval with the lowest error rate, and recording a predictor threshold value of the interval, the corresponding error rate and the quality value into a quality evaluation table.

6. And deleting the selected interval and the corresponding base in the previous step.

7. If all the intervals or all the bases are deleted, finishing the training and outputting a quality evaluation table, otherwise returning to the step 4.

The evaluation procedure is briefly described as follows:

1. the value of the predictor of the base to be evaluated is calculated.

2. And inquiring the quality evaluation table line by line, if the predictor values of the base to be evaluated are all smaller than the predictor threshold value recorded in a certain line in the table, assigning the quality value recorded in the line to the base, and if not, continuously inquiring the next line.

3. If the quality assessment table is checked and no quality value is assigned to the base to be assessed, a uniform quality value, such as 0, 10 or 20, may be assigned.

In the basic procedure, the method for establishing the relationship between the signal difference and the base accuracy can be machine learning. Specifically, one or more values of the predictor are used as input, whether the base is correct or not is used as output, and a classifier between the input and the output is obtained by training through a machine learning method. The evaluation method can be that a predictor value of the base in the DNA to be tested is calculated and input into a trained classifier, and the accuracy rate of the base and the corresponding quality value thereof are obtained according to the output of the classifier. Calculating the classification accuracy from the output of the classifier is a common technique in machine learning, and a common way is, for example, soft-max, that is, if the outputs representing "correct" and "incorrect" in the output of the classifier are a and b, respectively, the accuracy is:

in some preferred embodiments, for ECC sequencing, the length of the degenerate multimer in which the bases are located can be used as a predictor in conjunction with the difference between the sequencing signal s and the calibration signal c to assess base quality. For example, in sequence ACTTGAAATC, base 5G is in three degenerate multimers TTG, GAAA, G, the corresponding degenerate multimers being 3, 4, 1 in length, respectively.

In some preferred embodiments, for ECC sequencing, the base quality can be jointly evaluated by using the difference between the sequencing signal s and the calibration signal c in combination with the number of bases in the degenerate polymer where the bases are located as a predictor. For example, in sequence ACTTGAAATC, the 5 th base G is in three degenerate multimers TTG, GAAA, G, the greater of which is T, A, G, the number of which is 2, 3, 1, respectively.

The invention also provides a quality evaluation system of nucleic acid sequencing data, which is characterized by comprising the following components: a processor, a memory, and a program for quality assessment of nucleic acid sequencing data, the program comprising instructions for:

Each of the features discussed in the detailed description of the first aspect of the invention is equally applicable to the detailed implementation of the quality assessment system for nucleic acid sequencing data of the invention. As noted above, all other features are not repeated here and should be considered repeated by reference. Those of ordinary skill in the art will understand how features identified in these implementations may be readily combined with the basic set of features identified in other implementations.

After the base quality assessment is completed, the nucleic acid sequence to be tested may be further screened according to the base quality value, according to the general practice in the art. For example, if the average mass value of bases on a sequence is below a given threshold, the sequence may be discarded and not output; if the average quality value of the bases of the tail of a sequence is below a given threshold, the tail can be truncated and only the high quality portion of the head is output. There are many tools that can accomplish this, such as FASTP software, FASTQC software, etc. However, this screening method must be performed after obtaining the base quality value, which wastes a lot of waiting time, and it is most desirable for a user who needs sequencing to obtain a high quality sequencing result after screening in as short a time as possible. Therefore, there is a need to develop a method for screening high-quality sequencing data without scoring the base quality.

According to a preferred embodiment, the method of sequencing a nucleic acid sample to be tested includes, but is not limited to: dideoxynucleotide termination (Sanger), pyrosequencing (454), semiconductor sequencing (Ion torrent), reversible cycle termination (Illumina), fluorogenic sequencing (cena biotechnology), error correcting code sequencing (ECC), fuzzy sequencing (patent CN201611088606.0), coupled probe-anchored ligation (huada), oligonucleotide ligation sequencing by detection (SOLiD), single-molecule fluorescent sequencing (helios), single-molecule real-time sequencing (PacBio), Nanopore sequencing (Nanopore), etc. Upgrading technologies based on various sequencing methods are also included, for example, a Circular Consensus Sequencing (CCS) method improved by PacBio corporation on the basis of a single-molecule real-time sequencing method is adopted, that is, a consensus sequence can be calculated by repeatedly sequencing the same molecule, so that the accuracy is improved; intramolecular-ligated nanopore consensus sequencing (INC-Seq) based on nanopore sequencing method, and the like.

According to a preferred embodiment, the sequencing signal may be a signal directly acquired by the sequencer or a signal normalized by the sequencer; correspondingly, the calibration process for the sequencing signal may be phase loss calibration, and the calibration signal is a phase loss calibrated signal.

According to a preferred embodiment, in single molecule real-time sequencing using circular consensus sequencing, or some nanopore sequencing (e.g., Circ-seq, INC-seq, R2C2, etc.) that uses DNA replication, repeated multiple sequencing to improve accuracy, the sequencing signal may be the original sequence; accordingly, the process of correcting the sequencing signal may be a process of finding a consensus sequence.

According to a preferred embodiment, in error correction code sequencing, SOLiD and other sequencing technologies using an exact chemical read (exact call chemistry) technology, the sequencing signal may be a signal after phase loss correction; accordingly, the calibration process for the sequencing signal may be an error correction calibration process. For the ecc sequencing, see patent CN201510944878.5 for a specific error correction process, the error correction process itself is not the key point of the present invention, and it is the signal difference between the corrected signal after error correction and the phase-loss corrected signal before correction that is the key focus of the present invention. The signal differences can be characterized in two broad categories, the first category is general differences, the general differences can be differences of all or part of the sub-signals, and the second category is local differences, which are described in detail below. It should be understood that the signal difference is characterized by a specific application of mathematical formula, and does not affect the core content of the present invention.

According to a preferred embodiment, a plurality of different differences between the sequencing signal and the calibration signal can be compared, and the sequencing signal is(s)₁，s₂，...，s_n) Correction signal ═ c₁，c₂，...，c_n) Then the plurality of different differences in the sequencing signal and the calibration signal comprises:

1)

2)

3)

4)max|s_i-c_i|

5)min|s_i-c_i|

6)

7)

the difference is calculated by taking all the sub-signals of the sequencing signal and the correction signal, and how to take a part of the sub-signals to calculate the difference is easily given by following the formula.

According to a preferred embodiment, the comparison of the difference between the sequencing signal and the calibration signal can be a difference in a portion of the two signals, including but not limited to:

1) differences in all sub-signals;

2) differences of the first several sub-signals;

3) differences of the latter several sub-signals;

4) differences of the middle several sub-signals;

5) differences in odd-numbered sub-signals;

6) differences in even-numbered sub-signals;

7) a difference in a sub-signal greater than a given threshold in the sequencing signal or the correction signal;

8) a difference in a sub-signal of less than a given threshold in the sequencing signal or the correction signal;

According to a preferred embodiment, local differences in the sequencing signal and the correction signal can be compared; the local difference refers to the sequencing signal s for number i_iAnd correction signal c_iA number of sub-signals(s) preceding and following it_i-m，s_i-m+1，s_i-m+2，…，s_i+m-1，s_i+m) And (c)_i-m，c_i-m+1，c_i-m+2，…，c_i+m-1，c_i+m) The difference between them, namely, a local difference can be calculated for each sub-signal to obtain a group of local differences; m is an integer less than i; if a certain local difference in the group of local differences is larger than a given threshold value, cutting off a part with overlarge local difference in the sequence, and outputting a truncated sequence; the threshold is not a fixed value, and for different nucleic acid sequencing reactions, the difference distribution of the sequencing signal and the correction signal may be different, and a threshold needs to be determined for the difference distribution obtained in each sequencing reaction, and then the sequencing signal larger than the threshold is discarded. In some implementations, if the length of the truncated sequence is smaller than a predetermined value, the whole sequence may be discarded, for example, the value may be 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, and an appropriate value is selected according to a specific application scenario.

According to a preferred embodiment, a plurality of different differences between the sequencing signal and the calibration signal can be compared, and the decision to discard or retain the sequence to be screened can be made after a comprehensive determination. For example, the sum of the absolute difference values of each sequencing signal and each calibration signal and the sum of the squared difference values of each sequencing signal and each calibration signal can be calculated separately, a difference profile can be generated based on the two differences, the high quality signals with the smaller differences are retained, and the low quality signals with the larger differences are discarded.

Each of the features discussed in the detailed description of the methods for screening nucleic acid sequencing data of the present invention is equally applicable to the detailed description of the system for screening nucleic acid sequencing data. As noted above, all other features are not repeated here and should be considered repeated by reference. Those of ordinary skill in the art will understand how features identified in these implementations may be readily combined with the basic set of features identified in other implementations.

Example 1

Genomic DNA of lambda phage was purchased from New England Biolabs, and error correction code sequencing was performed after pooling. The error correcting code sequencing comprises 3 rounds (one round refers to the process of completely sequencing a sequence to be detected by using a sequencing substrate combination (such as MK)), such as MK (two groups of sequencing substrates are respectively M (A, C), K (G, T)), RY (two groups of sequencing substrates are respectively R (A, G), Y (C, T)), WS (two groups of sequencing substrates are respectively W (A, T), S (G, C)), and the like, and in the 3 rounds of sequencing, a signal (namely a sequencing signal S) obtained after an original fluorescent signal is subjected to signal normalization and phase loss correction is respectively a sequencing signal S

The signals obtained after error correction (i.e. correction signals c) are respectively

A certain base to be evaluated in quality is respectively located in Cycle i, Cycle j and Cycle k in sequencing of 3 rounds, and the difference between a lost phase correction signal and an error correction signal of the base is defined as follows:

the resulting DNA sequences were aligned to the reference genome using BWA-MEM software, ignoring unaligned sequences. According to the alignment result, all bases with the alignment result of matching are marked as correct, and the bases with the alignment result of mismatching, inserting or deleting are marked as error. Dividing the difference between the phase loss correction signal and the error correction signal into a plurality of intervals according to the value, counting the accuracy rate p of the base in each interval, and obtaining the base with the formula q-10 log₁₀(1-p) converting it to a mass value. The correspondence between the "difference between the phase loss correction signal and the error correction signal" and the quality value is shown in fig. 1, in which the highest quality value reaches 43 minutes, and the correspondence accuracy is 99.995%.

Subsequently, error correction code sequencing was performed on the genomic DNA of another bacteriophage lambda. For each base detected, the difference between the phase loss corrected signal and the error correction corrected signal was calculated in the same manner, and the quality value of each base was evaluated based on the correspondence shown in FIG. 1.

Example 2

Genomic DNA of lambda phage was purchased from New England Biolabs, and error correction code sequencing was performed after pooling. For each base, the following values for 4 predictors were calculated:

1. the position of the base in the sequence;

2. the length of the homomultimer in which the base is located;

3. the difference between the phase loss correction signal and the error correction signal calculated as in example 1;

4. the Euclidean distance between the phase loss correction signal of the base and the rounding signal;

the resulting DNA sequences were aligned to the reference genome using BWA-MEM software, ignoring unaligned sequences. According to the alignment result, all bases with the alignment result of matching are marked as correct, and the bases with the alignment result of mismatching, inserting or deleting are marked as error. And (3) constructing a quality evaluation table by using the 4 predictors through a Phred algorithm. The table has 447 rows, the first 20 of which are as follows:

predictor 1	Predictor 2	Predictor 3	Predictor 4	Base quality value
					138	4.5	0.6022	0.1764	44
60	1.5	0.7943	0.1305	45
					158	1.5	0.7943	0.1513	44
180	5.5	0.6022	0.1305	44
					158	4.5	0.5207	0.1596	44
180	1.5	0.7943	0.1305	44
					40	5.5	0.6022	0.1678	44
180	4.5	0.6909	0.2101	43
					118	1.5	0.9238	0.1678	43
79	3.5	0.5207	0.2386	43
					180	1.5	0.7943	0.1513	43
158	4.5	0.7943	0.1305	43
					138	1.5	0.9238	0.1513	43
118	5.5	0.4406	0.1764	43
					138	3.5	0.7943	0.1966	42
158	1.5	0.7943	0.1596	43
					118	3.5	0.6022	0.2386	42
180	1.5	0.7943	0.1764	42
					158	3.5	0.7943	0.1678	42
158	1.5	0.9238	0.1513	42

This quality evaluation table was used to evaluate error correction code sequencing data of another piece of genomic DNA of bacteriophage lambda. FIG. 2 shows the proportion of bases with quality values above a certain threshold in the evaluation results, wherein bases with quality values above 40 account for 55% and quality values up to 45. Fig. 3 shows the accuracy of scoring in the evaluation result, in which the abscissa represents the mass value predicted from the mass evaluation table, the ordinate represents the mass value actually measured after the comparison, the disc size represents the number of bases, the dotted line is a straight line expressed by the equation y ═ x, the two parallel lines are straight lines expressed by the equation y ═ x ± 3, and the gray parallelogram enclosed by them is the fluctuation range allowed for the mass evaluation. The result shows that the obtained quality evaluation table can accurately evaluate the sequencing quality of the base.

The substitution error (mutation) pattern of bases with quality values above 40 in the statistical sequencing data is shown in table 2:

TABLE 2

Error correcting code sequencing down to 10^-5Or even 10^-6The error rate of the order of magnitude makes the PCR detection reagent have good performance in detecting single-nucleotide polymorphism (SNP) and other gene variations.

Example 3

Genomic DNA of lambda phage was purchased from New England Biolabs, and error correction code sequencing was performed after pooling. For each base, the following 5 predictor values were calculated:

1. the position of the base in the sequence;

2. the difference between the phase loss correction signal and the error correction signal calculated as in example 1;

the number of more one base in the degenerate polymer in which the base is located in the MK round;

the number of more one base in the degenerate polymer in which this base is located in the RY round;

the number of more one base in the degenerate multimer in which this base is located in the WS round.

The resulting DNA sequences were aligned to the reference genome using BWA-MEM software, ignoring unaligned sequences. According to the alignment result, all bases with the alignment result of matching are marked as correct, and the bases with the alignment result of mismatching, inserting or deleting are marked as error. And (3) constructing a quality evaluation table by using the 5 predictors through a Phred algorithm. The table has 301 rows, the first 20 of which are as follows:

the quality evaluation table performed similarly to the quality evaluation table in example 2.

Example 4

Genome DNA samples of lambda phage were purchased from New England Biolabs, a sequencing library was constructed and error correction code sequencing was performed, wherein the MK, RY and WS three rounds each reacted for 50 cycles, and a total of 1797082 bright spots with complete sequencing signals were collected. Wherein, the signals after MK, RY and WS round phase loss correction in the sequencing signal of each bright spot are respectively

The ECC-corrected signals are respectively

Two differences are defined:

wherein the above-mentioned differences were calculated for the sequencing signal of each bright spot, and the distribution of the differences is shown in FIG. 4. Obviously, the signal can be divided into two clusters with obvious intervals according to the two differences, wherein the two differences of the cluster at the lower left are smaller, and the high-quality signal is reserved; the two differences in the upper right cluster are large and are low quality signals, and are deleted. Therefore, the screening of high-quality sequencing signals is completed.

The above embodiments are only preferred embodiments of the present invention, and those skilled in the art can make variations and modifications to the above embodiments, therefore, the present invention is not limited to the above embodiments, and any obvious improvements, substitutions or modifications made by those skilled in the art based on the present invention are within the protection scope of the present invention.

Claims

1. A method for assessing the quality of nucleic acid sequencing data, comprising:

2. A method of screening nucleic acid sequencing data, comprising:

3. The method of claim 1, wherein the genomic sequence of the species to which the reference nucleic acid sample belongs is known; when the reference nucleic acid is DNA, the reference sequence is a genomic sequence of the species to which the reference nucleic acid belongs; where the reference nucleic acid is RNA, the reference sequence is a transcriptome sequence of the species to which the reference nucleic acid belongs.

4. The method of claim 1, wherein the converted base sequences are aligned to a reference sequence to obtain an alignment result, and then the high-quality aligned base sequences are selected, and the bases in the high-quality aligned base sequences are labeled as sequencing-correct or sequencing-incorrect.

5. The method of claim 1 or 2, wherein the sequencing signal is a signal directly acquired by a sequencer or a normalized signal; correspondingly, the calibration process for the sequencing signal may be phase loss calibration, and the calibration signal is a phase loss calibrated signal.

6. The method according to claim 1 or 2, wherein in error correction code sequencing or SOLiD-on-demand sequencing technologies employing precise chemical readout, the sequencing signal can be a phase-loss corrected signal; correspondingly, the correction process for the sequencing signal may be an error correction process, and the correction signal is an error correction corrected signal.

7. The method of claim 1 or 2, wherein in single molecule real-time sequencing using circular consistent sequencing mode, or nanopore sequencing technology using DNA replication, repeated multiple sequencing to improve accuracy, the sequencing signal may be the original sequence; accordingly, the process of correcting the sequencing signal may be a process of finding a consensus sequence.

8. The method of claim 1, wherein the step of correlating the signal difference with the base accuracy comprises constructing a look-up table of signal difference and base accuracy.

9. The method of claim 1, wherein the relationship between the signal difference and the accuracy of the base is established by dividing one or more predictors into a plurality of intervals, and counting the accuracy of the base and the quality value corresponding to the accuracy in each interval; the evaluation method is to calculate the section of which predictor each base in the detected nucleic acid falls into, and then assign the quality value corresponding to the section to the base.

10. The method of claim 1, wherein the method of correlating signal differences to base accuracy and the corresponding method of evaluation is the Phred algorithm.

11. The method of claim 1, wherein the method of correlating signal differences to base accuracy is machine learning.

12. The method according to any of claims 8-11, wherein the base quality is jointly assessed using other predictors on the basis of the difference using the sequencing signal s and the correction signal c, including but not limited to:

1) the position of the base in the sequence;

2) the length of the homomultimer in which the base is located;

3) the position of the base in the homomeric polymer;

5) the degree of dephasing of the sequencing signal of the base;

6) the attenuation degree of the sequencing signal of the base;

13. The method according to claim 1, wherein in ECC sequencing, the base quality is jointly evaluated by using the difference between the sequencing signal s and the calibration signal c, in combination with the length of the degenerate polymer where the bases are located as a predictor.

14. The method according to claim 1, wherein in error correction code sequencing, the base quality is jointly evaluated by using the difference between the sequencing signal s and the correction signal c in combination with the number of the base in the degenerate polymer where the base is located as a predictor.

15. A system for quality assessment of nucleic acid sequencing data, comprising:

16. The method of claim 2, wherein local differences in the sequencing signal and the calibration signal can be compared; the local difference is for the sequencing signal s numbered i_iAnd correction signal c_iA number of sub-signals(s) preceding and following it_i-m,s_i-m+1,s_i-m+2,…,s_i+m-1,s_i+m) And (c)_i-m,c_i-m+1,c_i-m+2,…,c_i+m-1,c_i+m) The difference between them, namely, a local difference can be calculated for each sub-signal to obtain a group of local differences; m is an integer less than i; if a certain local difference in the group of local differences is larger than a given threshold value, cutting off a part with overlarge local difference in the sequence, and outputting a truncated sequence; if the length of the truncated sequence is smaller than a preset value, the whole sequence can be discarded.

17. A system for screening nucleic acid sequencing data, comprising: