CN116246703A - Quality assessment method for nucleic acid sequencing data - Google Patents

Quality assessment method for nucleic acid sequencing data Download PDF

Info

Publication number
CN116246703A
CN116246703A CN202310295466.8A CN202310295466A CN116246703A CN 116246703 A CN116246703 A CN 116246703A CN 202310295466 A CN202310295466 A CN 202310295466A CN 116246703 A CN116246703 A CN 116246703A
Authority
CN
China
Prior art keywords
sequencing
nucleic acid
polymer
sequence
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310295466.8A
Other languages
Chinese (zh)
Inventor
周文雄
李雷
李昂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Saina Biotechnology Guangzhou Co ltd
Original Assignee
Saina Biotechnology Guangzhou Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Saina Biotechnology Guangzhou Co ltd filed Critical Saina Biotechnology Guangzhou Co ltd
Priority to CN202310295466.8A priority Critical patent/CN116246703A/en
Publication of CN116246703A publication Critical patent/CN116246703A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a quality evaluation method of nucleic acid sequencing data, which is used for evaluating the quality by taking a polymer in a nucleic acid sequence as a basic unit, is not used for evaluating the quality by taking a base as the basic unit in the existing method, and is more suitable for a sequence obtained by a sequencing method with an open 3' end.

Description

Quality assessment method for nucleic acid sequencing data
Technical Field
The invention discloses a technology relating to a quality assessment method of nucleic acid sequencing data, belonging to the field of gene sequencing.
Background
The gene sequencing technology can ascertain the sequence of genetic material and is widely applied to the fields of clinical tumor typing, microorganism identification, genetic disease diagnosis and the like. In addition to producing the sequence of the nucleic acid sample to be tested, the current mainstream nucleic acid sequencing technology also gives a quality value to each base to be tested, so as to evaluate the accuracy of the measurement. This quality value is generally expressed in the form of Phred:
q=-10log 10 (1-a)
wherein a is the accuracy of the base, and q is the Phred value. For example, the relative Phred values for 99%, 99.9% and 99.99% are 20, 30 and 40, respectively.
In bioinformatic analysis of nucleic acid sequencing data, mass values play a very important role. For example, when a base on a sequence is different from the corresponding base on a reference sequence in identifying a gene mutation, the base is determined to be a gene mutation when the quality value of the base is high; when the mass value of the base is low, the sequence is considered to be missequenced and no gene mutation exists.
For the sequencing technologies of 454, ion Torrent and fluorescence generation sequencing, the reaction is performed from polymer to polymer, and each polymer is chemically integral to the sequencing. However, the existing method of assigning a quality value to each base is the wiggle of Sanger sequencing era, and only fits the sequencing chemistry of Illumina, so that the method is popular, but is not suitable for the 3' open end sequencing technology in practice, and has a plurality of defects. For example, many sequencing techniques are prone to insertion or deletion errors when longer homomultimers are detected, such as AAAA to AAAAA or AAA. Such errors occur in homomultimers, and it is difficult to accurately evaluate the mass value of each base on the homomultimer. In some cases, some bases are not prone to substitution errors and should be retained when single base substitution mutations are identified, but these bases often have lower quality values due to the ease of insertion or deletion errors, and instead are prone to discarding when mutations are identified, causing false negatives. There is therefore a need for a quality assessment method that is more suitable for 3' open ended sequencing sequences.
Disclosure of Invention
The invention discloses a quality evaluation method of nucleic acid sequencing data, which is used for evaluating the quality by taking a polymer in a nucleic acid sequence as a basic unit, is not used for evaluating the quality by taking a base as the basic unit in the existing method, and is more suitable for a sequence obtained by a sequencing method with an open 3' end.
Specifically, the invention provides a quality assessment method of nucleic acid sequencing data, which is characterized by comprising the following steps:
providing a nucleic acid sequence to be detected, taking the polymer as a basic unit, and calculating sequencing signal characteristics of the polymer;
predicting a mass score of the multimer based on the sequencing signal features using a trained and calibrated quantization scheme;
the training calibrated quantization scheme includes:
for the provided standard nucleic acid sequence, taking the polymer as a basic unit, calculating the sequencing signal characteristics of the polymer, and marking the polymer as correct or incorrect sequencing according to the comparison result of the standard nucleic acid sequence; the classifier is trained to fit the relationship between the sequencing signal features of the multimer and its markers.
According to a preferred embodiment, the polymer comprises a homopolymer, a bipolymer, a terpolymer, or the like.
According to a preferred embodiment, the sequencing signal characteristics of a polymer refer to characteristics of the signal generated when the polymer undergoes a sequencing chemistry during the sequencing process, including, but not limited to, the base species comprising the polymer, the length of the polymer, the number of rounds of sequencing chemistry, the signal strength, the degree to which the signal strength (and its proximity signal strength) is near an integer, the parameters of the sequencing signal (unit signal, background signal, lead coefficient, lag coefficient, decay coefficient), the degree of phase loss when the polymer is detected, and the like.
According to a preferred embodiment, fitting the relationship between the sequencing signal features of the multimer and the labels thereof comprises converting the fit result of the classifier into a quality score.
According to a preferred embodiment, a standard nucleic acid sample refers to a nucleic acid sample of which the source and sequence have been determined, which is highly homozygous at almost all loci of the genome, including lambda phage DNA, e.coli DNA, saccharomyces cerevisiae DNA, etc.
According to a preferred embodiment, the quality score refers to a value characterizing the accuracy of sequencing of the polymer, selected from the group consisting of accuracy, error rate, value of Phred, and the like.
According to a preferred embodiment, the quality score is logarithmically based on the probability of error in detection of the polymer, and wherein the quality score comprises Q10, Q15, Q20, Q25, Q30, Q35, Q40, Q45, Q50, Q55, Q60, etc.
According to a preferred embodiment, the classifier includes, but is not limited to, linear regression, polynomial regression, logistic regression, support vector machines, artificial neural networks, random forests, phred algorithms, ensemble learning, and the like.
According to a preferred embodiment, training the classifier includes classifying the polymers into several classes based on their sequencing signal characteristics, and counting the sequencing accuracy of each class of polymers.
According to a preferred embodiment, the classifier is trained based on a probability distribution model of maximum likelihood; probability distribution refers to probability distribution having unimodal shape characteristics including, but not limited to, two-point distribution, binomial distribution, negative binomial distribution, poisson distribution, geometric distribution, exponential distribution, normal distribution, Γ distribution, chi-square distribution, t distribution, F distribution, beta distribution, lognormal distribution, and high-dimensional extensions of the above.
According to a preferred embodiment, the method further comprises performing a bioinformatic analysis of the nucleic acid sequence to be tested based on the mass score of the multimer.
According to a preferred embodiment, the bioinformatic analysis comprises screening for high quality nucleic acid sequences based on the assigned quality values. Screening methods include, but are not limited to, screening nucleic acid sequences that have all mass values above or below a certain threshold, screening nucleic acid sequences that have all mass values with a mean above or below a certain threshold, screening for regions in the nucleic acid sequences that have mass values above or below a certain threshold, screening for regions in the nucleic acid sequences that have a mean of mass values above or below a certain threshold, and the like.
According to a preferred embodiment, the bioinformatic analysis comprises an alignment of the nucleic acid sequence onto a reference sequence according to the assigned quality value.
According to a preferred embodiment, the bioinformatic analysis comprises identifying genetic variations based on the alignment and the quality value assigned to the aligned sequences.
According to a preferred embodiment, certain features of the alignment may be used to remove potential false positive or false negative results when identifying genetic variations.
According to a preferred embodiment, the bioinformatic analysis comprises assembling the nucleic acid sequences into longer nucleic acid sequences according to the assigned quality values.
According to a preferred embodiment, the bioinformatic analysis comprises performing at least two orthogonal rounds of degenerate sequencing, obtaining a mass value of the degenerate polymer length, and performing an error correction using the mass value.
The invention also provides a quality assessment method of the nucleic acid sequencing data, which is characterized by comprising the following steps of: performing fuzzy sequencing or deletion sequencing on a nucleic acid sample to be detected to obtain input data, generating degenerate polymer length information of the input data, and calculating sequencing signal characteristics of the degenerate polymer length;
predicting a mass score of the multimer based on the sequencing signal features using a quantification protocol calibrated for training;
the training calibrated quantization scheme includes:
Sequencing a standard nucleic acid sample to obtain a nucleic acid sequence, calculating sequencing signal characteristics of the nucleic acid sequence, comparing the nucleic acid sequence to a reference sequence, and marking a polymer of the nucleic acid sequence as sequencing correct or incorrect; the classifier is trained to fit the relationship between the sequencing signal features of the multimer and its markers.
The invention has the advantages that
Compared with the prior art, the method has the following advantages:
the method takes the polymer as a basic unit for sequencing quality evaluation, imparts a quality value to the polymer composing the nucleic acid sequence instead of imparting a quality value to the base, is particularly suitable for 3' -end open sequencing reaction, and is beneficial to obtaining bioinformatics analysis results with higher accuracy, including identification of variation and the like.
Drawings
The novel features of the invention are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description and drawings, in which illustrative embodiments of the principles of the invention are utilized, of which:
FIG. 1 illustrates an example of sequencing signal features.
Detailed Description
To further illustrate the core of the present invention, the present invention will now be described by way of the following examples. The examples are provided to further illustrate the summary of the invention and are not intended to limit the invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. For a better disclosure of the method and content of the present invention, more critical terms are explained in detail herein.
Interpretation of the terms
Each of which is
The term "each" is intended to identify a single item in a collection, but does not necessarily refer to every item in a collection. An exception may occur if the disclosure is explicitly made or the context is otherwise explicitly specified.
Included
The term "comprising" is intended herein to be open-ended, including not only the recited elements, but also any additional elements.
Sample of
The term "sample" according to the invention refers to a specimen comprising a nucleic acid or a mixture of nucleic acids, typically derived from a biological fluid, cell, tissue, organ or organism, comprising at least one nucleic acid sequence to be sequenced and/or phased. Such samples include, but are not limited to, blood fractions, sputum/oral fluid, amniotic fluid, fine needle biopsy samples (e.g., surgical biopsies, fine needle biopsies, etc.), urine, peritoneal fluid, pleural fluid, tissue explants, organ cultures, and any other tissue or cell preparation, or fractions or derivatives thereof, or fractions or derivatives isolated therefrom. Although the sample is typically taken from a human subject (e.g., a patient), the sample may be taken from any organism having a chromosome, including, but not limited to, bacteria, viruses, fungi, birds, mammals, and the like. The sample may be used as it is obtained from biological sources or after pretreatment to alter the properties of the sample. For example, such pretreatment may include preparing plasma from blood, diluting viscous fluids, and the like. The methods of pretreatment may also involve, but are not limited to, filtration, precipitation, dilution, distillation, mixing, centrifugation, freezing, lyophilization, concentration, amplification, nucleic acid fragmentation, inactivation of interfering components, addition of reagents, lysis, and the like.
Standard nucleic acid sample
Refers to a nucleic acid sample of which both the source and sequence have been determined, which is highly homozygous at almost all loci in the genome. The advantage of using standard nucleic acid sample sequencing is that it is possible to more accurately discern whether sequences in the sequencing result that differ from the reference genome are ultimately variant or sequencing errors. For example, the standard nucleic acid sample may be a nucleic acid of Escherichia coli, saccharomyces cerevisiae, or the like, or lambda phage DNA produced by New England Biolabs, or the like.
Error correction code (Error) Correcting Code, ECC) sequencing and error correction
In this application, error correction code sequencing has the following features:
the sequencing method requires multiple times of sequencing, the information obtained by each time of sequencing is incomplete, and the total information obtained by the multiple times of sequencing is redundant; information redundancy of multi-loop sequencing is utilized to detect and correct potential sequencing errors, and high-accuracy sequences are obtained. For example, taking 2+2 sequencing as an example, sequencing reagents are divided into three groups (for example, three groups of MK, RY and WS) matched in pairs according to dual bases, and three independent sequencing processes are performed on a DNA sequence to be tested, so that three degenerate sequence codes are generated, the three codes can be checked with each other, and then, the true base sequence information can be deduced through decoding, and the correction capability of single-loop sequencing error sites is provided. The correction process is error correction.
Sequence(s)
In the present invention, a "sequence" is a nucleic acid sequence, including or representing nucleotide chains coupled to each other, and may be a defined nucleotide sequence, or may be a degenerate base sequence, which may be based on DNA or RNA. It should be understood that a sequence may include multiple subsequences. For example, a single sequence (e.g., the sequence of a PCR amplicon) may have 350 nucleotides. The sample read may include multiple subsequences within the 350 nucleotides. The sequence is not divided by a single base as a basic unit, but is divided by a polymer as a basic unit, and the length of the polymer may be 1bp or longer.
Degenerate bases
The degenerate bases are indicated by the IUPAC notation naming convention (Nucleic acid notation) using the letters of table 1 below, for example the letter M for a and/or C; the degenerate polymer MMKKK, which has a length of 5, i.e. DPL of 5.
TABLE 1
Letter The base represented
M A/C
K G/T
R A/G
Y C/T
W A/T
S C/G
B C/G/T
D A/G/T
H A/C/T
V A/C/G
Reference sequence
Reference sequence refers to any particular known genomic sequence, whether partial or complete, of any organism that can be used to reference an identified sequence from a subject. For example, reference genomes for human subjects as well as many other organisms can be found in the national center for biotechnology information NCBI, NCBI. "genome" refers to the complete genetic information of an organism or virus expressed in nucleic acid sequences. The genome includes both genes and non-coding sequences of DNA. The reference sequence may be larger than the reads with which it is aligned. For example, the reference sequence may be at least about 100-fold greater, or at least about 1000-fold greater, or at least about 10,000-fold greater, or at least about 10-fold greater than the alignment reads 5 Multiple, or at least about 10 6 Multiple, or at least about 10 7 And is multiple-large. In one example, the reference genomic sequence is the sequence of a full-length human genome. In another example, the reference genomic sequence is limited to a particular human chromosome, such as chromosome 13. In some implementations, the reference chromosome is a chromosomal sequence from human genomic version hg 19. Such sequences may be referred to as chromosomal reference sequences, but the term reference genome is intended to encompass such sequences. Other examples of reference sequences include genomes of other species, and chromosomes, sub-chromosomal regions (such as strands), and the like of any species. In various implementations, the reference genome is a consensus sequence or other combination derived from multiple individuals. However, in some applications, the reference sequence may be taken from a particular individual. In other embodiments, "genome" also encompasses so-called "graphic genomes" that use a specific storage format and representation of genomic sequences. In one implementation, the graphic genome stores the data in a linear file.
Alignment (alignment or alignment)
Alignment is a common concept in bioinformatics, where it is often used to compare similarities between different nucleic acids or between different proteins. The alignment in the present invention refers to a process of comparing a nucleic acid sequence with a reference sequence to determine whether the reference sequence contains the nucleic acid sequence. Common sequence alignment algorithms and software include, but are not limited to, for example, the Smith-Waterman algorithm, the Bowtie, BWA, SOAP, needleman-Wunch algorithm, bowtie2, BLAST, ELAND, TMAP, MAQ, minimap2, SHRiMP, and the like.
Quality score
Or a mass value, refers to a value that characterizes the sequencing accuracy. The quality value may be expressed in different mathematical ways, such as accuracy, error rate, value of Phred, etc. For example, the accuracy rates of 99%, 99.9% and 99.99% are 1%, 0.1% and 0.01% respectively, and the corresponding Phred values are 20, 30 and 40 respectively. In some implementations, for ease of recording and storage, the value of Phred is added to 33 and converted to ASCII code, e.g., the value of Phred 20, 30, 40 is converted to the characters '5', ' respectively? 'I'. The difference in the expression form of the mass values does not affect the essence of the present invention.
Fuzzy sequencing
In the present invention, the fuzzy sequencing refers to single-round (or single round) 2+2 sequencing, or single round 1+3 sequencing, or single round 3×4 sequencing. Specifically, 2+2 sequencing refers to the inclusion of two different sequencing reagents in the sequencing: a first sequencing reagent and a second sequencing reagent; two sequencing reagents are added in a circulating way; wherein the first sequencing reagent comprises two different nucleotide monomers having a detectable label; the second sequencing reagent comprises two different nucleotide monomers having a detectable label and which are different from the nucleotide monomers present in the first sequencing reagent, and wherein the second sequencing reagent detects a signal generated by the detectable label, e.g., a fluorescent signal, after the nucleotide monomers are incorporated into the nucleic acid to be detected, after the first sequencing reagent is provided and subsequently provided; in 2+2 sequencing, a combination of fluorescently labeled nucleotides can be used to obtain a fluorescent signal value associated with a target DNA sequence. Examples of possible combinations are shown below: M/K mode: the odd-numbered rounds presented dA4P and dC4P, and the even-numbered rounds presented dG4P and dT4P; or vice versa; R/Y mode: while the odd number wheel presents dA4P and dG4P, the even number wheel presents dC4P and dT4P; or vice versa; W/S mode: while the odd number wheel presents dA4P and dT4P, the even number wheel presents dC4P and dG4P; or vice versa. Sequencing of multiple cycles of the nucleic acid to be detected using one of the modes described above (e.g., M/K mode) may be referred to as a round (or single round). 1+3 sequencing is similar to 2+2 sequencing, meaning that in sequencing, two different sequencing reagents are included: a first sequencing reagent and a second sequencing reagent; two sequencing reagents are added in a circulating way; wherein the first sequencing reagent comprises three different nucleotide monomers having a detectable label; the second sequencing reagent comprises a nucleotide monomer having a detectable label and which is different from the nucleotide monomer present in the first sequencing reagent, and wherein the second sequencing reagent detects a signal generated by the detectable label, e.g., a fluorescent signal, after the nucleotide monomer is incorporated into the nucleic acid to be detected, after the first sequencing reagent is provided and subsequently provided; 3x4 sequencing refers to the inclusion of four different sequencing reagents in the sequencing: a first sequencing reagent, a second sequencing reagent, a third sequencing reagent, and a fourth sequencing reagent; four sequencing reagents are added in a circulating way; wherein each sequencing reagent comprises three different nucleotide monomers having a detectable label; and wherein the second sequencing reagent is provided subsequent to the provision of the first sequencing reagent, detecting a signal generated by the detectable label after incorporation of the nucleotide monomer into the nucleic acid to be detected; the third sequencing reagent is provided after the provision of the second sequencing reagent, and detects the signal generated by the detectable label after incorporation of the nucleotide monomers into the nucleic acid to be detected; the fourth sequencing reagent is provided after the third sequencing reagent is provided, and the signal generated by the detectable label is detected after the nucleotide monomer is doped into the nucleic acid to be detected; for example, the four sequencing reagents are B, D, H, V, respectively.
Obtaining ambiguous sequence information based on the signal, the ambiguous sequence information referring to base sequence information that cannot be determined from the sequence information to obtain a nucleotide sequence, where the determined base sequence information refers to nucleic acid sequence information encoded with A, G, T, C or encoded with A, G, U, C; wherein the base may be a methylated base. Ambiguous base sequences are a common concept in the scientific field, such as the use of the letter W for the bases A and/or T. There are also relevant definitions on WIKIPEDIA (https:// en. WIKIPEDIA. Org/wiki/nucleotidide).
Deletion sequencing and deletion sequences
Performing a plurality of sequencing chemical reaction cycles on the nucleic acid molecules to be tested on the gene sequencing chip; wherein at least one cycle is preselected, and signal acquisition is performed only in the preceding cycle; in other cycles, only sequencing chemical reaction is carried out, and signal acquisition is not carried out; coding the acquired signals into sequences to obtain a missing nucleic acid sequence, wherein the process is deletion sequencing.
A deleted sequence, i.e., a sequence with deletion information, where deletion information refers to the fact that part of the sequence information in the resulting sequence is deleted, for example: the sequence ATTCGNNTTT is a deletion sequence, N represents unknown sequence information, namely the sequence information is deleted, and the sequence is a deletion sequence.
Multiple base sequencing
Multiple base sequencing, i.e., a sequencing reaction in which the 3' -end of the nucleotide molecule as a substrate is hydroxyl, can occur freely, and theoretically one sequencing reaction can extend 1 or more nucleotide molecules. Common polynucleotide sequencing includes semiconductor sequencing by Ion Torrent, pyrosequencing by 454, fuzzy sequencing by Saint organisms, and the like.
Unit signal
The unit signal is the rise of the signal detected by the sequencer per one base extension of DNA, and is related to the number of DNA molecules subjected to extension reaction, the camera exposure time, the excitation light intensity, the camera light-sensing ability, and the like. The unit signal refers to the signal intensity of one base extended on each sequencing site, is a physical quantity proportional to the number of template DNA molecules of the sequencing site, and the intensity of the sequencing signal is equal to the number of bases of the unit signal multiplied by the extension reaction. The measurement accuracy of unit 1 directly affects the accuracy of the sequencing results, whether for any sequencing technique.
Background signal
The background signal refers to the reference signal detected by the sequencer when no base extension is performed, and is related to factors such as chip materials, spontaneous hydrolysis of sequencing reaction substrates, and the like. And the background signal may also change as the sequencing read length is extended. Background signals are generally defined.
Length of Degenerate Polymer (DPL)
Degenerate sequencing is a type of multi-base sequencing, which is distinguished from single base sequencing in that only one nucleotide molecule is extended per round of reaction, and multiple nucleotides may be extended per round of reaction, where the intensity of the fluorescent signal released by the sequencing reaction is positively correlated with the number of released fluorophores, and where the fluorescent signal per round of reaction reflects the number of bases extended for that round without attenuation and loss of phase, and is known as the length of the degenerate polymer (degenerate polymer length, DPL).
Wheel (cycle)
A sequencing reaction, i.e., a sequencing cycle, or sequencing reaction cycle, refers to the process of providing a sequencing reagent, incorporating a nucleotide with a detectable label into a nucleic acid to be detected, and then detecting the signal generated by the detectable label.
Polymer
The polymer of the present invention includes homomultimers, bipolymers, terpolymers, and the length of the polymer may be 1bp or longer.
Homomultimers
The homomultimer or homomultimer in the present invention refers to a multimer composed of a plurality of homonucleotide monomers, for example, AAAA or TTTTT, etc., all belong to the homomultimer.
Binary copolymer
The binary copolymer is a polymer formed by polymerizing two different monomers and provided with two different monomer units, and specifically refers to a polymer formed by two nucleotide monomers, for example, M consists of two nucleotide monomers A and C, K consists of two nucleotide monomers G and T when MK degenerate sequencing is carried out, and ACAC and GGTT are binary copolymers.
Ternary copolymer
Like a binary copolymer, a terpolymer is a polymer composed of three different monomers that participate simultaneously, and in the present invention, it specifically refers to a polymer composed of three nucleotide monomers, e.g., when AB degenerate sequencing is performed, B may be composed of three nucleotide monomers of C, G, T, and TCGG and GTGCC are terpolymers.
Classifier
Classifier classification is a very important method of data mining, and in machine learning, the classifier is used for judging a new class to which an observation sample belongs based on training data marked with the class.
The construction and implementation of the classifier generally requires the following steps:
selecting samples (including positive samples and negative samples), and dividing all the samples into two parts of training samples and test samples;
Executing a classifier algorithm on the training sample to generate a classification model;
and executing the classification model on the test sample to generate a prediction result.
Preferably, the necessary evaluation index is calculated according to the prediction result, and the performance of the classification model is evaluated.
Genetic variation
Refers to a nucleic acid sequence that differs from a reference sequence. Typical variations include, but are not limited to, single Nucleotide Variations (SNV), shortfall and insertion polymorphisms (Indel), copy Number Variations (CNV), epigenetic variations, microsatellite markers or short tandem repeat sequences and structural variations. Somatic mutation detection is a work to identify mutations that exist in DNA samples at low frequencies. Somatic mutation detection is very interesting in the context of cancer treatment. Cancer is caused by the accumulation of mutations in DNA. DNA samples from tumors are often heterogeneous, including some normal cells, some cells in early stages of cancer progression (with fewer mutations), and some late cells (with more mutations). Because of this heterogeneity, somatic mutations will typically occur at a low frequency when tumors (e.g., from FFPE samples) are sequenced. For example, SNV can be seen in only 10% of reads covering a given base. The variation to be classified as somatic or germ line by the variation classifier is also referred to herein as "genetic variation".
Threshold value
By digital or non-digital values are meant herein that serve as cut-off values for characterizing a sample, nucleic acid or portion thereof (e.g., a read). The threshold may vary based on empirical analysis. The threshold value may be compared to a measured or calculated value to determine whether the source that generated such a value should be classified in a particular manner. The threshold may be identified empirically or analytically. The choice of threshold depends on the confidence level that the user wishes to have to classify. The threshold may be selected for a particular purpose (e.g., to balance sensitivity and selectivity). As used herein, the term "threshold" indicates a point at which an analysis process may be changed and/or a point at which an action may be triggered. The threshold need not be a predetermined number. Instead, the threshold may be a function based on a number of factors, for example. The threshold may be adjusted as appropriate. Further, the threshold may indicate a range between an upper limit, a lower limit, or a limit.
The foregoing has outlined the general meaning of the terms involved in the present invention. The above terms are all of a conventional meaning in the art and are set forth again to avoid ambiguity. The above terms are not particularly meant.
Detailed Description
The quality assessment of nucleic acid sequencing data is very important for subsequent bioinformatic analysis. For example, in identifying a gene mutation, if a base has a high quality value and the base is different from the corresponding base on the reference sequence, the gene mutation is identified herein; when the mass value of the base is low, the sequence is considered to be missequenced and no gene mutation exists. In the prior art, the method of assigning a mass value to each base by using a single base as a basic unit has various defects. For example, many sequencing techniques are prone to insertion or deletion errors when longer homomultimers are detected, such as measuring TTTT as TTT or TTTTT. Since such errors occur on the homomultimer, it is difficult to accurately evaluate the mass value of each base on the homomultimer. For another example, some bases are not easily substituted with errors and should be retained when single base substitution mutations are identified, but these bases often have low quality values due to the ease of insertion or deletion errors, and instead are easily discarded when mutations are identified, causing false negatives. In order to overcome the defects in the prior art, the invention discloses a novel sequencing data quality evaluation method which takes a base as a basic unit and a polymer as a scoring basic unit, is particularly suitable for sequencing reaction with an open 3' end, and is beneficial to obtaining a bioinformatic analysis result with higher accuracy.
Specifically, the invention provides a quality assessment method of nucleic acid sequencing data, which is characterized by comprising the following steps:
providing a nucleic acid sequence to be detected, taking the polymer as a basic unit, and calculating sequencing signal characteristics of the polymer;
predicting a mass score of the multimer based on sequencing signal features using a trained and calibrated quantization scheme; the training calibrated quantization scheme includes:
for the provided standard nucleic acid sequence, taking the polymer as a basic unit, calculating the sequencing signal characteristics of the polymer, and marking the polymer as correct or incorrect sequencing according to the comparison result of the standard nucleic acid sequence; the classifier is trained to fit the relationship between the sequencing signal features of the multimer and its markers.
In the present invention, the sequencing method for obtaining a nucleic acid sequence includes a dideoxynucleotide termination method (Sanger sequencing), a chemical degradation method (Gilbert method), a pyrosequencing method (pyrosequencing), a semiconductor sequencing method (semiconductor sequencing), a cycle reversible termination method (cyclic reversible terminator), a fluorescence generation sequencing method (fluorogenic sequencing), an error correction code sequencing method (error-correction code sequencing), a fuzzy sequencing method (fuzzy sequencing), a deletion sequencing method (CN 2022101040373), a combined probe anchor ligation method (combinatorial probe-anchor ligation), a combined probe anchor polymerization method (combinatorial probe-anchor polymerization), an oligonucleotide ligation detection sequencing method (sequencing by oligonucleotide ligation and detection), a sequencing-by-binding method (sequencing-by-binding), a single molecule fluorescence sequencing method, a single molecule real-time sequencing, a nanopore sequencing method, and the like.
According to a preferred embodiment, the sequencing method is preferably selected from 3' open ended sequencing reactions, including but not limited to pyrosequencing, semiconductor sequencing, fluorogenic sequencing, error correcting code sequencing, fuzzy sequencing, deletion sequencing, nanopore sequencing, etc., in which one sequencing chemistry reaction cycle may be extended by one or more nucleotide molecules, that is, each cycle of reaction occurs in units of a polymer rather than in units of bases, so that the polymer is used as a basic unit of quality assessment, undoubtedly reflecting the actual sequencing process more accurately, and the sequencing quality assessment results are more accurate.
In the present invention, the nucleic acid sample includes deoxyribonucleic acid (DNA), ribonucleic acid (RNA), peptide Nucleic Acid (PNA), xylose Nucleic Acid (XNA), locked Nucleic Acid (LNA), and the like.
In some embodiments, the nucleic acid sample to be tested and the standard nucleic acid sample can be labeled respectively and then sequenced simultaneously, and the sequencing methods used are the same, and corresponding nucleic acid sequences are obtained respectively. In some embodiments, a standard nucleic acid sample may be sequenced first to obtain the corresponding nucleic acid sequence; and then, sequencing the nucleic acid sample to be tested by using the same sequencing method to obtain a nucleic acid sequence corresponding to the nucleic acid sample to be tested. In some embodiments, the nucleic acid sample to be detected may also be sequenced first to obtain the corresponding nucleic acid sequence; then, the standard nucleic acid sample is sequenced to obtain the corresponding nucleic acid sequence. The sequencing sequences of the sample to be tested and the standard nucleic acid sample can be exchanged, and importantly, the two require the same sequencing method for sequencing and base recognition.
According to a preferred embodiment, the input data is image data derived from sequencing images generated by a sequencer during a sequencing run. For example, input data generated by sequencing methods such as pyrosequencing (pyrosequencing), fluorogenic sequencing (fluorogenic sequencing), error correction code sequencing (error-correction code sequencing), cyclic reversible termination (cyclic reversible terminator), and combined probe-anchor ligation (combinatorial probe-anchor ligation) are image data.
In some embodiments, the input data is based on a pH change due to release of hydrogen ions during extension of the nucleotide substrate molecule, which is detected and converted to a voltage change proportional to the number of nucleotides incorporated, as is the case, for example, with Ion Torrent's semiconductor sequencing.
In some embodiments, the input data is created from nanopore sensing that uses a biosensor to measure the interruption of current flow as the analyte passes through or near the nanopore, while determining the identity of the base. For example, oxford nanopore sequencing technology (ONT) sequencing is based on the following concepts: single stranded DNA (or RNA) is passed through the membrane via the nanopore, and a voltage differential is applied across the membrane. The nucleotides present in the pore will affect the resistance of the pore, so current measurements over time can indicate the sequence of DNA bases through the pore. This current signal (referred to as a "waveform curve" due to its appearance at the time of drawing) is raw data collected by the ONT sequencer. These measurements are stored as 16-bit integer Data Acquisition (DAC) values acquired at, for example, a frequency of 4 kHz. This gives an average of about nine raw observations per base at a DNA strand speed of about 450 base pairs/second. The signal is then processed to identify an interruption in the aperture signal corresponding to each reading. These maximum uses of the original signal are the processes of base detection, i.e., conversion of DAC values into DNA base sequences. In some implementations, the input data includes normalized or scaled DAC values.
In the present invention, the polymer includes a homopolymer, a binary copolymer, a ternary copolymer, and the like. The nucleic acid sequence to be measured is divided into a plurality of polymers, and when the nucleic acid sequence is divided, all the polymers can be homomultimers, bipolymers, terpolymers, or terpolymers, and the copolymers can be divided into a part of homomultimers, a part of bipolymers, and another part of terpolymers, and the length of each polymer can be 1bp or more, for example, 5bp, 10bp, or more than 10 bp.
The division of the multimer should correspond to the sample injection procedure during the sequencing process, for example, if the sequencing method is MK degenerate sequencing, the resulting nucleic acid sequence may be MMKKKMKMMM, and the sequence is degenerate, then the multimer should be divided in a manner corresponding to the actual extension of MK sequencing, namely: (MM) (KKK) (M) (K) (MMM); if the sequencing method is AB degenerate sequencing, the obtained sequence can be ABBBAAABB, and the polymers are divided according to the mode corresponding to the actual extension of AB sequencing, namely: (a) (BBB) (AAA) (BB); if the sequencing method is 1x4, the obtained sequence can be ACCCTTGGATT, and the nucleic acid sequence is a determined nucleotide sequence, then the polymers should be divided according to the mode corresponding to the actual extension of 1x4 sequencing, and the polymers are taken as basic units, namely: (A) (CCC) (TT) (GG) (A) (TT).
In a preferred embodiment, the nucleic acid sequence is a defined nucleotide sequence, i.e. the sequence represented by a, G, C, T, or the sequence represented by a, G, C, U.
In some embodiments, the nucleic acid sequence is a degenerate sequence, or a fuzzy sequence, i.e., a sequence comprising indeterminate information, degenerate bases represented by M, K, R, Y, W, S, B, D, H, V, etc., it being understood that a portion of the sequence may be defined.
In the present invention, a standard nucleic acid sample refers to a nucleic acid sample which has been determined in both source and sequence and is highly homozygous at almost all loci in the genome. For example, there may be lambda phage DNA produced by New England Biolabs. The reference sequence of a standard nucleic acid is known, and the sequenced nucleic acid sequences can be aligned to their corresponding reference nucleic acid sequences and each multimer can be labeled as correct or incorrect, e.g., a segment of sequenced nucleic acid sequence ATTGGCCAAAT, which is divided into 3 binary copolymers: (ATT) (GGCC) (AAAT), the first and second polymers are marked as correct and the third polymer is marked as incorrect if the reference sequence is ATTGGCCAAAA.
In a specific embodiment, comparing the nucleic acid sequence obtained by sequencing to a corresponding reference nucleic acid sequence to obtain a comparison result, and marking the polymer as being correctly sequenced or incorrectly sequenced according to the comparison result; preferably, the high quality aligned sequences are further selected from the alignment results, and then the polymers in the high quality aligned sequences are marked as sequencing correct or sequencing incorrect, and undetermined sequences (i.e. bases which cannot be aligned to the reference sequence or bases with lower alignment quality) are ignored. Based on the alignment, the polymers with "matches" are labeled "sequencing correctly", and the polymers with "mismatches", "insertions" or "deletions" are labeled "sequencing errors". The high quality alignment described in the present invention, the quality value range needs to be specifically selected according to the comparison software or algorithm used; for example, when sequence alignment is performed using BWA, sequences with high quality alignment refer to base sequences with an alignment quality of greater than 0, or greater than or equal to 10, or greater than or equal to 20, or greater than or equal to 30, or greater than or equal to 40, or greater than or equal to 50, or greater than or equal to 60.
According to a preferred embodiment, the corresponding sequencing signal characteristics of the multimer refer to the characteristics of the signal generated when the multimer undergoes a sequencing chemistry during sequencing, and FIG. 1 gives examples of sequencing signal characteristics, including but not limited to: the type of the base, i.e., which of A, G, C, T (or U) the base belongs to; the position of the base on the sequence, i.e., the position order of the base on the nucleotide sequence in which it is located, e.g., for single-ended sequencing, the sequencing quality value of the base preceding it is typically higher than the base following it; the length of the polymer in which the base is located, i.e., the number of bases of the homomultimer or degenerate polymer in which the base is located, is generally short, and the sequencing quality value is high; the position of the base in the polymer in which it is located, i.e., the distance of the base from the nearest terminus of the homomultimer or degenerate polymer in which it is located; the number of rounds of sequencing chemical reaction of the base, namely the number of cycles corresponding to the base when the base is incorporated into a nucleotide chain, and usually, the number of cycles corresponding to the base is small and the quality value is high; the signal intensity can be the intensity of a signal directly collected by a sequencer, including brightness, voltage level or current level, and the like, can be a normalized signal and can be a signal after phase loss correction; the signal strength (and the adjacent signal strength) is close to the integer degree, namely the difference between the normalized signal or the signal after the phase loss correction or the signal after the error correction and the nearest integer, and in general, the difference is small and the accuracy is higher; parameters of the sequencing signal, namely unit signal, background signal, lead coefficient, lag coefficient, attenuation coefficient and the like; the degree of phase loss at the time of detection of the base is generally low, accuracy is higher, and so on.
According to a preferred embodiment, one or more sequencing signal features of the polymer are calculated, e.g., only one sequencing signal feature may be selected for calculation, two sequencing signal features may be selected, or more sequencing signal features.
Further, after labeling each of the multimers of the standard nucleic acid sequence, a classifier is trained to fit the relationship between the sequencing signal characteristics of the multimers and their labels. Classifiers are conventional concepts in the field of pattern recognition, including, but not limited to, linear regression, polynomial regression, logistic regression, support vector machines, artificial neural networks, random forests, phred algorithms, ensemble learning, and the like. With the development of pattern recognition, various novel classifier algorithms have been proposed in recent years. The use of novel classifier algorithms does not alter the essence of the invention.
Specifically, a classifier is trained to fit the relationship between the sequencing signal features of each polymer and its signature; the classifier can divide the polymers into a plurality of classes according to the sequencing signal characteristics of the polymers, and count the accuracy of each class of polymers. For example, polymers having lengths of 1, 2, 3, 4, 5 and 5 or more may be classified into one type, or polymers having unit signals in the ranges of 100 to 199, 200 to 299, 300 to 399, and >400 may be classified into one type, respectively. When multiple sequencing signal features are used, orthogonal partitioning can be performed, such as polymers with a length of 1 and a unit signal within 100-199 being scored as one type, polymers with a length of 1 and a unit signal within 200-299 being scored as another type, and so on.
In a preferred embodiment, after the fitting is completed, the fit results of the classifier are converted to quality scores. There are a number of literature reports on how to convert the classifier's predicted results into quality values. Taking the well-known softmax algorithm as an example, let the output of a classifier be (a, b), where (1, 0) represents correct and (0, 1) represents incorrect. The output of the classifier at the time of prediction is not always exactly (1, 0) or (0, 1) due to factors such as the accuracy of the classifier training or the calculation error at the time of prediction, but is so close to the value of (1, 0) or (0, 1) as (0.9,0.05) or (0.1,0.99). The softmax algorithm then converts the output (a, b) to the correct rate using the following equation:
Figure BDA0004142955640000141
with the development of the pattern recognition field, various novel transformation algorithms have been proposed in recent years, including, for example, spark-softmax, log-softmax, taylor softmax, log-Taylor softmax, soft-margin softmax, SM-Taylor softmax, etc., and the use of the novel transformation algorithms does not change the essence of the present invention.
For a sample of nucleic acid to be tested, the quality score of each polymer is predicted based on the sequencing signal characteristics of the nucleic acid to be tested by using the method for converting the fitting result of the classifier into the quality score.
In some embodiments, the mass fraction refers to a value that characterizes sequencing accuracy, selected from the group consisting of base detection accuracy, base detection error rate, phred value, and the like. It will be appreciated that the quality score is merely representative of the accuracy of base sequencing, and that the representation itself is not critical and that the quality score may be directly expressed in terms of accuracy without affecting the essence of the invention.
In some embodiments, the quality scores are logarithmically based on base detection error probabilities, and wherein the plurality of quality scores comprises Q10, Q15, Q20, Q25, Q30, Q35, Q40, Q45, Q50, Q55, Q60.
In some preferred embodiments, the quantization scheme for training calibration may be preformed, and the trained classifier may be stored as a configuration file in the system, and may be retrieved when performing the quality score of the nucleic acid sequence to be tested.
In some preferred embodiments, the standard nucleic acid sample and the test nucleic acid sample may be labeled with different molecules and mixed together for simultaneous sequencing. After sequencing, splitting the two samples by using a molecular marker, completing a quantization scheme for training and calibrating to obtain a trained classifier, and then applying the classifier to a nucleic acid sample to be tested.
The invention also discloses a method according to any of the preceding embodiments, wherein the reaction solution containing the nucleotide substrate molecules is used for sequencing. Nucleotide substrate molecules refer to any one, or two, or three of A, G, C, T nucleotide substrate molecules; or A, G, C, U nucleotide substrate molecules, or any one, two or three.
Disclosed herein are sequencing methods using nucleotide substrate molecules with fluorophore labels according to any of the preceding embodiments, wherein a set of reaction solutions is used per sequencing pass, each set of reaction solutions comprising at least two reaction solutions, each reaction solution comprising at least one of the A, G, C, T nucleotide substrate molecules, or each reaction solution comprising at least one of the A, G, C, U nucleotide substrate molecules. In one aspect, the method includes immobilizing a nucleotide sequence fragment to be detected, introducing a reaction solution of a set of reaction solutions, and recording fluorescence information. In one aspect, the method includes introducing one reaction solution at a time and sequentially introducing another reaction solution in the same set of reaction solutions. In one aspect, the reaction mixture contains at least one reaction mixture comprising two or three nucleotide molecules.
Disclosed herein are sequencing methods using nucleotide substrate molecules with fluorophore labels, wherein one set of reaction solutions is used per sequencing run, each set of reaction solutions comprising two reaction solutions, each reaction solution comprising two nucleotides with different bases. On the one hand, the nucleotide in one reaction liquid can be complementary with two bases on the nucleotide sequence to be detected, and the nucleotide in the other reaction liquid can be complementary with the other two bases on the nucleotide sequence to be detected. In one aspect, the method includes immobilizing a nucleotide sequence fragment to be detected, and introducing a first reaction solution of a set of reaction solutions. Then, a second part of the reaction solution in the same group of reaction solutions was added. The two reaction solutions may be added successively in an alternating manner to obtain the coding information of the nucleotide substrate to be measured by fluorescence information. When two nucleotide labels in each reaction liquid are different in fluorescent label (namely double-color 2+2 sequencing), two-time orthogonal double-color 2+2 sequencing is carried out; when two nucleotides in each reaction have the same fluorescent label (i.e., single color 2+2 sequencing), preferably, three orthogonal single color 2+2 sequencing passes are performed. Alternatively, one of the reaction solutions of the sequencing reaction contains 3 nucleotides and the other contains another nucleotide different from it, and four orthogonal sequencing reactions (1+3 sequencing) may be alternately performed.
According to a preferred embodiment, the method of the invention further comprises performing a bioinformatic analysis of the measured nucleic acid sequence based on the mass score of the multimer.
According to a preferred embodiment, the bioinformatic analysis refers to performing two or more orthogonal degenerate sequencing passes, obtaining a mass value for each degenerate polymer length, and then using the mass value for error correction code decoding (or error correction). For example, 3-fold orthogonal degenerate sequencing (MK, RY, WS) is performed on the nucleic acid sequence to be detected, the obtained nucleic acid sequence is ACCGTTTGC, in the case of the polymer CC, in the case of the MK sequencing loop, the binary polymer in which the binary polymer is located is ACC, in the case of the RY sequencing loop, the degenerate polymer in which the binary polymer is located is CC, in the case of the WS sequencing loop, the degenerate polymer in which the degenerate polymer is located is CCG, the characteristics of the polymer CC in each degenerate polymer, such as the length of the degenerate polymer, are 3,2,3, respectively, and the quality scores thereof are calculated, and error correction (or error correction decoding) is performed according to the quality scores, thereby determining the final nucleotide sequence.
Taking 2+2 degenerate sequencing as an example to illustrate the principle of ECC decoding, ECC sequencing consists of three rounds of degenerate sequencing of MK, RY and WS, and each round of degenerate sequencing is used for obtaining a degenerate sequence containing half information of the DNA to be detected. By taking the intersection of three degenerate bases at the same position of three degenerate sequences, the exact base composition of the DNA to be measured can be obtained. There are a total of 8 cases of intersection:
Figure BDA0004142955640000161
Among these 8 cases of intersection, there are 4 legal cases, and four bases can be obtained, respectively. There are also 4 illegal situations, taking the result of intersection as the empty set. Under ideal conditions without sequencing errors, the intersection of three degenerate sequences should be all legal, so that the sequence of the DNA to be detected can be obtained. When the DPL calculated by the signal processing algorithm contains errors, illegal conditions occur near the sequencing errors. Thus, an illegal situation when degenerate sequences are taken as intersections suggests the presence of a sequencing error. In order to correct sequencing errors, the probability of occurrence of different sequencing error modes is obtained by statistics firstly using sequencing data of a standard substance. Then, based on the maximum likelihood principle, the ECC decoding algorithm tries to correct the DPL calculated by the signal processing algorithm, and reaches 2 targets in the correction: 1. the corrected DPL can lead all three degenerate sequences to be legal when the three degenerate sequences take intersections; 2. under the constraint of the former strip, the probability of the correction mode is maximum according to the probability of different sequencing error modes. One optimization method can be used to achieve the 2 goals described above.
In certain embodiments, the bioinformatic analysis refers to screening for high quality nucleic acid sequences based on the assigned quality values. Screening methods include, but are not limited to, screening nucleic acid sequences that have all mass values above or below a certain threshold, screening nucleic acid sequences that have all mass values with a mean above or below a certain threshold, screening for regions in the nucleic acid sequences that have mass values above or below a certain threshold, screening for regions in the nucleic acid sequences that have a mean of mass values above or below a certain threshold, and the like. The high-quality nucleic acid sequence obtained by screening can be used for detecting gene variation, detecting gene expression quantity, detecting RNA alternative splicing state, detecting gene modification state, identifying species or individuals from which the nucleic acid is derived, detecting three-dimensional structure of genome, detecting interaction between the nucleic acid and the nucleic acid, detecting interaction between the nucleic acid and protein, detecting accessibility of chromatin, analyzing RNA structure and the like.
In some embodiments, the bioinformatic analysis refers to aligning nucleic acid sequences to reference sequences according to the assigned mass values. Alignment is a conventional concept in bioinformatics and can be performed using software or algorithms of BWA, smith-Waterman algorithm, bowtie, SOAP, needleman-Wunch algorithm, bowtie2, BLAST, ELAND, TMAP, MAQ, minimap2, SHRiMP. Methods for utilizing quality values in an alignment include, but are not limited to:
1. selecting a subsequence of high quality value as a seed for the preliminary positioning sequence;
2. when there are multiple possible alignment methods, a low quality value of the polymer/base is preferentially taken as the portion of the alignment mismatch.
In some embodiments, the bioinformatic analysis refers to identifying genetic variations based on the alignment results and the quality values assigned to the aligned sequences. Genetic variation is a common concept in biology and includes, but is not limited to, single nucleotide polymorphisms, copy number variation, epigenetic variation, wide range of structural variation, and the like. Methods for using quality values in identifying genetic variations include, but are not limited to:
1. for the genomic locus of the variation to be identified, the base with higher quality value of the polymer/base is screened from all the bases of the sequences aligned to the locus for identification.
2. Giving the null hypothesis: there was no genetic variation at this site. And calculating the probability of zero hypothesis establishment according to the quality value and the comparison result, accepting the zero hypothesis if the probability is larger than a given significance level, and rejecting the zero hypothesis if the probability is not larger than the given significance level, wherein the locus is considered to have genetic variation.
In some embodiments, certain features of the alignment may be used to remove potential false positive or false negative results when identifying genetic variations. These are all routine manipulations in bioinformatics, the addition of which does not affect the essence of the invention. Such features include, but are not limited to:
1. the genetic variation occurs centrally in the forward or reverse aligned sequences and less in the reverse or forward aligned sequences;
2. the genetic variation occurs centrally at both ends of the sequence and less in the center of the sequence;
3. when using double-ended sequencing, read1 detected that the site was mainly G-to-T and read2 detected that the site was mainly C-to-A, or read1 detected that the site was mainly C-to-T and read2 detected that the site was mainly G-to-A;
4. other different genetic variations frequently occur around the genetic variation.
In some embodiments, bioinformatic analysis refers to the assembly of nucleic acid sequences into longer nucleic acid sequences based on the assigned quality values.
According to a preferred embodiment, the classifier is trained, possibly based on a probability distribution model of maximum likelihood. Probability distribution refers to probability distribution having unimodal shape characteristics including, but not limited to, two-point distribution, binomial distribution, negative binomial distribution, poisson distribution, geometric distribution, exponential distribution, normal distribution, Γ distribution, chi-square distribution, t distribution, F distribution, beta distribution, lognormal distribution, and high-dimensional extensions of the above. In the foregoing probability distribution model, the expected or peak value of the probability distribution is related to the sequencing signal characteristics of the polymer, and the variance or bias is related to the mass value of the polymer. From basic statistics, it is known that different probability distributions can be completely determined by a set of parameters, e.g. a normal distribution can be completely determined by both the mean and standard deviation parameters. After aligning the sequenced sequences to the reference genome, the correspondence between the sequencing signal characteristics of each polymer and the length of the polymer (set as n) can be determined from the alignment. The integrated area of the probability distribution between n-0.5 and n +0.5, after being fully determined by a given parameter, represents the probability of the sequencing signal feature corresponding to the length n of the polymer. Likelihood function, which is the result of calculating the probability of a group of polymer sequencing signal features corresponding to the length n of the polymer after the probability distribution is determined by a given parameter, and multiplying the probability. Maximum likelihood means that a set of parameters is found such that after the probability distribution is determined with the set of parameters, the resulting likelihood function is maximized.
The maximum likelihood based probability distribution model may divide the polymer sequencing signal features into several populations, each population separately applying the maximum likelihood based probability distribution model.
The likelihood function may be mathematically transformed for simplicity of calculation. Mathematical transformations transform probability multiplication into probability addition, for example, by taking the logarithm.
The invention also provides a quality assessment method of the nucleic acid sequencing data, which is characterized by comprising the following steps of: performing fuzzy sequencing or deletion sequencing on a nucleic acid sample to be detected to obtain input data, generating degenerate polymer length information of the input data, and calculating sequencing signal characteristics of the degenerate polymer length;
predicting a mass score of the multimer based on the sequencing signal features using a quantification protocol calibrated for training;
the training calibrated quantization scheme includes:
sequencing a standard nucleic acid sample to obtain a nucleic acid sequence, calculating sequencing signal characteristics of the nucleic acid sequence, comparing the nucleic acid sequence to a reference sequence, and marking a polymer of the nucleic acid sequence as sequencing correct or incorrect; the classifier is trained to fit the relationship between the sequencing signal features of the multimer and its markers.
According to a preferred embodiment, the sequencing signal features of the degenerate polymer length include: the length of the polymer in which the base is located, i.e., the number of bases of the degenerate polymer in which the base is located, is generally short, and the sequencing quality value is high; the position of the base in the polymer in which it is located, i.e., the distance of the base from the nearest terminus of the degenerate polymer in which it is located, etc.
Specifically, a classifier is trained to fit the relationship between the sequencing signal features of each polymer and its signature; the classifier can divide the polymers into a plurality of classes according to the sequencing signal characteristics of the polymers, and the accuracy of each class of polymers is counted. For example, polymers having lengths of 1, 2, 3, 4, 5 and 5 or more can be classified into one type, respectively. When multiple sequencing signal features are used, orthogonal partitioning may be performed.
In a preferred embodiment, after the fitting is completed, the fit results of the classifier are converted to quality scores. There are a number of literature reports on how to convert the classifier's predicted results into quality values.
The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (11)

1. A method for quality assessment of nucleic acid sequencing data, comprising:
providing a nucleic acid sequence to be detected, taking the polymer as a basic unit, and calculating sequencing signal characteristics of the polymer; predicting a mass score of the multimer based on sequencing signal features using a trained and calibrated quantization scheme;
the training calibrated quantization scheme includes:
for the provided standard nucleic acid sequence, taking the polymer as a basic unit, calculating the sequencing signal characteristics of the polymer, and marking the polymer as correct or incorrect sequencing according to the comparison result of the standard nucleic acid sequence; the classifier is trained to fit the relationship between the sequencing signal features of the multimer and its markers.
2. The method of claim 1, wherein the sequencing signal characteristic of the polymer refers to a characteristic of a signal generated when the polymer undergoes a sequencing chemistry during the sequencing process, including, but not limited to, the base type of the constituent polymer, the length of the polymer, the number of rounds of the sequencing chemistry, the signal strength, the degree to which the signal strength is near an integer, a parameter of the sequencing signal, the degree of loss of phase when the polymer is detected, etc.
3. The method according to claim 1 or 2, wherein the polymer comprises a homopolymer, a bipolymer, a terpolymer, or the like.
4. A method according to any one of claims 1-3, wherein the training classifier is based on a probability distribution model of maximum likelihood.
5. The method of claim 4, wherein training the classifier comprises classifying the polymers into a plurality of classes based on sequencing signal characteristics of the polymers, and counting sequencing accuracy for each class of polymers.
6. The method of claim 5, wherein the classifier includes, but is not limited to, linear regression, polynomial regression, logistic regression, support vector machines, artificial neural networks, random forests, phred algorithms, ensemble learning, and the like.
7. The method of any one of claims 1-6, wherein the standard nucleic acid sequence is a sequence obtained by sequencing a standard nucleic acid sample; the standard nucleic acid sample refers to a nucleic acid sample which has been determined in both source and sequence and is highly homozygous at almost all sites of the genome, and includes lambda phage DNA, E.coli DNA, saccharomyces cerevisiae DNA, etc.
8. The method of claim 7, further comprising performing a bioinformatic analysis of the nucleic acid sequence to be tested based on the mass score of the multimer.
9. The method of claim 8, wherein the bioinformatic analysis comprises identifying genetic variations based on the alignment and the quality value assigned to the aligned sequences.
10. The method of claim 9, wherein the bioinformatic analysis comprises performing at least two orthogonal degenerate sequencing runs to obtain a mass value for a degenerate polymer length, and correcting with the mass value.
11. A method for quality assessment of nucleic acid sequencing data, comprising:
performing fuzzy sequencing or deletion sequencing on a nucleic acid sample to be detected to obtain input data, generating degenerate polymer length information of the input data, and calculating sequencing signal characteristics of the degenerate polymer length; predicting a mass score of the multimer based on the sequencing signal features using a quantification protocol calibrated for training;
the training calibrated quantization scheme includes:
sequencing a standard nucleic acid sample to obtain a nucleic acid sequence, calculating sequencing signal characteristics of the nucleic acid sequence, comparing the nucleic acid sequence to a reference sequence, and marking a polymer of the nucleic acid sequence as sequencing correct or incorrect; the classifier is trained to fit the relationship between the sequencing signal features of the multimer and its markers.
CN202310295466.8A 2023-03-24 2023-03-24 Quality assessment method for nucleic acid sequencing data Pending CN116246703A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310295466.8A CN116246703A (en) 2023-03-24 2023-03-24 Quality assessment method for nucleic acid sequencing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310295466.8A CN116246703A (en) 2023-03-24 2023-03-24 Quality assessment method for nucleic acid sequencing data

Publications (1)

Publication Number Publication Date
CN116246703A true CN116246703A (en) 2023-06-09

Family

ID=86633250

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310295466.8A Pending CN116246703A (en) 2023-03-24 2023-03-24 Quality assessment method for nucleic acid sequencing data

Country Status (1)

Country Link
CN (1) CN116246703A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117594130A (en) * 2024-01-19 2024-02-23 北京普译生物科技有限公司 Nanopore sequencing signal evaluation method and device, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117594130A (en) * 2024-01-19 2024-02-23 北京普译生物科技有限公司 Nanopore sequencing signal evaluation method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110870016B (en) Verification method and system for sequence variant exhalations
CN114999573B (en) Genome variation detection method and detection system
US20140256571A1 (en) Systems and Methods for Determining Copy Number Variation
EP3052651A1 (en) Systems and methods for detecting structural variants
CN116434843A (en) Base sequencing quality assessment method
CN115083521B (en) Method and system for identifying tumor cell group in single cell transcriptome sequencing data
CN116246703A (en) Quality assessment method for nucleic acid sequencing data
CN113249453A (en) Method for detecting copy number change
CN108460248B (en) Method for detecting long tandem repeat sequence based on Bionano platform
US20220364080A1 (en) Methods for dna library generation to facilitate the detection and reporting of low frequency variants
WO2021226523A2 (en) Genome sequencing and detection techniques
CN113614832A (en) Method for detecting chaperone unknown gene fusions
US20040219593A1 (en) Biochip and method of designing probes
CN114420214A (en) Quality evaluation method and screening method of nucleic acid sequencing data
US20220356513A1 (en) Synthetic polynucleotides and method of use thereof in genetic analysis
WO2024163553A1 (en) Methods for detecting gene level copy number variation in brca1 and brca2
CN118207309A (en) Sequencing method and analysis method for short tandem repeat sequence
JP2021114903A (en) DNA probe for quadruplex structure of nucleic acid
CN115762641A (en) Fingerprint spectrum construction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination