CN111667882B - Sequencing fuzzy sequence information comparison method - Google Patents

Sequencing fuzzy sequence information comparison method Download PDF

Info

Publication number
CN111667882B
CN111667882B CN202010525168.XA CN202010525168A CN111667882B CN 111667882 B CN111667882 B CN 111667882B CN 202010525168 A CN202010525168 A CN 202010525168A CN 111667882 B CN111667882 B CN 111667882B
Authority
CN
China
Prior art keywords
sequencing
reaction
sequence information
fuzzy
nucleotide
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010525168.XA
Other languages
Chinese (zh)
Other versions
CN111667882A (en
Inventor
周文雄
陈子天
康力
乔朔
段海峰
黄岩谊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN202010525168.XA priority Critical patent/CN111667882B/en
Publication of CN111667882A publication Critical patent/CN111667882A/en
Application granted granted Critical
Publication of CN111667882B publication Critical patent/CN111667882B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a method for comparing sequencing fuzzy sequence information, which comprises the following steps: fixing the nucleotide fragment to be detected, and obtaining fuzzy sequence information through a sequencing reaction; comparing the fuzzy sequence information with a reference genome; at the same time, the mutation can be identified. The method provided by the invention does not need complete nucleic acid base sequence, and can compare and find variation only through fuzzy information obtained by sequencing the multi-base reaction liquid, thereby saving the cost of sequencing, accelerating the comparison speed and reducing the cost.

Description

Sequencing fuzzy sequence information comparison method
Technical Field
The invention relates to a method and a system for comparing sequencing fuzzy sequence information, belonging to the field of gene sequencing.
Background
High-throughput sequencing technology, also known as next generation sequencing technology (NGS), is a new class of sequencing technology developed in recent years. High throughput sequencing technology is a revolutionary change to traditional sequencing technology, with simultaneous sequencing of tens of thousands to millions of nucleic acid molecules. High throughput sequencing can produce large amounts of data. The processing and utilization of data is an important component of high throughput sequencing.
The high-throughput sequencing technology can find out genetic variation and provide basis for clinical diagnosis, screening and the like. Genetic variations include Single Nucleotide Variations (SNV), copy Number Variations (CNV), fold-over-chromosome variations, DNA-modified variations (e.g., DNA methylation), and the like. Clinical diagnosis requires the ability to rapidly and accurately detect genetic variation at a low cost. However, the existing genetic variation detection methods based on the high-throughput sequencing technology all need to obtain the complete DNA sequence first and then find the variation, so that the time and price cost are increased. The invention provides a fuzzy comparison method, which can utilize fuzzy nucleic acid sequences to rapidly perform comparison and search for variation.
Disclosure of Invention
The present invention provides a method for obtaining partial information of a DNA sequence, aligning the partial information to a reference genome, and using the partial information to find/identify genetic variations.
The invention provides a method for comparing sequencing fuzzy sequence information, which is characterized in that,
Fixing the nucleotide fragment to be detected, and obtaining fuzzy sequence information through a sequencing reaction;
comparing the fuzzy sequence information with a reference nucleic acid sequence;
Wherein, the reaction liquid of the sequencing reaction contains nucleotide substrate molecules with two different bases;
sequencing refers to sequencing by utilizing a nucleotide substrate molecule of which the 5' -end is modified with a fluorophore with fluorescence switching property on polyphosphoric acid;
the fluorescence switching property means that the fluorescence signal is obviously changed after sequencing compared with that before sequencing reaction;
The sequencing reaction is a sequencing method with an unclosed 3 end;
The alignment of the ambiguous sequence information with the reference nucleic acid sequence refers to encoding the ambiguous sequence information with the reference nucleic acid sequence in the same manner and then aligning.
The invention provides a method for comparing fuzzy sequence information by sequencing, which is characterized in that,
Fixing the nucleotide fragment to be detected, and obtaining fuzzy sequence information through a sequencing reaction;
comparing the fuzzy sequence information with a reference nucleic acid sequence;
Wherein, the reaction liquid of the sequencing reaction contains nucleotide substrate molecules with three different bases;
sequencing refers to sequencing by utilizing a nucleotide substrate molecule of which the 5' -end is modified with a fluorophore with fluorescence switching property on polyphosphoric acid;
the fluorescence switching property means that the fluorescence signal is obviously changed after sequencing compared with that before sequencing reaction;
The sequencing reaction is a sequencing method with an unclosed 3 end;
The alignment of the ambiguous sequence information with the reference nucleic acid sequence refers to encoding the ambiguous sequence information with the reference nucleic acid sequence in the same manner and then aligning.
According to a preferred embodiment, one set of reaction solutions is used for each sequencing, each set comprising two or more reaction solutions, each reaction solution comprising nucleotide substrate molecules of at least two different bases.
According to a preferred embodiment, the ambiguous sequence information is a combination of degenerate sequence information and non-degenerate sequence information.
According to a preferred embodiment, the ambiguous sequence information obtained by sequencing is encoded into one of its possible base sequence information.
According to a preferred embodiment, all of the ambiguous sequence information obtained by sequencing is encoded as numbers.
According to a preferred embodiment, the ambiguous sequence information is encoded simultaneously or sequentially with the reference nucleic acid sequence.
According to a preferred embodiment, the 5 '-terminal polyphosphoric acid-modified fluorophore having fluorescence switching properties is a 5' -terminal polyphosphoric acid-modified fluorophore.
According to a preferred embodiment, the sequencing is performed using a nucleotide substrate molecule modified with a fluorophore having a fluorescence switching property at the 5' polyphosphate end or the middle phosphate;
The fluorescence switching property means that the fluorescence signal intensity is obviously increased after sequencing compared with that before sequencing reaction;
Each set of reaction liquid group is used for sequencing, each set of reaction liquid group comprises two reaction liquids, and each reaction liquid contains nucleotide substrate molecules with two different bases;
The nucleotide substrate molecules in one reaction liquid can be complementary with two bases on the nucleotide sequence to be detected, and the nucleotide substrate molecules in the other reaction liquid can be complementary with the other two bases on the nucleotide sequence to be detected;
Firstly, fixing a nucleotide sequence fragment to be detected in a reaction chamber, and then introducing one reaction solution in a set of reaction solution groups;
Releasing the fluorophore on the nucleotide substrate having the fluorophore with fluorescence switching properties using an enzyme, thereby resulting in fluorescence switching;
then introducing a second reaction liquid in the same set of reaction liquid groups;
Releasing the fluorophore on the nucleotide substrate of the fluorophore having fluorescence switching properties using an enzyme, thereby causing fluorescence switching;
and (3) circularly adding the two reaction solutions, and obtaining fuzzy coding information of the nucleotide substrate to be detected through fluorescence information.
The invention provides a system for comparing fuzzy sequence information obtained by sequencing, which comprises a computing system and is characterized in that,
Using the method of any one of the preceding claims; comparing the fuzzy sequence information obtained by sequencing with a reference nucleic acid sequence.
The invention provides a method for comparing and identifying mutation by fuzzy sequence information obtained by sequencing, which comprises the following steps: fixing the nucleotide fragment to be detected, and obtaining fuzzy sequence information through a sequencing reaction; comparing the fuzzy sequence information with a reference genome; wherein the reaction liquid of the sequencing reaction contains nucleotide substrate molecules with two or more different bases.
The reaction liquid of the sequencing reaction comprises two or more nucleotide substrate molecules with different bases. When it is subjected to a sequencing reaction, sequence information corresponding to the nucleotide substrate molecules in the sequencing reaction solution is obtained each time. The information may contain two or more kinds of base number information, and is not specific sequence information but ambiguous sequence information.
According to a preferred embodiment of the present invention, the sequencing is performed using 5' -terminal polyphosphoric acid modified with a fluorophore having fluorescence switching properties; the fluorescence switching property refers to that the fluorescence signal is obviously changed after sequencing compared with that before sequencing reaction.
According to a preferred embodiment of the invention, the sequencing is a sequencing-by-side method.
According to a preferred embodiment of the invention, it further comprises encoding the ambiguous sequence information and the reference genome in the same way and then aligning.
According to a preferred embodiment of the invention, it further comprises encoding the ambiguous sequence information or the reference genome and then aligning. In the coding process, the change of the base arrangement order may be involved, and other letters or symbols can be used instead, so that the same form is adopted and the alignment is facilitated.
According to a preferred embodiment of the invention, it further comprises encoding the reference genome, altering its sequence information and then aligning with the ambiguous sequence information.
According to a preferred embodiment of the invention, the reference genome is encoded, its sequence information is modified, and then aligned with the encoding of the ambiguous sequence information.
According to a preferred embodiment of the present invention, the ambiguous sequence information refers to complete base sequence information from which a nucleotide sequence cannot be derived.
According to a preferred embodiment of the present invention, the complete base sequence information refers to nucleic acid sequence information encoded by A, G, T, C or nucleic acid sequence information encoded by A, G, U, C can be obtained; wherein the base may be a methylated base.
According to a preferred embodiment of the invention, the ambiguous sequence information may be a degenerate sequence represented using M, K, R, Y, W, S, B, D, H, V letters.
According to a preferred embodiment of the present invention, the ambiguous sequence information may be a combination of degenerate sequence information and non-degenerate sequence information.
According to a preferred embodiment of the present invention, the method further comprises encoding a reference genome and then comparing the encoding of the ambiguous sequence information with the encoding of the reference genome
According to a preferred embodiment of the present invention, the encoding of the ambiguous sequence information and the encoding of the reference genome result in the same representation.
According to a preferred embodiment of the invention, the sequencing is a 3-terminal unblocked sequencing method.
According to a preferred embodiment of the present invention, the reaction solution used for sequencing comprises nucleotide substrate molecules of two or more different bases.
According to a preferred embodiment of the present invention, nucleotide substrate molecules of two or more different bases in a reaction solution used for sequencing are labeled with the same or different fluorescent molecules.
According to a preferred embodiment of the present invention, the reaction solution used for sequencing is a set of reaction solutions, each set of reaction solutions containing two or more reaction solutions.
According to a preferred embodiment of the present invention, the sequencing reaction fluid is a set of reaction fluid sets, each set of reaction fluid sets comprising two reaction fluids, each reaction fluid comprising nucleotides of two different bases; the nucleotide in one reaction liquid can be complementary with two bases on the nucleotide sequence to be detected, and the nucleotide in the other reaction liquid can be complementary with the other two bases on the nucleotide sequence to be detected.
According to a preferred embodiment of the invention, the encoded ambiguous sequence information is aligned to the encoded reference genome using the Smith-Waterman algorithm, bowtie, BWA or SOAP.
According to a preferred embodiment of the invention, the mutated gene is found from the result of the alignment using a common method of finding genetic mutations, preferably one or more of mutect, strelka, control-freec, cns-seq.
According to a preferred embodiment of the present invention, the ambiguous sequence information obtained by sequencing is encoded into one of its possible base sequence information.
According to a preferred embodiment of the invention, all ambiguous sequence information in the ambiguous sequence information obtained by sequencing is encoded into numbers.
According to a preferred embodiment of the invention, the coding of the ambiguous sequence information and the coding order of the reference genome are exchangeable.
According to a preferred embodiment of the present invention, the sequencing is performed using 5' -terminal polyphosphoric acid modified with a fluorophore having fluorescence switching properties; the fluorescence switching property refers to that the fluorescence signal is obviously changed after sequencing compared with that before sequencing reaction.
According to a preferred embodiment of the invention, the fluorescence switching property means that after each sequencing reaction, the fluorescence signal is significantly increased or significantly decreased or the emitted light frequency range is significantly changed compared to before the sequencing reaction.
According to a preferred embodiment of the present invention, the 5 '-terminal polyphosphoric acid-modified fluorescent group-modified nucleotide substrate molecule refers to a 5' -terminal polyphosphoric acid-modified fluorescent group-modified nucleotide substrate molecule.
According to a preferred embodiment of the present invention, the sequencing is performed using a nucleotide substrate molecule modified with a fluorophore having a fluorescence switching property at the 5' polyphosphate end or the middle phosphate; the fluorescence switching property means that the fluorescence signal intensity is obviously increased after sequencing compared with that before sequencing reaction; each set of reaction liquid group is used for sequencing, each set of reaction liquid group comprises two reaction liquids, and each reaction liquid contains nucleotide substrate molecules with two different bases; the nucleotide substrate molecules in one reaction liquid can be complementary with two bases on the nucleotide sequence to be detected, and the nucleotide substrate molecules in the other reaction liquid can be complementary with the other two bases on the nucleotide sequence to be detected; firstly, fixing a nucleotide sequence fragment to be detected in a reaction chamber, and then introducing one reaction solution in a set of reaction solution groups; releasing the fluorophore on the nucleotide substrate having the fluorophore with fluorescence switching properties using an enzyme, thereby resulting in fluorescence switching; then introducing a second reaction liquid in the same set of reaction liquid groups; releasing the fluorophore on the nucleotide substrate of the fluorophore having fluorescence switching properties using an enzyme, thereby causing fluorescence switching; and (3) circularly adding the two reaction solutions, and obtaining fuzzy coding information of the nucleotide substrate to be detected through fluorescence information.
The invention provides a sequencing reagent, which is characterized in that a nucleotide fragment to be detected is fixed, and fuzzy sequence information is obtained through the reaction of the sequencing reagent and the fixed nucleotide fragment; wherein the reaction liquid of the sequencing reaction contains nucleotide substrate molecules with two or more different bases.
According to a preferred embodiment of the present invention, sequencing is performed using a nucleotide substrate molecule sequencing reagent modified with a fluorophore having fluorescence switching properties at the 5' end of the polyphosphate; the fluorescence switching property refers to that the fluorescence signal is obviously changed after sequencing compared with that before sequencing reaction.
According to a preferred embodiment of the present invention, the nucleotide substrate molecules of two or more different bases in the reaction reagent are labeled with the same or different fluorescent molecules.
According to a preferred embodiment of the present invention, the reaction reagent is a set of reaction solutions, each set of reaction solutions containing two or more reaction solutions.
According to a preferred embodiment of the present invention, the sequencing reagent is a set of reaction solutions, each set of reaction solutions comprising two reaction solutions, each reaction solution comprising nucleotides of two different bases; the nucleotide in one reaction liquid can be complementary with two bases on the nucleotide sequence to be detected, and the nucleotide in the other reaction liquid can be complementary with the other two bases on the nucleotide sequence to be detected.
According to a preferred embodiment of the present invention, the sequencing is performed using a nucleotide substrate molecule modified with a fluorophore having a fluorescence switching property at the 5' polyphosphate end or the middle phosphate; the fluorescence switching property means that the fluorescence signal intensity is obviously increased after sequencing compared with that before sequencing reaction; each set of reaction liquid group is used for sequencing, each set of reaction liquid group comprises two reaction liquids, and each reaction liquid contains nucleotide substrate molecules with two different bases; the nucleotide substrate molecules in one reaction liquid can be complementary with two bases on the nucleotide sequence to be detected, and the nucleotide substrate molecules in the other reaction liquid can be complementary with the other two bases on the nucleotide sequence to be detected; firstly, fixing a nucleotide sequence fragment to be detected, and introducing one reaction solution in a set of reaction solution sets; releasing the fluorophore on the nucleotide substrate having the fluorophore with fluorescence switching properties using an enzyme, thereby resulting in fluorescence switching; then introducing a second reaction liquid in the same set of reaction liquid groups; releasing the fluorophore on the nucleotide substrate having the fluorophore with fluorescence switching properties using an enzyme, thereby resulting in fluorescence switching; and (3) circularly adding the two reaction solutions, and obtaining fuzzy coding information of the nucleotide substrate to be detected through fluorescence information.
The invention provides a nucleic acid sequencing method for obtaining fuzzy nucleic acid coding information, which is characterized in that a nucleotide fragment to be detected is fixed, and a sequencing reagent reacts with the fixed nucleotide fragment to obtain fuzzy sequence information; wherein the reaction liquid of the sequencing reaction contains nucleotide substrate molecules with two or more different bases.
According to a preferred embodiment of the present invention, sequencing is performed using a nucleotide substrate molecule sequencing reagent modified with a fluorophore having fluorescence switching properties at the 5' end of the polyphosphate;
The fluorescence switching property refers to that the fluorescence signal is obviously changed after sequencing compared with that before sequencing reaction.
According to a preferred embodiment of the present invention, the nucleotide substrate molecules of two or more different bases in the reaction reagent are labeled with the same or different fluorescent molecules.
According to a preferred embodiment of the present invention, the reaction reagent is a set of reaction solutions, each set of reaction solutions containing two or more reaction solutions.
According to a preferred embodiment of the present invention, the sequencing reagent is a set of reaction solutions, each set of reaction solutions comprising two reaction solutions, each reaction solution comprising nucleotides of two different bases; the nucleotide in one reaction liquid can be complementary with two bases on the nucleotide sequence to be detected, and the nucleotide in the other reaction liquid can be complementary with the other two bases on the nucleotide sequence to be detected.
According to a preferred embodiment of the present invention, the sequencing is performed using a nucleotide substrate molecule modified with a fluorophore having a fluorescence switching property at the 5' polyphosphate end or the middle phosphate; the fluorescence switching property means that the fluorescence signal intensity is obviously increased after sequencing compared with that before sequencing reaction; each set of reaction liquid group is used for sequencing, each set of reaction liquid group comprises two reaction liquids, and each reaction liquid contains nucleotide substrate molecules with two different bases; the nucleotide substrate molecules in one reaction liquid can be complementary with two bases on the nucleotide sequence to be detected, and the nucleotide substrate molecules in the other reaction liquid can be complementary with the other two bases on the nucleotide sequence to be detected; firstly, fixing a nucleotide sequence fragment to be detected, and introducing one reaction solution in a set of reaction solution sets; releasing the fluorophore on the nucleotide substrate having the fluorophore with fluorescence switching properties using an enzyme, thereby resulting in fluorescence switching; then introducing a second reaction liquid in the same set of reaction liquid groups; releasing the fluorophore on the nucleotide substrate having the fluorophore with fluorescence switching properties using an enzyme, thereby resulting in fluorescence switching; and (3) circularly adding the two reaction solutions, and obtaining fuzzy coding information of the nucleotide substrate to be detected through fluorescence information.
The invention provides a system for comparing and identifying mutation of fuzzy sequence information obtained by sequencing, which comprises a computing system and is used for comparing and/or identifying mutation by utilizing the fuzzy sequence information obtained by sequencing.
The ambiguous sequencing information refers to base sequence information that cannot be determined from the nucleotide sequence derived from the sequence information. Ambiguous base sequences are a common concept in the scientific field, such as the use of the letter W for the bases A and/or T. There are also relevant definitions on WIKIPEDIA (https:// en. WIKIPEDIA. Org/wiki/nucleotidide).
Fuzzy coding means that different DNA sequences may have identical coding results. Conversely, the same encoding result may have multiple different sources.
Ambiguous information encoding refers to manipulation of DNA sequences, which may have identical results. Encoding a reference genome refers to manipulation of the reference genome sequence, and locally different reference genomes may have identical manipulation results. Ambiguous information encoding refers to a simple rearrangement of the sequence locally ignoring the actual sequence order, according to its corresponding base. Sequence part refers to a region on a sequence corresponding to one sequencing reaction (one sequencing consists of a plurality of sequencing reactions).
The 2+2 sequencing method of the invention refers to that each round of sequencing uses a set of reaction liquid groups, each set of reaction liquid groups comprises two reaction liquids, and each reaction liquid comprises nucleotide substrate molecules with two different bases; the nucleotide substrate molecules in one reaction liquid can be complementary with two bases on the nucleotide sequence to be detected, and the nucleotide substrate molecules in the other reaction liquid can be complementary with the other two bases on the nucleotide sequence to be detected. For example, one set of reaction solutions contains two reaction solutions, the first containing substrate molecules of A and T and the second containing substrate molecules of G and C. The 2+2 sequencing method can be named by the nucleotide molecular composition in the two reaction solutions. For example, if one set of reaction solutions contains two reaction solutions, the first containing a and T substrate molecules (collectively, W) and the second containing G and C substrate molecules (collectively, S), 2+2 sequencing using the set of reaction solutions is called WS sequencing. 2+2 sequencing shares the sequencing methods of three combinations of MK, RY and WS, each of which can be further divided into single-color and two-color sequencing.
The 1+3 sequencing in the invention means that one set of reaction liquid is used for each round of sequencing, each set of reaction liquid comprises two reaction liquids, wherein a nucleotide substrate molecule in one reaction liquid can be in complementary reaction with one base on a nucleotide sequence to be detected, and a nucleotide substrate molecule in the other reaction liquid can be in complementary reaction with other three bases on the nucleotide sequence to be detected. For example, one set of reaction solutions contains two reaction solutions, the first containing a substrate molecule of a and the other containing G, C and T substrate molecules.
The method provided by the invention has the following advantages: the 2+2 or 1+3 sequencing is performed only once, and repeated 2+2 or 1+3 sequencing for the same DNA sequence is not required. The nucleotide substrates in the reaction liquid used for each round of sequencing can be marked with the same fluorescent groups, or can be respectively marked with different fluorescent groups. The invention can encode the sequencing result and the reference genome at the same time. The characteristic of the coding is that if the theoretical sequencing signals of the two DNA sequences are identical, the coding results are identical. The invention uses general sequence comparison and identification method to compare the coded sequencing result to the coded reference genome and identify the genetic variation. The method provided by the invention requires discarding the first and last substrings of each sequence in the encoding of bicolor 2+2 sequencing information. The invention provides application of 2+2 or 1+3 fuzzy sequencing information for the first time.
All terms used in the present invention are intended to be given their ordinary meanings in the field of gene sequencing, unless otherwise indicated.
Detailed Description
The compounds, sequencing steps, alignment methods, etc. described in the disclosure are merely further illustrative of the invention, and the terms used are merely used to describe specific forms and are not limiting factors of the invention.
The basic steps of the invention are as follows:
1. the DNA samples were subjected to one round of 2+2 or 1+3 sequencing.
2. The sequencing result and the reference genome are encoded in the same way. The feature of the coding is that if the theoretical sequencing signals of the two DNA sequences are identical, the coding results are identical (even if the two sequences themselves are different). The result of the encoding is one or more strings (or sequences).
3. The encoded sequencing results are aligned to the encoded reference genome using commonly used sequence alignment methods (e.g., smith-Waterman algorithm, bowtie, BWA, SOAP, etc.).
4. Gene variation was found from the alignment of step 3 using commonly used methods for gene variation discovery (e.g., mutect, strelka, control-freec, cns-seq, GATK, etc.).
5. According to the coding method in step 2, the genetic variation found in step 4 was interpreted.
Theoretical sequencing signals refer to signals that should be theoretically sequenced in ideal situations without taking into account anomalies such as sequencing errors, signal attenuation, and dyssynchrony of DNA molecules. Theoretical sequencing signals directly reflect the base composition of the DNA sequence.
The above coding method may or may not satisfy the following "coding and reverse complement exchangeable" properties: the result obtained in either case is the same, either the DNA sequence is encoded first, followed by the reverse complement, or the DNA sequence is encoded first, followed by the reverse complement, followed by the encoding. For example, single MK sequencing of a DNA sequence, the coding scheme is defined as: all measured M was rewritten as A and all measured K was rewritten as T. Then:
it can be seen that this coding is consistent with the "code and reverse complement exchangeable" nature. However, if the coding scheme is defined as: all measured M was rewritten as A and all measured K was rewritten as C.
Then:
that is not in accordance with the "code and reverse complement exchangeable" nature.
If the selected coding mode does not meet the property of 'coding and reverse complementation exchange', the reference genome and the reverse complementation sequence thereof are required to be coded simultaneously in the step 2, and the (coded) sequencing result of each DNA molecule is compared with the coding result of the reference genome and the reverse complementation sequence thereof in the step 3, and a better comparison result is selected. If the coding scheme is chosen to meet the property of "coding and reverse complement interchangeability", then only the reference genome need be encoded in step 2, and its reverse complement need not be encoded.
Examples of coding schemes consistent with the "code and reverse complement exchangeable" property in monochrome 2+2 sequencing:
mk sequencing: 1) M is rewritten to A, K is rewritten to T; or 2) M is rewritten as C, K is rewritten as G;
RY sequencing: 1) R is rewritten to A, Y is rewritten to T; or 2) R is rewritten as C and Y is rewritten as G;
Ws sequencing: method of coding for monochrome WS sequencing, which codes for exchangeable properties complementary to the reverse: the W character is coded into a character string AT, and the S character is coded into a character string CG; similarly, WW codes to ATAT, SS codes to CGCG, WWW codes to ataat, SSs codes to CGCGCG, etc.
Examples of coding schemes consistent with the "code and reverse complement exchangeable" property in two-color 2+2 sequencing:
1. The sequence is sequentially partitioned into a number of substrings, each substring containing only bases corresponding to the 2+2 sequencing combination. For example, in two-color MK sequencing, each substring consists of A and/or C only, or G and/or T only. For example, sequence AAGTGGCACT is partitioned into (AA, GTGG, CAC, T).
2. Each sub-string is rearranged from small to large in alphabetical order, respectively. For example, (AA, GTGG, CAC, T) is rearranged to (AA, GGGT, ACC, T).
3. And sequentially connecting the rearranged sub-strings to form a new string, and taking the new string as a coding result. For example, (AA, GGGT, ACC, T) are concatenated into a string AAGGGTACCT.
The two-color coding mode accords with the property of 'coding and reverse complementary exchangeable':
to improve alignment accuracy in step 3, the first and last substrings of each sequence in the bicolor 2+2 encoding may need to be discarded. In the above example, sequence AAGTGGCACT needs to be encoded as GGTACCC. Since the two parts are prone to alignment errors.
The following examples are given without specific description, and both mono-color and bi-color 2+2 are encoded as given in the previous examples. dMK, dRY, dWS denotes two colors MK, two colors RY and two colors WS, sMK, sRY denotes one color MK and one color RY, respectively. For further elucidation of the invention, the following specific embodiments are presented. The specific parameters, steps, etc. involved are conventional in the art. The detailed description and examples do not limit the scope of the invention. Except where specifically indicated, all terms used in this application are used in the generic sense of this art. All gene sequences referred to in the present invention are sequences artificially synthesized on the market, except for the specific descriptions. There are many companies that commonly synthesize sequences, such as, for example, invitrogen.
Example 1
According to the description of the invention, human genomic DNA samples (reagent Human CEPH Genomic DNA in Ion PITM Controls Kit from Thermo Inc., cat. No. 4488985) were sequenced with two colors MK, two colors RY, two colors WS, one color MK, one color RY, one million DNA sequences each. The results were aligned to the corresponding encoded genome using Bowtie2 and the statistics were only able to align the proportion of DNA sequences to unique positions on the encoded genome (unique alignment). And comparing the result with the sequencing result (complete DNA sequence information can be obtained) of the Illumina sequencer (HiSeq 2000). The unique alignment is as follows:
in the table dMK represents a two-color MK sequencing method. Lower case letters d and s represent two-color sequencing and one-color sequencing, respectively.
Example 2
According to the description of the invention, E.coli genomic DNA samples (thermo E.coli DNA Control, cat. No. 4458450) were sequenced with two colors MK, two colors RY, two colors WS, one color MK, one color RY, one million DNA sequences each. The results were aligned to the corresponding encoded genome using Bowtie2 and the statistics were only able to align the proportion of DNA sequences to unique positions on the encoded genome (unique alignment). And comparing the result with the sequencing result of the Illumina sequencer (complete DNA sequence information can be obtained). The results of the unique alignment are shown in the following table:
Example 3
Since the present invention infers genetic variations based on only partial information of DNA sequences, the existence of a portion of genetic variations is not theoretically possible to find by the present invention. For example, in single color MK sequencing, the point mutation A.fwdarw.C cannot be found (but can be theoretically found in single color RY); in two-color MK sequencing, however, if adjacent two bases AC change position in mutation to CA, it is also theoretically impossible to find. We count the proportion of All human SNVs known to date (dbSNP database download: https:// www.ncbi.nlm.nih.gov/pnp. Filename: all_2015105. Vcf. Gz) that could not be detected theoretically by the present invention, as shown in the following table:
/>
Example 4
2+2 Three rounds of sequencing, single color: 3 sets of reaction solutions are prepared, each set of reaction solution comprises two types of bases marked with fluorescent groups, and the fluorescent groups are all fluorescent groups for common nucleic acid marking. Two bottles of reaction liquid in one set contain exactly 4 complete bases. The 6 bottles of solution were not repeated with each other.
The complete sequencing process involves three rounds, one after the other. The three sets of reagents were used separately for each round of sequencing process. In addition, the sequencing primer was identical (identical sequencing primer was used, and the reaction conditions were identical).
Each round of sequencing comprises:
1. hybridization of sequencing primers to an already prepared DNA array
2. The sequencing process is started. The 2.1-2.4 procedure is repeated a limited number of times.
2.1 Into the first vial of reagent. And reacting and collecting fluorescent signals.
2.2 Washing flowcell all residual reaction solution and fluorescent molecules produced
2.3 Into a second vial of reagent. And reacting and collecting fluorescent signals.
2.2 Washing flowcell all residual reaction solution and fluorescent molecules produced
3. The extended sequencing primer is unwound.
Thus, the next round of experiments can be performed.
Preparing a reaction solution:
Preparing a sequencing reaction lotion, namely a lotion for short, which comprises the following components:
20mM Tris-HCl pH 8.8
10mM(NH4)2SO4
50mM KCl
2mM MgSO4
0.1%20
Preparing a sequencing reaction mother solution (mother solution for short) which contains:
20mM Tris-HCl pH 8.8
10mM(NH4)2SO4
50mM KCl
2mM MgSO4
0.1%20
8000unit/mL Bst polymerase
100unit/mL CIP
Three sets of sequencing reaction solutions were prepared, six bottles total. The method comprises the following steps of:
1A, mother liquor +20uM dA4P-TG +20uM dC4P-TG
1B, mother liquor +20uM dG4P-TG+20uM dG4P-TG
2A, mother liquor +20uM dA4P-TG +20uM dG4P-TG
2B, mother liquor +20uM dC4P-TG +20uM dG4P-TG
3A, mother liquor +20uM dA4P-TG +20uM dT4P-TG
3B, mother liquor +20uM dC4P-TG +20uM dG4P-TG
The prepared reaction liquid and mother liquid are placed on a 4c refrigerator or ice for standby.
Hybridization sequencing primer:
The sequencing chip was filled with sequencing primer solution (10 uM dissolved in 1 XSSC buffer), warmed to 90℃and cooled to 40℃at a rate of 5℃per minute. The sequencing primer solution was rinsed off with a wash.
The first sequencing was performed:
The sequencing chip was placed on a sequencer.
Sequencing was performed using the first set of reactions. The following procedure was followed.
1, 10ML of washing liquid is introduced to wash the chip
2, Cooling the chip to 4 DEG C
3, 100UL of reaction solution 1A was introduced
4, Heating the chip to 65 DEG C
5, Waiting for 1min
6, Exciting with 473nm laser, and shooting fluorescent image.
7, 10ML of washing liquid is introduced to wash the chip
8, Cooling the chip to 4 DEG C
9, 100UL of reaction solution 1B was introduced
10, Heating the chip to 65 DEG C
11, Wait for 1min
12, Fluorescence image was taken by excitation with 473nm laser light.
The steps 1-12 were repeated 50 times to obtain 100 fluorescence signals.
Example 5
Bicolor 2+2 three rounds of sequencing: 3 sets of reaction solutions are prepared, each set of reaction solution comprises two bottles, and each bottle comprises two bases. The two bases are labeled with different fluorescent chromophores to distinguish between them, with different emission wavelengths.
In this example, two chromophores are used for all bases: x and Y. Two bottles of reaction liquid in one set contain exactly 4 complete bases. The 6 bottles of solution were not repeated with each other.
First bottle Second bottle
First set AX+CY GX+TY
Second set AX+GY CX+TY
Third set AX+TY CX+GY
(XY is symbolized by the term "fluorescent group for labeling of nucleic acids" as commonly used)
The complete sequencing process involves three rounds, one after the other. The three sets of reagents were used separately for each round of sequencing process. Except that they are identical.
Each round of sequencing comprises:
1 hybridization of sequencing primers to an already prepared DNA array
2 Start the sequencing process. The 2.1-2.4 procedure is repeated a limited number of times.
2.1 Into the first vial of reagent. The fluorescent signals of two wavelengths are reacted and collected.
2.2 Washing flowcell all residual reaction solution and fluorescent molecules produced
2.3 Into a second vial of reagent. The fluorescent signals of two wavelengths are reacted and collected.
2.2 Washing flowcell all residual reaction solution and fluorescent molecules produced
3 Unwinding the extended sequencing primer.
Thus, the next round of experiments can be performed.
Example 6
Examples 4 and 5 are complete sequencing schemes. It is a common view that complete, well-defined sequence information can be obtained under the sequencing flow of example 4 and example 5, or at least in the case of two rounds of sequencing. In the presence of the reference genome, only one round of sequencing is required to obtain ambiguous sequence information, such that variations can be aligned or found with the reference gene.
On the basis of example 4. Only any one of the three sets of reaction solutions is needed to be prepared, and two bottles of reaction solutions are utilized to carry out one round of sequencing. The specific sequencing steps may be the same as in example 4.
Example 7
On the basis of example 5, only any one of three sets of reaction solutions was prepared, and one round of sequencing was performed using two bottles of the reaction solutions. The specific sequencing steps may be the same as in example 5.
For further elucidation of the sequencing method of the present invention reference may be made to the applicant's already filed patents, CN201510822361.9 or CN 2015110815685. X. And will not be described in detail herein. It is specifically stated that the specific sequencing steps of the present invention do not limit the scope of the present invention.

Claims (10)

1. A method for comparing sequencing fuzzy sequence information is characterized in that,
Fixing the nucleotide fragment to be detected, and obtaining fuzzy sequence information through a sequencing reaction;
comparing the fuzzy sequence information with a reference nucleic acid sequence;
Wherein, the reaction liquid of the sequencing reaction contains nucleotide substrate molecules with two different bases;
sequencing refers to sequencing by utilizing a nucleotide substrate molecule of which the 5' -end is modified with a fluorophore with fluorescence switching property on polyphosphoric acid;
the fluorescence switching property means that the fluorescence signal is obviously changed after sequencing compared with that before sequencing reaction;
The sequencing reaction is a sequencing method with an unclosed 3 end;
comparing the fuzzy sequence information with the reference nucleic acid sequence means that the fuzzy sequence information and the reference nucleic acid sequence are encoded in the same mode and then are compared;
wherein, the comparing of the fuzzy sequence information and the reference nucleic acid sequence comprises the following steps:
(1) Encoding the sequencing result and the reference nucleic acid sequence by the same method;
(2) Comparing the encoded sequencing result to the encoded reference nucleic acid sequence;
(3) Gene variation was found in the comparison results.
2. A method for comparing fuzzy sequence information by sequencing is characterized in that,
Fixing the nucleotide fragment to be detected, and obtaining fuzzy sequence information through a sequencing reaction;
comparing the fuzzy sequence information with a reference nucleic acid sequence;
Wherein, the reaction liquid of the sequencing reaction contains nucleotide substrate molecules with three different bases;
sequencing refers to sequencing by utilizing a nucleotide substrate molecule of which the 5' -end is modified with a fluorophore with fluorescence switching property on polyphosphoric acid;
the fluorescence switching property means that the fluorescence signal is obviously changed after sequencing compared with that before sequencing reaction;
The sequencing reaction is a sequencing method with an unclosed 3 end;
comparing the fuzzy sequence information with the reference nucleic acid sequence means that the fuzzy sequence information and the reference nucleic acid sequence are encoded in the same mode and then are compared;
wherein, the comparing of the fuzzy sequence information and the reference nucleic acid sequence comprises the following steps:
(1) Encoding the sequencing result and the reference nucleic acid sequence by the same method;
(2) Comparing the encoded sequencing result to the encoded reference nucleic acid sequence;
(3) Gene variation was found in the comparison results.
3. A method according to claim 1 or 2, characterized in that,
Each sequencing uses one set of reaction liquid group, wherein each set of reaction liquid group comprises two or more reaction liquids, and each reaction liquid comprises nucleotide substrate molecules with at least two different bases.
4. The method of claim 1, wherein the step of determining the position of the substrate comprises,
The ambiguous sequence information is a combination of degenerate sequence information and non-degenerate sequence information.
5. A method according to claim 1 or 2, characterized in that,
The ambiguous sequence information obtained by sequencing is encoded into one of its possible base sequence information.
6. A method according to claim 1 or 2, characterized in that,
And encoding all the fuzzy sequence information into numbers in the fuzzy sequence information obtained by sequencing.
7. A method according to claim 1 or 2, characterized in that,
The ambiguous sequence information and the reference nucleic acid sequence are encoded simultaneously or sequentially.
8. The method of claim 1, wherein the step of determining the position of the substrate comprises,
The nucleotide substrate molecule of the fluorophore with the fluorescent switching property modified by the 5 '-terminal polyphosphoric acid refers to the nucleotide substrate molecule of the fluorophore with the fluorescent switching property modified by the 5' -terminal polyphosphoric acid.
9. The method of claim 1, wherein the step of determining the position of the substrate comprises,
Sequencing using a nucleotide substrate molecule modified with a fluorophore having a fluorescence switching property at the 5' polyphosphate end or the middle phosphate;
The fluorescence switching property means that the fluorescence signal intensity is obviously increased after sequencing compared with that before sequencing reaction;
Each set of reaction liquid group is used for sequencing, each set of reaction liquid group comprises two reaction liquids, and each reaction liquid contains nucleotide substrate molecules with two different bases;
The nucleotide substrate molecules in one reaction liquid can be complementary with two bases on the nucleotide sequence to be detected, and the nucleotide substrate molecules in the other reaction liquid can be complementary with the other two bases on the nucleotide sequence to be detected;
Firstly, fixing a nucleotide sequence fragment to be detected in a reaction chamber, and then introducing one reaction solution in a set of reaction solution groups;
Releasing the fluorophore on the nucleotide substrate having the fluorophore with fluorescence switching properties using an enzyme, thereby resulting in fluorescence switching;
then introducing a second reaction liquid in the same set of reaction liquid groups;
Releasing the fluorophore on the nucleotide substrate of the fluorophore having fluorescence switching properties using an enzyme, thereby causing fluorescence switching;
and (3) circularly adding the two reaction solutions, and obtaining fuzzy coding information of the nucleotide substrate to be detected through fluorescence information.
10. A system for comparing ambiguous sequence information obtained from sequencing comprises a computing system, wherein,
Use of the method of any of the preceding claims; comparing the fuzzy sequence information obtained by sequencing with a reference nucleic acid sequence.
CN202010525168.XA 2016-12-01 2016-12-01 Sequencing fuzzy sequence information comparison method Active CN111667882B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010525168.XA CN111667882B (en) 2016-12-01 2016-12-01 Sequencing fuzzy sequence information comparison method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010525168.XA CN111667882B (en) 2016-12-01 2016-12-01 Sequencing fuzzy sequence information comparison method
CN201611088606.0A CN108165616B (en) 2016-12-01 2016-12-01 Method and system for comparing and identifying variation by using fuzzy nucleic acid sequencing information

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201611088606.0A Division CN108165616B (en) 2016-12-01 2016-12-01 Method and system for comparing and identifying variation by using fuzzy nucleic acid sequencing information

Publications (2)

Publication Number Publication Date
CN111667882A CN111667882A (en) 2020-09-15
CN111667882B true CN111667882B (en) 2024-05-14

Family

ID=62525863

Family Applications (3)

Application Number Title Priority Date Filing Date
CN202010525787.9A Active CN111575355B (en) 2016-12-01 2016-12-01 Sequencing fuzzy sequence analysis method
CN202010525168.XA Active CN111667882B (en) 2016-12-01 2016-12-01 Sequencing fuzzy sequence information comparison method
CN201611088606.0A Active CN108165616B (en) 2016-12-01 2016-12-01 Method and system for comparing and identifying variation by using fuzzy nucleic acid sequencing information

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202010525787.9A Active CN111575355B (en) 2016-12-01 2016-12-01 Sequencing fuzzy sequence analysis method

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201611088606.0A Active CN108165616B (en) 2016-12-01 2016-12-01 Method and system for comparing and identifying variation by using fuzzy nucleic acid sequencing information

Country Status (1)

Country Link
CN (3) CN111575355B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112102883B (en) * 2020-08-20 2023-12-08 深圳华大生命科学研究院 Base sequence coding method and system in FASTQ file compression
CN114561453A (en) * 2022-01-28 2022-05-31 赛纳生物科技(北京)有限公司 Method for qualitatively or quantitatively analyzing target sample through degenerate sequencing
CN114540471B (en) * 2022-01-28 2024-05-14 赛纳生物科技(北京)有限公司 Method and system for performing comparison by using missing nucleic acid sequencing information

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102329884A (en) * 2011-10-20 2012-01-25 东南大学 Synchronous synthesis and DNA sequencing method for two nucleotides and application thereof
CN103951724A (en) * 2014-04-30 2014-07-30 南京普东兴生物科技有限公司 Specially modified nucleotide as well as application thereof in high-throughput sequencing
CN104662165A (en) * 2012-03-30 2015-05-27 加利福尼亚太平洋生物科学股份有限公司 Methods and composition for sequencing modified nucleic acids
CN104910229A (en) * 2015-04-30 2015-09-16 北京大学 Poly phosphoric acid end fluorescent labeled nucleotide and application thereof

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100130368A1 (en) * 1998-07-30 2010-05-27 Shankar Balasubramanian Method and system for sequencing polynucleotides
US20100035249A1 (en) * 2008-08-05 2010-02-11 Kabushiki Kaisha Dnaform Rna sequencing and analysis using solid support
CN102634586B (en) * 2012-04-27 2013-10-30 东南大学 Decoding and sequencing method by real-time synthesis of two nucleotides into deoxyribonucleic acid (DNA)
CN106755292B (en) * 2015-11-19 2019-06-18 赛纳生物科技(北京)有限公司 A kind of nucleic acid molecule sequencing approach of phosphoric acid modification fluorogen

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102329884A (en) * 2011-10-20 2012-01-25 东南大学 Synchronous synthesis and DNA sequencing method for two nucleotides and application thereof
CN104662165A (en) * 2012-03-30 2015-05-27 加利福尼亚太平洋生物科学股份有限公司 Methods and composition for sequencing modified nucleic acids
CN103951724A (en) * 2014-04-30 2014-07-30 南京普东兴生物科技有限公司 Specially modified nucleotide as well as application thereof in high-throughput sequencing
CN104910229A (en) * 2015-04-30 2015-09-16 北京大学 Poly phosphoric acid end fluorescent labeled nucleotide and application thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
第二代测序序列比对方法综述;杨烨;刘娟;;武汉大学学报(理学版)(第05期);全文 *

Also Published As

Publication number Publication date
CN111667882A (en) 2020-09-15
CN111575355A (en) 2020-08-25
CN108165616A (en) 2018-06-15
CN108165616B (en) 2020-09-29
CN111575355B (en) 2023-03-10

Similar Documents

Publication Publication Date Title
ES2873850T3 (en) Next Generation Sequencing Libraries
CN110343753B (en) Nucleotide molecule sequencing method of phosphate modified fluorophore
CN108699599A (en) The method for obtaining and correcting biological sequence information
CN111667882B (en) Sequencing fuzzy sequence information comparison method
CN101818142B (en) Method for replicating nucleic acid sequence
EP2909343B1 (en) Methods to sequence a nucleic acid
CN112752850A (en) Digital amplification for protein detection
CN112840035B (en) Method for sequencing polynucleotides
US20130331286A1 (en) Universal random access detection of nucleic acids
JP2002523062A (en) Methods for determining polynucleotide sequence mutations
WO2020010137A1 (en) Formulations and signal encoding and decoding methods for massively multiplexed biochemical assays
CN111454281B (en) Merocyanine compound, dye for biomolecule labeling, kit and contrast agent composition containing same
CN106755290B (en) The method being sequenced using the nucleotides substrate molecule with fluorescence switching property fluorogen
CN106916882B (en) Method for dual allele-specific polymerase chain reaction of genotype identification chip for identifying polymorphism of nucleotide gene
CN114540471B (en) Method and system for performing comparison by using missing nucleic acid sequencing information
US20240011020A1 (en) Sequencing oligonucleotides and methods of use thereof
CN112280842B (en) Sequencing-by-synthesis method for 3' -hydroxyl-terminated reversible blocked nucleotide
CN116574790A (en) Polynucleotide sequencing method
WO2023175041A1 (en) Concurrent sequencing of forward and reverse complement strands on concatenated polynucleotides
Gaikwad Source of Genomic Resources-The genome sequencing facility
JP2004016131A (en) Dna microarray and method for analyzing the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant