CN111667882B

CN111667882B - Sequencing fuzzy sequence information comparison method

Info

Publication number: CN111667882B
Application number: CN202010525168.XA
Authority: CN
Inventors: 周文雄; 陈子天; 康力; 乔朔; 段海峰; 黄岩谊
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2016-12-01
Filing date: 2016-12-01
Publication date: 2024-05-14
Anticipated expiration: 2036-12-01
Also published as: CN111667882A; CN111575355A; CN108165616A; CN108165616B; CN111575355B

Abstract

The invention provides a method for comparing sequencing fuzzy sequence information, which comprises the following steps: fixing the nucleotide fragment to be detected, and obtaining fuzzy sequence information through a sequencing reaction; comparing the fuzzy sequence information with a reference genome; at the same time, the mutation can be identified. The method provided by the invention does not need complete nucleic acid base sequence, and can compare and find variation only through fuzzy information obtained by sequencing the multi-base reaction liquid, thereby saving the cost of sequencing, accelerating the comparison speed and reducing the cost.

Description

Sequencing fuzzy sequence information comparison method

Technical Field

The invention relates to a method and a system for comparing sequencing fuzzy sequence information, belonging to the field of gene sequencing.

Background

High-throughput sequencing technology, also known as next generation sequencing technology (NGS), is a new class of sequencing technology developed in recent years. High throughput sequencing technology is a revolutionary change to traditional sequencing technology, with simultaneous sequencing of tens of thousands to millions of nucleic acid molecules. High throughput sequencing can produce large amounts of data. The processing and utilization of data is an important component of high throughput sequencing.

The high-throughput sequencing technology can find out genetic variation and provide basis for clinical diagnosis, screening and the like. Genetic variations include Single Nucleotide Variations (SNV), copy Number Variations (CNV), fold-over-chromosome variations, DNA-modified variations (e.g., DNA methylation), and the like. Clinical diagnosis requires the ability to rapidly and accurately detect genetic variation at a low cost. However, the existing genetic variation detection methods based on the high-throughput sequencing technology all need to obtain the complete DNA sequence first and then find the variation, so that the time and price cost are increased. The invention provides a fuzzy comparison method, which can utilize fuzzy nucleic acid sequences to rapidly perform comparison and search for variation.

Disclosure of Invention

The present invention provides a method for obtaining partial information of a DNA sequence, aligning the partial information to a reference genome, and using the partial information to find/identify genetic variations.

The invention provides a method for comparing sequencing fuzzy sequence information, which is characterized in that,

Fixing the nucleotide fragment to be detected, and obtaining fuzzy sequence information through a sequencing reaction;

comparing the fuzzy sequence information with a reference nucleic acid sequence;

Wherein, the reaction liquid of the sequencing reaction contains nucleotide substrate molecules with two different bases;

sequencing refers to sequencing by utilizing a nucleotide substrate molecule of which the 5' -end is modified with a fluorophore with fluorescence switching property on polyphosphoric acid;

the fluorescence switching property means that the fluorescence signal is obviously changed after sequencing compared with that before sequencing reaction;

The sequencing reaction is a sequencing method with an unclosed 3 end;

The alignment of the ambiguous sequence information with the reference nucleic acid sequence refers to encoding the ambiguous sequence information with the reference nucleic acid sequence in the same manner and then aligning.

The invention provides a method for comparing fuzzy sequence information by sequencing, which is characterized in that,

Wherein, the reaction liquid of the sequencing reaction contains nucleotide substrate molecules with three different bases;

The sequencing reaction is a sequencing method with an unclosed 3 end;

According to a preferred embodiment, one set of reaction solutions is used for each sequencing, each set comprising two or more reaction solutions, each reaction solution comprising nucleotide substrate molecules of at least two different bases.

According to a preferred embodiment, the ambiguous sequence information is a combination of degenerate sequence information and non-degenerate sequence information.

According to a preferred embodiment, the ambiguous sequence information obtained by sequencing is encoded into one of its possible base sequence information.

According to a preferred embodiment, all of the ambiguous sequence information obtained by sequencing is encoded as numbers.

According to a preferred embodiment, the ambiguous sequence information is encoded simultaneously or sequentially with the reference nucleic acid sequence.

According to a preferred embodiment, the 5 '-terminal polyphosphoric acid-modified fluorophore having fluorescence switching properties is a 5' -terminal polyphosphoric acid-modified fluorophore.

According to a preferred embodiment, the sequencing is performed using a nucleotide substrate molecule modified with a fluorophore having a fluorescence switching property at the 5' polyphosphate end or the middle phosphate;

The fluorescence switching property means that the fluorescence signal intensity is obviously increased after sequencing compared with that before sequencing reaction;

Each set of reaction liquid group is used for sequencing, each set of reaction liquid group comprises two reaction liquids, and each reaction liquid contains nucleotide substrate molecules with two different bases;

The nucleotide substrate molecules in one reaction liquid can be complementary with two bases on the nucleotide sequence to be detected, and the nucleotide substrate molecules in the other reaction liquid can be complementary with the other two bases on the nucleotide sequence to be detected;

Firstly, fixing a nucleotide sequence fragment to be detected in a reaction chamber, and then introducing one reaction solution in a set of reaction solution groups;

Releasing the fluorophore on the nucleotide substrate having the fluorophore with fluorescence switching properties using an enzyme, thereby resulting in fluorescence switching;

then introducing a second reaction liquid in the same set of reaction liquid groups;

Releasing the fluorophore on the nucleotide substrate of the fluorophore having fluorescence switching properties using an enzyme, thereby causing fluorescence switching;

and (3) circularly adding the two reaction solutions, and obtaining fuzzy coding information of the nucleotide substrate to be detected through fluorescence information.

The invention provides a system for comparing fuzzy sequence information obtained by sequencing, which comprises a computing system and is characterized in that,

Using the method of any one of the preceding claims; comparing the fuzzy sequence information obtained by sequencing with a reference nucleic acid sequence.

The invention provides a method for comparing and identifying mutation by fuzzy sequence information obtained by sequencing, which comprises the following steps: fixing the nucleotide fragment to be detected, and obtaining fuzzy sequence information through a sequencing reaction; comparing the fuzzy sequence information with a reference genome; wherein the reaction liquid of the sequencing reaction contains nucleotide substrate molecules with two or more different bases.

The reaction liquid of the sequencing reaction comprises two or more nucleotide substrate molecules with different bases. When it is subjected to a sequencing reaction, sequence information corresponding to the nucleotide substrate molecules in the sequencing reaction solution is obtained each time. The information may contain two or more kinds of base number information, and is not specific sequence information but ambiguous sequence information.

According to a preferred embodiment of the present invention, the sequencing is performed using 5' -terminal polyphosphoric acid modified with a fluorophore having fluorescence switching properties; the fluorescence switching property refers to that the fluorescence signal is obviously changed after sequencing compared with that before sequencing reaction.

According to a preferred embodiment of the invention, the sequencing is a sequencing-by-side method.

According to a preferred embodiment of the invention, it further comprises encoding the ambiguous sequence information and the reference genome in the same way and then aligning.

According to a preferred embodiment of the invention, it further comprises encoding the ambiguous sequence information or the reference genome and then aligning. In the coding process, the change of the base arrangement order may be involved, and other letters or symbols can be used instead, so that the same form is adopted and the alignment is facilitated.

According to a preferred embodiment of the invention, it further comprises encoding the reference genome, altering its sequence information and then aligning with the ambiguous sequence information.

According to a preferred embodiment of the invention, the reference genome is encoded, its sequence information is modified, and then aligned with the encoding of the ambiguous sequence information.

According to a preferred embodiment of the present invention, the ambiguous sequence information refers to complete base sequence information from which a nucleotide sequence cannot be derived.

According to a preferred embodiment of the present invention, the complete base sequence information refers to nucleic acid sequence information encoded by A, G, T, C or nucleic acid sequence information encoded by A, G, U, C can be obtained; wherein the base may be a methylated base.

According to a preferred embodiment of the invention, the ambiguous sequence information may be a degenerate sequence represented using M, K, R, Y, W, S, B, D, H, V letters.

According to a preferred embodiment of the present invention, the ambiguous sequence information may be a combination of degenerate sequence information and non-degenerate sequence information.

According to a preferred embodiment of the present invention, the method further comprises encoding a reference genome and then comparing the encoding of the ambiguous sequence information with the encoding of the reference genome

According to a preferred embodiment of the present invention, the encoding of the ambiguous sequence information and the encoding of the reference genome result in the same representation.

According to a preferred embodiment of the invention, the sequencing is a 3-terminal unblocked sequencing method.

According to a preferred embodiment of the present invention, the reaction solution used for sequencing comprises nucleotide substrate molecules of two or more different bases.

According to a preferred embodiment of the present invention, nucleotide substrate molecules of two or more different bases in a reaction solution used for sequencing are labeled with the same or different fluorescent molecules.

According to a preferred embodiment of the present invention, the reaction solution used for sequencing is a set of reaction solutions, each set of reaction solutions containing two or more reaction solutions.

According to a preferred embodiment of the present invention, the sequencing reaction fluid is a set of reaction fluid sets, each set of reaction fluid sets comprising two reaction fluids, each reaction fluid comprising nucleotides of two different bases; the nucleotide in one reaction liquid can be complementary with two bases on the nucleotide sequence to be detected, and the nucleotide in the other reaction liquid can be complementary with the other two bases on the nucleotide sequence to be detected.

According to a preferred embodiment of the invention, the encoded ambiguous sequence information is aligned to the encoded reference genome using the Smith-Waterman algorithm, bowtie, BWA or SOAP.

According to a preferred embodiment of the invention, the mutated gene is found from the result of the alignment using a common method of finding genetic mutations, preferably one or more of mutect, strelka, control-freec, cns-seq.

According to a preferred embodiment of the present invention, the ambiguous sequence information obtained by sequencing is encoded into one of its possible base sequence information.

According to a preferred embodiment of the invention, all ambiguous sequence information in the ambiguous sequence information obtained by sequencing is encoded into numbers.

According to a preferred embodiment of the invention, the coding of the ambiguous sequence information and the coding order of the reference genome are exchangeable.

According to a preferred embodiment of the invention, the fluorescence switching property means that after each sequencing reaction, the fluorescence signal is significantly increased or significantly decreased or the emitted light frequency range is significantly changed compared to before the sequencing reaction.

According to a preferred embodiment of the present invention, the 5 '-terminal polyphosphoric acid-modified fluorescent group-modified nucleotide substrate molecule refers to a 5' -terminal polyphosphoric acid-modified fluorescent group-modified nucleotide substrate molecule.

According to a preferred embodiment of the present invention, the sequencing is performed using a nucleotide substrate molecule modified with a fluorophore having a fluorescence switching property at the 5' polyphosphate end or the middle phosphate; the fluorescence switching property means that the fluorescence signal intensity is obviously increased after sequencing compared with that before sequencing reaction; each set of reaction liquid group is used for sequencing, each set of reaction liquid group comprises two reaction liquids, and each reaction liquid contains nucleotide substrate molecules with two different bases; the nucleotide substrate molecules in one reaction liquid can be complementary with two bases on the nucleotide sequence to be detected, and the nucleotide substrate molecules in the other reaction liquid can be complementary with the other two bases on the nucleotide sequence to be detected; firstly, fixing a nucleotide sequence fragment to be detected in a reaction chamber, and then introducing one reaction solution in a set of reaction solution groups; releasing the fluorophore on the nucleotide substrate having the fluorophore with fluorescence switching properties using an enzyme, thereby resulting in fluorescence switching; then introducing a second reaction liquid in the same set of reaction liquid groups; releasing the fluorophore on the nucleotide substrate of the fluorophore having fluorescence switching properties using an enzyme, thereby causing fluorescence switching; and (3) circularly adding the two reaction solutions, and obtaining fuzzy coding information of the nucleotide substrate to be detected through fluorescence information.

The invention provides a sequencing reagent, which is characterized in that a nucleotide fragment to be detected is fixed, and fuzzy sequence information is obtained through the reaction of the sequencing reagent and the fixed nucleotide fragment; wherein the reaction liquid of the sequencing reaction contains nucleotide substrate molecules with two or more different bases.

According to a preferred embodiment of the present invention, sequencing is performed using a nucleotide substrate molecule sequencing reagent modified with a fluorophore having fluorescence switching properties at the 5' end of the polyphosphate; the fluorescence switching property refers to that the fluorescence signal is obviously changed after sequencing compared with that before sequencing reaction.

According to a preferred embodiment of the present invention, the nucleotide substrate molecules of two or more different bases in the reaction reagent are labeled with the same or different fluorescent molecules.

According to a preferred embodiment of the present invention, the reaction reagent is a set of reaction solutions, each set of reaction solutions containing two or more reaction solutions.

According to a preferred embodiment of the present invention, the sequencing reagent is a set of reaction solutions, each set of reaction solutions comprising two reaction solutions, each reaction solution comprising nucleotides of two different bases; the nucleotide in one reaction liquid can be complementary with two bases on the nucleotide sequence to be detected, and the nucleotide in the other reaction liquid can be complementary with the other two bases on the nucleotide sequence to be detected.

According to a preferred embodiment of the present invention, the sequencing is performed using a nucleotide substrate molecule modified with a fluorophore having a fluorescence switching property at the 5' polyphosphate end or the middle phosphate; the fluorescence switching property means that the fluorescence signal intensity is obviously increased after sequencing compared with that before sequencing reaction; each set of reaction liquid group is used for sequencing, each set of reaction liquid group comprises two reaction liquids, and each reaction liquid contains nucleotide substrate molecules with two different bases; the nucleotide substrate molecules in one reaction liquid can be complementary with two bases on the nucleotide sequence to be detected, and the nucleotide substrate molecules in the other reaction liquid can be complementary with the other two bases on the nucleotide sequence to be detected; firstly, fixing a nucleotide sequence fragment to be detected, and introducing one reaction solution in a set of reaction solution sets; releasing the fluorophore on the nucleotide substrate having the fluorophore with fluorescence switching properties using an enzyme, thereby resulting in fluorescence switching; then introducing a second reaction liquid in the same set of reaction liquid groups; releasing the fluorophore on the nucleotide substrate having the fluorophore with fluorescence switching properties using an enzyme, thereby resulting in fluorescence switching; and (3) circularly adding the two reaction solutions, and obtaining fuzzy coding information of the nucleotide substrate to be detected through fluorescence information.

The invention provides a nucleic acid sequencing method for obtaining fuzzy nucleic acid coding information, which is characterized in that a nucleotide fragment to be detected is fixed, and a sequencing reagent reacts with the fixed nucleotide fragment to obtain fuzzy sequence information; wherein the reaction liquid of the sequencing reaction contains nucleotide substrate molecules with two or more different bases.

According to a preferred embodiment of the present invention, sequencing is performed using a nucleotide substrate molecule sequencing reagent modified with a fluorophore having fluorescence switching properties at the 5' end of the polyphosphate;

The fluorescence switching property refers to that the fluorescence signal is obviously changed after sequencing compared with that before sequencing reaction.

The invention provides a system for comparing and identifying mutation of fuzzy sequence information obtained by sequencing, which comprises a computing system and is used for comparing and/or identifying mutation by utilizing the fuzzy sequence information obtained by sequencing.

The ambiguous sequencing information refers to base sequence information that cannot be determined from the nucleotide sequence derived from the sequence information. Ambiguous base sequences are a common concept in the scientific field, such as the use of the letter W for the bases A and/or T. There are also relevant definitions on WIKIPEDIA (https:// en. WIKIPEDIA. Org/wiki/nucleotidide).

Fuzzy coding means that different DNA sequences may have identical coding results. Conversely, the same encoding result may have multiple different sources.

Ambiguous information encoding refers to manipulation of DNA sequences, which may have identical results. Encoding a reference genome refers to manipulation of the reference genome sequence, and locally different reference genomes may have identical manipulation results. Ambiguous information encoding refers to a simple rearrangement of the sequence locally ignoring the actual sequence order, according to its corresponding base. Sequence part refers to a region on a sequence corresponding to one sequencing reaction (one sequencing consists of a plurality of sequencing reactions).

The 2+2 sequencing method of the invention refers to that each round of sequencing uses a set of reaction liquid groups, each set of reaction liquid groups comprises two reaction liquids, and each reaction liquid comprises nucleotide substrate molecules with two different bases; the nucleotide substrate molecules in one reaction liquid can be complementary with two bases on the nucleotide sequence to be detected, and the nucleotide substrate molecules in the other reaction liquid can be complementary with the other two bases on the nucleotide sequence to be detected. For example, one set of reaction solutions contains two reaction solutions, the first containing substrate molecules of A and T and the second containing substrate molecules of G and C. The 2+2 sequencing method can be named by the nucleotide molecular composition in the two reaction solutions. For example, if one set of reaction solutions contains two reaction solutions, the first containing a and T substrate molecules (collectively, W) and the second containing G and C substrate molecules (collectively, S), 2+2 sequencing using the set of reaction solutions is called WS sequencing. 2+2 sequencing shares the sequencing methods of three combinations of MK, RY and WS, each of which can be further divided into single-color and two-color sequencing.

The 1+3 sequencing in the invention means that one set of reaction liquid is used for each round of sequencing, each set of reaction liquid comprises two reaction liquids, wherein a nucleotide substrate molecule in one reaction liquid can be in complementary reaction with one base on a nucleotide sequence to be detected, and a nucleotide substrate molecule in the other reaction liquid can be in complementary reaction with other three bases on the nucleotide sequence to be detected. For example, one set of reaction solutions contains two reaction solutions, the first containing a substrate molecule of a and the other containing G, C and T substrate molecules.

The method provided by the invention has the following advantages: the 2+2 or 1+3 sequencing is performed only once, and repeated 2+2 or 1+3 sequencing for the same DNA sequence is not required. The nucleotide substrates in the reaction liquid used for each round of sequencing can be marked with the same fluorescent groups, or can be respectively marked with different fluorescent groups. The invention can encode the sequencing result and the reference genome at the same time. The characteristic of the coding is that if the theoretical sequencing signals of the two DNA sequences are identical, the coding results are identical. The invention uses general sequence comparison and identification method to compare the coded sequencing result to the coded reference genome and identify the genetic variation. The method provided by the invention requires discarding the first and last substrings of each sequence in the encoding of bicolor 2+2 sequencing information. The invention provides application of 2+2 or 1+3 fuzzy sequencing information for the first time.

All terms used in the present invention are intended to be given their ordinary meanings in the field of gene sequencing, unless otherwise indicated.

Detailed Description

The compounds, sequencing steps, alignment methods, etc. described in the disclosure are merely further illustrative of the invention, and the terms used are merely used to describe specific forms and are not limiting factors of the invention.

The basic steps of the invention are as follows:

1. the DNA samples were subjected to one round of 2+2 or 1+3 sequencing.

2. The sequencing result and the reference genome are encoded in the same way. The feature of the coding is that if the theoretical sequencing signals of the two DNA sequences are identical, the coding results are identical (even if the two sequences themselves are different). The result of the encoding is one or more strings (or sequences).

3. The encoded sequencing results are aligned to the encoded reference genome using commonly used sequence alignment methods (e.g., smith-Waterman algorithm, bowtie, BWA, SOAP, etc.).

4. Gene variation was found from the alignment of step 3 using commonly used methods for gene variation discovery (e.g., mutect, strelka, control-freec, cns-seq, GATK, etc.).

5. According to the coding method in step 2, the genetic variation found in step 4 was interpreted.

Theoretical sequencing signals refer to signals that should be theoretically sequenced in ideal situations without taking into account anomalies such as sequencing errors, signal attenuation, and dyssynchrony of DNA molecules. Theoretical sequencing signals directly reflect the base composition of the DNA sequence.

The above coding method may or may not satisfy the following "coding and reverse complement exchangeable" properties: the result obtained in either case is the same, either the DNA sequence is encoded first, followed by the reverse complement, or the DNA sequence is encoded first, followed by the reverse complement, followed by the encoding. For example, single MK sequencing of a DNA sequence, the coding scheme is defined as: all measured M was rewritten as A and all measured K was rewritten as T. Then:

it can be seen that this coding is consistent with the "code and reverse complement exchangeable" nature. However, if the coding scheme is defined as: all measured M was rewritten as A and all measured K was rewritten as C.

Then:

that is not in accordance with the "code and reverse complement exchangeable" nature.

If the selected coding mode does not meet the property of 'coding and reverse complementation exchange', the reference genome and the reverse complementation sequence thereof are required to be coded simultaneously in the step 2, and the (coded) sequencing result of each DNA molecule is compared with the coding result of the reference genome and the reverse complementation sequence thereof in the step 3, and a better comparison result is selected. If the coding scheme is chosen to meet the property of "coding and reverse complement interchangeability", then only the reference genome need be encoded in step 2, and its reverse complement need not be encoded.

Examples of coding schemes consistent with the "code and reverse complement exchangeable" property in monochrome 2+2 sequencing:

mk sequencing: 1) M is rewritten to A, K is rewritten to T; or 2) M is rewritten as C, K is rewritten as G;

RY sequencing: 1) R is rewritten to A, Y is rewritten to T; or 2) R is rewritten as C and Y is rewritten as G;

Ws sequencing: method of coding for monochrome WS sequencing, which codes for exchangeable properties complementary to the reverse: the W character is coded into a character string AT, and the S character is coded into a character string CG; similarly, WW codes to ATAT, SS codes to CGCG, WWW codes to ataat, SSs codes to CGCGCG, etc.

Examples of coding schemes consistent with the "code and reverse complement exchangeable" property in two-color 2+2 sequencing:

1. The sequence is sequentially partitioned into a number of substrings, each substring containing only bases corresponding to the 2+2 sequencing combination. For example, in two-color MK sequencing, each substring consists of A and/or C only, or G and/or T only. For example, sequence AAGTGGCACT is partitioned into (AA, GTGG, CAC, T).

2. Each sub-string is rearranged from small to large in alphabetical order, respectively. For example, (AA, GTGG, CAC, T) is rearranged to (AA, GGGT, ACC, T).

3. And sequentially connecting the rearranged sub-strings to form a new string, and taking the new string as a coding result. For example, (AA, GGGT, ACC, T) are concatenated into a string AAGGGTACCT.

The two-color coding mode accords with the property of 'coding and reverse complementary exchangeable':

to improve alignment accuracy in step 3, the first and last substrings of each sequence in the bicolor 2+2 encoding may need to be discarded. In the above example, sequence AAGTGGCACT needs to be encoded as GGTACCC. Since the two parts are prone to alignment errors.

The following examples are given without specific description, and both mono-color and bi-color 2+2 are encoded as given in the previous examples. dMK, dRY, dWS denotes two colors MK, two colors RY and two colors WS, sMK, sRY denotes one color MK and one color RY, respectively. For further elucidation of the invention, the following specific embodiments are presented. The specific parameters, steps, etc. involved are conventional in the art. The detailed description and examples do not limit the scope of the invention. Except where specifically indicated, all terms used in this application are used in the generic sense of this art. All gene sequences referred to in the present invention are sequences artificially synthesized on the market, except for the specific descriptions. There are many companies that commonly synthesize sequences, such as, for example, invitrogen.

Example 1

According to the description of the invention, human genomic DNA samples (reagent Human CEPH Genomic DNA in Ion PITM Controls Kit from Thermo Inc., cat. No. 4488985) were sequenced with two colors MK, two colors RY, two colors WS, one color MK, one color RY, one million DNA sequences each. The results were aligned to the corresponding encoded genome using Bowtie2 and the statistics were only able to align the proportion of DNA sequences to unique positions on the encoded genome (unique alignment). And comparing the result with the sequencing result (complete DNA sequence information can be obtained) of the Illumina sequencer (HiSeq 2000). The unique alignment is as follows:

in the table dMK represents a two-color MK sequencing method. Lower case letters d and s represent two-color sequencing and one-color sequencing, respectively.

Example 2

According to the description of the invention, E.coli genomic DNA samples (thermo E.coli DNA Control, cat. No. 4458450) were sequenced with two colors MK, two colors RY, two colors WS, one color MK, one color RY, one million DNA sequences each. The results were aligned to the corresponding encoded genome using Bowtie2 and the statistics were only able to align the proportion of DNA sequences to unique positions on the encoded genome (unique alignment). And comparing the result with the sequencing result of the Illumina sequencer (complete DNA sequence information can be obtained). The results of the unique alignment are shown in the following table:

Example 3

Since the present invention infers genetic variations based on only partial information of DNA sequences, the existence of a portion of genetic variations is not theoretically possible to find by the present invention. For example, in single color MK sequencing, the point mutation A.fwdarw.C cannot be found (but can be theoretically found in single color RY); in two-color MK sequencing, however, if adjacent two bases AC change position in mutation to CA, it is also theoretically impossible to find. We count the proportion of All human SNVs known to date (dbSNP database download: https:// www.ncbi.nlm.nih.gov/pnp. Filename: all_2015105. Vcf. Gz) that could not be detected theoretically by the present invention, as shown in the following table:

/>

Example 4

2+2 Three rounds of sequencing, single color: 3 sets of reaction solutions are prepared, each set of reaction solution comprises two types of bases marked with fluorescent groups, and the fluorescent groups are all fluorescent groups for common nucleic acid marking. Two bottles of reaction liquid in one set contain exactly 4 complete bases. The 6 bottles of solution were not repeated with each other.

The complete sequencing process involves three rounds, one after the other. The three sets of reagents were used separately for each round of sequencing process. In addition, the sequencing primer was identical (identical sequencing primer was used, and the reaction conditions were identical).

Each round of sequencing comprises:

1. hybridization of sequencing primers to an already prepared DNA array

2. The sequencing process is started. The 2.1-2.4 procedure is repeated a limited number of times.

2.1 Into the first vial of reagent. And reacting and collecting fluorescent signals.

2.2 Washing flowcell all residual reaction solution and fluorescent molecules produced

2.3 Into a second vial of reagent. And reacting and collecting fluorescent signals.

3. The extended sequencing primer is unwound.

Thus, the next round of experiments can be performed.

Preparing a reaction solution:

Preparing a sequencing reaction lotion, namely a lotion for short, which comprises the following components:

20mM Tris-HCl pH 8.8

10mM(NH4)2SO4

50mM KCl

2mM MgSO4

0.1％20

Preparing a sequencing reaction mother solution (mother solution for short) which contains:

20mM Tris-HCl pH 8.8

10mM(NH4)2SO4

50mM KCl

2mM MgSO4

0.1％20

8000unit/mL Bst polymerase

100unit/mL CIP

Three sets of sequencing reaction solutions were prepared, six bottles total. The method comprises the following steps of:

1A, mother liquor +20uM dA4P-TG +20uM dC4P-TG

1B, mother liquor +20uM dG4P-TG+20uM dG4P-TG

2A, mother liquor +20uM dA4P-TG +20uM dG4P-TG

2B, mother liquor +20uM dC4P-TG +20uM dG4P-TG

3A, mother liquor +20uM dA4P-TG +20uM dT4P-TG

3B, mother liquor +20uM dC4P-TG +20uM dG4P-TG

The prepared reaction liquid and mother liquid are placed on a 4c refrigerator or ice for standby.

Hybridization sequencing primer:

The sequencing chip was filled with sequencing primer solution (10 uM dissolved in 1 XSSC buffer), warmed to 90℃and cooled to 40℃at a rate of 5℃per minute. The sequencing primer solution was rinsed off with a wash.

The first sequencing was performed:

The sequencing chip was placed on a sequencer.

Sequencing was performed using the first set of reactions. The following procedure was followed.

1, 10ML of washing liquid is introduced to wash the chip

2, Cooling the chip to 4 DEG C

3, 100UL of reaction solution 1A was introduced

4, Heating the chip to 65 DEG C

5, Waiting for 1min

6, Exciting with 473nm laser, and shooting fluorescent image.

7, 10ML of washing liquid is introduced to wash the chip

8, Cooling the chip to 4 DEG C

9, 100UL of reaction solution 1B was introduced

10, Heating the chip to 65 DEG C

11, Wait for 1min

12, Fluorescence image was taken by excitation with 473nm laser light.

The steps 1-12 were repeated 50 times to obtain 100 fluorescence signals.

Example 5

Bicolor 2+2 three rounds of sequencing: 3 sets of reaction solutions are prepared, each set of reaction solution comprises two bottles, and each bottle comprises two bases. The two bases are labeled with different fluorescent chromophores to distinguish between them, with different emission wavelengths.

In this example, two chromophores are used for all bases: x and Y. Two bottles of reaction liquid in one set contain exactly 4 complete bases. The 6 bottles of solution were not repeated with each other.

	First bottle	Second bottle
			First set	AX+CY	GX+TY
Second set	AX+GY	CX+TY
			Third set	AX+TY	CX+GY

(XY is symbolized by the term "fluorescent group for labeling of nucleic acids" as commonly used)

The complete sequencing process involves three rounds, one after the other. The three sets of reagents were used separately for each round of sequencing process. Except that they are identical.

Each round of sequencing comprises:

1 hybridization of sequencing primers to an already prepared DNA array

2 Start the sequencing process. The 2.1-2.4 procedure is repeated a limited number of times.

2.1 Into the first vial of reagent. The fluorescent signals of two wavelengths are reacted and collected.

2.3 Into a second vial of reagent. The fluorescent signals of two wavelengths are reacted and collected.

3 Unwinding the extended sequencing primer.

Thus, the next round of experiments can be performed.

Example 6

Examples 4 and 5 are complete sequencing schemes. It is a common view that complete, well-defined sequence information can be obtained under the sequencing flow of example 4 and example 5, or at least in the case of two rounds of sequencing. In the presence of the reference genome, only one round of sequencing is required to obtain ambiguous sequence information, such that variations can be aligned or found with the reference gene.

On the basis of example 4. Only any one of the three sets of reaction solutions is needed to be prepared, and two bottles of reaction solutions are utilized to carry out one round of sequencing. The specific sequencing steps may be the same as in example 4.

Example 7

On the basis of example 5, only any one of three sets of reaction solutions was prepared, and one round of sequencing was performed using two bottles of the reaction solutions. The specific sequencing steps may be the same as in example 5.

For further elucidation of the sequencing method of the present invention reference may be made to the applicant's already filed patents, CN201510822361.9 or CN 2015110815685. X. And will not be described in detail herein. It is specifically stated that the specific sequencing steps of the present invention do not limit the scope of the present invention.

Claims

1. A method for comparing sequencing fuzzy sequence information is characterized in that,

The sequencing reaction is a sequencing method with an unclosed 3 end;

comparing the fuzzy sequence information with the reference nucleic acid sequence means that the fuzzy sequence information and the reference nucleic acid sequence are encoded in the same mode and then are compared;

wherein, the comparing of the fuzzy sequence information and the reference nucleic acid sequence comprises the following steps:

(1) Encoding the sequencing result and the reference nucleic acid sequence by the same method;

(2) Comparing the encoded sequencing result to the encoded reference nucleic acid sequence;

(3) Gene variation was found in the comparison results.

2. A method for comparing fuzzy sequence information by sequencing is characterized in that,

The sequencing reaction is a sequencing method with an unclosed 3 end;

(3) Gene variation was found in the comparison results.

3. A method according to claim 1 or 2, characterized in that,

Each sequencing uses one set of reaction liquid group, wherein each set of reaction liquid group comprises two or more reaction liquids, and each reaction liquid comprises nucleotide substrate molecules with at least two different bases.

4. The method of claim 1, wherein the step of determining the position of the substrate comprises,

The ambiguous sequence information is a combination of degenerate sequence information and non-degenerate sequence information.

5. A method according to claim 1 or 2, characterized in that,

The ambiguous sequence information obtained by sequencing is encoded into one of its possible base sequence information.

6. A method according to claim 1 or 2, characterized in that,

And encoding all the fuzzy sequence information into numbers in the fuzzy sequence information obtained by sequencing.

7. A method according to claim 1 or 2, characterized in that,

The ambiguous sequence information and the reference nucleic acid sequence are encoded simultaneously or sequentially.

8. The method of claim 1, wherein the step of determining the position of the substrate comprises,

The nucleotide substrate molecule of the fluorophore with the fluorescent switching property modified by the 5 '-terminal polyphosphoric acid refers to the nucleotide substrate molecule of the fluorophore with the fluorescent switching property modified by the 5' -terminal polyphosphoric acid.

9. The method of claim 1, wherein the step of determining the position of the substrate comprises,

Sequencing using a nucleotide substrate molecule modified with a fluorophore having a fluorescence switching property at the 5' polyphosphate end or the middle phosphate;

10. A system for comparing ambiguous sequence information obtained from sequencing comprises a computing system, wherein,

Use of the method of any of the preceding claims; comparing the fuzzy sequence information obtained by sequencing with a reference nucleic acid sequence.