CN110016498A

CN110016498A - The method of single nucleotide polymorphism is determined in the sequencing of Sanger method

Info

Publication number: CN110016498A
Application number: CN201910332899.XA
Authority: CN
Inventors: 杨志凯; 张延明; 杜楠; 王柏婧; 张萱; 朱政英; 王忠杰; 许志华; 万丽君; 周鑫峰
Original assignee: Sinogenomax Co Ltd
Current assignee: Sinogenomax Co Ltd
Priority date: 2019-04-24
Filing date: 2019-04-24
Publication date: 2019-07-16
Anticipated expiration: 2039-04-24
Also published as: CN110016498B

Abstract

The present invention provides the method that single nucleotide polymorphism is determined in a kind of sequencing of Sanger method, the method includes deriving from the base sequence file of the high quality of the same sequence to be measured to pass through building kmer Hash table, using kmer sequence as the key of Hash table, it is value corresponding to the key with the number that kmer sequence occurs in whole sequence, carries out the high quality base sequence splicing assembling using kmer value.It determines that complete analysis process automation may be implemented in the method for single nucleotide polymorphism in Sanger method sequencing of the invention, need to only set relevant parameter, centre is participated in without artificial, directly output destination file.

Description

The method of single nucleotide polymorphism is determined in the sequencing of Sanger method

Technical field

The present invention relates to field of biotechnology, and in particular to the side of single nucleotide polymorphism is determined in the sequencing of Sanger method Method.

Background technique

Sanger sequencing is DNA sequencing technology " goldstandard ", and crucial promotion has once been played in the Human Genome Project Effect, and be still used to obtain pin-point accuracy and reliable sequencing data now.

During Sanger sequencing, archaeal dna polymerase replicates list by the way that nucleotide is added into growing chain (extension products) Chain DNA template.Chain elongation occurs to select to be added in amplified production by the base pair complementarity with template at 3 ' ends of primer Deoxynucleotide.

Sanger sequencing extends the primer being incorporated on sequence template undetermined using a kind of archaeal dna polymerase.Until incorporation Until a kind of chain termination nucleotide.Sequencing is individually reacted by a set of four and is constituted each time, and each reaction is containing all Four kinds of deoxynucleotide triphosphoric acids (dNTP), and it is mixed into a kind of different dideoxyribonucleoside triphosphate (ddNTP) of limitation.Due to DdNTP, which lacks, extends required 3-OH group, terminates extended oligonucleotide selectively at G, A, T or C.It terminates Depending on point is by double deoxidation corresponding in reaction.The relative concentration of each dNTPs and ddNTPs is adjustable, and reaction is made to obtain one The chain termination product of group leader's hundreds to thousands base.

Currently, Sanger sequencing data base quality need according to the corresponding peak figure value of base each in abi file into Pedestrian's work distinguishes and removes low-quality base by hand, has certain manual operation error, while existing method carries out When this step process, can only a sequence handled, result is exported after terminating and carries out the operation of next sequence again, seriously Affect analysis progress.And when carrying out the splicing between sequence, the above problem is equally existed, a sample can only be once carried out Sequence assembly, analysis result just can be carried out the analysis of next sample after the completion.And it is carried out for generation Sanger sequencing data When SNP is detected, need to be distinguished according to multiple peak figure values of same position in abi file, only when minor peaks and main peak value Reach certain ratio range, just can determine that this base positions, there are SNP variations, since peak value can not carry out number in current method Quantization can only manually be estimated, there are serious manual operation errors according to peak figure height, if sequence is longer, be carried out Artificial SNP, which differentiates, to be needed to take a significant amount of time, and can not improve working efficiency.And the analysis based on generation Sanger sequencing data Method can not be improved on the basis of existing method to realize Quality Control, splicing and the full-automatic analysis of SNP detection.

Summary of the invention

In one embodiment, the present invention provides the side that single nucleotide polymorphism is determined in a kind of sequencing of Sanger method Method the described method comprises the following steps: step 1: according to the length of each sequence to be measured, designing N and carry out PCR amplification, N to primer For the integer not less than 2, sequence to be measured can be completely covered to primer in N；Step 2: to the amplified production of each sequence to be measured into The sequencing of row Sanger method, each sequence to be measured generate 2N Sanger sequencing abi file, the abi file of each sequence to be measured into Row name, in order to be identified according to the name from the same sequence to be measured；Step 3: by Sanger sequencing abi text Part is converted to text formatting file, and does normalized to base signal value；Step 4: being deleted by sliding window method lower than pre- If the sequence of base mass value, synchronization removal is lower than base mass value corresponding to the sequence area of default base mass value, obtains Obtain high quality base sequence and corresponding base mass value；Step 5: will be from the high quality of the same sequence to be measured Base sequence file is by building kmer Hash table, using kmer sequence as the key of Hash table, with kmer sequence in whole sequence The number of appearance is value corresponding to the key, carries out the high quality base sequence splicing assembling using kmer value；With step 6: The high quality base sequence and corresponding base mass value obtained based on step 4 obtains time maximum base of each base position The ratio of signal value and maximum base signal value assesses every and splices the more of the sequence after assembling when the value is greater than preset value Then state property site stores and exports the polymorphic site information of every sequence to be measured.

In one embodiment, sliding window range used in the sliding window method is 5-20bp, it is therefore preferable to 5-10bp.

In one embodiment, the default base mass value is 30-60, it is therefore preferable to which mass value range is 50-60.

In one embodiment, in step 2, each abi file of the sequence to be measured is with sequence names to be measured+draw Name claims mode to be named.

In one embodiment, the high quality sequential file of the same sequence to be measured will be derived from according to sequence name to be measured Primer in title is ranked up, and is constructed the kmer Hash table of two adjacent sequences to be spliced respectively, is with kmer sequence The key of Hash table is value corresponding to the key with the number that kmer sequence occurs in whole sequence, respectively from adjacent two sequences Retrieval whether there is the identical key for representing kmer sequence in the key of the corresponding Hash table of column, and the key is only corresponding unique Value subtracts 1 to kmer value, continues to search for when identical key is not present in two sequences, is up to finding maximum kmer value Only, based on existed simultaneously in adjacent two sequences and unique all kmer sequences corresponding to location information, orient two sequences Maximum overlapping interval between column obtains location information of the section in two sequences.

In one embodiment, the kmer value range is 90-150bp.

In one embodiment, the preset value is not less than 0.25.

In one embodiment, in step 6, mononucleotide polymorphism site recognition result is defeated with Excel file format Out, coordinate position and corresponding base of the record mononucleotide polymorphism site in splicing sequence.

It, can be in common windows using the method for determining single nucleotide polymorphism in Sanger method sequencing of the invention The system computerized automatic multiple abi layout sequence files read in data input file folder of upper realization, in the future according to file name A plurality of sequence derived from same sample carries out sliding window method automatic fitration both ends low quality base, carries out SNP after completing sequence assembly Detection exports SNP site degeneracy base, splices sequence and corresponding peak figure result into data result file, then successively locates The data file in input data file folder is managed, until completing all data files.

It is automatic to determine that complete analysis process may be implemented in the method for single nucleotide polymorphism in Sanger method sequencing of the invention Change, need to only set relevant parameter, centre is participated in without artificial, directly output destination file.The method can be to avoid in sequence The error that artificial interpretation generates in Quality Control filtering and SNP detection process, obtained result are more accurate.Full-automatic flow process can simultaneously Largely to save analysis time, if carrying out sequence Quality Control and splicing using commercial methods, a sample can only be inputted every time Sequence analyzed, then exported again after artificial interpretation SNP site information corresponding as a result, a sample takes around 3-5 The time of minute, and only need 5-6 seconds time that same work can be completed using the method for the present invention, it is greatly improved work Efficiency.

Detailed description of the invention

It in order to more clearly explain the technical solutions in the embodiments of the present application, below will be to needed in the embodiment Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is only some embodiments as described in this application, right For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings Its attached drawing.

Fig. 1 be the original lower machine sequence of sample LSYZ2017020_SeqF3 and sliding window length (win) be respectively 5bp, 10bp, When 15bp and 20bp carry out the removal of low quality base after sequence 5 ' and 3 ' hold peak figures；

Fig. 2 is the original lower machine sequence of sample LSYZ2017020_SeqF3 and fixed sliding window length (win) is 5bp, base matter The average value of amount is lower than 5 ' and 3 ' end peak figures of the sequence being removed when preset base mass value (QC) 30,40,50 and 60；

Fig. 3 is high quality sequential file after six of sample LSYZ2017020 filter, according to sequence SeqR3-SeqR4- SeqRT-R2-SeqPRT-F2-SeqF3-SeqF4 carries out sequence assembly assembling, and Kmer is respectively set to 20,25,30,35,40, 45,50,60,70,80,90,100,150,160,180 and 200 assembling result sequence display diagram；

Fig. 4 is high quality sequential file splicing assembling schematic diagram and actual number after six of sample LSYZ2017020 filter According to assembling display diagram；

Fig. 5 is the different SNP cutoff values detection SNP site peak figure result display diagram of sample LSYZ2017020；

Fig. 6 is the data splicing and mononucleotide polymorphism site recognition result display diagram of sample LSYZ2017020；

Fig. 7 be the original lower machine sequence of sample LSYZ2017032_SeqF3 and sliding window length (win) be respectively 5bp, 10bp, When 15bp and 20bp carry out the removal of low quality base after sequence 5 ' and 3 ' hold peak figures；

Fig. 8 is the original lower machine sequence of sample LSYZ2017032_SeqF3 and fixed sliding window length (win) is 5bp, base matter The average value of amount is lower than 5 ' and 3 ' end peak figures of the sequence being removed when preset base mass value (QC) 30,40,50 and 60；

Fig. 9 is high quality sequential file after six of LSYZ2017032 filter, according to sequence SeqR3-SeqR4-SeqRT- R2-SeqPRT-F2-SeqF3-SeqF4 carries out sequence assembly assembling, and Kmer is respectively set to 20,25,30,35,40,45,50, 60,70,80,90,100,150,160,180 and 200 assembling result sequence display diagram；

Figure 10 is high quality sequential file after six of LSYZ2017041 filter, according to sequence SeqR3-SeqR4- SeqRT-R2-SeqPRT-F2-SeqF3-SeqF4 carries out sequence assembly assembling, and Kmer is respectively set to 20,25,30,35,40, 45,50,60,70,80,90,100,150,160,180 and 200 assembling result sequence display diagram；

Figure 11 is high quality sequential file splicing assembling schematic diagram and actual number after six of sample LSYZ2017032 filter According to assembling display diagram；

Figure 12 is high quality sequential file splicing assembling schematic diagram and actual number after six of sample LSYZ2017041 filter According to assembling display diagram；

Figure 13 is the different SNP cutoff values detection SNP site peak figure result display diagram of sample LSYZ2017032；

Figure 14 is the different SNP cutoff values detection SNP site peak figure result display diagram of sample LSYZ2017041；With

Figure 15 is data splicing and the mononucleotide polymorphism site of sample LSYZ2017032 and sample LSYZ2017041 Recognition result display diagram.

Specific embodiment

In order to make art technology field personnel more fully understand the technical solution in the application, below in conjunction with embodiment The invention will be further described, it is clear that and described embodiments are only a part of embodiments of the present application, rather than whole Embodiment.Based on the embodiment in the application, those of ordinary skill in the art are obtained without making creative work The all other embodiment obtained, shall fall within the protection scope of the present application.The present invention is made with reference to the accompanying drawings and embodiments It further describes.

Embodiment one: the method that the method for the present invention surveys single nucleotide polymorphism in simple sequence

One, object to be measured sequence length :~1200bp designs 3 pairs of sequencing primers, specific as shown in table 1.

1 amplimer information table of table

Primer	Base constitutes (5 ' -3 ')	Position (HXB2)
			PRT-F2	CTTTARCTTCCCTCARATCACTCT	2243-2266
RT-R2	CTTCTGTATGTCATTGACAGTCC	3326-3304
			SeqF3	AGTCCTATTGARACTGTRCCAG	2556-2577
SeqR3	TTTYTCTTCTGTCAATGGCCA	2639-2619
			SeqF4	CAGTACTGGATGTGGGRGAYG	2869-2889
SeqR4	TACTAGGTATGGTAAATGCAGT	2952-2931

Two, the amplified production for treating sequencing column carries out the sequencing of Sanger method, obtains six Sanger sequencing abi files, text Part name is formed with sample names+Primer, sequence names LSYZ2017020- (SeqPRT-F2), LSYZ2017020- (SeqF3), LSYZ2017020- (SeqF4), LSYZ2017020- (SeqRT-R2), LSYZ2017020- (SeqR3) and Above-mentioned sequential file is placed in same data input file folder by LSYZ2017020- (SeqR4), and computer is according to sequence text Part title automatic identification belongs to the sequencing file of sequence to be measured, and carries out subsequent analysis.

Three, Sanger sequencing abi file is converted to text formatting file, and normalizing is done to base quality signal value Change processing.

Four, the sequence lower than default base mass value is deleted by sliding window method, synchronization removal is lower than default base quality Base mass value corresponding to the sequence area of value obtains high quality base sequence and corresponding base mass value, detailed process It is as follows:

1. setting sliding window length is respectively 5bp, 10bp, 15bp and 20bp detection sequence base mass average value works as base When the average value of quality is greater than preset base mass value (default value 50), stop sliding, deletes lower than default base quality The sequence of value, relevant information corresponding to synchronization removal low base mass-sequential region, obtain high quality base sequence and its Related data information；Six Sanger sequencing abi file in data input file folder is successively handled using the method: LSYZ2017020- (SeqPRT-F2), LSYZ2017020- (SeqF3), LSYZ2017020- (SeqF4), LSYZ2017020- (SeqRT-R2), LSYZ2017020- (SeqR3) and LSYZ2017020- (SeqR4).

By taking sample LSYZ2017020_SeqF3 as an example, it is 5bp, 10bp, 15bp that length, which is respectively set, in primitive sequencer sequence With 20bp sliding window, detection sequence base mass average value, when the average value of base quality is greater than preset base mass value (default When value is 50), stop sliding, deletes lower than the sequence for presetting base mass value, synchronization removal low base mass-sequential region institute Corresponding relevant information, obtains the base sequence and its related data information of high quality, and exports peak figure file.Since sequence is sequenced Column length is longer, it has not been convenient to show completely, in order to preferably compare the quality evaluation effect of different length sliding window, interception is former respectively Beginning sequencing sequence and 5bp, 10bp, 15bp and 20bp length sliding window remove 5 ' and 3 ' end peak figures progress of sequence after low quality base It is parallel to compare, as shown in Figure 1.Following correlated series peak figures are relatively all made of such 5 ' and 3 ' end peak figure mode of interception, remaining sequence Arrange LSYZ2017020-SeqF4, LSYZ2017020-SeqPRT-F2, LSYZ2017020-SeqR3, LSYZ2017020-SeqR4 It successively carries out the removal of low quality base according to the method described above to LSYZ2017020-SeqRT-R2 and related corresponding peak figure compares.

After original series and four kinds of different length sliding windows progress base quality evaluations and removal low quality base 5 ' and 3 ' end peak figure results be compared analysis, according to the quantity of removal low quality base and the peak of reservation high quality base Value information suggests when the quality of data filters using sliding window range being 5-10bp.

2. setting sliding window length is 5bp detection sequence base mass average value, preset when the average value of base quality is greater than Base mass value 30,40,50 and 60 when, stop sliding, delete the sequence lower than default base mass value, the low alkali of synchronization removal Relevant information corresponding to matrix amount sequence area obtains the base sequence and its related data information of high quality；Using this side Method successively handles six Sanger sequencing abi file in data input file folder: LSYZ2017020- (SeqPRT-F2), LSYZ2017020- (SeqF3), LSYZ2017020- (SeqF4), LSYZ2017020- (SeqRT-R2), LSYZ2017020- (SeqR3) and LSYZ2017020- (SeqR4).

By taking sample LSYZ2017020SeqF3 as an example, setting sliding window length is 5bp detection sequence base mass average value, when When the average value of base quality is greater than preset base mass value 30,40,50 and 60, stop sliding, deletes lower than default base The sequence of mass value, relevant information corresponding to synchronization removal low base mass-sequential region, obtains the base sequence of high quality And its related data information, and export peak figure file.The average value of primitive sequencer sequence and base quality is intercepted respectively lower than pre- If base mass value 30,40,50 with 60 when be removed the 5 ' of sequence with peak figure progress is parallel at 3 ' ends compares, as shown in Figure 2. Remaining sequence LSYZ2017020-SeqF4, LSYZ2017020-SeqPRT-F2, LSYZ2017020-SeqR3, LSYZ2017020-SeqR4 and LSYZ2017020-SeqRT-R2 successively carries out the removal of low quality base and phase according to the method described above Corresponding peak figure is closed to compare.Base quality evaluation and removal low quality are carried out according to original series and four kinds of different bases mass values 5 ' and 3 ' later end peak figure results of base are compared analysis, the quantity and reservation high quality according to removal low quality base The peak information of base suggests when the quality of data filters using mass value range being 50-60.

Five, building kmer Hash will be passed through from the base sequence file of the high quality of the same sequence to be measured Table is value corresponding to the key with the number that kmer sequence occurs in whole sequence, makes using kmer sequence as the key of Hash table The high quality base sequence splicing assembling is carried out with kmer value.

1. by after six of a sample filterings high quality sequential file according to Primer in sample names into Row sequence (seq1-seq2-seq3-seq4-seq5-seq6) constructs two sequence (seql-seq2, seq2- to be spliced respectively Seq3, seq3-seq4 etc.) kmer Hash table, using kmer sequence as the key of Hash table, with kmer sequence in whole sequence go out Existing number is value corresponding to the key, is retrieved from the key of Hash table corresponding to two sequences with the presence or absence of identical respectively The key of kmer sequence is represented, and the key only corresponds to unique value, when identical key is not present in two sequences, to kmer value Subtract 1, continue to search for, until finding maximum kmer value, based on existing simultaneously and uniquely own in two sequences Location information corresponding to kmer sequence orients maximum overlapping interval between two sequences, obtains the section in two sequences In location information；

By taking LSYZ2017020 sample as an example, by identifying the high quality sequential file after six filterings of this sample, It is ranked up SeqR3-SeqR4-SeqPRT-F2-SeqRT-R2-SeqF3-SeqF4 according to Primer in sample names, is such as schemed Shown in 3；Then sequence assembly assembling is carried out according to the method described above, Kmer is respectively set to 20,25,30,35,40,45,50,60, 70,80,90,100,150,160,180 and 200, assessment sequence assembles result.

Sequence is compared using MEGA software in 16 kinds of Kmer assembling result sequences, as a result as Fig. 3 shows all Kmer Under the conditions of, sequence assembling result is length 1061bp.Picture left-hand column shows the sequence names of different Kmer assembling results, after Sidebar show after the series of assembling compare sequence as a result, tetra- kinds of bases of ATCG be individually identified as green, it is red, light blue with And four kinds of colors such as purple, degeneracy base is without color identifier, if all aligned sequences are consistent in the base of same position, The upside field mark of picture knows *, is otherwise blank.Alignment and assembbly assembling as a result, it has been found that, 16 kinds of Kmer assembling splicing result length are complete It is complete consistent, indifference, but it is variant to assemble SNP identification in sequence.

2 LSYZ2017020 sample difference Kmer of table assembles result SNP statistical form

The SNP site and corresponding base that identify in 16 kinds of Kmer assembling splicing results are counted, it is as shown in table 2, left Sidebar is the position coordinates of SNP in the sequence, and upper sidebar is 16 kinds of Kmer values, and rear side column is that SNP corresponds to base, and Ref indicates ginseng Series are examined, Alt indicates the mutating alkali yl of assembling sequence.It is found from upper table analysis, Kmer value is greater than after 150bp, exists Some site SNP undetected situation, in conjunction with the test result of multiple samples, it is proposed that carry out sequence assembly group using this method It is 90-150bp that kmer value range is used when dress.

It is shown 2. six high quality sequences splice assembling result respectively

Such as Fig. 4, six high quality sequences from same sample are ranked up according to Primer in sample names SeqR3-SeqR4-SeqPRT-F2-SeqRT-R2-SeqF3-SeqF4 (Fig. 4-A) then carries out sequence according to above-mentioned Kmer method Column splicing assembling, Fig. 4-B shows SeqR3, SeqR4 implementations consistent with the end comparison of SeqPRT-F2 sequence 5 ', according to splicing side Case, this region aligned sequences generate a consensus sequence after assembling, and as sequence after splicing assembling, Fig. 4-C show SeqR3 The end of sequence 3 ', the end of SeqR4 sequence 3 ', the end of SeqPRT-F2 sequence 5 ' and the consistent implementations of the end comparison of SeqRT-R2 sequence 5 ', Fig. 4- D shows the end of SeqR4 sequence 3 ', the end of SeqPRT-F2 sequence 3 ', the end of SeqRT-R2 sequence 3 ', the end of SeqF3 sequence 3 ' and SeqF4 sequence The end of column 5 ' compares consistent implementations.

Six, the high quality base sequence and corresponding base mass value obtained based on step 4, obtains each base position The ratio of secondary maximum base signal value and maximum base signal value, when the value is greater than preset value, after assessing every splicing assembling Sequence polymorphic site, then store and export the polymorphic site information of every sequence to be measured.

1. obtaining each base position based on the high quality sequence and associated data files that obtain after filtering in step 4 The ratio of secondary maximum base signal value and maximum base signal value, when the value be greater than preset value (be respectively set to 0.2,0.25, 0.33,0.5) when, polymorphic site is assessed；

Preset value (cut off) is assessed according to four kinds of different SNP, SNP site in identification assembling sequence and corresponding Base, statistical result is as shown in table 3, left-hand column for SNP coordinate position in the sequence, upper sidebar be preset value, rear side column For the corresponding base of detection SNP site, Ref indicates reference sequences base, and Alt indicates the mutating alkali yl of assembling sequence, indicate not Detect SNP.For same position at different preset value 0.2,0.25,0.33 and 0.5, SNP testing result has difference in analytical table 3 It is different, it chooses the base peak figure (Fig. 5) that coordinate position is 5,401,463,554 and 812 and carries out detailed analysis, left-hand column is pre- in figure If value, rear side column is that 5 different coordinate positions correspond to the peak figure of base as a result, according to practical peak figure result judgement, and analysis carries out Suggest using cut off value minimum 0.25 when polymorphic position point analysis.

3 difference SNP cutoff value testing result of table

The splicing of 2.Sanger sequencing data and mononucleotide polymorphism site recognition result

As a result output file names (Fig. 6-A) with import file name, by six sequence assembly results of same samples sources (Fig. 6-B) is exported with fasta format, wherein mononucleotide polymorphism site is recorded in the form of degeneracy base, in splicing result Base mass value is converted into figure signal and exports (Fig. 6-C) with pdf formatted file, and mononucleotide polymorphism site is with blue bar column Mark, while mononucleotide polymorphism site recognition result exports (Fig. 6-D) with Excel file format, record SNP site is being spelled Connect the coordinate position in sequence and corresponding base.

Embodiment two: the method that the method for the present invention surveys single nucleotide polymorphism in two sample sequences.

One, by taking sample LSYZ2017032 and LSYZ2017041 as an example, each sample sequencing obtains six Sanger sequencings Abi file, filename are formed with sample names+Primer, and sequence names are respectively LSYZ2017032- (SeqPRT-F2), LSYZ2017032- (SeqF3), LSYZ2017032- (SeqF4), LSYZ2017032- (SeqRT-R2), LSYZ2017032- (SeqR3) and LSYZ2017032- (SeqR4)；LSYZ2017041- (SeqPRT-F2), LSYZ2017041- (SeqF3), LSYZ2017041- (SeqF4), LSYZ2017041- (SeqRT-R2), LSYZ2017041- (SeqR3) and LSYZ2017041- (SeqR4).Above-mentioned sequential file is placed in same data input file folder, computer is known automatically according to sequence file name Do not belong to the sequencing file of sequence to be measured, and carries out subsequent analysis.

Two, Sanger sequencing abi file is converted to text formatting file, and normalizing is done to base quality signal value Change processing.

Three, the sequence lower than default base mass value is deleted by sliding window method, synchronization removal is lower than default base quality Base mass value corresponding to the sequence area of value obtains high quality base sequence and corresponding base mass value:

1. setting sliding window length is respectively 5bp, 10bp, 15bp and 20bp detection sequence base mass average value works as base When the average value of quality is greater than preset base mass value (default value 50), stop sliding, deletes lower than default base quality The sequence of value, relevant information corresponding to synchronization removal low base mass-sequential region, obtain high quality base sequence and its Related data information；12 Sanger sequencing abi file in data input file folder is successively handled using the method: LSYZ2017032- (SeqPRT-F2), LSYZ2017032- (SeqF3), LSYZ2017032- (SeqF4), LSYZ2017032- (SeqRT-R2), LSYZ2017032- (SeqR3) and LSYZ2017032- (SeqR4)；LSYZ2017041- (SeqPRT-F2), LSYZ2017041- (SeqF3), LSYZ2017041- (SeqF4), LSYZ2017041- (SeqRT-R2), LSYZ2017041- (SeqR3) and LSYZ2017041- (SeqR4).

By taking sample LSYZ2017032_SeqF3 as an example, it is 5bp, 10bp, 15bp that length, which is respectively set, in primitive sequencer sequence With 20bp sliding window, detection sequence base mass average value, when the average value of base quality is greater than preset base mass value (default When value is 50), stop sliding, deletes lower than the sequence for presetting base mass value, synchronization removal low base mass-sequential region institute Corresponding relevant information, obtains the base sequence and its related data information of high quality, and exports peak figure file.Since sequence is sequenced Column length is longer, it has not been convenient to show completely, in order to preferably compare the quality evaluation effect of different length sliding window, interception is former respectively Beginning sequencing sequence and 5bp, 10bp, 15bp and 20bp length sliding window remove 5 ' and 3 ' end peak figures progress of sequence after low quality base Parallel to compare (Fig. 7), following correlated series peak figures are relatively all made of such 5 ' and 3 ' end peak figure modes of interception, remaining sequence LSYZ2017032-SeqF4, LSYZ2017032-SeqPRT-F2, LSYZ2017032-SeqR3, LSYZ2017032-SeqR4 and LSYZ2017032-SeqRT-R successively carries out the removal of low quality base according to the method described above and related corresponding peak figure compares.Sample Six sequences of LSYZ2017041 are also all made of above method progress low quality base removal and related corresponding peak figure compares.

2. setting sliding window length is 5bp detection sequence base mass average value, preset when the average value of base quality is greater than Base mass value 30,40,50 and 60 when, stop sliding, delete the sequence lower than default base mass value, the low alkali of synchronization removal Relevant information corresponding to matrix amount sequence area obtains the base sequence and its related data information of high quality；Using this side Method successively handle data input file folder in 12 Sanger sequencing abi file, LSYZ2017032- (SeqPRT-F2), LSYZ2017032- (SeqF3), LSYZ2017032- (SeqF4), LSYZ2017032- (SeqRT-R2), LSYZ2017032- (SeqR3) and LSYZ2017032- (SeqR4)；LSYZ2017041- (SeqPRT-F2), LSYZ2017041- (SeqF3), LSYZ2017041- (SeqF4), LSYZ2017041- (SeqRT-R2), LSYZ2017041- (SeqR3) and LSYZ2017041- (SeqR4)。

By taking sample LSYZ2017032_SeqF3 as an example, setting sliding window length is 5bp detection sequence base mass average value, When the average value of base quality is greater than preset base mass value 30,40,50 and 60, stop sliding, deletes lower than default alkali The sequence of matrix magnitude, relevant information corresponding to synchronization removal low base mass-sequential region, obtains the base sequence of high quality Column and its related data information, and export peak figure file.Primitive sequencer sequence is intercepted respectively and the average value of base quality is lower than Peak figure progress is parallel the 5 ' of the sequence being removed when preset base mass value 30,40,50 is with 60 and 3 ' ends compares (Fig. 8), Remaining sequence LSYZ2017032-SeqF4, LSYZ2017032-SeqPRT-F2, LSYZ2017032-SeqR3, LSYZ2017032- SeqR4 successively carries out the removal of low quality base and related corresponding peak figure ratio to LSYZ2017032-SeqRT-R2 according to the method described above Compared with.Six sequences of sample LSYZ2017041 are also all made of the above method and carry out the removal of low quality base and related corresponding peak figure Compare.

According to original series and four kinds of different bases mass values carry out base quality evaluations and removal low quality base with 5 ' and 3 ' end peak figure results afterwards are compared analysis, according to the quantity for removing low quality base and retain high quality base Peak information suggests when the quality of data filters using mass value range being 50-60.

Four, building kmer Hash will be passed through from the base sequence file of the high quality of the same sequence to be measured Table is value corresponding to the key with the number that kmer sequence occurs in whole sequence, makes using kmer sequence as the key of Hash table The high quality base sequence splicing assembling is carried out with kmer value:

1. by after six of a sample filterings high quality sequential file according to Primer in sample names into Row sequence (seq1-seq2-seq3-seq4-seq5-seq6) constructs two sequence (seq1-seq2, seq2- to be spliced respectively Seq3, seq3-seq4 etc.) kmer Hash table, using kmer sequence as the key of Hash table, with kmer sequence in whole sequence go out Existing number is value corresponding to the key, is retrieved from the key of Hash table corresponding to two sequences with the presence or absence of identical respectively The key of kmer sequence is represented, and the key only corresponds to unique value, when identical key is not present in two sequences, to kmer value Subtract 1, continue to search for, until finding maximum kmer value, based on existing simultaneously and uniquely own in two sequences Location information corresponding to kmer sequence orients maximum overlapping interval between two sequences, obtains the section in two sequences In location information；

By taking LSYZ2017032 and LSYZ2017041 sample as an example, by being identified after six filterings of sample respectively High quality sequential file is ranked up SeqR3-SeqR4-SeqPRT-F2-SeqRT-R2- according to Primer in sample names Then SeqF3-SeqF4 carries out sequence assembly assembling according to the method described above, Kmer is respectively set to 20,25,30,35,40,45, 50,60,70,80,90,100,150,160,180 and 200, assessment sequence assembles result.

By each 16 kinds of Kmer of LSYZ2017032 and LSYZ2017041 sample assembling result sequence using MEGA software into Row compare sequence, as the result is shown under the conditions of (Fig. 9 and 10) all Kmer, sequence assembling result length be respectively 1056bp and 1057bp.Picture left-hand column shows that the sequence names of different Kmer assembling results, rear side column show that the series of assembling compare As a result, tetra- kinds of bases of ATCG are individually identified as four kinds of colors such as green, red, light blue and purple, degeneracy base after sequence Without color identifier, if all aligned sequences are consistent in the base of same position, picture upside field mark know *, otherwise for Blank.Alignment and assembbly assembling as a result, it has been found that, 16 kinds of Kmer of each sample assembling splicing result length is completely the same, indifference, But SNP identification is variant in assembling sequence.

4 LSYZ2017032 sample difference Kmer of table assembles result SNP statistical form

5 LSYZ2017041 sample difference Kmer of table assembles result SNP statistical form

The SNP site and corresponding base that identify in 16 kinds of Kmer assembling splicing result of two samples are united respectively Meter, as shown in table 4 and 5, left-hand column is the position coordinates of SNP in the sequence, and upper sidebar is 16 kinds of Kmer values, and rear side column is SNP Corresponding base, Ref indicate reference sequences base, and Alt indicates the mutating alkali yl of assembling sequence.It is found from upper table analysis, Kmer value After 150bp, there is a situation where that some site SNP are undetected, in conjunction with the test result of multiple samples, it is proposed that use this Inventive method carries out using kmer value range when sequence assembly assembling being 90-150bp.

Six high quality sequences from same sample are ranked up SeqR3- according to Primer in sample names SeqR4-SeqPRT-F2-SeqRT-R2-SeqF3-SeqF4 (Figure 11-A and 12A) then carries out sequence according to above-mentioned Kmer method Column splicing assembling, Figure 11-B and 12-B show SeqR3, SeqR4 implementations consistent with the end comparison of SeqPRT-F2 sequence 5 ', foundation Connection scheme, this region aligned sequences generate a consensus sequence after assembling, as splicing assembling after sequence, Figure 11-C and 12-C shows that the end of SeqR3 sequence 3 ', the end of SeqR4 sequence 3 ', the end of SeqPRT-F2 sequence 5 ' and the end of SeqRT-R2 sequence 5 ' compare one Cause implementations, Figure 11-D and 12-D show the end of SeqR4 sequence 3 ', the end of SeqPRT-F2 sequence 3 ', the end of SeqRT-R2 sequence 3 ', The end of SeqF3 sequence 3 ' implementations consistent with the end comparison of SeqF4 sequence 5 '.

Five, based on the high quality sequence and associated data files obtained after filtering in step 4, each base position is obtained The ratio of secondary maximum base signal value and maximum base signal value, when the value be greater than preset value (be respectively set to 0.2,0.25, 0.33,0.5) when, polymorphic site is assessed.

Preset value (cut off) is assessed according to four kinds of different SNP, SNP site in identification assembling sequence and corresponding Base, LSYZ2017032 and LSYZ2017041 sample statistics results as shown in tables 6 and 7, left-hand column for SNP in the sequence Coordinate position, upper sidebar are preset value, and rear side column is the corresponding base of detection SNP site, and Ref indicates reference sequences base, Alt Indicate assembling sequence mutating alkali yl, indicate SNP is not detected.Same position is not in 6 sample LSYZ2017032 of analytical table When with preset value 0.2,0.25,0.33 and 0.5, SNP testing result is variant, and choosing coordinate position is 28,100,254 and 1045 Base peak figure (Figure 13) carry out detailed analysis, in 7 sample LSYZ2017041 of analytical table same position different preset values 0.2, 0.25,0.33 and 0.5 when, SNP testing result is variant, choose coordinate position be 459,653,656,683,696 and 736 alkali Base peak figure (Figure 14) carries out detailed analysis, and left-hand column is preset value in figure, and rear side column is that 5 different coordinate positions correspond to base As a result, according to practical peak figure result judgement, analysis suggest when polymorphic position point analysis minimum using cut off value peak figure 0.25。

6 LSYZ2017032 sample difference SNP cutoff value testing result of table

7 LSYZ2017041 sample difference SNP cutoff value testing result of table

As a result output file names (Figure 15-A) with import file name, by six sequences of LSYZ2017032 samples sources Splicing result exports (Figure 15-B) with fasta format, and wherein mononucleotide polymorphism site is recorded in the form of degeneracy base, spells Base mass value is converted into figure signal and exports (Figure 15-D) with pdf formatted file in binding fruit, mononucleotide polymorphism site With blue bar column mark, while mononucleotide polymorphism site recognition result exports (Figure 16-F) with Excel file format, record Coordinate position and corresponding base of the SNP site in splicing sequence；By six sequences of LSYZ2017041 samples sources Splicing result exports (Figure 15-C) with fasta format, and wherein mononucleotide polymorphism site is recorded in the form of degeneracy base, spells Base mass value is converted into figure signal and exports (Figure 15-E) with pdf formatted file in binding fruit, mononucleotide polymorphism site With blue bar column mark, while mononucleotide polymorphism site recognition result exports (Figure 15-G) with Excel file format, record Coordinate position and corresponding base of the SNP site in splicing sequence.

It should be understood that the present invention disclosed is not limited only to specific method, scheme and the substance of description, because these It is alterable.It will also be understood that purpose of the terminology used here just for the sake of the specific embodiment scheme of description, rather than It is intended to limit the scope of the invention, the scope of the present invention is limited solely by the attached claims.

Those skilled in the art, which will also be appreciated that or be able to confirm that, uses no more than routine experiment, institute herein The many equivalents for the specific embodiment of the invention stated.These equivalents are also contained in the attached claims.

Claims

The method of single nucleotide polymorphism is determined in the sequencing of 1.Sanger method, which is characterized in that the described method comprises the following steps:

Step 1: according to the length of each sequence to be measured, designs N and PCR amplification carried out to primer, N is the integer not less than 2, N pairs Sequence to be measured can be completely covered in primer；

Step 2: the sequencing of Sanger method being carried out to the amplified production of each sequence to be measured, each sequence to be measured generates 2N Sanger Abi file is sequenced, the abi file of each sequence to be measured is named, in order to identify according to the name from same Sequence to be measured；

Step 3: Sanger sequencing abi file being converted to text formatting file, and base signal value is done at normalization Reason；

Step 4: the sequence lower than default base mass value being deleted by sliding window method, synchronization removal is lower than default base mass value Sequence area corresponding to base mass value, obtain high quality base sequence and corresponding base mass value；

Step 5: building kmer Hash table will be passed through from the base sequence file of the high quality of the same sequence to be measured, Using kmer sequence as the key of Hash table, it is value corresponding to the key with the number that kmer sequence occurs in whole sequence, uses Kmer value carries out the high quality base sequence splicing assembling；With

Step 6: the high quality base sequence and corresponding base mass value obtained based on step 4 obtains each base position The ratio of secondary maximum base signal value and maximum base signal value, when the value is greater than preset value, after assessing every splicing assembling Sequence polymorphic site, then store and export the polymorphic site information of every sequence to be measured.
2. the method according to claim 1, wherein in step 4, sliding window model used in the sliding window method It encloses for 5-20bp, it is therefore preferable to 5-10bp.
3. the default base mass value is 30-60 the method according to claim 1, wherein in step 4, Preferably mass value range is 50-60.
4. the method according to claim 1, wherein in step 2, each abi file of the sequence to be measured It is named in a manner of sequence names+Primer to be measured.
5. according to the method described in claim 4, it is characterized in that, by from the high quality sequence text of the same sequence to be measured Part is ranked up according to the Primer in sequence names to be measured, constructs the kmer Hash of two adjacent sequences to be spliced respectively Table is value corresponding to the key with the number that kmer sequence occurs in whole sequence using kmer sequence as the key of Hash table, point Retrieval whether there is the identical key for representing kmer sequence not from the key of Hash table corresponding to adjacent two sequences, and should Only corresponding unique value subtracts 1 to kmer value, continues to search for key when identical key is not present in two sequences, until looking for Until maximum kmer value, based on existed simultaneously in adjacent two sequences and unique all kmer sequences corresponding to position letter Breath, orients maximum overlapping interval between two sequences, obtains location information of the section in two sequences.
6. according to the method described in claim 5, it is characterized in that, the kmer value range is 90-150bp.
7. the method according to claim 1, wherein the preset value is not less than 0.25 in step 6.
8. the method according to claim 1, wherein in step 6, mononucleotide polymorphism site recognition result with The output of Excel file format, coordinate position and corresponding alkali of the record mononucleotide polymorphism site in splicing sequence Base.