Invention content
The main purpose of the present invention is to provide a kind of processing method of high-flux sequence data, processing unit, storages to be situated between
Matter and processor, to solve the problems, such as that there are false positive site is more in existing handling result.
To achieve the goals above, according to an aspect of the invention, there is provided a kind of processing of high-flux sequence data
Method, the processing method include:Two level sequencing sequence is obtained, two level sequencing sequence is can be by purpose in high-flux sequence data
Fragment amplification primer identifies, and removes the sequencing sequence after corresponding amplimer;Compare two level sequencing sequence and reference gene
Group sequence obtains primary variation result;And it is made a variation using the accidental data amendment primary in known mutations data as a result, obtaining
Handling result.
Further, the step of acquisition two level sequencing sequence includes:Low quality in the high-flux sequence data of the lower machine of filtering
Sequencing data, obtain level-one sequencing sequence, low-quality sequencing data refers to that Q20 is more than less than 80% or N base ratios
10% sequencing sequence;Level-one sequencing sequence is identified using the amplimer of target fragment, obtains identification sequence;And removal is known
Corresponding amplimer in other sequence, obtains two level sequencing sequence.
Further, the step of comparing two level sequencing sequence and reference gene group sequence, obtaining primary variation result include:
According to the location information of the amplimer of target fragment, intercepted from reference gene group sequence the reference of corresponding target fragment than
To sequence;Two level sequencing sequence is compared with reference to aligned sequences, obtains primary variation result.
Further, after two level sequencing sequence is compared with reference aligned sequences, and primary variation is obtained
As a result before, processing method further includes:Two level sequencing sequence is compared with reference to aligned sequences, obtains aligned sequences;Root
According to the location information of amplimer, judge to whether there is unusual sequences in aligned sequences, unusual sequences, which refer to, compares quality less than the
The sequence of one threshold value or the sequence inconsistent with the information with reference to aligned sequences;If in the presence of being filtered out from aligned sequences
Unusual sequences, and each position for counting residue sequence obtains primary variation result with reference to the similarities and differences of aligned sequences.
Further, it is made a variation using the accidental data amendment primary in known mutations data as a result, obtaining handling result
Step includes:From the known mutations data filtered out in known mutations data in target fragment corresponding region, known mutations are obtained
Local sequence;Screening also goes out to be present in the variant sites in known mutations data simultaneously from the primary result that makes a variation, and is formed just
The local sequence of grade variation result;The local sequence of primary variation result is compared with the local sequence of known mutations, is obtained
To handling result.
Further, the local sequence of primary variation result is compared with the local sequence of known mutations, is obtained everywhere
Manage result the step of include:The local sequence of primary variation result is compared with the local sequence of known mutations, obtains two
Grade variation result;Two level variation result is modified, handling result is obtained;Wherein, two level variation result is modified
Step includes:Judge to whether there is neighbouring mutational site in two level variation result, such as exists, then judge neighbouring mutational site
Variation frequency with the presence or absence of significant difference and whether having support sequence, if significant difference is not present and has support sequence,
Then adjacent mutational site is merged, to obtain handling result.
Further, the step of identifying level-one sequencing sequence using the amplimer of target fragment, obtaining identification sequence is wrapped
It includes:Step A recycles all amplimers of target fragment, special according to length L interceptions since 5 ' ends of every amplimer
Anisotropic sequence and after recording the quantity of specific sequence, corresponding specific sequence and the specific sequence of each pair of amplimer
The length of remaining primer sequence;Step B, variation length L repeat step A, obtain the spy of the different number of all amplimers
The set of anisotropic sequence selects the length L corresponding to the most set of specific sequence quantity and corresponding specific sequence
Set carries out subsequent analysis;Step C, every sequence in circular treatment level-one sequencing sequence, every sequence of interception preceding 25~
The sequence of 35bp is gone interception sequence according to the length L corresponding to the most set of specific sequence quantity, is obtained since the ends 5`
Arrangement set is intercepted to sequencing;Step D searches the expansion corresponding to the specific sequence in the most set of specific sequence quantity
Increase the primer most amplimer of occurrence number and corresponding number in sequencing intercepts arrangement set, and when the maximum of corresponding number
When value is more than the second threshold set, that is, think that this level-one sequencing sequence is expanded by the most amplimer of occurrence number
It arrives, then this level-one sequencing sequence is denoted as identification sequence.
Further, corresponding amplimer in removal identification sequence, the step of obtaining two level sequencing sequence, include:According to
The position that the specific sequence of the most amplimer of occurrence number finally occurs in identifying sequence and the position finally occurred
The length of remaining amplimer sequence after setting, removal identify the amplimer in sequence, obtain two level sequencing sequence.
Further, it from the known mutations data filtered out in known mutations data in target fragment corresponding region, obtains
The step of local sequence of known mutations includes:Target fragment corresponding region in known mutations data is screened, known mutations are formed
Region;The initial position of each known mutations and final position in known mutations region are recorded, and along initial position and stop bit
It sets and extends respectively to both ends, the initial position after then record extends and final position, and initial position after extending and end
When stop bit is setting in target fragment corresponding region, the corresponding sequence of final position behind the initial position and extension after extension is
For the local sequence of known mutations;Initial position and/or final position after extension exceed target fragment corresponding region, then will
Initial position and/or final position of the boundary of target fragment corresponding region as the local sequence of known mutations.
Further, screening also goes out to be present in the variant sites in known mutations data simultaneously from the primary result that makes a variation,
The step of local sequence for forming primary variation result includes:It is screened in making a variation result from primary prominent known to also going out to be present in simultaneously
Become the variant sites in data, records initial position and the final position of each variant sites, and along initial position and stop bit
It sets and extends respectively to both ends, extend to the corresponding position of local sequence of known mutations, the local sequence of as primary variation result
Row.
Further, the local sequence of primary variation result is compared with the local sequence of known mutations, obtains two
Grade variation result the step of include:It searches in the local sequence of the corresponding known mutations of each target fragment and becomes with the presence or absence of primary
Different result;If the primary variation there are one as a result, if initial position according to variation result and final position, and along initial position
Extend respectively to both ends with final position, forms a local sequence of sample mutation;If there are multiple primary variations as a result, if sentence
Variation frequency between disconnected multiple primary variation results whether there is significant difference;If all there is significant difference, basis
The initial position of each primary variation result and final position, and extend respectively to both ends along initial position and final position, shape
It is mutated local sequence at respective sample;If in the presence of the primary variation of no significant difference as a result, if preliminary judgement is multiple primary becomes
Different result is chain, and multiple primary results that make a variation are merged to form the same local sequence of sample mutation, and multiple primary variations
As a result there are the primary variation of the residue of significant difference as a result, being then individually created the respective local sequence of sample mutation in;Judge each
Whether the local sequence of the local sequence of sample mutation and known mutations is identical, if identical, primary variation result is calibrated to
Know mutation result;If it is different, not calibrating then;The calibrated mutational site for known mutations result is not calibrated with remaining
Mutational site merge, obtain two level variation result.
Further, the step of being modified to two level variation result includes judging that multiple primary variation results are to be chain
No the step of there are false positives;Wherein, judge that multiple primary variation results include with the presence or absence of the step of false positive to be chain:It carries
The sequence of multiple variation results is taken while being covered, and counts the ratio for the two level sequencing sequence for supporting while covering multiple variation results
Example;If the ratio for supporting while covering the two level sequencing sequence of multiple variation results each makes a variation with the multiple variation results of support
As a result significant difference is not present in the ratio of sequence, then confirms that multiple primary variation results are chain appearance, and with linked mutation
Mode recalculate the frequency of mutation, when the frequency of mutation after recalculating meets third threshold value, obtain correct mutation result;
If support while covering the ratio of the two level sequencing sequence of multiple variation results and support that each variation is tied in multiple variation results
There are significant differences for the ratio of the sequence of fruit, then confirm multiple primary variation results be it is chain there are false positives, and by merging
Multiple variation results recalculate the frequency of mutation after being split, when the frequency of mutation after recalculating meets third threshold value,
It obtains correcting mutation result;Mutation result will be corrected with uncorrected mutation result to merge, obtain handling result.
To achieve the goals above, according to an aspect of the invention, there is provided a kind of processing of high-flux sequence data
Device, the processing unit include:Two level sequencing sequence acquiring unit, for obtaining two level sequencing sequence, two level sequencing sequence is
It can be identified by target fragment amplimer in high-flux sequence data, and eliminate the sequencing sequence after corresponding amplimer
Row;Primary variation result acquiring unit obtains primary variation knot for comparing two level sequencing sequence and reference gene group sequence
Fruit;And amending unit, for being tied using the accidental data amendment primary variation in known mutations data as a result, obtaining processing
Fruit.
Further, two level sequencing sequence acquiring unit includes:Filtering module, the high-flux sequence number for filtering lower machine
The low-quality sequencing data in, obtains level-one sequencing sequence, and low-quality sequencing data refers to that Q20 is less than 80% or N bases
Ratio is more than 10% sequencing sequence;Identification module is obtained for identifying level-one sequencing sequence using the amplimer of target fragment
To identification sequence;And removal module obtains two level sequencing sequence for removing corresponding amplimer in identification sequence.
Further, primary variation result acquiring unit includes:Interception module, for the amplimer according to target fragment
Location information, the reference aligned sequences of corresponding target fragment are intercepted from reference gene group sequence;And first compare mould
Block obtains primary variation result for two level sequencing sequence to be compared with reference to aligned sequences.
Further, the first comparing module is after two level sequencing sequence is compared with reference to aligned sequences, and
Before obtaining primary variation result, further include:First compare submodule, for by two level sequencing sequence with reference to aligned sequences into
Row compares, and obtains aligned sequences;Judging submodule, for according to the location information of amplimer, judge in aligned sequences whether
There are unusual sequences, unusual sequences refer to comparison quality and are less than the sequence of first threshold or inconsistent with reference aligned sequences information
Sequence;And filter submodule filters out exception in the presence of the judging result for judging submodule is from aligned sequences
Sequence, and each position for counting residue sequence obtains primary variation result with reference to the similarities and differences of aligned sequences.
Further, amending unit includes:The local block of known mutations, for being screened from known mutations data
Go out the known mutations data in target fragment corresponding region, obtains the local sequence of known mutations;The part of primary variation result
Block is formed for filtering out while existing in the variant sites in known mutations data in making a variation result from primary
The local sequence of primary variation result;And second comparing module, for by the local sequence of primary variation result with it is known prominent
The local sequence of change is compared, and obtains handling result.
Further, the second comparing module includes:Second compares submodule, the local sequence for result that primary makes a variation
It is compared with the local sequence of known mutations, obtains two level variation result;Correct submodule, for two level make a variation result into
Row is corrected, and handling result is obtained;Wherein, it corrects submodule and step is modified as follows to two level variation result execution:Judge two
Grade variation result in whether there is neighbouring mutational site, such as exist, then judge neighbouring mutational site variation frequency whether
There are significant difference and/or whether there is support sequence, if significant difference is not present and/or has support sequence, adjacent is dashed forward
Become site to merge, to obtain handling result.
Further, identification module includes:The first submodule of amplimer specific sequence, for recycling target fragment
All amplimers intercept specific sequence according to length L and record each pair of amplification and draw since 5 ' ends of every amplimer
The length of remaining primer sequence after the quantity of the specific sequence of object, corresponding specific sequence and specific sequence;Expand
Increase primer specificity sequence the second submodule, is used for variation length L, repeats the first submodule of amplimer specific sequence
The step of, it obtains the set of the specific sequence of the different number of all amplimers, selects specific sequence quantity most
The corresponding length L of set and corresponding specific sequence set carry out subsequent analysis;Sequencing sequence intercepts submodule, is used for
Every sequence in circular treatment level-one sequencing sequence, the sequence for intercepting preceding 25~35bp of every sequence are pressed since the ends 5`
Interception sequence is gone according to the length L corresponding to the most set of specific sequence quantity, obtains sequencing interception arrangement set;Search son
Module, the amplimer corresponding to specific sequence in the set most for searching specific sequence quantity are intercepted in sequencing
The most amplimer of occurrence number and corresponding number in arrangement set, and it is more than the second of setting in the maximum value of corresponding number
When threshold value, that is, think that this level-one sequencing sequence is to expand to obtain by the most amplimer of occurrence number, then by this level-one
Sequencing sequence is denoted as identification sequence.
Further, removal module includes:Submodule is removed, for according to the special of the most amplimer of occurrence number
The length of property sequence remaining amplimer sequence after the position finally occurred in identifying sequence and the position finally occurred
Degree, removal identify the amplimer in sequence, obtain two level sequencing sequence.
Further, it is known that the local block of mutation includes:First screening submodule, for screening known mutations number
According to middle target fragment corresponding region, known mutations region is formed;First logging modle, it is each in known mutations region for recording
The initial position of known mutations and final position, and extend respectively to both ends along initial position and final position, then record prolongs
Initial position after stretching and final position;First known mutations sequence generating module is used for initial position after extending and end
When stop bit is setting in target fragment corresponding region, by the initial position after extension and the corresponding sequence of the final position after extension
It is denoted as the local sequence of known mutations;And the second known mutations sequence generating module, for after extending initial position with
Final position exceeds target fragment corresponding region, then using the boundary of target fragment corresponding region as the local sequence of known mutations
Initial position and/or final position.
Further, the local block of primary variation result includes:Second screening submodule, for making a variation from primary
As a result screening also goes out to be present in the variant sites in known mutations data simultaneously in;Second record sub module, it is each for recording
The initial position of variant sites and final position, and extend respectively to both ends along initial position and final position, it extends to known
The corresponding position of local sequence of mutation, the local sequence of as primary variation result.
Further, the second comparison submodule includes:First searches subcomponent, corresponding for searching each target fragment
With the presence or absence of primary variation result in the local sequence of known mutations;The local sequence producing element of first sample mutation, for working as
First lookup result for searching subcomponent is there are when a primary variation result, then according to the initial position of variation result and end
Stop bit is set, and is extended respectively to both ends along initial position and final position, and a local sequence of sample mutation is formed;Second sample
The local sequence producing element of mutation, the lookup result for searching subcomponent when first are that there are multiple primary variations as a result, then
Judge that the variation frequency between multiple primary variation results whether there is significant difference;If all there is significant difference, root
Initial position according to each primary result that makes a variation and final position, and extend respectively to both ends along initial position and final position,
Then it is respectively formed the respective local sequence of sample mutation;If in the presence of the primary variation of no significant difference as a result, if preliminary judgement it is more
A primary variation result is chain, and multiple primary results that make a variation are merged to form the same local sequence of sample mutation, and multiple
There are the primary variation of the residue of significant difference as a result, being then individually created the respective local sequence of sample mutation in primary variation result
Row;Subcomponent is calibrated, whether the local sequence for judging the local sequence of each sample mutation and known mutations is identical, if identical,
Primary variation result is then calibrated to known mutations result;If it is different, not calibrating then;First merges subcomponent, and being used for will
The mutational site for being calibrated to known mutations result merges with the remaining mutational site for not making to calibrate, and obtains two level variation result.
Further, it includes that chain false positive judges that subcomponent, chain false positive judge that subcomponent includes to correct submodule:
Extraction statistics subcomponent, the sequence for extracting while covering multiple variation results, and count support while covering multiple variations
As a result the ratio of two level sequencing sequence;Chain confirmation subcomponent, for when the two level for supporting while covering multiple variation results
Significant difference is not present in the ratio of sequencing sequence and the ratio of the sequence for the result that each makes a variation in the multiple variation results of support, then really
It is chain appearance to recognize multiple primary variation results, and the frequency of mutation is recalculated in a manner of linked mutation, after recalculating
Frequency of mutation when meeting third threshold value, obtain correcting mutation result;False positive confirms subcomponent, supports while covering for working as
The ratio of the ratio of the two level sequencing sequence of multiple variation results and the sequence for the result that each makes a variation in the multiple variation results of support
There are significant differences, then confirm multiple primary variation results be it is chain there are false positives, and by multiple variation results of merging into
Row recalculates the frequency of mutation after splitting, and when the frequency of mutation after recalculating meets third threshold value, obtains correcting abrupt junction
Fruit;And second merge subcomponent, for will correct be mutated result merge with uncorrected mutation result, obtain handling result.
According to another aspect of the present invention, a kind of storage medium is provided, storage medium includes the program of storage, wherein
Equipment where controlling storage medium when program is run executes any of the above-described kind of processing method.
According to another aspect of the present invention, a kind of processor is provided, processor is for running program, wherein program is transported
Any of the above-described kind of processing method is executed when row.
It applies the technical scheme of the present invention, by according to known primer information, the original number obtained from high-flux sequence
Removed according to the middle primer portion by every sequence, reduce in amplified production overlapping region there are primer mutagenesis and caused by it is false
Positive handling result.In addition, can also be by the sequence of some mistake amplifications in high-flux sequence data by the identification for carrying out primer
Row removal, not only facilitates the accuracy for improving subsequent analysis, reduces false positive results, and helps to reduce overall data
Amount improves the efficiency of subsequent analysis step.
Specific implementation mode
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase
Mutually combination.Below in conjunction with embodiment, the present invention will be described in detail.
Embodiments herein is described below in detail, the embodiment of specific descriptions is exemplary, it is intended to for explaining this
Application, and should not be understood as the limitation to the application.In the description of the present application, " level-one ", " two level ", " first ", " second " etc.
For convenience of description, it is not related to the description in terms of importance.Particular technique condition person is not specified in embodiment, according in related field
The technical conditions or Related product specification of document description carry out.Agents useful for same or instrument do not have specified otherwise person, and being can
With conventional products that are commercially available.
As background technology is previously mentioned, in the prior art in the high-flux sequence number of processing multiplex amplification specific target areas
According to when, often there is the defect more than false positive site in handling result, it is a kind of typical in the application in order to improve this situation
In embodiment, a kind of processing method of high-flux sequence data is provided, which includes:It obtains two level and sequence is sequenced
Row, two level sequencing sequence are that can be identified by target fragment amplimer in high-flux sequence data, and remove corresponding amplification
Sequencing sequence after primer;Two level sequencing sequence and reference gene group sequence are compared, primary variation result is obtained;It is dashed forward using known
The accidental data amendment primary become in data makes a variation as a result, obtaining handling result.
When carrying out specificity capture to target area due to multiplex amplification method, between the amplification region of different amplimers
There may be overlapping, therefore the presence of amplimer may interfere the abrupt climatic change of overlapping region.Based on this original
Cause, the processing method of the above-mentioned high-flux sequence data of the application, by according to known primer information, being obtained from high-flux sequence
To initial data in the primer portion in every sequence removed, reduce in amplified production overlapping region that there are primer mutagenesis
False positive handling result caused by and.In addition, can also be wrong by some in high-flux sequence data by the identification for carrying out primer
The accidentally sequence removal of amplification, not only facilitates the accuracy for improving subsequent analysis, and helps to reduce overall amount of data, improves
The efficiency of subsequent analysis step.
In above-mentioned processing method, obtain two level sequencing sequence the step of in, in addition to utilize known amplimer information pair
Before sequencing sequence is identified and removes, further include the conventional pre-treatment step of high-flux sequence data, for example removes low-quality
The step of measuring sequencing sequence.In a kind of preferred embodiment of the application, the high-flux sequence data of the application as shown in Figure 1
Processing method detail flowchart in, the step of above-mentioned acquisition two level sequencing sequence includes:The high-flux sequence of the lower machine of filtering
Low-quality sequencing data in data obtains level-one sequencing sequence, filters low-quality sequencing data in high-flux sequence data,
Level-one sequencing sequence is obtained, low-quality sequencing data refers to the sequencing sequence that Q20 is more than 10% less than 80% or N base ratios
Row;Level-one sequencing sequence is identified using the amplimer of target fragment, obtains identification sequence;Corresponding expansion in removal identification sequence
Increase primer, obtains two level sequencing sequence.
Herein it is noted that in this application, the high-flux sequence data of lower machine refer to the FASTQ obtained from sequenator
Or the data of BAM formats.
In the step of above-mentioned acquisition two level sequencing sequence, identify situation to raw sequencing data according to sequencing quality, base
It is filtered screening, the low quality data in sequencing procedure is avoided to be interfered caused by subsequent data analysis, improves subsequent analysis
As a result accuracy.
In the above-mentioned processing method of the application, the step of comparison, can be realized using the comparison step of this field routine.For
It further increases and compares speed and data-handling efficiency, in a kind of preferred embodiment of the application, compare two level sequencing
Sequence and reference gene group sequence, the step of obtaining primary variation result include:According to the position of the amplimer of target fragment
Information intercepts the reference aligned sequences of corresponding target fragment from reference gene group sequence;By two level sequencing sequence and reference
Aligned sequences are compared, and obtain primary variation result.
Since most of the sequence obtained by multiplex amplification should be the segment of destination region, the application is carrying out data
When comparison, selection intercepts reference sequences according to the amplification region of primer.Computing resource can not only be saved in this way, moreover it is possible to greatly greatly
It is fast to compare speed.Specifically, alignments are overall comparison, and specific algorithm is:
(1) parameter setting:, base mispairing identical to the base during comparison, base insertion and deletion, base insertion and deletion
The score value of extension is defined;
(2) scoring matrix initializes:
A. using each base of reference sequences as a row of scoring matrix, first is left a blank;
B. using each base of sequencing sequence as a line of scoring matrix, first trip is left a blank;
C. scoring matrix is filled:Scoring matrix is filled according to following rule from left to right, from top to bottom:
D. each vacancy is calculated separately extends gained score by left side, top, upper left side.Wherein, the case where coming from upper left
It needs to judge whether the sequencing base of current location and reference base are identical.It is identical then add the identical score value of base, it is different then
In addition the score value of base mispairing;The case where for coming from left side or top, back need to be judged whether also for insertion and deletion.If
It is the score value for then adding insertion and deletion and extending, otherwise adds the score value of insertion and deletion.
(3) best result that 3 kinds of situations are calculated is as comparison score value herein, and the path for recording best result is come
Source.
(4) optimal path is recalled:Recalled from the lower right corner of scoring matrix, according to the path source in each site, is obtained
To comparison result, optimal comparison result is chosen.
In a kind of preferred embodiment of the application, it is being compared with reference to aligned sequences in two level sequencing sequence
Afterwards, and before obtaining primary variation result, above-mentioned processing method further includes:By two level sequencing sequence with reference to aligned sequences into
Row compares, and obtains aligned sequences;According to the location information of amplimer, judge to whether there is unusual sequences in aligned sequences, it is different
Chang Xulie, which refers to, compares quality less than the sequence of threshold value or the sequence inconsistent with the information with reference to aligned sequences;If in the presence of,
Unusual sequences are filtered out from aligned sequences, and each position for counting residue sequence is obtained with reference to the similarities and differences of aligned sequences
Primary variation result.
Since reference sequences are intercepted, the non-purpose extension increasing sequence in part may be compared by force to the ginseng after interception
It examines in sequence, this is easy to interfere follow-up abrupt climatic change.In addition, the sequence that each pair of primer is expanded should be corresponding amplification region
Sequence in domain, therefore its comparison position should be almost the same with the position of amplification destination region.Based on this 2 points, above-mentioned preferred reality
It applies in example, sequence alignment result is tentatively filtered after sequence alignment, help to improve the accuracy of subsequent analysis result.
The sequence that quality is less than first threshold is compared in above-mentioned unusual sequences, refers to being obtained comparing quality according to alignment algorithm
Value, given threshold, the mass value that compares less than threshold value is to compare the too low sequence of quality.According to practical experience, first threshold
Usually 5.The above-mentioned inconsistent sequence of information with reference to aligned sequences refers to according to alignment algorithm, and with reference to aligned sequences
The secondary sequence not compared completely is the inconsistent sequence of information.
In the variation detection process method of existing high-flux sequence data, the specifying information of mutation is usually to compare
As a result subject to.The defect of this method be if certain mutation compare position near there are a variety of comparisons may when, output
Comparison result and existing database information may be inconsistent, lead to not carry out subsequent association with existing database.In addition, right
In relative complex mutation, comparison process may be split into several relatively small mutation and be obtained with obtaining optimal comparison
Point, it is not inconsistent with real change.Thus, final analysis result is also inaccurate.
In order to improve the above situation, in a kind of preferred embodiment of the application, the mutation in known mutations data is utilized
Data correction primary makes a variation as a result, the step of obtaining handling result includes:Target fragment pair is filtered out from known mutations data
The known mutations data in region are answered, the local sequence of known mutations is obtained;It screens also to go out to deposit simultaneously in making a variation result from primary
The variant sites being in known mutations data form the local sequence of primary variation result;By the part of primary variation result
Sequence is compared with the local sequence of known mutations, obtains handling result.
Calibration is modified to primary variation result according to the local sequence of known mutations by selection, so that place
It is more acurrate to manage obtained final result.In particular it is required that the local sequence of the corresponding known mutations of amplification region is generated in advance, it can
It is obtained with carrying out interception from known mutations data according to the specifying information of amplimer.And the sequence in result that primary makes a variation
In known mutations region, the local sequence of primary variation result is formed, without not correcting in known mutations region
Calibration.
Although can be calibrated to the abrupt that some are likely to occur according to known mutations data, in reality
In detection, some may be there is also and be not recorded in known mutations database abrupt.Therefore, abrupt junction is being obtained
After fruit, the multiple mutation that wherein whether there is possible chain appearance are judged, and be modified to it, to further increase at analysis
Manage the accuracy of result.In a kind of preferred embodiment of the application, as depicted in figs. 1 and 2, by the part of primary variation result
The step of sequence is compared with the local sequence of known mutations, obtains handling result include:By the part of primary variation result
Sequence is compared with the local sequence of known mutations, obtains two level variation result;Two level variation result is modified, is obtained
Handling result;Wherein, to two level variation result be modified the step of include:Judge in two level variation result with the presence or absence of neighbouring
Mutational site, such as exist, then judge the variation frequency in neighbouring mutational site with the presence or absence of significant difference and whether have branch
Sequence is held, if significant difference is not present and has support sequence, adjacent mutational site is merged, to obtain processing knot
Fruit.
In above-mentioned processing method, identify that the basic principle of sequencing sequence is drawn using every amplification using known amplimer
The specific sequence of object is used as the specific marker of corresponding primer.When certain to the specific sequence of primer before sequencing sequence
When repeatedly occurring in 25~35bp, it is believed that the sequence is to expand to obtain by the corresponding amplimer.Identifying correspondence
Amplimer after, you can corresponding amplimer is removed according to the length of amplimer.According to above-mentioned principle, can design not
Specific algorithm realizes the function of above-mentioned primer identification and removal.
In a kind of preferred embodiment of the application, identifies level-one sequencing sequence using the amplimer of target fragment, obtain
Include to the step of identification sequence:Step A recycles all amplimers of target fragment, is opened from 5 ' ends of every amplimer
Begin, specific sequence is intercepted according to length L and records the quantity of the specific sequence of each pair of amplimer, corresponding specific sequence
The length of remaining primer sequence after row and specific sequence;Step B, variation length L repeat step A, obtain all amplifications
The set of the specific sequence of the different number of primer, select the length L corresponding to the most set of specific sequence quantity with
And corresponding specific sequence set carries out subsequent analysis;Step C, every sequence in circular treatment level-one sequencing sequence are cut
The sequence for taking preceding 25~35bp of every sequence, since the ends 5`, corresponding to the most set of specific sequence quantity
Length L goes interception sequence, obtains sequencing interception arrangement set;Step D searches the spy in the most set of specific sequence quantity
Amplimer corresponding to anisotropic the sequence most amplimer of occurrence number and corresponding number in sequencing intercepts arrangement set,
And when the maximum value of number is more than second threshold (being usually 3) of setting, that is, think that this level-one sequencing sequence is by occurring
The most amplimer of number expands to obtain, then this level-one sequencing sequence is denoted as identification sequence.
This amplimer of the application identifies and the method for removal, can effectively bring amplimer mutation dry
It disturbs and builds the wrong extension increasing sequence generated in library and/or sequencing procedure to be removed, on the one hand improve the accurate of handling result
Property, overall amount of data on the other hand can be reduced, the efficiency of subsequent analysis step is improved.
It should be noted that in above-mentioned preferred embodiment, specific sequence refer in all amplimers there is no with
Identical sequence, be unique.Even if different amplimers can intercept out the identical sequence of sequence, and such sequence is not
It can be referred to as specific sequence.
In a kind of preferred embodiment of the application, corresponding amplimer in removal identification sequence obtains two level sequencing
The step of sequence includes:The position finally occurred in identifying sequence according to the specific sequence of the most amplimer of occurrence number
Set and the position that finally occurs after remaining amplimer sequence length, the amplimer in removal identification sequence obtains
To two level sequencing sequence.
Specifically, the particular number of certain specific sequence that amplimer can be intercepted, on the one hand with amplimer
Specific length is related, on the other hand related with the length of set interception.The specific length of amplimer is longer, can cut
The quantity of the specific sequence taken is more.The intercepted length of setting is longer, and the quantity for the specific sequence that can be intercepted is just
It is fewer.The specific length of amplimer in the sequence of removal identification herein is not complete with amplimer used when structure library
Length is completely the same.Since mistake caused by sequencing mistake or other unknown causes can be by certain in every two level sequencing sequence
To the sequence before the rearmost position of the specific sequence of amplimer identification, and the sequence of residue length thereafter, no matter its
It is practical whether identical as amplimer sequence, it is accordingly to be regarded as that the primer sequence that primer is identified can be amplified, thus, it is required for
It removes.I.e. it is capable to by the sequence of residue length after certain rearmost position to the specific sequence identification of amplimer, have
May be not fully identical as the sequence of the amplimer real surplus, but the base sequence of equal length is also required to remove.And
The application passes through test of many times result verification, and this removal has no influence to final process result.
In a kind of preferred embodiment of the application, as shown in Fig. 2, filtering out target fragment pair from known mutations data
The step of answering the known mutations data in region, obtaining the local sequence of known mutations include:Screen mesh in known mutations data
Segment corresponding region, form known mutations regions;Record the initial position of each known mutations and end in known mutations region
Stop bit is set, and is extended to both ends respectively along initial position and final position and (preferably respectively extended 10~15bp), and then record extends
Initial position afterwards and final position, and when extend after initial position and final position be located in target fragment corresponding region
When, the corresponding sequence of final position behind the initial position and extension after extension is the local sequence of known mutations;Work as extension
Initial position and/or final position afterwards exceeds target fragment corresponding region, then makees the boundary of target fragment corresponding region
Initial position for the local sequence of known mutations and/or final position.
As shown in Figure 2, according to abrupt information (including mutation initiation position and the variation described in known mutations database
Base type), by extending forward from initial position and extending back simultaneously from final position, local-reference sequence is generated respectively
Row and corresponding local variations sequence.And if front and back extension rear region has exceeded corresponding to the target fragment with amplimer
Known mutations region, the initial position and/or termination of the boundary of target fragment corresponding region as the local sequence of known mutations
Position.It illustrates herein:Assuming that there are one point mutation, each extension 10bp in left and right, if the termination extended in destination region
Position or initial position are within the scope of target fragment, then the local sequence length of the mutation is 21bp.Assuming that the mutation is to the right
After extension, final position exceeds target fragment 5bp, then this 5bp can be thrown away, then the local sequence length of the mutation is 16bp.
It is compared by forming local sequence, it can be ensured that comparison result is in target fragment region, to make comparison result also more
Accurately, it avoids a variety of comparison results in part to interfere caused by abrupt climatic change, for the various database sides of providing of subsequent association
Just.
It should be noted that the step of local sequence of the generation known mutations, carries out once.Every time when analysis, such as
Fruit target fragment region does not change, then is not required to regenerate the local sequence of known mutations every time.Moreover, generating known prominent
The opportunity of the local sequence of change is unlimited, as long as being formed before comparing step.
Due to and not all primary variation result in variant sites exist in known mutations data, in order to become
Different result is compared with known variation data, and multiple primary variant sites is avoided to compare to different known mutations data
Position, can by exist simultaneously in known mutations data and primary variation result in variant sites local sequence individually with
The local sequence of known mutations is compared.In a kind of preferred embodiment of the application, screened in making a variation result from primary same
When also go out to be present in the variant sites in known mutations data, the step of local sequence for forming primary variation result includes:From
Screening also goes out to be present in the variant sites in known mutations data simultaneously in primary variation result, records rising for each variant sites
Beginning position and final position, and extend respectively to both ends along initial position and final position, extend to the local sequence of known mutations
Arrange corresponding position, the local sequence of as primary variation result.
In a kind of preferred embodiment of the application, by the local sequence of the local sequence and known mutations of primary variation result
Row are compared, obtain two level variation result the step of include:Search the part per the corresponding known mutations of each target fragment
In sequence with the presence or absence of primary variation as a result, if there are a primary variation as a result, if according to the initial position of variation result and
Final position, and extend respectively to both ends along initial position and final position, form a local sequence of sample mutation;If in the presence of
Multiple primary variations are as a result, then judge that the variation frequency between multiple primary variation results whether there is significant difference;If all
All there is significant difference, then according to the initial position of each primary result that makes a variation and final position, and along initial position and termination
Position extends to both ends respectively, and formation is respectively formed the respective local sequence of sample mutation;If in the presence of the primary of no significant difference
Variation is as a result, then the multiple primary variation results of preliminary judgement are chain, and multiple primary results that make a variation are merged to be formed with
This mutation part sequence, and there are the primary variations of the residue of significant difference as a result, being then individually created in multiple primary variation results
The respective local sequence of sample mutation;Judge whether the local sequence of the local sequence of each sample mutation and known mutations is identical, if
It is identical, then primary variation result is calibrated to known mutations result;If it is different, not calibrating then;By calibrated for known mutations
As a result mutational site merges with the remaining mutational site for not making to calibrate, and obtains two level variation result.
Specifically, in above-mentioned preferred embodiment, as shown in Fig. 2, in primary variation result, the mutation type of ChrA with
Know that the mutation type in mutation database is identical, is all ATCG missings, but the initial position recorded in primary variation result is x4
And y4, and the initial position in known accidental data is x1 and y1, thus, it, will according to the abrupt information of known mutations database
The primary variation result in the mutational site is calibrated to known mutations result.And in Fig. 2, since the abrupt information of ChrD is known prominent
In variable database and it is not present, thus without calibration.
In a kind of preferred embodiment of the application, as shown in figure 3, the step of being modified to two level variation result includes
Judge multiple primary variation results for chain the step of whether there is false positive:Wherein, multiple primary variation results are judged to connect
It locks and includes with the presence or absence of the step of false positive:Extraction while the sequence for covering multiple variation results, and count support while covering
The ratio of the two level sequencing sequence of multiple variation results;If supporting while covering the ratio of the two level sequencing sequence of multiple variation results
Significant difference is not present in the ratio of the sequence for the result that each makes a variation in example and the multiple variation results of support, then confirms that multiple primary become
Different result is chain appearance, and the frequency of mutation is recalculated in a manner of linked mutation, and frequency of mutation after recalculating is full
When sufficient third threshold value, obtain correcting mutation result;If supporting while covering the ratio of the two level sequencing sequence of multiple variation results
There are significant differences with the ratio of the sequence for the result that each makes a variation in the multiple variation results of support, then confirm multiple primary variation knots
Fruit be it is chain there are false positives, and the frequency of mutation is recalculated after multiple variation results of merging are split, when counting again
When the frequency of mutation after calculation meets third threshold value (being usually 2%), obtain correcting mutation result;Mutation result will be corrected and do not repaiied
Positive mutation result merges, and obtains handling result.
Two level is made a variation, and multiple in result individual there are the mutation of true linkage relationship to be modified to company by above-mentioned steps
Lock mutation, and the mutation merged in the form of chain false positive is split into individual mutation so that mutation result is more acurrate.
In another typical embodiment of the application, a kind of processing unit of high-flux sequence data is provided, it should
Processing unit includes:Two level sequencing sequence acquiring unit, for obtaining two level sequencing sequence, two level sequencing sequence measures for high pass
Ordinal number can be identified in by target fragment amplimer, and eliminate the sequencing sequence after corresponding amplimer;Primary becomes
Different result acquiring unit obtains primary variation result for comparing two level sequencing sequence and reference gene group sequence;It corrects single
Member, for being made a variation using the accidental data amendment primary in known mutations data as a result, obtaining handling result.
The above-mentioned processing unit of the application can be expanded by executing the acquisition of two level sequencing sequence acquiring unit by target fragment
Increase primer identification, and eliminate the sequencing sequence after corresponding amplimer, then executing primary variation result acquiring unit will
The two level sequencing sequence got is compared with reference gene group sequence, and obtained primary variation result is held by amending unit
The result that makes a variation after row amendment step, in obtained handling result is more acurrate.
In a kind of preferred embodiment of the application, two level sequencing sequence acquiring unit includes:Filtering module, for filtering
Low-quality sequencing data in the high-flux sequence data of lower machine obtains level-one sequencing sequence, filters in high-flux sequence data
Low-quality sequencing data, obtains level-one sequencing sequence, and low-quality sequencing data refers to that Q20 is less than 80% or N base ratios
Sequencing sequence more than 10%;Identification module is known for identifying level-one sequencing sequence using the amplimer of target fragment
Other sequence;Module is removed, for removing corresponding amplimer in identification sequence, obtains two level sequencing sequence.
Above-mentioned two level sequencing sequence acquiring unit carries out raw sequencing data according to sequencing quality, base identification situation
Filtering screening avoids the low quality data in sequencing procedure from being interfered caused by subsequent data analysis, improves subsequent analysis result
Accuracy.
In a kind of preferred embodiment of the application, primary variation result acquiring unit includes:Interception module is used for basis
The location information of the amplimer of target fragment, the reference that corresponding target fragment is intercepted from reference gene group sequence compare sequence
Row;First comparing module obtains primary variation result for two level sequencing sequence to be compared with reference to aligned sequences.
Since most of the sequence obtained by multiplex amplification should be the segment of destination region, above-mentioned primary variation result
For acquiring unit when carrying out comparing, selection intercepts reference sequences according to the amplification region of primer, can not only save in this way
Computing resource, moreover it is possible to greatly speed up comparison speed.
In a kind of preferred embodiment of the application, the first comparing module is by two level sequencing sequence and with reference to aligned sequences
After being compared, and before obtaining primary variation result, further include:First compares submodule, for sequence to be sequenced in two level
Row are compared with reference to aligned sequences, obtain aligned sequences;Judging submodule is used for the location information according to amplimer,
Judge in aligned sequences whether there is unusual sequences, unusual sequences refer to compare quality less than first threshold sequence or with reference to than
The sequence inconsistent to the information of sequence and amplimer;Filter submodule, the judging result for judging submodule are to exist
When, unusual sequences are filtered out from aligned sequences, and each position for counting residue sequence is obtained with reference to the similarities and differences of aligned sequences
To primary variation result.
Since interception module is interception target fragment extension increasing sequence, the non-purpose extension increasing sequence in part from reference sequences
It may by force be compared on the reference sequences to after interception, this is easy to interfere follow-up abrupt climatic change.In addition, each pair of draw
The sequence that object is expanded should be the sequence in corresponding amplification region, therefore its comparison position should be with the position base of amplification destination region
This is consistent.Based on this 2 points, in above preferred embodiment, judging submodule and filter submodule are set, respectively to sequence alignment
As a result abnormal judgement and preliminary filtering are carried out, the accuracy of subsequent analysis result is helped to improve.
In a kind of preferred embodiment of the application, amending unit includes the local block of known mutations, for from
The known mutations data in target fragment corresponding region are filtered out in known mutations data, obtain the local sequence of known mutations,
The local block of primary variation result, for being filtered out in making a variation result from primary while existing in known mutations data
In variant sites, form the local sequence of primary variation result;Second comparing module, the part for result that primary makes a variation
Sequence is compared with the local sequence of known mutations, obtains handling result.
The local sequence of primary variation result is modified according to the local sequence of known mutations by amending unit,
So that the variation result that processing obtains is more acurrate.
In a kind of preferred embodiment of the application, the second comparing module includes:Second compares submodule, and being used for will be primary
The local sequence of variation result is compared with the local sequence of known mutations, obtains two level variation result;Submodule is corrected, is used
It is modified in two level variation result, obtains handling result;Wherein, correct submodule to two level variation result execute as follows into
Row amendment step:Judge to whether there is neighbouring mutational site in two level variation result, such as exists, then judge neighbouring mutation position
The variation frequency of point is with the presence or absence of significant difference and/or whether has support sequence, if significant difference is not present and/or has support sequence
Row then merge adjacent mutational site, to obtain handling result.
Although above-mentioned second comparing module, the abrupt that some are likely to occur can be carried out according to known mutations data
Calibration is corrected, but in actually detected, the complexity that some may be there is also and be not recorded in known mutations database
Mutation.Therefore, after obtaining mutation result, judge wherein to whether there is possible chain appearance by executing above-mentioned amendment submodule
Multiple mutation, and it is modified, to further increase the accuracy of analysis and processing result.
In a kind of preferred embodiment of the application, identification module includes:The first submodule of amplimer specific sequence,
All amplimers for recycling target fragment, since 5 ' ends of every amplimer, according to the specific sequence of length L interceptions
It arranges and records and is remaining after quantity, corresponding specific sequence and the specific sequence of the specific sequence of each pair of amplimer
The length of primer sequence;Amplimer specific sequence the second submodule is used for variation length L, repeats amplimer spy
The step of anisotropic the first submodule of sequence, the set of the specific sequence of the different number of all amplimers is obtained, selection is special
Length L and corresponding specific sequence set corresponding to the most set of anisotropic sequence quantity carry out subsequent analysis;Sequencing
Sequence truncation submodule intercepts preceding 25~35bp of every sequence for every sequence in circular treatment level-one sequencing sequence
Sequence, since the ends 5`, according to the length L corresponding to the most set of specific sequence quantity go interception sequence, surveyed
Sequence intercepts arrangement set;Submodule is searched, the specific sequence institute in the set most for searching specific sequence quantity is right
The amplimer the answered most amplimer of occurrence number and corresponding number in sequencing intercepts arrangement set, and number most
When big value is more than second threshold (being usually 3) set, that is, think that this level-one sequencing sequence is by the most expansion of occurrence number
Increase primer amplification to obtain, then this level-one sequencing sequence is denoted as identification sequence.
In a kind of preferred embodiment of the application, removal module includes:Submodule is removed, is used for according to occurrence number most
The specific sequence of more amplimers is remaining after the position finally occurred in identifying sequence and the position finally occurred
Amplimer sequence length, removal identification sequence in amplimer, obtain two level sequencing sequence.
In a kind of preferred embodiment of the application, it is known that the local block of mutation includes:First screening submodule,
For screening target fragment corresponding region in known mutations data, known mutations region is formed;First logging modle, for recording
The initial position of each known mutations and final position in known mutations region, and along initial position and final position respectively to two
End extends, the initial position after then record extends and final position, and the first known mutations sequence generating module, prolongs for working as
When initial position and final position after stretching are located in target fragment corresponding region, by after extension initial position and extension after
The corresponding sequence of final position is denoted as the local sequence of known mutations;Second known mutations sequence generating module, for when extension
Initial position and final position afterwards exceeds target fragment corresponding region, then using the boundary of target fragment corresponding region as known to
The initial position of the local sequence of mutation and/or final position.
In another preferred embodiment, the local block of above-mentioned primary variation result includes:Second screening
Module for the screening from the primary result that makes a variation while also going out to be present in the variant sites in known mutations data;Second record
Submodule, the initial position for recording each variant sites and final position, and along initial position and final position respectively to
Both ends extend, and extend to the corresponding position of local sequence of known mutations, the local sequence of as primary variation result.
In a kind of preferred embodiment of the application, the second comparison submodule includes:First searches subcomponent, for searching
With the presence or absence of primary variation as a result, the local sequence of first sample mutation in the local sequence of the corresponding known mutations of each target fragment
Column-generation element, the lookup result for searching subcomponent when first are there are when a primary variation result, then according to variation
As a result initial position and final position, and extend respectively to both ends along initial position and final position, it is prominent to form a sample
Become local sequence;The local sequence producing element of second sample mutation, the lookup result for searching subcomponent when first are to exist
Multiple primary variations are as a result, then judge that the variation frequency between multiple primary variation results whether there is significant difference;If all
All there is significant difference, then according to the initial position of each primary result that makes a variation and final position, and along initial position and termination
Position extends to both ends respectively and (preferably respectively extends 10~15bp), forms the respective local sequence of sample mutation;If existing without aobvious
The primary variation of difference is write as a result, then the multiple primary variation results of preliminary judgement are chain, and multiple primary results that make a variation are closed
And the same local sequence of sample mutation is formed, and there are the primary variation knots of the residue of significant difference in multiple primary variation results
Fruit is then individually created the respective local sequence of sample mutation;Calibrate subcomponent, for judge the local sequence of each sample mutation with
Know whether the local sequence of mutation is identical, if identical, primary variation result is calibrated to known mutations result;If it is different, then
It does not calibrate;First merges subcomponent, for not calibrating the calibrated mutational site for known mutations result with remaining
Mutational site merge, obtain two level variation result.
In a kind of preferred embodiment of the application, correcting submodule includes:Chain false positive judges subcomponent, chain vacation
The positive judges that subcomponent includes:Extraction statistics subcomponent, the sequence for extracting while covering multiple variation results, and count branch
Hold while covering the ratio of the two level sequencing sequence of multiple variation results;Chain confirmation subcomponent is supported while being covered for working as
The ratio of the ratio of the two level sequencing sequence of multiple variation results and the sequence for the result that each makes a variation in the multiple variation results of support
There is no significant differences, then confirm that multiple primary variation results are chain appearance, and are recalculated in a manner of linked mutation prominent
Frequency when the frequency of mutation after recalculating meets third threshold value, obtains correcting mutation result;False positive confirms son member
Part, for each in the ratio of two level sequencing sequence of multiple variation results and the multiple variation results of support when supporting while covering
There are significant differences for the ratio of sequence for the result that makes a variation, then confirm multiple primary variation results be it is chain there are false positives, and will
The multiple variation results merged recalculate the frequency of mutation after being split, and frequency of mutation after recalculating meets third threshold
When value (being usually 2%), obtain correcting mutation result;And second merge subcomponent, result and is not repaiied for that will correct to be mutated
Positive mutation result merges, and obtains handling result.
In the application in the third typical embodiment, a kind of storage medium is additionally provided, which includes depositing
The program of storage, wherein the equipment where controlling storage medium when program is run executes the processing of above-mentioned high-flux sequence data
Method.
In the 4th kind of typical embodiment of the application, a kind of processor is additionally provided, the processor is for running journey
Sequence, wherein program executes the processing method of above-mentioned high-flux sequence data when running.
As seen through the above description of the embodiments, described device is only schematical, such as the unit
Division, can be a kind of division of logic function, formula that in actual implementation, there may be another division manner, such as multiple units or group
Part can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown
Or the mutual coupling, direct-coupling or communication connection discussed can be by some interfaces, unit or module it is indirect
Coupling or communication connection, can be electrical or other forms.
The unit illustrated as separating component may or may not be physically separated, and be shown as unit
Component may or may not be physical unit, you can be located at a place, or may be distributed over multiple units
On.Some or all of unit therein can be selected according to the actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also
It is that each unit physically exists alone, it can also be during two or more units be integrated in one unit.Above-mentioned integrated list
The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.
It, can if integrated unit is realized in the form of SFU software functional unit and when sold or used as an independent product
To be stored in a computer read/write memory medium.
Those skilled in the art can be understood that the application can add required general hardware platform by software
Mode realize.Based on this understanding, technical scheme of the present invention substantially in other words contributes to the prior art
The all or part of part or the technical solution can be expressed in the form of software products, which deposits
Storage in a storage medium, including some instructions are used so that computer equipment (can be personal computer, server or
Person's network equipment etc.) it performs all or part of the steps of the method described in the various embodiments of the present invention.And storage medium packet above-mentioned
It includes:USB flash disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access
Memory), the various media that can store program code such as mobile hard disk, magnetic disc or CD.
Each embodiment in this specification is described in a progressive manner, identical similar portion between each embodiment
Point just to refer each other, and each embodiment focuses on the differences from other embodiments.Especially for system reality
For applying example, since it is substantially similar to the method embodiment, so description is fairly simple, related place is referring to embodiment of the method
Part explanation.
Further illustrate the advantageous effect of the application below in conjunction with specific embodiments.