CN108280325A

CN108280325A - Processing method, processing unit, storage medium and the processor of high-flux sequence data

Info

Publication number: CN108280325A
Application number: CN201711296903.9A
Authority: CN
Inventors: 李晖; 陈钊; 莫敏俐; 丁凤; 王淑娟
Original assignee: YAKANGBO BIOLOGICAL SCIENCE AND TECHNOLOGY Co Ltd BEIJING
Current assignee: YAKANGBO BIOLOGICAL SCIENCE AND TECHNOLOGY Co Ltd BEIJING; Beijing ACCB Biotech Ltd
Priority date: 2017-12-08
Filing date: 2017-12-08
Publication date: 2018-07-13
Anticipated expiration: 2037-12-08
Also published as: CN108280325B

Abstract

The present invention provides a kind of processing method, processing unit, storage medium and the processors of high-flux sequence data.The processing method includes obtaining two level sequencing sequence, and two level sequencing sequence is that can be identified by target fragment amplimer in high-flux sequence data, and remove the sequencing sequence after corresponding amplimer；Two level sequencing sequence and reference gene group sequence are compared, primary variation result is obtained；And it is made a variation using the accidental data amendment primary in known mutations data as a result, obtaining handling result.By the way that the primer portion in every sequence is removed in the initial data that is obtained from high-flux sequence according to known primer information, reduce in amplified production overlapping region there are primer mutagenesis and caused by false positive handling result.The sequence of some mistake amplifications in high-flux sequence data can also be removed, not only improve the accuracy of subsequent analysis, but also help to reduce overall amount of data raising analysis efficiency.

Description

Processing method, processing unit, storage medium and the processor of high-flux sequence data

Technical field

The present invention relates to the process fields of high-flux sequence data, in particular to a kind of high-flux sequence data Processing method, processing unit, storage medium and processor.

Background technology

Currently, the method for detecting mutation to gene sequencing has very much.Wherein by multiplex amplification to specific target areas Specific amplification is carried out, and it is a kind of common method efficiently, economic to carry out high-flux sequence to product.However, high pass measures Program process will produce a large amount of sequence information, thus, these sequencing data information how are rapidly and accurately handled, it is one to become Technical problem urgently to be resolved hurrily.

Although the processing analysis method of many of prior art high-flux sequence data, there is processing and tie in these methods The low defect of fruit accuracy.Therefore, it is still necessary to which the processing method of existing sequencing data is improved.

Invention content

The main purpose of the present invention is to provide a kind of processing method of high-flux sequence data, processing unit, storages to be situated between Matter and processor, to solve the problems, such as that there are false positive site is more in existing handling result.

To achieve the goals above, according to an aspect of the invention, there is provided a kind of processing of high-flux sequence data Method, the processing method include：Two level sequencing sequence is obtained, two level sequencing sequence is can be by purpose in high-flux sequence data Fragment amplification primer identifies, and removes the sequencing sequence after corresponding amplimer；Compare two level sequencing sequence and reference gene Group sequence obtains primary variation result；And it is made a variation using the accidental data amendment primary in known mutations data as a result, obtaining Handling result.

Further, the step of acquisition two level sequencing sequence includes：Low quality in the high-flux sequence data of the lower machine of filtering Sequencing data, obtain level-one sequencing sequence, low-quality sequencing data refers to that Q20 is more than less than 80% or N base ratios 10% sequencing sequence；Level-one sequencing sequence is identified using the amplimer of target fragment, obtains identification sequence；And removal is known Corresponding amplimer in other sequence, obtains two level sequencing sequence.

Further, the step of comparing two level sequencing sequence and reference gene group sequence, obtaining primary variation result include： According to the location information of the amplimer of target fragment, intercepted from reference gene group sequence the reference of corresponding target fragment than To sequence；Two level sequencing sequence is compared with reference to aligned sequences, obtains primary variation result.

Further, after two level sequencing sequence is compared with reference aligned sequences, and primary variation is obtained As a result before, processing method further includes：Two level sequencing sequence is compared with reference to aligned sequences, obtains aligned sequences；Root According to the location information of amplimer, judge to whether there is unusual sequences in aligned sequences, unusual sequences, which refer to, compares quality less than the The sequence of one threshold value or the sequence inconsistent with the information with reference to aligned sequences；If in the presence of being filtered out from aligned sequences Unusual sequences, and each position for counting residue sequence obtains primary variation result with reference to the similarities and differences of aligned sequences.

Further, it is made a variation using the accidental data amendment primary in known mutations data as a result, obtaining handling result Step includes：From the known mutations data filtered out in known mutations data in target fragment corresponding region, known mutations are obtained Local sequence；Screening also goes out to be present in the variant sites in known mutations data simultaneously from the primary result that makes a variation, and is formed just The local sequence of grade variation result；The local sequence of primary variation result is compared with the local sequence of known mutations, is obtained To handling result.

Further, the local sequence of primary variation result is compared with the local sequence of known mutations, is obtained everywhere Manage result the step of include：The local sequence of primary variation result is compared with the local sequence of known mutations, obtains two Grade variation result；Two level variation result is modified, handling result is obtained；Wherein, two level variation result is modified Step includes：Judge to whether there is neighbouring mutational site in two level variation result, such as exists, then judge neighbouring mutational site Variation frequency with the presence or absence of significant difference and whether having support sequence, if significant difference is not present and has support sequence, Then adjacent mutational site is merged, to obtain handling result.

Further, the step of identifying level-one sequencing sequence using the amplimer of target fragment, obtaining identification sequence is wrapped It includes：Step A recycles all amplimers of target fragment, special according to length L interceptions since 5 ' ends of every amplimer Anisotropic sequence and after recording the quantity of specific sequence, corresponding specific sequence and the specific sequence of each pair of amplimer The length of remaining primer sequence；Step B, variation length L repeat step A, obtain the spy of the different number of all amplimers The set of anisotropic sequence selects the length L corresponding to the most set of specific sequence quantity and corresponding specific sequence Set carries out subsequent analysis；Step C, every sequence in circular treatment level-one sequencing sequence, every sequence of interception preceding 25~ The sequence of 35bp is gone interception sequence according to the length L corresponding to the most set of specific sequence quantity, is obtained since the ends 5` Arrangement set is intercepted to sequencing；Step D searches the expansion corresponding to the specific sequence in the most set of specific sequence quantity Increase the primer most amplimer of occurrence number and corresponding number in sequencing intercepts arrangement set, and when the maximum of corresponding number When value is more than the second threshold set, that is, think that this level-one sequencing sequence is expanded by the most amplimer of occurrence number It arrives, then this level-one sequencing sequence is denoted as identification sequence.

Further, corresponding amplimer in removal identification sequence, the step of obtaining two level sequencing sequence, include：According to The position that the specific sequence of the most amplimer of occurrence number finally occurs in identifying sequence and the position finally occurred The length of remaining amplimer sequence after setting, removal identify the amplimer in sequence, obtain two level sequencing sequence.

Further, it from the known mutations data filtered out in known mutations data in target fragment corresponding region, obtains The step of local sequence of known mutations includes：Target fragment corresponding region in known mutations data is screened, known mutations are formed Region；The initial position of each known mutations and final position in known mutations region are recorded, and along initial position and stop bit It sets and extends respectively to both ends, the initial position after then record extends and final position, and initial position after extending and end When stop bit is setting in target fragment corresponding region, the corresponding sequence of final position behind the initial position and extension after extension is For the local sequence of known mutations；Initial position and/or final position after extension exceed target fragment corresponding region, then will Initial position and/or final position of the boundary of target fragment corresponding region as the local sequence of known mutations.

Further, screening also goes out to be present in the variant sites in known mutations data simultaneously from the primary result that makes a variation, The step of local sequence for forming primary variation result includes：It is screened in making a variation result from primary prominent known to also going out to be present in simultaneously Become the variant sites in data, records initial position and the final position of each variant sites, and along initial position and stop bit It sets and extends respectively to both ends, extend to the corresponding position of local sequence of known mutations, the local sequence of as primary variation result Row.

Further, the local sequence of primary variation result is compared with the local sequence of known mutations, obtains two Grade variation result the step of include：It searches in the local sequence of the corresponding known mutations of each target fragment and becomes with the presence or absence of primary Different result；If the primary variation there are one as a result, if initial position according to variation result and final position, and along initial position Extend respectively to both ends with final position, forms a local sequence of sample mutation；If there are multiple primary variations as a result, if sentence Variation frequency between disconnected multiple primary variation results whether there is significant difference；If all there is significant difference, basis The initial position of each primary variation result and final position, and extend respectively to both ends along initial position and final position, shape It is mutated local sequence at respective sample；If in the presence of the primary variation of no significant difference as a result, if preliminary judgement is multiple primary becomes Different result is chain, and multiple primary results that make a variation are merged to form the same local sequence of sample mutation, and multiple primary variations As a result there are the primary variation of the residue of significant difference as a result, being then individually created the respective local sequence of sample mutation in；Judge each Whether the local sequence of the local sequence of sample mutation and known mutations is identical, if identical, primary variation result is calibrated to Know mutation result；If it is different, not calibrating then；The calibrated mutational site for known mutations result is not calibrated with remaining Mutational site merge, obtain two level variation result.

Further, the step of being modified to two level variation result includes judging that multiple primary variation results are to be chain No the step of there are false positives；Wherein, judge that multiple primary variation results include with the presence or absence of the step of false positive to be chain：It carries The sequence of multiple variation results is taken while being covered, and counts the ratio for the two level sequencing sequence for supporting while covering multiple variation results Example；If the ratio for supporting while covering the two level sequencing sequence of multiple variation results each makes a variation with the multiple variation results of support As a result significant difference is not present in the ratio of sequence, then confirms that multiple primary variation results are chain appearance, and with linked mutation Mode recalculate the frequency of mutation, when the frequency of mutation after recalculating meets third threshold value, obtain correct mutation result； If support while covering the ratio of the two level sequencing sequence of multiple variation results and support that each variation is tied in multiple variation results There are significant differences for the ratio of the sequence of fruit, then confirm multiple primary variation results be it is chain there are false positives, and by merging Multiple variation results recalculate the frequency of mutation after being split, when the frequency of mutation after recalculating meets third threshold value, It obtains correcting mutation result；Mutation result will be corrected with uncorrected mutation result to merge, obtain handling result.

To achieve the goals above, according to an aspect of the invention, there is provided a kind of processing of high-flux sequence data Device, the processing unit include：Two level sequencing sequence acquiring unit, for obtaining two level sequencing sequence, two level sequencing sequence is It can be identified by target fragment amplimer in high-flux sequence data, and eliminate the sequencing sequence after corresponding amplimer Row；Primary variation result acquiring unit obtains primary variation knot for comparing two level sequencing sequence and reference gene group sequence Fruit；And amending unit, for being tied using the accidental data amendment primary variation in known mutations data as a result, obtaining processing Fruit.

Further, two level sequencing sequence acquiring unit includes：Filtering module, the high-flux sequence number for filtering lower machine The low-quality sequencing data in, obtains level-one sequencing sequence, and low-quality sequencing data refers to that Q20 is less than 80% or N bases Ratio is more than 10% sequencing sequence；Identification module is obtained for identifying level-one sequencing sequence using the amplimer of target fragment To identification sequence；And removal module obtains two level sequencing sequence for removing corresponding amplimer in identification sequence.

Further, primary variation result acquiring unit includes：Interception module, for the amplimer according to target fragment Location information, the reference aligned sequences of corresponding target fragment are intercepted from reference gene group sequence；And first compare mould Block obtains primary variation result for two level sequencing sequence to be compared with reference to aligned sequences.

Further, the first comparing module is after two level sequencing sequence is compared with reference to aligned sequences, and Before obtaining primary variation result, further include：First compare submodule, for by two level sequencing sequence with reference to aligned sequences into Row compares, and obtains aligned sequences；Judging submodule, for according to the location information of amplimer, judge in aligned sequences whether There are unusual sequences, unusual sequences refer to comparison quality and are less than the sequence of first threshold or inconsistent with reference aligned sequences information Sequence；And filter submodule filters out exception in the presence of the judging result for judging submodule is from aligned sequences Sequence, and each position for counting residue sequence obtains primary variation result with reference to the similarities and differences of aligned sequences.

Further, amending unit includes：The local block of known mutations, for being screened from known mutations data Go out the known mutations data in target fragment corresponding region, obtains the local sequence of known mutations；The part of primary variation result Block is formed for filtering out while existing in the variant sites in known mutations data in making a variation result from primary The local sequence of primary variation result；And second comparing module, for by the local sequence of primary variation result with it is known prominent The local sequence of change is compared, and obtains handling result.

Further, the second comparing module includes：Second compares submodule, the local sequence for result that primary makes a variation It is compared with the local sequence of known mutations, obtains two level variation result；Correct submodule, for two level make a variation result into Row is corrected, and handling result is obtained；Wherein, it corrects submodule and step is modified as follows to two level variation result execution：Judge two Grade variation result in whether there is neighbouring mutational site, such as exist, then judge neighbouring mutational site variation frequency whether There are significant difference and/or whether there is support sequence, if significant difference is not present and/or has support sequence, adjacent is dashed forward Become site to merge, to obtain handling result.

Further, identification module includes：The first submodule of amplimer specific sequence, for recycling target fragment All amplimers intercept specific sequence according to length L and record each pair of amplification and draw since 5 ' ends of every amplimer The length of remaining primer sequence after the quantity of the specific sequence of object, corresponding specific sequence and specific sequence；Expand Increase primer specificity sequence the second submodule, is used for variation length L, repeats the first submodule of amplimer specific sequence The step of, it obtains the set of the specific sequence of the different number of all amplimers, selects specific sequence quantity most The corresponding length L of set and corresponding specific sequence set carry out subsequent analysis；Sequencing sequence intercepts submodule, is used for Every sequence in circular treatment level-one sequencing sequence, the sequence for intercepting preceding 25~35bp of every sequence are pressed since the ends 5` Interception sequence is gone according to the length L corresponding to the most set of specific sequence quantity, obtains sequencing interception arrangement set；Search son Module, the amplimer corresponding to specific sequence in the set most for searching specific sequence quantity are intercepted in sequencing The most amplimer of occurrence number and corresponding number in arrangement set, and it is more than the second of setting in the maximum value of corresponding number When threshold value, that is, think that this level-one sequencing sequence is to expand to obtain by the most amplimer of occurrence number, then by this level-one Sequencing sequence is denoted as identification sequence.

Further, removal module includes：Submodule is removed, for according to the special of the most amplimer of occurrence number The length of property sequence remaining amplimer sequence after the position finally occurred in identifying sequence and the position finally occurred Degree, removal identify the amplimer in sequence, obtain two level sequencing sequence.

Further, it is known that the local block of mutation includes：First screening submodule, for screening known mutations number According to middle target fragment corresponding region, known mutations region is formed；First logging modle, it is each in known mutations region for recording The initial position of known mutations and final position, and extend respectively to both ends along initial position and final position, then record prolongs Initial position after stretching and final position；First known mutations sequence generating module is used for initial position after extending and end When stop bit is setting in target fragment corresponding region, by the initial position after extension and the corresponding sequence of the final position after extension It is denoted as the local sequence of known mutations；And the second known mutations sequence generating module, for after extending initial position with Final position exceeds target fragment corresponding region, then using the boundary of target fragment corresponding region as the local sequence of known mutations Initial position and/or final position.

Further, the local block of primary variation result includes：Second screening submodule, for making a variation from primary As a result screening also goes out to be present in the variant sites in known mutations data simultaneously in；Second record sub module, it is each for recording The initial position of variant sites and final position, and extend respectively to both ends along initial position and final position, it extends to known The corresponding position of local sequence of mutation, the local sequence of as primary variation result.

Further, the second comparison submodule includes：First searches subcomponent, corresponding for searching each target fragment With the presence or absence of primary variation result in the local sequence of known mutations；The local sequence producing element of first sample mutation, for working as First lookup result for searching subcomponent is there are when a primary variation result, then according to the initial position of variation result and end Stop bit is set, and is extended respectively to both ends along initial position and final position, and a local sequence of sample mutation is formed；Second sample The local sequence producing element of mutation, the lookup result for searching subcomponent when first are that there are multiple primary variations as a result, then Judge that the variation frequency between multiple primary variation results whether there is significant difference；If all there is significant difference, root Initial position according to each primary result that makes a variation and final position, and extend respectively to both ends along initial position and final position, Then it is respectively formed the respective local sequence of sample mutation；If in the presence of the primary variation of no significant difference as a result, if preliminary judgement it is more A primary variation result is chain, and multiple primary results that make a variation are merged to form the same local sequence of sample mutation, and multiple There are the primary variation of the residue of significant difference as a result, being then individually created the respective local sequence of sample mutation in primary variation result Row；Subcomponent is calibrated, whether the local sequence for judging the local sequence of each sample mutation and known mutations is identical, if identical, Primary variation result is then calibrated to known mutations result；If it is different, not calibrating then；First merges subcomponent, and being used for will The mutational site for being calibrated to known mutations result merges with the remaining mutational site for not making to calibrate, and obtains two level variation result.

Further, it includes that chain false positive judges that subcomponent, chain false positive judge that subcomponent includes to correct submodule： Extraction statistics subcomponent, the sequence for extracting while covering multiple variation results, and count support while covering multiple variations As a result the ratio of two level sequencing sequence；Chain confirmation subcomponent, for when the two level for supporting while covering multiple variation results Significant difference is not present in the ratio of sequencing sequence and the ratio of the sequence for the result that each makes a variation in the multiple variation results of support, then really It is chain appearance to recognize multiple primary variation results, and the frequency of mutation is recalculated in a manner of linked mutation, after recalculating Frequency of mutation when meeting third threshold value, obtain correcting mutation result；False positive confirms subcomponent, supports while covering for working as The ratio of the ratio of the two level sequencing sequence of multiple variation results and the sequence for the result that each makes a variation in the multiple variation results of support There are significant differences, then confirm multiple primary variation results be it is chain there are false positives, and by multiple variation results of merging into Row recalculates the frequency of mutation after splitting, and when the frequency of mutation after recalculating meets third threshold value, obtains correcting abrupt junction Fruit；And second merge subcomponent, for will correct be mutated result merge with uncorrected mutation result, obtain handling result.

According to another aspect of the present invention, a kind of storage medium is provided, storage medium includes the program of storage, wherein Equipment where controlling storage medium when program is run executes any of the above-described kind of processing method.

According to another aspect of the present invention, a kind of processor is provided, processor is for running program, wherein program is transported Any of the above-described kind of processing method is executed when row.

It applies the technical scheme of the present invention, by according to known primer information, the original number obtained from high-flux sequence Removed according to the middle primer portion by every sequence, reduce in amplified production overlapping region there are primer mutagenesis and caused by it is false Positive handling result.In addition, can also be by the sequence of some mistake amplifications in high-flux sequence data by the identification for carrying out primer Row removal, not only facilitates the accuracy for improving subsequent analysis, reduces false positive results, and helps to reduce overall data Amount improves the efficiency of subsequent analysis step.

Description of the drawings

The accompanying drawings which form a part of this application are used to provide further understanding of the present invention, and of the invention shows Meaning property embodiment and its explanation are not constituted improper limitations of the present invention for explaining the present invention.In the accompanying drawings：

Fig. 1 shows the flow diagram of the high-flux sequence data in a preferred embodiment of the present application；

Fig. 2 shows the mutation in a preferred embodiment of the present application to calibrate schematic diagram；And

Fig. 3 shows the chain merging schematic diagram in a preferred embodiment of the present application.

Specific implementation mode

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.Below in conjunction with embodiment, the present invention will be described in detail.

Embodiments herein is described below in detail, the embodiment of specific descriptions is exemplary, it is intended to for explaining this Application, and should not be understood as the limitation to the application.In the description of the present application, " level-one ", " two level ", " first ", " second " etc. For convenience of description, it is not related to the description in terms of importance.Particular technique condition person is not specified in embodiment, according in related field The technical conditions or Related product specification of document description carry out.Agents useful for same or instrument do not have specified otherwise person, and being can With conventional products that are commercially available.

As background technology is previously mentioned, in the prior art in the high-flux sequence number of processing multiplex amplification specific target areas According to when, often there is the defect more than false positive site in handling result, it is a kind of typical in the application in order to improve this situation In embodiment, a kind of processing method of high-flux sequence data is provided, which includes：It obtains two level and sequence is sequenced Row, two level sequencing sequence are that can be identified by target fragment amplimer in high-flux sequence data, and remove corresponding amplification Sequencing sequence after primer；Two level sequencing sequence and reference gene group sequence are compared, primary variation result is obtained；It is dashed forward using known The accidental data amendment primary become in data makes a variation as a result, obtaining handling result.

When carrying out specificity capture to target area due to multiplex amplification method, between the amplification region of different amplimers There may be overlapping, therefore the presence of amplimer may interfere the abrupt climatic change of overlapping region.Based on this original Cause, the processing method of the above-mentioned high-flux sequence data of the application, by according to known primer information, being obtained from high-flux sequence To initial data in the primer portion in every sequence removed, reduce in amplified production overlapping region that there are primer mutagenesis False positive handling result caused by and.In addition, can also be wrong by some in high-flux sequence data by the identification for carrying out primer The accidentally sequence removal of amplification, not only facilitates the accuracy for improving subsequent analysis, and helps to reduce overall amount of data, improves The efficiency of subsequent analysis step.

In above-mentioned processing method, obtain two level sequencing sequence the step of in, in addition to utilize known amplimer information pair Before sequencing sequence is identified and removes, further include the conventional pre-treatment step of high-flux sequence data, for example removes low-quality The step of measuring sequencing sequence.In a kind of preferred embodiment of the application, the high-flux sequence data of the application as shown in Figure 1 Processing method detail flowchart in, the step of above-mentioned acquisition two level sequencing sequence includes：The high-flux sequence of the lower machine of filtering Low-quality sequencing data in data obtains level-one sequencing sequence, filters low-quality sequencing data in high-flux sequence data, Level-one sequencing sequence is obtained, low-quality sequencing data refers to the sequencing sequence that Q20 is more than 10% less than 80% or N base ratios Row；Level-one sequencing sequence is identified using the amplimer of target fragment, obtains identification sequence；Corresponding expansion in removal identification sequence Increase primer, obtains two level sequencing sequence.

Herein it is noted that in this application, the high-flux sequence data of lower machine refer to the FASTQ obtained from sequenator Or the data of BAM formats.

In the step of above-mentioned acquisition two level sequencing sequence, identify situation to raw sequencing data according to sequencing quality, base It is filtered screening, the low quality data in sequencing procedure is avoided to be interfered caused by subsequent data analysis, improves subsequent analysis As a result accuracy.

In the above-mentioned processing method of the application, the step of comparison, can be realized using the comparison step of this field routine.For It further increases and compares speed and data-handling efficiency, in a kind of preferred embodiment of the application, compare two level sequencing Sequence and reference gene group sequence, the step of obtaining primary variation result include：According to the position of the amplimer of target fragment Information intercepts the reference aligned sequences of corresponding target fragment from reference gene group sequence；By two level sequencing sequence and reference Aligned sequences are compared, and obtain primary variation result.

Since most of the sequence obtained by multiplex amplification should be the segment of destination region, the application is carrying out data When comparison, selection intercepts reference sequences according to the amplification region of primer.Computing resource can not only be saved in this way, moreover it is possible to greatly greatly It is fast to compare speed.Specifically, alignments are overall comparison, and specific algorithm is：

(1) parameter setting：, base mispairing identical to the base during comparison, base insertion and deletion, base insertion and deletion The score value of extension is defined；

(2) scoring matrix initializes：

A. using each base of reference sequences as a row of scoring matrix, first is left a blank；

B. using each base of sequencing sequence as a line of scoring matrix, first trip is left a blank；

C. scoring matrix is filled：Scoring matrix is filled according to following rule from left to right, from top to bottom：

D. each vacancy is calculated separately extends gained score by left side, top, upper left side.Wherein, the case where coming from upper left It needs to judge whether the sequencing base of current location and reference base are identical.It is identical then add the identical score value of base, it is different then In addition the score value of base mispairing；The case where for coming from left side or top, back need to be judged whether also for insertion and deletion.If It is the score value for then adding insertion and deletion and extending, otherwise adds the score value of insertion and deletion.

(3) best result that 3 kinds of situations are calculated is as comparison score value herein, and the path for recording best result is come Source.

(4) optimal path is recalled：Recalled from the lower right corner of scoring matrix, according to the path source in each site, is obtained To comparison result, optimal comparison result is chosen.

In a kind of preferred embodiment of the application, it is being compared with reference to aligned sequences in two level sequencing sequence Afterwards, and before obtaining primary variation result, above-mentioned processing method further includes：By two level sequencing sequence with reference to aligned sequences into Row compares, and obtains aligned sequences；According to the location information of amplimer, judge to whether there is unusual sequences in aligned sequences, it is different Chang Xulie, which refers to, compares quality less than the sequence of threshold value or the sequence inconsistent with the information with reference to aligned sequences；If in the presence of, Unusual sequences are filtered out from aligned sequences, and each position for counting residue sequence is obtained with reference to the similarities and differences of aligned sequences Primary variation result.

Since reference sequences are intercepted, the non-purpose extension increasing sequence in part may be compared by force to the ginseng after interception It examines in sequence, this is easy to interfere follow-up abrupt climatic change.In addition, the sequence that each pair of primer is expanded should be corresponding amplification region Sequence in domain, therefore its comparison position should be almost the same with the position of amplification destination region.Based on this 2 points, above-mentioned preferred reality It applies in example, sequence alignment result is tentatively filtered after sequence alignment, help to improve the accuracy of subsequent analysis result.

The sequence that quality is less than first threshold is compared in above-mentioned unusual sequences, refers to being obtained comparing quality according to alignment algorithm Value, given threshold, the mass value that compares less than threshold value is to compare the too low sequence of quality.According to practical experience, first threshold Usually 5.The above-mentioned inconsistent sequence of information with reference to aligned sequences refers to according to alignment algorithm, and with reference to aligned sequences The secondary sequence not compared completely is the inconsistent sequence of information.

In the variation detection process method of existing high-flux sequence data, the specifying information of mutation is usually to compare As a result subject to.The defect of this method be if certain mutation compare position near there are a variety of comparisons may when, output Comparison result and existing database information may be inconsistent, lead to not carry out subsequent association with existing database.In addition, right In relative complex mutation, comparison process may be split into several relatively small mutation and be obtained with obtaining optimal comparison Point, it is not inconsistent with real change.Thus, final analysis result is also inaccurate.

In order to improve the above situation, in a kind of preferred embodiment of the application, the mutation in known mutations data is utilized Data correction primary makes a variation as a result, the step of obtaining handling result includes：Target fragment pair is filtered out from known mutations data The known mutations data in region are answered, the local sequence of known mutations is obtained；It screens also to go out to deposit simultaneously in making a variation result from primary The variant sites being in known mutations data form the local sequence of primary variation result；By the part of primary variation result Sequence is compared with the local sequence of known mutations, obtains handling result.

Calibration is modified to primary variation result according to the local sequence of known mutations by selection, so that place It is more acurrate to manage obtained final result.In particular it is required that the local sequence of the corresponding known mutations of amplification region is generated in advance, it can It is obtained with carrying out interception from known mutations data according to the specifying information of amplimer.And the sequence in result that primary makes a variation In known mutations region, the local sequence of primary variation result is formed, without not correcting in known mutations region Calibration.

Although can be calibrated to the abrupt that some are likely to occur according to known mutations data, in reality In detection, some may be there is also and be not recorded in known mutations database abrupt.Therefore, abrupt junction is being obtained After fruit, the multiple mutation that wherein whether there is possible chain appearance are judged, and be modified to it, to further increase at analysis Manage the accuracy of result.In a kind of preferred embodiment of the application, as depicted in figs. 1 and 2, by the part of primary variation result The step of sequence is compared with the local sequence of known mutations, obtains handling result include：By the part of primary variation result Sequence is compared with the local sequence of known mutations, obtains two level variation result；Two level variation result is modified, is obtained Handling result；Wherein, to two level variation result be modified the step of include：Judge in two level variation result with the presence or absence of neighbouring Mutational site, such as exist, then judge the variation frequency in neighbouring mutational site with the presence or absence of significant difference and whether have branch Sequence is held, if significant difference is not present and has support sequence, adjacent mutational site is merged, to obtain processing knot Fruit.

In above-mentioned processing method, identify that the basic principle of sequencing sequence is drawn using every amplification using known amplimer The specific sequence of object is used as the specific marker of corresponding primer.When certain to the specific sequence of primer before sequencing sequence When repeatedly occurring in 25~35bp, it is believed that the sequence is to expand to obtain by the corresponding amplimer.Identifying correspondence Amplimer after, you can corresponding amplimer is removed according to the length of amplimer.According to above-mentioned principle, can design not Specific algorithm realizes the function of above-mentioned primer identification and removal.

In a kind of preferred embodiment of the application, identifies level-one sequencing sequence using the amplimer of target fragment, obtain Include to the step of identification sequence：Step A recycles all amplimers of target fragment, is opened from 5 ' ends of every amplimer Begin, specific sequence is intercepted according to length L and records the quantity of the specific sequence of each pair of amplimer, corresponding specific sequence The length of remaining primer sequence after row and specific sequence；Step B, variation length L repeat step A, obtain all amplifications The set of the specific sequence of the different number of primer, select the length L corresponding to the most set of specific sequence quantity with And corresponding specific sequence set carries out subsequent analysis；Step C, every sequence in circular treatment level-one sequencing sequence are cut The sequence for taking preceding 25~35bp of every sequence, since the ends 5`, corresponding to the most set of specific sequence quantity Length L goes interception sequence, obtains sequencing interception arrangement set；Step D searches the spy in the most set of specific sequence quantity Amplimer corresponding to anisotropic the sequence most amplimer of occurrence number and corresponding number in sequencing intercepts arrangement set, And when the maximum value of number is more than second threshold (being usually 3) of setting, that is, think that this level-one sequencing sequence is by occurring The most amplimer of number expands to obtain, then this level-one sequencing sequence is denoted as identification sequence.

This amplimer of the application identifies and the method for removal, can effectively bring amplimer mutation dry It disturbs and builds the wrong extension increasing sequence generated in library and/or sequencing procedure to be removed, on the one hand improve the accurate of handling result Property, overall amount of data on the other hand can be reduced, the efficiency of subsequent analysis step is improved.

It should be noted that in above-mentioned preferred embodiment, specific sequence refer in all amplimers there is no with Identical sequence, be unique.Even if different amplimers can intercept out the identical sequence of sequence, and such sequence is not It can be referred to as specific sequence.

In a kind of preferred embodiment of the application, corresponding amplimer in removal identification sequence obtains two level sequencing The step of sequence includes：The position finally occurred in identifying sequence according to the specific sequence of the most amplimer of occurrence number Set and the position that finally occurs after remaining amplimer sequence length, the amplimer in removal identification sequence obtains To two level sequencing sequence.

Specifically, the particular number of certain specific sequence that amplimer can be intercepted, on the one hand with amplimer Specific length is related, on the other hand related with the length of set interception.The specific length of amplimer is longer, can cut The quantity of the specific sequence taken is more.The intercepted length of setting is longer, and the quantity for the specific sequence that can be intercepted is just It is fewer.The specific length of amplimer in the sequence of removal identification herein is not complete with amplimer used when structure library Length is completely the same.Since mistake caused by sequencing mistake or other unknown causes can be by certain in every two level sequencing sequence To the sequence before the rearmost position of the specific sequence of amplimer identification, and the sequence of residue length thereafter, no matter its It is practical whether identical as amplimer sequence, it is accordingly to be regarded as that the primer sequence that primer is identified can be amplified, thus, it is required for It removes.I.e. it is capable to by the sequence of residue length after certain rearmost position to the specific sequence identification of amplimer, have May be not fully identical as the sequence of the amplimer real surplus, but the base sequence of equal length is also required to remove.And The application passes through test of many times result verification, and this removal has no influence to final process result.

In a kind of preferred embodiment of the application, as shown in Fig. 2, filtering out target fragment pair from known mutations data The step of answering the known mutations data in region, obtaining the local sequence of known mutations include：Screen mesh in known mutations data Segment corresponding region, form known mutations regions；Record the initial position of each known mutations and end in known mutations region Stop bit is set, and is extended to both ends respectively along initial position and final position and (preferably respectively extended 10~15bp), and then record extends Initial position afterwards and final position, and when extend after initial position and final position be located in target fragment corresponding region When, the corresponding sequence of final position behind the initial position and extension after extension is the local sequence of known mutations；Work as extension Initial position and/or final position afterwards exceeds target fragment corresponding region, then makees the boundary of target fragment corresponding region Initial position for the local sequence of known mutations and/or final position.

As shown in Figure 2, according to abrupt information (including mutation initiation position and the variation described in known mutations database Base type), by extending forward from initial position and extending back simultaneously from final position, local-reference sequence is generated respectively Row and corresponding local variations sequence.And if front and back extension rear region has exceeded corresponding to the target fragment with amplimer Known mutations region, the initial position and/or termination of the boundary of target fragment corresponding region as the local sequence of known mutations Position.It illustrates herein：Assuming that there are one point mutation, each extension 10bp in left and right, if the termination extended in destination region Position or initial position are within the scope of target fragment, then the local sequence length of the mutation is 21bp.Assuming that the mutation is to the right After extension, final position exceeds target fragment 5bp, then this 5bp can be thrown away, then the local sequence length of the mutation is 16bp. It is compared by forming local sequence, it can be ensured that comparison result is in target fragment region, to make comparison result also more Accurately, it avoids a variety of comparison results in part to interfere caused by abrupt climatic change, for the various database sides of providing of subsequent association Just.

It should be noted that the step of local sequence of the generation known mutations, carries out once.Every time when analysis, such as Fruit target fragment region does not change, then is not required to regenerate the local sequence of known mutations every time.Moreover, generating known prominent The opportunity of the local sequence of change is unlimited, as long as being formed before comparing step.

Due to and not all primary variation result in variant sites exist in known mutations data, in order to become Different result is compared with known variation data, and multiple primary variant sites is avoided to compare to different known mutations data Position, can by exist simultaneously in known mutations data and primary variation result in variant sites local sequence individually with The local sequence of known mutations is compared.In a kind of preferred embodiment of the application, screened in making a variation result from primary same When also go out to be present in the variant sites in known mutations data, the step of local sequence for forming primary variation result includes：From Screening also goes out to be present in the variant sites in known mutations data simultaneously in primary variation result, records rising for each variant sites Beginning position and final position, and extend respectively to both ends along initial position and final position, extend to the local sequence of known mutations Arrange corresponding position, the local sequence of as primary variation result.

In a kind of preferred embodiment of the application, by the local sequence of the local sequence and known mutations of primary variation result Row are compared, obtain two level variation result the step of include：Search the part per the corresponding known mutations of each target fragment In sequence with the presence or absence of primary variation as a result, if there are a primary variation as a result, if according to the initial position of variation result and Final position, and extend respectively to both ends along initial position and final position, form a local sequence of sample mutation；If in the presence of Multiple primary variations are as a result, then judge that the variation frequency between multiple primary variation results whether there is significant difference；If all All there is significant difference, then according to the initial position of each primary result that makes a variation and final position, and along initial position and termination Position extends to both ends respectively, and formation is respectively formed the respective local sequence of sample mutation；If in the presence of the primary of no significant difference Variation is as a result, then the multiple primary variation results of preliminary judgement are chain, and multiple primary results that make a variation are merged to be formed with This mutation part sequence, and there are the primary variations of the residue of significant difference as a result, being then individually created in multiple primary variation results The respective local sequence of sample mutation；Judge whether the local sequence of the local sequence of each sample mutation and known mutations is identical, if It is identical, then primary variation result is calibrated to known mutations result；If it is different, not calibrating then；By calibrated for known mutations As a result mutational site merges with the remaining mutational site for not making to calibrate, and obtains two level variation result.

Specifically, in above-mentioned preferred embodiment, as shown in Fig. 2, in primary variation result, the mutation type of ChrA with Know that the mutation type in mutation database is identical, is all ATCG missings, but the initial position recorded in primary variation result is x4 And y4, and the initial position in known accidental data is x1 and y1, thus, it, will according to the abrupt information of known mutations database The primary variation result in the mutational site is calibrated to known mutations result.And in Fig. 2, since the abrupt information of ChrD is known prominent In variable database and it is not present, thus without calibration.

In a kind of preferred embodiment of the application, as shown in figure 3, the step of being modified to two level variation result includes Judge multiple primary variation results for chain the step of whether there is false positive：Wherein, multiple primary variation results are judged to connect It locks and includes with the presence or absence of the step of false positive：Extraction while the sequence for covering multiple variation results, and count support while covering The ratio of the two level sequencing sequence of multiple variation results；If supporting while covering the ratio of the two level sequencing sequence of multiple variation results Significant difference is not present in the ratio of the sequence for the result that each makes a variation in example and the multiple variation results of support, then confirms that multiple primary become Different result is chain appearance, and the frequency of mutation is recalculated in a manner of linked mutation, and frequency of mutation after recalculating is full When sufficient third threshold value, obtain correcting mutation result；If supporting while covering the ratio of the two level sequencing sequence of multiple variation results There are significant differences with the ratio of the sequence for the result that each makes a variation in the multiple variation results of support, then confirm multiple primary variation knots Fruit be it is chain there are false positives, and the frequency of mutation is recalculated after multiple variation results of merging are split, when counting again When the frequency of mutation after calculation meets third threshold value (being usually 2%), obtain correcting mutation result；Mutation result will be corrected and do not repaiied Positive mutation result merges, and obtains handling result.

Two level is made a variation, and multiple in result individual there are the mutation of true linkage relationship to be modified to company by above-mentioned steps Lock mutation, and the mutation merged in the form of chain false positive is split into individual mutation so that mutation result is more acurrate.

In another typical embodiment of the application, a kind of processing unit of high-flux sequence data is provided, it should Processing unit includes：Two level sequencing sequence acquiring unit, for obtaining two level sequencing sequence, two level sequencing sequence measures for high pass Ordinal number can be identified in by target fragment amplimer, and eliminate the sequencing sequence after corresponding amplimer；Primary becomes Different result acquiring unit obtains primary variation result for comparing two level sequencing sequence and reference gene group sequence；It corrects single Member, for being made a variation using the accidental data amendment primary in known mutations data as a result, obtaining handling result.

The above-mentioned processing unit of the application can be expanded by executing the acquisition of two level sequencing sequence acquiring unit by target fragment Increase primer identification, and eliminate the sequencing sequence after corresponding amplimer, then executing primary variation result acquiring unit will The two level sequencing sequence got is compared with reference gene group sequence, and obtained primary variation result is held by amending unit The result that makes a variation after row amendment step, in obtained handling result is more acurrate.

In a kind of preferred embodiment of the application, two level sequencing sequence acquiring unit includes：Filtering module, for filtering Low-quality sequencing data in the high-flux sequence data of lower machine obtains level-one sequencing sequence, filters in high-flux sequence data Low-quality sequencing data, obtains level-one sequencing sequence, and low-quality sequencing data refers to that Q20 is less than 80% or N base ratios Sequencing sequence more than 10%；Identification module is known for identifying level-one sequencing sequence using the amplimer of target fragment Other sequence；Module is removed, for removing corresponding amplimer in identification sequence, obtains two level sequencing sequence.

Above-mentioned two level sequencing sequence acquiring unit carries out raw sequencing data according to sequencing quality, base identification situation Filtering screening avoids the low quality data in sequencing procedure from being interfered caused by subsequent data analysis, improves subsequent analysis result Accuracy.

In a kind of preferred embodiment of the application, primary variation result acquiring unit includes：Interception module is used for basis The location information of the amplimer of target fragment, the reference that corresponding target fragment is intercepted from reference gene group sequence compare sequence Row；First comparing module obtains primary variation result for two level sequencing sequence to be compared with reference to aligned sequences.

Since most of the sequence obtained by multiplex amplification should be the segment of destination region, above-mentioned primary variation result For acquiring unit when carrying out comparing, selection intercepts reference sequences according to the amplification region of primer, can not only save in this way Computing resource, moreover it is possible to greatly speed up comparison speed.

In a kind of preferred embodiment of the application, the first comparing module is by two level sequencing sequence and with reference to aligned sequences After being compared, and before obtaining primary variation result, further include：First compares submodule, for sequence to be sequenced in two level Row are compared with reference to aligned sequences, obtain aligned sequences；Judging submodule is used for the location information according to amplimer, Judge in aligned sequences whether there is unusual sequences, unusual sequences refer to compare quality less than first threshold sequence or with reference to than The sequence inconsistent to the information of sequence and amplimer；Filter submodule, the judging result for judging submodule are to exist When, unusual sequences are filtered out from aligned sequences, and each position for counting residue sequence is obtained with reference to the similarities and differences of aligned sequences To primary variation result.

Since interception module is interception target fragment extension increasing sequence, the non-purpose extension increasing sequence in part from reference sequences It may by force be compared on the reference sequences to after interception, this is easy to interfere follow-up abrupt climatic change.In addition, each pair of draw The sequence that object is expanded should be the sequence in corresponding amplification region, therefore its comparison position should be with the position base of amplification destination region This is consistent.Based on this 2 points, in above preferred embodiment, judging submodule and filter submodule are set, respectively to sequence alignment As a result abnormal judgement and preliminary filtering are carried out, the accuracy of subsequent analysis result is helped to improve.

In a kind of preferred embodiment of the application, amending unit includes the local block of known mutations, for from The known mutations data in target fragment corresponding region are filtered out in known mutations data, obtain the local sequence of known mutations, The local block of primary variation result, for being filtered out in making a variation result from primary while existing in known mutations data In variant sites, form the local sequence of primary variation result；Second comparing module, the part for result that primary makes a variation Sequence is compared with the local sequence of known mutations, obtains handling result.

The local sequence of primary variation result is modified according to the local sequence of known mutations by amending unit, So that the variation result that processing obtains is more acurrate.

In a kind of preferred embodiment of the application, the second comparing module includes：Second compares submodule, and being used for will be primary The local sequence of variation result is compared with the local sequence of known mutations, obtains two level variation result；Submodule is corrected, is used It is modified in two level variation result, obtains handling result；Wherein, correct submodule to two level variation result execute as follows into Row amendment step：Judge to whether there is neighbouring mutational site in two level variation result, such as exists, then judge neighbouring mutation position The variation frequency of point is with the presence or absence of significant difference and/or whether has support sequence, if significant difference is not present and/or has support sequence Row then merge adjacent mutational site, to obtain handling result.

Although above-mentioned second comparing module, the abrupt that some are likely to occur can be carried out according to known mutations data Calibration is corrected, but in actually detected, the complexity that some may be there is also and be not recorded in known mutations database Mutation.Therefore, after obtaining mutation result, judge wherein to whether there is possible chain appearance by executing above-mentioned amendment submodule Multiple mutation, and it is modified, to further increase the accuracy of analysis and processing result.

In a kind of preferred embodiment of the application, identification module includes：The first submodule of amplimer specific sequence, All amplimers for recycling target fragment, since 5 ' ends of every amplimer, according to the specific sequence of length L interceptions It arranges and records and is remaining after quantity, corresponding specific sequence and the specific sequence of the specific sequence of each pair of amplimer The length of primer sequence；Amplimer specific sequence the second submodule is used for variation length L, repeats amplimer spy The step of anisotropic the first submodule of sequence, the set of the specific sequence of the different number of all amplimers is obtained, selection is special Length L and corresponding specific sequence set corresponding to the most set of anisotropic sequence quantity carry out subsequent analysis；Sequencing Sequence truncation submodule intercepts preceding 25~35bp of every sequence for every sequence in circular treatment level-one sequencing sequence Sequence, since the ends 5`, according to the length L corresponding to the most set of specific sequence quantity go interception sequence, surveyed Sequence intercepts arrangement set；Submodule is searched, the specific sequence institute in the set most for searching specific sequence quantity is right The amplimer the answered most amplimer of occurrence number and corresponding number in sequencing intercepts arrangement set, and number most When big value is more than second threshold (being usually 3) set, that is, think that this level-one sequencing sequence is by the most expansion of occurrence number Increase primer amplification to obtain, then this level-one sequencing sequence is denoted as identification sequence.

In a kind of preferred embodiment of the application, removal module includes：Submodule is removed, is used for according to occurrence number most The specific sequence of more amplimers is remaining after the position finally occurred in identifying sequence and the position finally occurred Amplimer sequence length, removal identification sequence in amplimer, obtain two level sequencing sequence.

In a kind of preferred embodiment of the application, it is known that the local block of mutation includes：First screening submodule, For screening target fragment corresponding region in known mutations data, known mutations region is formed；First logging modle, for recording The initial position of each known mutations and final position in known mutations region, and along initial position and final position respectively to two End extends, the initial position after then record extends and final position, and the first known mutations sequence generating module, prolongs for working as When initial position and final position after stretching are located in target fragment corresponding region, by after extension initial position and extension after The corresponding sequence of final position is denoted as the local sequence of known mutations；Second known mutations sequence generating module, for when extension Initial position and final position afterwards exceeds target fragment corresponding region, then using the boundary of target fragment corresponding region as known to The initial position of the local sequence of mutation and/or final position.

In another preferred embodiment, the local block of above-mentioned primary variation result includes：Second screening Module for the screening from the primary result that makes a variation while also going out to be present in the variant sites in known mutations data；Second record Submodule, the initial position for recording each variant sites and final position, and along initial position and final position respectively to Both ends extend, and extend to the corresponding position of local sequence of known mutations, the local sequence of as primary variation result.

In a kind of preferred embodiment of the application, the second comparison submodule includes：First searches subcomponent, for searching With the presence or absence of primary variation as a result, the local sequence of first sample mutation in the local sequence of the corresponding known mutations of each target fragment Column-generation element, the lookup result for searching subcomponent when first are there are when a primary variation result, then according to variation As a result initial position and final position, and extend respectively to both ends along initial position and final position, it is prominent to form a sample Become local sequence；The local sequence producing element of second sample mutation, the lookup result for searching subcomponent when first are to exist Multiple primary variations are as a result, then judge that the variation frequency between multiple primary variation results whether there is significant difference；If all All there is significant difference, then according to the initial position of each primary result that makes a variation and final position, and along initial position and termination Position extends to both ends respectively and (preferably respectively extends 10~15bp), forms the respective local sequence of sample mutation；If existing without aobvious The primary variation of difference is write as a result, then the multiple primary variation results of preliminary judgement are chain, and multiple primary results that make a variation are closed And the same local sequence of sample mutation is formed, and there are the primary variation knots of the residue of significant difference in multiple primary variation results Fruit is then individually created the respective local sequence of sample mutation；Calibrate subcomponent, for judge the local sequence of each sample mutation with Know whether the local sequence of mutation is identical, if identical, primary variation result is calibrated to known mutations result；If it is different, then It does not calibrate；First merges subcomponent, for not calibrating the calibrated mutational site for known mutations result with remaining Mutational site merge, obtain two level variation result.

In a kind of preferred embodiment of the application, correcting submodule includes：Chain false positive judges subcomponent, chain vacation The positive judges that subcomponent includes：Extraction statistics subcomponent, the sequence for extracting while covering multiple variation results, and count branch Hold while covering the ratio of the two level sequencing sequence of multiple variation results；Chain confirmation subcomponent is supported while being covered for working as The ratio of the ratio of the two level sequencing sequence of multiple variation results and the sequence for the result that each makes a variation in the multiple variation results of support There is no significant differences, then confirm that multiple primary variation results are chain appearance, and are recalculated in a manner of linked mutation prominent Frequency when the frequency of mutation after recalculating meets third threshold value, obtains correcting mutation result；False positive confirms son member Part, for each in the ratio of two level sequencing sequence of multiple variation results and the multiple variation results of support when supporting while covering There are significant differences for the ratio of sequence for the result that makes a variation, then confirm multiple primary variation results be it is chain there are false positives, and will The multiple variation results merged recalculate the frequency of mutation after being split, and frequency of mutation after recalculating meets third threshold When value (being usually 2%), obtain correcting mutation result；And second merge subcomponent, result and is not repaiied for that will correct to be mutated Positive mutation result merges, and obtains handling result.

In the application in the third typical embodiment, a kind of storage medium is additionally provided, which includes depositing The program of storage, wherein the equipment where controlling storage medium when program is run executes the processing of above-mentioned high-flux sequence data Method.

In the 4th kind of typical embodiment of the application, a kind of processor is additionally provided, the processor is for running journey Sequence, wherein program executes the processing method of above-mentioned high-flux sequence data when running.

As seen through the above description of the embodiments, described device is only schematical, such as the unit Division, can be a kind of division of logic function, formula that in actual implementation, there may be another division manner, such as multiple units or group Part can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown Or the mutual coupling, direct-coupling or communication connection discussed can be by some interfaces, unit or module it is indirect Coupling or communication connection, can be electrical or other forms.

The unit illustrated as separating component may or may not be physically separated, and be shown as unit Component may or may not be physical unit, you can be located at a place, or may be distributed over multiple units On.Some or all of unit therein can be selected according to the actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also It is that each unit physically exists alone, it can also be during two or more units be integrated in one unit.Above-mentioned integrated list The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.

It, can if integrated unit is realized in the form of SFU software functional unit and when sold or used as an independent product To be stored in a computer read/write memory medium.

Those skilled in the art can be understood that the application can add required general hardware platform by software Mode realize.Based on this understanding, technical scheme of the present invention substantially in other words contributes to the prior art The all or part of part or the technical solution can be expressed in the form of software products, which deposits Storage in a storage medium, including some instructions are used so that computer equipment (can be personal computer, server or Person's network equipment etc.) it performs all or part of the steps of the method described in the various embodiments of the present invention.And storage medium packet above-mentioned It includes：USB flash disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), the various media that can store program code such as mobile hard disk, magnetic disc or CD.

Each embodiment in this specification is described in a progressive manner, identical similar portion between each embodiment Point just to refer each other, and each embodiment focuses on the differences from other embodiments.Especially for system reality For applying example, since it is substantially similar to the method embodiment, so description is fairly simple, related place is referring to embodiment of the method Part explanation.

Further illustrate the advantageous effect of the application below in conjunction with specific embodiments.

Embodiment 1：Detect the abrupt information in human genome targeting sequencing product

With the present processes and device, the 33 gene sequencing data obtained by targeting sequencing to 216 are divided Analysis.Including 170 parts of cancer patient's samples (whole blood, FFPE, pleural effusion, flesh tissue), 32 healthy voluntary blood donations The whole blood sample of person, 14 quality-control products (room interstitial comments sample, Horizon standard items).It is surveyed using Ion PGM microarray datasets Sequence obtains 216 sample B AM sequencing files.

Wherein 1 cancer patient (FFPE), 1 horizon standard items, 1 Healthy People are detected specific through the invention As a result it is compared as follows Tables 1 and 2 with the specific testing results of Torrent Suite.Wherein, (the cancer trouble of sample 1 is shown in table 1 Person, FFPE) result that is detected using the present processes.

Table 1：

*：Yes (~2%)：Yes indicates to be consistent with practical mutation result；Percentage in bracket indicates practical mutation frequency Rate.Show that low frequency mutation can be detected using the present processes, and testing result is accurate, is consistent with practical mutation result.

The result that sample 1 (cancer patient, FFPE) uses Torrent Suite to be detected is shown in table 2.

Table 2：

It is attached：The connotation of Yes is the same as table 1.Synonymous：Indicate same sense mutation.Due at present clinically same sense mutation to The directive significance of medicine is indefinite, thus although the present processes can also detected, not expressly listed such knot in table 1 Fruit.

From sample 1 using the present processes compared with the conventional method compared with the result of Tables 1 and 2 can be seen that this Shen Similar linked mutation can not only be shown mutation result by method please in the form of chain so that mutation result shows more accurate Really, also above existing method and in terms of sensitivity, 5% low frequency mutation below can accurately be detected.

Sample 2 (Horizon standard items), see the table below 3 using the result that the present processes are detected.

Table 3：

Sample 2 (Horizon standard items) see the table below 4 using the Torrent Suite results being detected.

Table 4：

Sample 2 is can be seen that from the results contrast of table 3 and table 4 to carry out using the processing method of the application and the prior art After processing, obtained result can not only accurately show chain mutation, and can exclude false positive mutation.

Sample 3 (Healthy People, whole blood), see the table below 5 using the result that the present processes are detected.

Table 5：

Sample 3 (Healthy People, whole blood), the result being detected using the method for Torrent Suite see the table below 6.

Table 6：

Sample 3 is can be seen that from the results contrast of table 5 and table 6 to carry out using the processing method of the application and the prior art After processing, obtained result can exclude false positive mutation.

216 sites sample hotspot are counted as a result, the testing result of the application method is detected with Torrent Suite Results contrast such as the following table 7.

Table 7：

It can be seen from the above description that compared with the mutation detection methods of current conventional multiplex amplification sequencing, it is existing Have slower using full-length genome as the speed of service is compared with reference to sequence in technology.In addition, for insertion mutation and missing Mutation, comparing the difference of position may cause to have differences with the result in database, and then lead to not and existing database In biological significance be directly linked.And the processing method of the sequencing data of the application and the advantage of processing unit are：

1) high efficiency.By using the step of identifying primer according to specific sequence, primer is carried out using specific sequence Identify rather than be compared, can be rapidly and efficiently identify the corresponding primer of sequencing sequence, be greatly saved computing resource.

2) accuracy.The application calibrates mutation result using known mutations, and uses mutation in calibration Local sequence, a variety of comparison results in part can be effectively prevented from and interfere caused by abrupt climatic change, it is various for subsequent association Database provides a convenient.

Furthermore, it is necessary to explanation, the application can be used in numerous general or special purpose computing system environments or configuration.Example Such as：Personal computer, server computer, handheld device or portable device, multicomputer system, are based on laptop device The system of microprocessor, set top box, programmable consumer-elcetronics devices, network PC, minicomputer, mainframe computer including The distributed computing environment etc. of any of the above system or equipment.

Obviously, those skilled in the art should be understood that each module of the above invention or each step can be with general Computing device realize that they can be concentrated on a single computing device, or be distributed in multiple computing devices and formed Network on, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored Be performed by computing device in the storage device, either they are fabricated to each integrated circuit modules or by they In multiple modules or step be fabricated to single integrated circuit module to realize.In this way, the present invention is not limited to any specific Hardware and software combines.

The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, any made by repair Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims

1. a kind of processing method of high-flux sequence data, which is characterized in that the treating method comprises：

Two level sequencing sequence is obtained, the two level sequencing sequence is that can be expanded by target fragment in the high-flux sequence data Primer identifies, and removes the sequencing sequence after the corresponding amplimer；

The two level sequencing sequence and reference gene group sequence are compared, primary variation result is obtained；

The primary variation is corrected as a result, obtaining handling result using the accidental data in known mutations data.

2. processing method according to claim 1, which is characterized in that the step of obtaining the two level sequencing sequence include：

Low-quality sequencing data in the high-flux sequence data of the lower machine of filtering, obtains level-one sequencing sequence, described low-quality Sequencing data refers to the sequencing sequence that Q20 is more than 10% less than 80% or N base ratios；

The level-one sequencing sequence is identified using the amplimer of target fragment, obtains identification sequence；

The corresponding amplimer in the identification sequence is removed, the two level sequencing sequence is obtained.

3. processing method according to claim 1, which is characterized in that compare the two level sequencing sequence and reference gene group Sequence, the step of obtaining primary variation result include：

According to the location information of the amplimer of the target fragment, intercepted from the reference gene group sequence corresponding The reference aligned sequences of target fragment；

The two level sequencing sequence is compared with described with reference to aligned sequences, the primary variation result is obtained.

4. processing method according to claim 3, which is characterized in that by the two level sequencing sequence with it is described with reference to than After sequence is compared, and before obtaining the primary variation result, the processing method further includes：

The two level sequencing sequence is compared with described with reference to aligned sequences, aligned sequences are obtained；

According to the location information of the amplimer, judge to whether there is unusual sequences in the aligned sequences, the exception sequence Row, which refer to, compares sequence of the quality less than first threshold or the sequence inconsistent with the information with reference to aligned sequences；

If in the presence of the unusual sequences are filtered out from the aligned sequences, and count each position and the institute of residue sequence The similarities and differences with reference to aligned sequences are stated, the primary variation result is obtained.

5. processing method according to any one of claim 1 to 4, which is characterized in that using in known mutations data Accidental data corrects the primary variation as a result, the step of obtaining handling result includes：

From the known mutations data filtered out in known mutations data in the target fragment corresponding region, known mutations are obtained Local sequence；

Screening also goes out the variant sites being present in the known mutations data simultaneously from the primary variation result, is formed just The local sequence of grade variation result；

The primary local sequence for making a variation result is compared with the local sequence of the known mutations, obtains the processing As a result.

6. processing method according to claim 5, which is characterized in that by the local sequence of the primary result that makes a variation and institute It states the step of the local sequences of known mutations is compared, obtains the handling result and includes：

The primary local sequence for making a variation result is compared with the local sequence of the known mutations, obtains two level variation As a result；

Two level variation result is modified, the handling result is obtained；

Wherein, to the two level variation result be modified the step of include：

Judge to whether there is neighbouring mutational site in the two level variation result, such as exists, then judge the neighbouring mutation The variation frequency in site is with the presence or absence of significant difference and whether has support sequence, if significant difference is not present and has support sequence Row then merge adjacent mutational site, to obtain the handling result.

7. processing method according to claim 2, which is characterized in that the amplimer using target fragment is known The not described level-one sequencing sequence, obtain identification sequence the step of include：

Step A recycles all amplimers of the target fragment, since 5 ' ends of every amplimer, according to length Degree L interception specific sequences and record the quantity of specific sequence of each pair of amplimer, corresponding specific sequence with And after the specific sequence remaining primer sequence length；

Step B changes the length L, repeating said steps A, obtains the specificity of the different number of all amplimers The set of sequence selects the length L corresponding to the most set of specific sequence quantity and corresponding specific sequence set Carry out subsequent analysis；

Step C, every sequence in level-one sequencing sequence described in circular treatment, intercepts the sequence of preceding 25~35bp of every sequence Row go interception sequence according to the length L corresponding to the most set of specific sequence quantity, obtain sequencing and cut since the ends 5` Take arrangement set；

Step D searches the amplimer corresponding to the specific sequence in the most set of the specific sequence quantity in institute The number of the most amplimers and the most amplimer of the occurrence number of occurrence number in sequencing interception arrangement set is stated, And when the maximum value of the number of the most amplimer of the occurrence number is more than the second threshold of setting, that is, think this institute It is to expand to obtain by the most amplimer of occurrence number to state level-one sequencing sequence, then by this level-one sequencing sequence It is denoted as identification sequence.

8. processing method according to claim 7, which is characterized in that the corresponding amplification in the removal identification sequence Primer, the step of obtaining the two level sequencing sequence include：

The position finally occurred in the identification sequence according to the specific sequence of the most amplimer of the occurrence number And after the position finally occurred remaining amplimer sequence length, remove it is described identification sequence in the expansion Increase primer, obtains the two level sequencing sequence.

9. processing method according to claim 5, which is characterized in that filter out the purpose piece from known mutations data Known mutations data in section corresponding region, the step of obtaining the local sequence of known mutations include：

Target fragment corresponding region described in known mutations data is screened, known mutations region is formed；

Record the initial position of each known mutations and final position in the known mutations region, and along the initial position and Final position extends to both ends respectively, the initial position after then record extends and final position, and

When after the extension initial position and final position be located in the target fragment corresponding region when, after the extension Initial position sequence corresponding with the final position after extension is the local sequence of the known mutations；

Initial position and/or final position after the extension exceed the target fragment corresponding region, then by the purpose Initial position and/or final position of the boundary of segment corresponding region as the local sequence of the known mutations.

10. processing method according to claim 9, which is characterized in that screened simultaneously from the primary variation result Go out the variant sites being present in the known mutations data, the step of local sequence for forming primary variation result includes：

Screening also goes out the variant sites being present in the known mutations data simultaneously from the primary variation result, and record is every The initial position of a variant sites and final position, and extend respectively to both ends along the initial position and final position, extend To the corresponding position of local sequence of the known mutations, the local sequence of the as described primary variation result.

11. processing method according to claim 6, which is characterized in that by the local sequence of the primary result that makes a variation with The local sequence of the known mutations is compared, obtain two level variation result the step of include：

It searches in the local sequence of the corresponding known mutations of each target fragment with the presence or absence of the primary result that makes a variation；

If the primary variation there are one as a result, if initial position according to the variation result and final position, and along institute It states initial position and final position respectively to extend to both ends, forms a local sequence of sample mutation；

If there are multiple primary variations as a result, if judge whether the variation frequency between multiple primary variation results is deposited In significant difference；If all there is significant difference, according to the initial position of each primary variation result and stop bit It sets, and extends respectively to both ends along the initial position and final position, form the respective local sequence of sample mutation；If in the presence of The primary variation of no significant difference is as a result, then the multiple primary variation results of preliminary judgement are chain, and by multiple institutes Primary variation result is stated to merge to form the same local sequence of sample mutation, and there are significance differences in multiple primary variation results The primary variation of different residue is as a result, be then individually created the respective local sequence of sample mutation；

Judge whether the local sequence of the local sequence of each sample mutation and known mutations is identical, if identical, by the primary change Different result is calibrated to known mutations result；If it is different, not calibrating then；

The calibrated mutational site for known mutations result is merged with the remaining mutational site for not making to calibrate, obtains described two Grade variation result.

12. processing method according to claim 11, which is characterized in that the step being modified to two level variation result Rapid includes the steps that judging that multiple primary variation results whether there is false positive to be chain；

Wherein, judge that multiple primary variation results include with the presence or absence of the step of false positive to be chain：

Extraction while the sequence for covering multiple variation results, and count and support while covering the two of multiple variation results The ratio of grade sequencing sequence；

If supporting while covering the ratio and the multiple variation results of support of the two level sequencing sequence of multiple variation results In each the make a variation ratio of sequence of result significant difference is not present, then confirm multiple primary variation results be it is chain go out It is existing, and the frequency of mutation is recalculated in a manner of linked mutation, frequency of mutation after described recalculate meets third threshold value When, it obtains correcting mutation result；

If supporting while covering the ratio and the multiple variation results of support of the two level sequencing sequence of multiple variation results In each make a variation that there are significant differences for the ratio of sequence of result, then confirm multiple primary variation results be it is chain exist it is false The positive, and the frequency of mutation is recalculated after multiple variation results of merging are split, after described recalculate When the frequency of mutation meets third threshold value, obtain correcting mutation result；

Amendment mutation result is merged with uncorrected mutation result, obtains the handling result.

13. a kind of processing unit of high-flux sequence data, which is characterized in that the processing unit includes：

Two level sequencing sequence acquiring unit, for obtaining two level sequencing sequence, the two level sequencing sequence measures for the high pass Ordinal number can be identified in by target fragment amplimer, and eliminate the sequencing sequence after the corresponding amplimer；

Primary variation result acquiring unit obtains primary become for comparing the two level sequencing sequence and reference gene group sequence Different result；And

Amending unit, for correcting the primary variation using the accidental data in known mutations data as a result, obtaining processing knot Fruit.

14. processing unit according to claim 13, which is characterized in that the two level sequencing sequence acquiring unit includes：

Filtering module, low-quality sequencing data in the high-flux sequence data for filtering lower machine obtain level-one sequencing sequence, The low-quality sequencing data refers to the sequencing sequence that Q20 is more than 10% less than 80% or N base ratios；

Identification module obtains identification sequence for identifying the level-one sequencing sequence using the amplimer of target fragment； And

Removal module obtains the two level sequencing sequence for removing the corresponding amplimer in the identification sequence.

15. processing unit according to claim 13, which is characterized in that the primary variation result acquiring unit includes：

Interception module is used for the location information of the amplimer according to the target fragment, from the reference gene group sequence The reference aligned sequences of corresponding target fragment are intercepted on row；And

First comparing module obtains described first for the two level sequencing sequence to be compared with described with reference to aligned sequences Grade variation result.

16. processing unit according to claim 15, which is characterized in that first comparing module is surveyed by the two level Sequence sequence with it is described be compared with reference to aligned sequences after, and before obtaining the primary variation result, further include：

First compares submodule, for the two level sequencing sequence to be compared with described with reference to aligned sequences, is compared Sequence；

Judging submodule judges in the aligned sequences for the location information according to the amplimer with the presence or absence of abnormal Sequence, the unusual sequences refer to comparison quality and are less than the sequence of first threshold or inconsistent with the reference aligned sequences information Sequence；And

Filter submodule in the presence of the judging result for the judging submodule is, filters out institute from the aligned sequences Unusual sequences are stated, and count each position of residue sequence and the similarities and differences with reference to aligned sequences, obtain the primary variation As a result.

17. the processing unit according to any one of claim 13 to 16, which is characterized in that the amending unit includes：

The local block of known mutations, for from being filtered out in known mutations data in the target fragment corresponding region Known mutations data obtain the local sequence of known mutations；

The local block of primary variation result, for filtering out while existing in described from the primary variation result Variant sites in known mutations data form the local sequence of primary variation result；And

Second comparing module, for carrying out the local sequence of the local sequence and the known mutations of the primary result that makes a variation It compares, obtains the handling result.

18. processing unit according to claim 17, which is characterized in that second comparing module includes：

Second compares submodule, for by the local sequence of the local sequence of the primary result that makes a variation and the known mutations into Row compares, and obtains two level variation result；

Submodule is corrected, for being modified to two level variation result, obtains the handling result；

Wherein, the amendment submodule is modified step as follows to two level variation result execution：

Judge to whether there is neighbouring mutational site in the two level variation result, such as exists, then judge the neighbouring mutation The variation frequency in site is with the presence or absence of significant difference and/or whether has support sequence, if significant difference is not present and/or has support Sequence then merges adjacent mutational site, to obtain the handling result.

19. processing unit according to claim 14, which is characterized in that the identification module includes：

The first submodule of amplimer specific sequence, all amplimers for recycling the target fragment, from every institute 5 ' the ends for stating amplimer start, and intercept specific sequence according to length L and record the specific sequence of each pair of amplimer The length of remaining primer sequence after the quantity of row, corresponding specific sequence and the specific sequence；

Amplimer specific sequence the second submodule repeats the amplimer specificity for changing the length L The step of the first submodule of sequence, the set of the specific sequence of the different number of all amplimers is obtained, selection is special Length L and corresponding specific sequence set corresponding to the most set of anisotropic sequence quantity carry out subsequent analysis；

Sequencing sequence intercepts submodule, for every sequence in level-one sequencing sequence described in circular treatment, intercepts every sequence The sequence of preceding 25~35bp go to intercept according to the length L corresponding to the most set of specific sequence quantity since the ends 5` Sequence obtains sequencing interception arrangement set；

Submodule is searched, the amplification corresponding to specific sequence in the set most for searching the specific sequence quantity The primer most amplimer of occurrence number and the most amplification of the occurrence number in the sequencing interception arrangement set are drawn The number of object, and when the maximum value of the number of the most amplimer of the occurrence number is more than the second threshold of setting, i.e., Think that this level-one sequencing sequence is to expand to obtain by the most amplimer of occurrence number, then by this described one Grade sequencing sequence is denoted as identification sequence.

20. processing unit according to claim 19, which is characterized in that removing module includes：

Submodule is removed, for the specific sequence according to the most amplimer of the occurrence number in the identification sequence The length of remaining amplimer sequence, removes the identification after the position and the position finally occurred that finally occur The amplimer in sequence obtains the two level sequencing sequence.

21. processing unit according to claim 17, which is characterized in that the local block packet of the known mutations It includes：

First screening submodule forms known mutations area for screening target fragment corresponding region described in known mutations data Domain；

First logging modle, for recording the initial position of each known mutations and final position in the known mutations region, And extend respectively to both ends along the initial position and final position, the initial position after then record extends and final position；

First known mutations sequence generating module, is used for the initial position after extension and final position is located at the purpose When in segment corresponding region, by the initial position after the extension and the corresponding sequence of the final position after extension be denoted as it is described Know the local sequence of mutation；And

Second known mutations sequence generating module, is used for the initial position after extension and final position exceeds the purpose Segment corresponding region, then using the boundary of the target fragment corresponding region as the start bit of the local sequence of the known mutations It sets and/or final position.

22. processing unit according to claim 17, which is characterized in that the local block of the primary variation result Including：

Second screening submodule going out for the screening from the primary variation result while also to be present in the known mutations data In variant sites；

Second record sub module, the initial position for recording each variant sites and final position, and along the initial position Extend respectively to both ends with final position, extends to the corresponding position of local sequence of the known mutations, the as described primary The local sequence for the result that makes a variation.

23. processing unit according to claim 18, which is characterized in that described second, which compares submodule, includes：

First searches subcomponent, whether there is in the local sequence for searching the corresponding known mutations of each target fragment The primary variation result；

The local sequence producing element of first sample mutation, the lookup result for searching subcomponent when described first are that there are one When the primary variation result, then according to the initial position of the variation result and final position, and along the initial position and Final position extends to both ends respectively, forms a local sequence of sample mutation；

The local sequence producing element of second sample mutation, the lookup result for searching subcomponent when described first are that there are multiple The primary variation is as a result, then judge that the variation frequency between multiple primary variation results whether there is significant difference；If All there is significant difference, then according to the initial position of each primary variation result and final position, and along described Beginning position and final position extend respectively to both ends, then be respectively formed the local sequence of respective sample mutation；If existing without significantly The primary variation of difference is as a result, then the multiple primary variation results of preliminary judgement are chain, and by multiple primary Variation result merges to form the same local sequence of sample mutation, and there are the surplus of significant difference in multiple primary variation results Remaining primary variation is as a result, be then individually created the respective local sequence of sample mutation；

Subcomponent is calibrated, whether the local sequence for judging the local sequence of each sample mutation and known mutations is identical, if identical, The primary result that makes a variation then is calibrated to known mutations result；If it is different, not calibrating then；

First merges subcomponent, for by the calibrated mutational site for known mutations result and the remaining mutation for not making to calibrate Site merges, and obtains the two level variation result.

24. processing unit according to claim 18, which is characterized in that the amendment submodule includes that chain false positive is sentenced Disconnected subcomponent, the chain false positive judge that subcomponent includes：

Extraction statistics subcomponent, the sequence for extracting while covering multiple variation results, and count support while covering The ratio of the two level sequencing sequence of multiple variation results；

Chain confirmation subcomponent, for the ratio and branch when the two level sequencing sequence for supporting while covering multiple variation results Significant difference is not present in the ratio for holding the sequence for the result that each makes a variation in multiple variation results, then confirms multiple primary Variation result is chain appearance, and the frequency of mutation is recalculated in a manner of linked mutation, mutation after described recalculate When frequency meets third threshold value, obtain correcting mutation result；

False positive confirm subcomponent, for when support simultaneously cover it is multiple it is described variation results two level sequencing sequence ratio and Supporting each to make a variation in multiple variation results, there are significant differences for the ratio of sequence of result, then confirm multiple primary Make a variation result be it is chain there are false positives, and recalculated after multiple variation results of merging are split mutation frequently Rate when the frequency of mutation after described recalculate meets third threshold value, obtains correcting mutation result；And

Second merges subcomponent, for merging amendment mutation result with uncorrected mutation result, obtains the processing As a result.

25. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein run in described program When control the equipment perform claim where the storage medium and require processing method described in any one of 1 to 12.

26. a kind of processor, which is characterized in that the processor is for running program, wherein right of execution when described program is run Profit requires the method described in any one of 1 to 12.