Target gene two generations sequencing data automatic analysis system and method
Technical field
The invention belongs to field of biological medicine, are related to the single nucleotide variations of two generation sequencing datas(SNV, Single
Nucleotide Variant)With section segment insertion and deletion(InDel, Short Insertion-Deletion)Detection, specifically
The automatic analysis system and method that SNV and InDel for target gene two generations sequencing data are detected.
Background technology
Two generation gene sequencing technologies, core concept are to be sequenced in synthesis.With the fluorescence of different colours label four respectively
Kind different DNA A, T, C, G, are passing through PCR(PCR, Polymerase Chain
Reaction)During the complementary strand of synthetic gene template, deoxyribonucleotide is added to the end of complementary strand successively, passes through capture
The fluorescence signal of end identifies the deoxyribonucleotide type of addition, so that it is determined that the gene order of synthesis.Two generation genes are surveyed
Sequence technology has the characteristics that high throughput, can millions of sequences of once sequencing.In two generation of target gene, is sequenced, i.e., is sequenced using two generations
The DNA sequence dna of technology sequencing targeting.In order to capture target gene, first have to design and synthesize the spy with target gene complementary pairing
Needle comes out target gene sequence capturing according to probe and complementary combine of targeting sequence.Then it is built according to the DNA sequence dna of capture
Library simultaneously carries out the sequencing of two generations.The advantages of target gene is sequenced is directed sequencing target dna sequence, can reduce cost and improve number
According to utilization rate.
Two generation sequencing datas include multiple analytical procedures.By taking SNV and InDel detections as an example, as shown in figure 3, needing to carry out
Sequence alignment, comparison result sequence, label repeat(Duplication), data quality accessment, InDel compares again, base quality
Correction, SNV and InDel detections and testing result filtering.The analysis realization of two generation sequencing datas is most of to be divided into two ways:
1)All steps are integrated into a flow, an analysis task is submitted by order line, completes data analysis;2)To each it divide
Analysis step submits analysis task respectively, completes data analysis step by step.The shortcomings that both modes is must manually to deliver to appoint
Business consumes human cost, extends analytical cycle, and there is also hidden danger for the stability of analysis result.
The realization process of current existing a kind of two generations sequencing data automatic analysis system can be summarized as:User's input
Property parameter and data, system receives user's input, related analysis software and script called to input data according to input parameter
Data analysis is carried out, then exports analysis result.Such as patent CN106021993A " tumour sequencing of extron group analysis system and side
Method ", CN105653893A " a kind of genome resurveys sequence analysis system and method ", it is necessary first to which user is defeated in Web applying units
Enter data to be analyzed and relevant parameter, receiving these users using Java interactive units inputs and start correlation analysis script
Carry out data analysis.Patent CN105550536A " a kind of extron sequencing data analysis method based on biological cloud platform and is
System " realizes that process is also similar, and only analysis platform has been put into high in the clouds.This kind of patent is required to user's input, is carried out according to input
Automated analysis, and analysis process is fixed.The advantages of this kind of patent is that analytical parameters can be adjusted flexibly, and shortcoming is detrimental to full-page proof
The data analysis of this amount.
The realization process of another kind of two generations sequencing data automatic analysis system can be summarized as:Establishment project and according to project
Demand selects correlation analysis module and relevant parameter, according to the analysis module of selection and relevant parameter to project corresponding sequencing number
According to being analyzed, then export analysis result.Related patents such as CN104484750A " automatic of the product parameters of biological information project
Method of completing the square and system ", CN104484582A " select the biological information project automatic analysis method realized by modularization and are
System ".This kind of patent needs user to create relevant item first and selects the required analysis module of the project and other parameter.This
The advantages of class patent is can flexibly to select data analysis content, suitable for the analysis management of items of different types.Shortcoming is artificial
Operating procedure is more, needs establishment project, selects the corresponding analysing content of project, the corresponding sequencing sample of project etc..
In view of existing two generations sequencing data automatic analysis system is more there are manual operation, it is not suitable for extensive number
The shortcomings that according to automatical analysis, there is an urgent need for researching and developing a kind of system suitable for large-scale data automatical analysis, reduces manual operation
Error and cost.In addition, in view of some genetic tests have timeliness higher requirement, there is an urgent need for research and develop stable and quickly analysis
System.
Invention content
One of the technical problem to be solved in the present invention is to provide a kind of two generation of target gene sequencing data automated analysis system
System, it is more there are manual operation to overcome the prior art, the shortcomings that not being suitable for large-scale data automatical analysis.
The second technical problem to be solved by the present invention is to provide based on target gene two generations sequencing data automated analysis system
The implementation method of system.
In order to solve the above technical problems, the present invention adopts the following technical scheme that:
In one aspect of the invention, a kind of two generation of target gene sequencing data automatic analysis system is provided, including treating point
Analysis data storage cell, analysis mode decision package, high in the clouds data analysis unit, preliminary data analytic unit and analysis result are deposited
Storage unit;
The data storage cell to be analyzed if the unit is stored with data, then enters for storing data to be analyzed
Analysis mode decision package;
The analysis mode decision package, is analyzed for determination data by high in the clouds or standby mode is divided
Analysis, respectively enters high in the clouds data analysis unit or preliminary data analytic unit;
The high in the clouds data analysis unit will upload to the data to be analyzed in high in the clouds, as unit of sample, carry out data point
Analysis;
The preliminary data analytic unit, carries out data analysis on local analytics platform;
The analysis result storage unit, for storing from the high in the clouds data analysis unit and the preliminary data point
Analyse the analysis result of the qualified data of element quality detection.
As currently preferred technical solution, the realization step of the analysis mode decision package is including as follows:First
By the data to be analyzed detected as unit of sample, high in the clouds is uploaded to;If it uploads successfully, into high in the clouds data analysis list
Member;If uploading failure, can be attempted three times in total again;If again attempting to success, high in the clouds data analysis unit will be entered;Such as
Fruit uploads and proves an abortion, and data will copy local server to, and enter preliminary data analytic unit.
As currently preferred technical solution, the high in the clouds data analysis unit carries out data analysis, if data analysis
Failure will re-start second of analysis of data;If second of analysis still fails, artificial error correction reparation need to be carried out;If data point
It analyses successfully, quality control detection will be carried out to data.
As currently preferred technical solution, the preliminary data analytic unit carries out data on local analytics platform
Analysis, local analytics platform submit analysis task by resource management software SGE, and resource is worked as in different Sample-Parallel starting analyses
When insufficient, analysis task needs are waited in line;The data to be analyzed of local server are copied to, as unit of sample, into line number
According to analysis, if data analysis fails, artificial error correction reparation need to be carried out;If data will be carried out quality control by data analysis success
Detection.
As currently preferred technical solution, the analysis result storage unit as unit of sample, is stored in specific
Position carries out data query and browsing convenient for user.
As currently preferred technical solution, which further includes logging unit, for recording data analysis
Full step, including data transmission, data analysis, quality testing and result storage.
As currently preferred technical solution, the logging unit is used to record the full step of data analysis,
In any one step failure, the unit all for automation send mail to specified mailbox, remind specific failure information;When all
When step is all successful, which, which can automate, sends mail to specified mailbox, and sample is reminded to successfully complete.
In another aspect of this invention, a kind of realization side of two generation of target gene sequencing data automatic analysis system is provided
Method includes the following steps:
Step 1, system detects data to be analyzed automatically, judges whether data storage cell to be analyzed is stored with data, such as
Fruit has, then into analysis mode decision package;
Step 2, data are uploaded to high in the clouds operation;Data upload successfully, enter step 3;Data upload failure, into step
Rapid 6;
Step 3, data upload successfully, into high in the clouds data analysis unit, start high in the clouds data analysis;
Step 4, high in the clouds data analysis state is monitored, a data analysis task is restarted in analysis failure;
Step 5, high in the clouds analysis is completed, and enters step 8;
Step 6, data upload failure, by data copy to local server, into preliminary data analytic unit, start this
Ground data analysis;
Step 7, Analysis on monitoring data state enters step 8;
Step 8, quality testing is carried out to data;
Step 9, quality testing is qualified, and data are positioned over analysis result storage unit.
It is described that data are uploaded to high in the clouds operation in step 2 as currently preferred technical solution, if on first time
Failure is passed, is reattempted three times.
As currently preferred technical solution, in step 3 and step 6, the data analysis includes the following steps:
1)Sequence alignment:Sequencing data is compared onto reference gene group;
2)Comparison result sorts:To sequence alignment as a result, as unit of reference gene group coordinate, rearrange;
3)Mark Duplication:Mark the part that position consistency is compared in comparison result;
4)Data quality accessment:According to sequence alignment result, comparison rate, target area overburden depth, PCR are calculated
The information such as duplication ratios, information judges sequencing data quality to user whereby;
5)InDel is compared again:The region that mistake is compared because being generated during InDel is compared again;
6)Base mass calibration:Base quality is corrected using machine learning method, to obtain more accurately base
Quality;
7)SNV is detected and InDel detections:According to treated sequence alignment file, SNV and InDel detections are carried out respectively;
8)SNV mass filters and InDel mass filters:To SNV the and InDel sites detected, to its quality height into
Row is assessed and marks different labels.
Compared with prior art, the beneficial effects of the present invention are:
1. automation
Existing two generations sequencing data relevant automatic analysis system needs be manually entered early period, including number is sequenced
According to data analysis module and analysis relevant parameter etc. can just start datamation analysis.With existing two generations sequencing data
Relevant automatic analysis system is compared, and this system is operated without any input, can detect data to be analyzed, and turn-on data automatically
Analysis.Therefore this system can accomplish full-automation, whole to save human cost without human intervention, reduce analytical cycle,
Manually-operated error probability is reduced, suitable for the batch quantity analysis of large-scale data.
2. operating procedure is traceable
Compared with other two generations sequencing data relevant automatic analysis systems, this system includes logging unit, record
Enter the journal file of each operating procedure of each sample of system.For operation failure, mail reminder can be sent automatically,
Relevant treatment is carried out in time convenient for user.For running successful sample in automated system, mail reminder user can be also sent
It runs successfully.Therefore the traceable all operationss step to sequencing data of this system, and with automatic prompting function.
3. stability
The stability of this system embodies in the following areas:1)As a result stablize, all data analysis steps and relevant parameter tool
There is consistency, so as to ensure the stability of analysis result.2)Function-stable, this system are incorporated there is provided committed step monitoring
High in the clouds and local two analysis platforms, the Double tabletop strategy ensure to stablize, rapidly realize datamation analytic function.First
A monitoring point uploads high in the clouds for data, attempts repeatedly to upload.Second monitoring point is high in the clouds data analysis, and trial is analyzed for several times.
High in the clouds cannot be uploaded further for data caused by a variety of causes, system can automatically switch local spare analysis platform, ensure number
According to being normally carried out for analysis.
4. analysis result is easily managed
Analysis result as unit of sample, is stored in specific position by this system, be convenient for analysis result retrieval and it is clear
It lookes at.
5. suitable for large-scale data
This system is added to analysis data and detects automatically and originate the function of analysis, therefore this system is more suitable on a large scale
The full-automatic analyzing and processing of data.In view of the stability of Data Analysis Platform, this system incorporates two analysis platforms, i.e.,
High in the clouds analysis platform and local analytics platform.System Priority selects cloud platform, and can according to circumstances automatically switch to local analytics
Platform carries out data analysis, ensures system stable operation.The computing resource of cloud platform is enriched, and different samples can simultaneously divide the start of line
Analysis, therefore can disposably handle a large amount of sequencing sample.Local analytics platform can submit analysis to appoint by resource management software SGE
Business, different samples also can simultaneously the start of line be analyzed, but be constrained to the limitation of local analytics platform computing resource, work as inadequate resource
When, analysis task needs are waited in line.Considering based on the time cycle, this system preferentially selects cloud platform.
Description of the drawings
Fig. 1 is the arrangement framework map of two generation of target gene sequencing data automatic analysis system of the present invention;
Fig. 2 is the particular flow sheet of two generation of target gene sequencing data automated analysis method of the present invention;
Fig. 3 is the data analysis flowcharts of two generation of target gene sequencing data automatic analysis system of the present invention.
Specific embodiment
With reference to specific embodiment, the present invention is furture elucidated, but these embodiments are only intended to illustrate the present invention, and
It does not limit the scope of the invention.
As shown in Figure 1, two generation of target gene sequencing data automatic analysis system of the present invention, including following aspect:
1. data storage cell to be analyzed
This unit is used to store data to be analyzed.This system can detect the storage unit in predetermined time interval
It is no to be stored with data, if so, analysis mode decision package will be entered.
2. analysis mode decision package
The function of this unit is that determination data is analyzed by high in the clouds or standby mode is analyzed.It first will detection
The data to be analyzed arrived upload to high in the clouds as unit of sample.If it uploads successfully, into high in the clouds data analysis unit;On if
Failure is passed, can be attempted three times in total again;If again attempting to success, high in the clouds data analysis unit will be entered;If it uploads
It proves an abortion, data will copy local server to, and enter preliminary data analytic unit
3. high in the clouds data analysis unit
The computing resource of cloud platform is enriched, and different samples can simultaneously start of line analysis, therefore can disposably handle a large amount of survey
Sequence sample.The data to be analyzed in high in the clouds are uploaded to, as unit of sample, carry out data analysis.If data analysis fails, will again
Carry out second of analysis of data.If second of analysis still fails, artificial error correction reparation need to be carried out.If data analysis success, will be right
Data carry out quality control detection, and the analysis result of quality testing qualification data is entered analysis result storage unit and is deposited
Storage.
Data analysis step is as shown in figure 3, be specially:
1)Sequence alignment
Sequencing data is compared onto reference gene group, software used is bwa.
2)Comparison result sorts
It to sequence alignment as a result, as unit of reference gene group coordinate, rearranges, software used is
Bamsormadup。
3)Mark Duplication
Mark the part that position consistency is compared in comparison result.
4)Data quality accessment
According to sequence alignment result, the letters such as comparison rate, target area overburden depth, PCR duplication ratios are calculated
Breath.User information can judge sequencing data quality whereby.
5)InDel is compared again
The region that mistake is compared because being generated during InDel is compared again, software used is GATK.
6)Base mass calibration
Base quality is corrected using machine learning method, in order to obtain more accurately base quality, institute
It is GATK with software.
7)SNV is detected and InDel detections
According to treated sequence alignment file, SNV and InDel detections are carried out respectively, and software used is GATK.
8)SNV mass filters and InDel mass filters
To SNV the and InDel sites detected, its quality height is assessed and marks different labels, software used
For GATK.
4. preliminary data analytic unit
Preliminary data analytic unit is the alternative of this system, and data analysis is carried out on local analytics platform.It is local
Analysis platform can submit analysis task by resource management software SGE, and different samples the start of line can simultaneously be analyzed, but be constrained to this
The limitation of ground analysis platform computing resource, when inadequate resource, analysis task needs are waited in line.Copy local server to
Data to be analyzed as unit of sample, carry out data analysis, data analysis step and high in the clouds are consistent.If data analysis is lost
It loses, artificial error correction reparation need to be carried out.If data will be carried out quality control detection by data analysis success, by quality testing qualification
The analysis result of data enters analysis result storage unit and is stored.
Data analysis step is as shown in figure 3, be specially:
1)Sequence alignment
Sequencing data is compared onto reference gene group, software used is bwa.
2)Comparison result sorts
It to sequence alignment as a result, as unit of reference gene group coordinate, rearranges, software used is
Bamsormadup。
3)Mark Duplication
Mark the part that position consistency is compared in comparison result.
4)Data quality accessment
According to sequence alignment result, the letters such as comparison rate, target area overburden depth, PCR duplication ratios are calculated
Breath.User information can judge sequencing data quality whereby.
5)InDel is compared again
The region that mistake is compared because being generated during InDel is compared again, software used is GATK.
6)Base mass calibration
Base quality is corrected using machine learning method, in order to obtain more accurately base quality, institute
It is GATK with software.
7)SNV is detected and InDel detections
According to treated sequence alignment file, SNV and InDel detections are carried out respectively, and software used is GATK.
8)SNV mass filters and InDel mass filters
To SNV the and InDel sites detected, its quality height is assessed and marks different labels, software used
For GATK.
In view of the stability of Data Analysis Platform, this system incorporates two analysis platforms, i.e., high in the clouds analysis platform and
Local analytics platform.System Priority selects cloud platform, and can according to circumstances automatically switch to local analytics platform and carry out data point
Analysis ensures system stable operation.The computing resource of cloud platform is enriched, and different samples can simultaneously start of line analysis, therefore can be disposable
The a large amount of sequencing sample of processing.Local analytics platform can submit analysis task by resource management software SGE, and different samples also may be used
And the start of line is analyzed, but is constrained to the limitation of local analytics platform computing resource, when inadequate resource, analysis task needs to arrange
Team waits for.Considering based on the time cycle, this system preferentially selects cloud platform.
5. analysis result storage unit
The analysis result of quality testing qualification data is stored in specific position, data query and clear is carried out convenient for user
It lookes at.
6. logging unit
The unit records the full step of data analysis as unit of sample, including data transmission, data analysis, quality inspection
It surveys and result stores.The failure of wherein any one step, the unit all send mail to specified mailbox for automation, remind tool
Body failure information facilitates related personnel's timely processing.When all steps are all successful, which can automate transmission mail and extremely refer to
Fixed mailbox reminds sample to successfully complete.In order to monitor state of the sample in automated system in real time, this system is added to
Log recording function.Under normal circumstances, the journal file of each operating procedure can be dispersed in different servers, be unfavorable for criticizing
Buret is managed, and this system can be after each operating procedure of each sample by corresponding journal file and the fortune of operating procedure
The state of row success or failure is sent to logging unit, and mail reminder can be sent in real time for operation this system of failure.
As shown in Fig. 2, the implementation method of two generation of target gene sequencing data automatic analysis system of the present invention, specifically includes
Following process step:
1. system is automatic(In predetermined time interval)Data to be analyzed are detected, judge that data storage cell to be analyzed is
It is no to be stored with data, if so, analysis mode decision package will be entered.
2. data are carried out to be uploaded to high in the clouds operation, if uploading failure for the first time, reattempt three times;Data upload successfully,
Enter step 3;Data upload failure, enter step 6.
3. data upload successfully, into high in the clouds data analysis unit, start high in the clouds data analysis.
4. monitoring high in the clouds data analysis state, a data analysis task is restarted in analysis failure.
5. high in the clouds analysis is completed, 8 are entered step.
6. data upload failure, by data copy to local server, into preliminary data analytic unit, start local number
According to analysis.
7. monitoring local data analysis state, 8 are entered step.
8. quality testing is carried out to data.
9. quality testing is qualified, data are positioned over analysis result storage unit.
The high in the clouds automated analysis of 1 target gene sequencing data of embodiment
1. the sequencing data of sample 1 is placed into the target gene two generations sequencing data of a sample in specific bit as required
It puts.
2. system automatically detects sample 1 to be analyzed.
3. data are uploaded to high in the clouds.
4. data upload successfully, start high in the clouds data analysis.
5. monitoring high in the clouds data analysis state, analyze successfully.
6. quality inspection is qualified, data are positioned over analysis result storage unit.
The home automation analysis of 2 target gene sequencing data of embodiment
1. the sequencing data of sample 2 is placed into the target gene two generations sequencing data of a sample in specific bit as required
It puts.
2. system automatically detects sample 2 to be analyzed.
3. data are uploaded to high in the clouds.
4. data upload failure, by data copy to local server.
5. start local data analysis.
6. monitoring local data analysis state, analyze successfully.
Quality inspection is qualified, and data are positioned over analysis result storage unit.