CN106960135B

CN106960135B - Target gene two generations sequencing data automatic analysis system and method

Info

Publication number: CN106960135B
Application number: CN201710160731.6A
Authority: CN
Inventors: 孟鑫; 祝鹏飞; 彭建龙; 戴珩
Original assignee: Plain (shanghai) Biotechnology Co Ltd
Current assignee: Shanghai Xuzhenda Biotechnology Co ltd
Priority date: 2017-03-17
Filing date: 2017-03-17
Publication date: 2018-06-26
Anticipated expiration: 2037-03-17
Also published as: CN106960135A

Abstract

The invention discloses a kind of two generation of target gene sequencing data automatic analysis system, including：Data storage cell to be analyzed, for storing data to be analyzed, if the unit has data, then into analysis mode decision package；Analysis mode decision package is analyzed for determination data by which kind of mode, respectively enters high in the clouds data analysis unit or preliminary data analytic unit；High in the clouds data analysis unit will upload to the data to be analyzed in high in the clouds, as unit of sample, carry out data analysis；Preliminary data analytic unit carries out data analysis on local analytics platform；Analysis result storage unit, for storing the analysis result from high in the clouds data analysis unit and preliminary data analytic unit quality testing qualification data.It is more there are manual operation that the present invention solves the prior art, the problem of not being suitable for large-scale data automatical analysis, and employs high in the clouds and local computing Double tabletop automatically switches strategy, ensure that the robustness of automated analysis.

Description

Target gene two generations sequencing data automatic analysis system and method

Technical field

The invention belongs to field of biological medicine, are related to the single nucleotide variations of two generation sequencing datas（SNV, Single Nucleotide Variant）With section segment insertion and deletion（InDel, Short Insertion-Deletion）Detection, specifically The automatic analysis system and method that SNV and InDel for target gene two generations sequencing data are detected.

Background technology

Two generation gene sequencing technologies, core concept are to be sequenced in synthesis.With the fluorescence of different colours label four respectively Kind different DNA A, T, C, G, are passing through PCR（PCR, Polymerase Chain Reaction）During the complementary strand of synthetic gene template, deoxyribonucleotide is added to the end of complementary strand successively, passes through capture The fluorescence signal of end identifies the deoxyribonucleotide type of addition, so that it is determined that the gene order of synthesis.Two generation genes are surveyed Sequence technology has the characteristics that high throughput, can millions of sequences of once sequencing.In two generation of target gene, is sequenced, i.e., is sequenced using two generations The DNA sequence dna of technology sequencing targeting.In order to capture target gene, first have to design and synthesize the spy with target gene complementary pairing Needle comes out target gene sequence capturing according to probe and complementary combine of targeting sequence.Then it is built according to the DNA sequence dna of capture Library simultaneously carries out the sequencing of two generations.The advantages of target gene is sequenced is directed sequencing target dna sequence, can reduce cost and improve number According to utilization rate.

Two generation sequencing datas include multiple analytical procedures.By taking SNV and InDel detections as an example, as shown in figure 3, needing to carry out Sequence alignment, comparison result sequence, label repeat（Duplication）, data quality accessment, InDel compares again, base quality Correction, SNV and InDel detections and testing result filtering.The analysis realization of two generation sequencing datas is most of to be divided into two ways： 1）All steps are integrated into a flow, an analysis task is submitted by order line, completes data analysis；2）To each it divide Analysis step submits analysis task respectively, completes data analysis step by step.The shortcomings that both modes is must manually to deliver to appoint Business consumes human cost, extends analytical cycle, and there is also hidden danger for the stability of analysis result.

The realization process of current existing a kind of two generations sequencing data automatic analysis system can be summarized as：User's input Property parameter and data, system receives user's input, related analysis software and script called to input data according to input parameter Data analysis is carried out, then exports analysis result.Such as patent CN106021993A " tumour sequencing of extron group analysis system and side Method ", CN105653893A " a kind of genome resurveys sequence analysis system and method ", it is necessary first to which user is defeated in Web applying units Enter data to be analyzed and relevant parameter, receiving these users using Java interactive units inputs and start correlation analysis script Carry out data analysis.Patent CN105550536A " a kind of extron sequencing data analysis method based on biological cloud platform and is System " realizes that process is also similar, and only analysis platform has been put into high in the clouds.This kind of patent is required to user's input, is carried out according to input Automated analysis, and analysis process is fixed.The advantages of this kind of patent is that analytical parameters can be adjusted flexibly, and shortcoming is detrimental to full-page proof The data analysis of this amount.

The realization process of another kind of two generations sequencing data automatic analysis system can be summarized as：Establishment project and according to project Demand selects correlation analysis module and relevant parameter, according to the analysis module of selection and relevant parameter to project corresponding sequencing number According to being analyzed, then export analysis result.Related patents such as CN104484750A " automatic of the product parameters of biological information project Method of completing the square and system ", CN104484582A " select the biological information project automatic analysis method realized by modularization and are System ".This kind of patent needs user to create relevant item first and selects the required analysis module of the project and other parameter.This The advantages of class patent is can flexibly to select data analysis content, suitable for the analysis management of items of different types.Shortcoming is artificial Operating procedure is more, needs establishment project, selects the corresponding analysing content of project, the corresponding sequencing sample of project etc..

In view of existing two generations sequencing data automatic analysis system is more there are manual operation, it is not suitable for extensive number The shortcomings that according to automatical analysis, there is an urgent need for researching and developing a kind of system suitable for large-scale data automatical analysis, reduces manual operation Error and cost.In addition, in view of some genetic tests have timeliness higher requirement, there is an urgent need for research and develop stable and quickly analysis System.

Invention content

One of the technical problem to be solved in the present invention is to provide a kind of two generation of target gene sequencing data automated analysis system System, it is more there are manual operation to overcome the prior art, the shortcomings that not being suitable for large-scale data automatical analysis.

The second technical problem to be solved by the present invention is to provide based on target gene two generations sequencing data automated analysis system The implementation method of system.

In order to solve the above technical problems, the present invention adopts the following technical scheme that：

In one aspect of the invention, a kind of two generation of target gene sequencing data automatic analysis system is provided, including treating point Analysis data storage cell, analysis mode decision package, high in the clouds data analysis unit, preliminary data analytic unit and analysis result are deposited Storage unit；

The data storage cell to be analyzed if the unit is stored with data, then enters for storing data to be analyzed Analysis mode decision package；

The analysis mode decision package, is analyzed for determination data by high in the clouds or standby mode is divided Analysis, respectively enters high in the clouds data analysis unit or preliminary data analytic unit；

The high in the clouds data analysis unit will upload to the data to be analyzed in high in the clouds, as unit of sample, carry out data point Analysis；

The preliminary data analytic unit, carries out data analysis on local analytics platform；

The analysis result storage unit, for storing from the high in the clouds data analysis unit and the preliminary data point Analyse the analysis result of the qualified data of element quality detection.

As currently preferred technical solution, the realization step of the analysis mode decision package is including as follows：First By the data to be analyzed detected as unit of sample, high in the clouds is uploaded to；If it uploads successfully, into high in the clouds data analysis list Member；If uploading failure, can be attempted three times in total again；If again attempting to success, high in the clouds data analysis unit will be entered；Such as Fruit uploads and proves an abortion, and data will copy local server to, and enter preliminary data analytic unit.

As currently preferred technical solution, the high in the clouds data analysis unit carries out data analysis, if data analysis Failure will re-start second of analysis of data；If second of analysis still fails, artificial error correction reparation need to be carried out；If data point It analyses successfully, quality control detection will be carried out to data.

As currently preferred technical solution, the preliminary data analytic unit carries out data on local analytics platform Analysis, local analytics platform submit analysis task by resource management software SGE, and resource is worked as in different Sample-Parallel starting analyses When insufficient, analysis task needs are waited in line；The data to be analyzed of local server are copied to, as unit of sample, into line number According to analysis, if data analysis fails, artificial error correction reparation need to be carried out；If data will be carried out quality control by data analysis success Detection.

As currently preferred technical solution, the analysis result storage unit as unit of sample, is stored in specific Position carries out data query and browsing convenient for user.

As currently preferred technical solution, which further includes logging unit, for recording data analysis Full step, including data transmission, data analysis, quality testing and result storage.

As currently preferred technical solution, the logging unit is used to record the full step of data analysis, In any one step failure, the unit all for automation send mail to specified mailbox, remind specific failure information；When all When step is all successful, which, which can automate, sends mail to specified mailbox, and sample is reminded to successfully complete.

In another aspect of this invention, a kind of realization side of two generation of target gene sequencing data automatic analysis system is provided Method includes the following steps：

Step 1, system detects data to be analyzed automatically, judges whether data storage cell to be analyzed is stored with data, such as Fruit has, then into analysis mode decision package；

Step 2, data are uploaded to high in the clouds operation；Data upload successfully, enter step 3；Data upload failure, into step Rapid 6；

Step 3, data upload successfully, into high in the clouds data analysis unit, start high in the clouds data analysis；

Step 4, high in the clouds data analysis state is monitored, a data analysis task is restarted in analysis failure；

Step 5, high in the clouds analysis is completed, and enters step 8；

Step 6, data upload failure, by data copy to local server, into preliminary data analytic unit, start this Ground data analysis；

Step 7, Analysis on monitoring data state enters step 8；

Step 8, quality testing is carried out to data；

Step 9, quality testing is qualified, and data are positioned over analysis result storage unit.

It is described that data are uploaded to high in the clouds operation in step 2 as currently preferred technical solution, if on first time Failure is passed, is reattempted three times.

As currently preferred technical solution, in step 3 and step 6, the data analysis includes the following steps：

1）Sequence alignment：Sequencing data is compared onto reference gene group；

2）Comparison result sorts：To sequence alignment as a result, as unit of reference gene group coordinate, rearrange；

3）Mark Duplication：Mark the part that position consistency is compared in comparison result；

4）Data quality accessment：According to sequence alignment result, comparison rate, target area overburden depth, PCR are calculated The information such as duplication ratios, information judges sequencing data quality to user whereby；

5）InDel is compared again：The region that mistake is compared because being generated during InDel is compared again；

6）Base mass calibration：Base quality is corrected using machine learning method, to obtain more accurately base Quality；

7）SNV is detected and InDel detections：According to treated sequence alignment file, SNV and InDel detections are carried out respectively；

8）SNV mass filters and InDel mass filters：To SNV the and InDel sites detected, to its quality height into Row is assessed and marks different labels.

Compared with prior art, the beneficial effects of the present invention are：

1. automation

Existing two generations sequencing data relevant automatic analysis system needs be manually entered early period, including number is sequenced According to data analysis module and analysis relevant parameter etc. can just start datamation analysis.With existing two generations sequencing data Relevant automatic analysis system is compared, and this system is operated without any input, can detect data to be analyzed, and turn-on data automatically Analysis.Therefore this system can accomplish full-automation, whole to save human cost without human intervention, reduce analytical cycle, Manually-operated error probability is reduced, suitable for the batch quantity analysis of large-scale data.

2. operating procedure is traceable

Compared with other two generations sequencing data relevant automatic analysis systems, this system includes logging unit, record Enter the journal file of each operating procedure of each sample of system.For operation failure, mail reminder can be sent automatically, Relevant treatment is carried out in time convenient for user.For running successful sample in automated system, mail reminder user can be also sent It runs successfully.Therefore the traceable all operationss step to sequencing data of this system, and with automatic prompting function.

3. stability

The stability of this system embodies in the following areas：1）As a result stablize, all data analysis steps and relevant parameter tool There is consistency, so as to ensure the stability of analysis result.2）Function-stable, this system are incorporated there is provided committed step monitoring High in the clouds and local two analysis platforms, the Double tabletop strategy ensure to stablize, rapidly realize datamation analytic function.First A monitoring point uploads high in the clouds for data, attempts repeatedly to upload.Second monitoring point is high in the clouds data analysis, and trial is analyzed for several times. High in the clouds cannot be uploaded further for data caused by a variety of causes, system can automatically switch local spare analysis platform, ensure number According to being normally carried out for analysis.

4. analysis result is easily managed

Analysis result as unit of sample, is stored in specific position by this system, be convenient for analysis result retrieval and it is clear It lookes at.

5. suitable for large-scale data

This system is added to analysis data and detects automatically and originate the function of analysis, therefore this system is more suitable on a large scale The full-automatic analyzing and processing of data.In view of the stability of Data Analysis Platform, this system incorporates two analysis platforms, i.e., High in the clouds analysis platform and local analytics platform.System Priority selects cloud platform, and can according to circumstances automatically switch to local analytics Platform carries out data analysis, ensures system stable operation.The computing resource of cloud platform is enriched, and different samples can simultaneously divide the start of line Analysis, therefore can disposably handle a large amount of sequencing sample.Local analytics platform can submit analysis to appoint by resource management software SGE Business, different samples also can simultaneously the start of line be analyzed, but be constrained to the limitation of local analytics platform computing resource, work as inadequate resource When, analysis task needs are waited in line.Considering based on the time cycle, this system preferentially selects cloud platform.

Description of the drawings

Fig. 1 is the arrangement framework map of two generation of target gene sequencing data automatic analysis system of the present invention；

Fig. 2 is the particular flow sheet of two generation of target gene sequencing data automated analysis method of the present invention；

Fig. 3 is the data analysis flowcharts of two generation of target gene sequencing data automatic analysis system of the present invention.

Specific embodiment

With reference to specific embodiment, the present invention is furture elucidated, but these embodiments are only intended to illustrate the present invention, and It does not limit the scope of the invention.

As shown in Figure 1, two generation of target gene sequencing data automatic analysis system of the present invention, including following aspect：

1. data storage cell to be analyzed

This unit is used to store data to be analyzed.This system can detect the storage unit in predetermined time interval It is no to be stored with data, if so, analysis mode decision package will be entered.

2. analysis mode decision package

The function of this unit is that determination data is analyzed by high in the clouds or standby mode is analyzed.It first will detection The data to be analyzed arrived upload to high in the clouds as unit of sample.If it uploads successfully, into high in the clouds data analysis unit；On if Failure is passed, can be attempted three times in total again；If again attempting to success, high in the clouds data analysis unit will be entered；If it uploads It proves an abortion, data will copy local server to, and enter preliminary data analytic unit

3. high in the clouds data analysis unit

The computing resource of cloud platform is enriched, and different samples can simultaneously start of line analysis, therefore can disposably handle a large amount of survey Sequence sample.The data to be analyzed in high in the clouds are uploaded to, as unit of sample, carry out data analysis.If data analysis fails, will again Carry out second of analysis of data.If second of analysis still fails, artificial error correction reparation need to be carried out.If data analysis success, will be right Data carry out quality control detection, and the analysis result of quality testing qualification data is entered analysis result storage unit and is deposited Storage.

Data analysis step is as shown in figure 3, be specially：

1）Sequence alignment

Sequencing data is compared onto reference gene group, software used is bwa.

2）Comparison result sorts

It to sequence alignment as a result, as unit of reference gene group coordinate, rearranges, software used is Bamsormadup。

3）Mark Duplication

Mark the part that position consistency is compared in comparison result.

4）Data quality accessment

According to sequence alignment result, the letters such as comparison rate, target area overburden depth, PCR duplication ratios are calculated Breath.User information can judge sequencing data quality whereby.

5）InDel is compared again

The region that mistake is compared because being generated during InDel is compared again, software used is GATK.

6）Base mass calibration

Base quality is corrected using machine learning method, in order to obtain more accurately base quality, institute It is GATK with software.

7）SNV is detected and InDel detections

According to treated sequence alignment file, SNV and InDel detections are carried out respectively, and software used is GATK.

8）SNV mass filters and InDel mass filters

To SNV the and InDel sites detected, its quality height is assessed and marks different labels, software used For GATK.

4. preliminary data analytic unit

Preliminary data analytic unit is the alternative of this system, and data analysis is carried out on local analytics platform.It is local Analysis platform can submit analysis task by resource management software SGE, and different samples the start of line can simultaneously be analyzed, but be constrained to this The limitation of ground analysis platform computing resource, when inadequate resource, analysis task needs are waited in line.Copy local server to Data to be analyzed as unit of sample, carry out data analysis, data analysis step and high in the clouds are consistent.If data analysis is lost It loses, artificial error correction reparation need to be carried out.If data will be carried out quality control detection by data analysis success, by quality testing qualification The analysis result of data enters analysis result storage unit and is stored.

Data analysis step is as shown in figure 3, be specially：

1）Sequence alignment

Sequencing data is compared onto reference gene group, software used is bwa.

2）Comparison result sorts

3）Mark Duplication

Mark the part that position consistency is compared in comparison result.

4）Data quality accessment

5）InDel is compared again

6）Base mass calibration

7）SNV is detected and InDel detections

8）SNV mass filters and InDel mass filters

In view of the stability of Data Analysis Platform, this system incorporates two analysis platforms, i.e., high in the clouds analysis platform and Local analytics platform.System Priority selects cloud platform, and can according to circumstances automatically switch to local analytics platform and carry out data point Analysis ensures system stable operation.The computing resource of cloud platform is enriched, and different samples can simultaneously start of line analysis, therefore can be disposable The a large amount of sequencing sample of processing.Local analytics platform can submit analysis task by resource management software SGE, and different samples also may be used And the start of line is analyzed, but is constrained to the limitation of local analytics platform computing resource, when inadequate resource, analysis task needs to arrange Team waits for.Considering based on the time cycle, this system preferentially selects cloud platform.

5. analysis result storage unit

The analysis result of quality testing qualification data is stored in specific position, data query and clear is carried out convenient for user It lookes at.

6. logging unit

The unit records the full step of data analysis as unit of sample, including data transmission, data analysis, quality inspection It surveys and result stores.The failure of wherein any one step, the unit all send mail to specified mailbox for automation, remind tool Body failure information facilitates related personnel's timely processing.When all steps are all successful, which can automate transmission mail and extremely refer to Fixed mailbox reminds sample to successfully complete.In order to monitor state of the sample in automated system in real time, this system is added to Log recording function.Under normal circumstances, the journal file of each operating procedure can be dispersed in different servers, be unfavorable for criticizing Buret is managed, and this system can be after each operating procedure of each sample by corresponding journal file and the fortune of operating procedure The state of row success or failure is sent to logging unit, and mail reminder can be sent in real time for operation this system of failure.

As shown in Fig. 2, the implementation method of two generation of target gene sequencing data automatic analysis system of the present invention, specifically includes Following process step：

1. system is automatic（In predetermined time interval）Data to be analyzed are detected, judge that data storage cell to be analyzed is It is no to be stored with data, if so, analysis mode decision package will be entered.

2. data are carried out to be uploaded to high in the clouds operation, if uploading failure for the first time, reattempt three times；Data upload successfully, Enter step 3；Data upload failure, enter step 6.

3. data upload successfully, into high in the clouds data analysis unit, start high in the clouds data analysis.

4. monitoring high in the clouds data analysis state, a data analysis task is restarted in analysis failure.

5. high in the clouds analysis is completed, 8 are entered step.

6. data upload failure, by data copy to local server, into preliminary data analytic unit, start local number According to analysis.

7. monitoring local data analysis state, 8 are entered step.

8. quality testing is carried out to data.

9. quality testing is qualified, data are positioned over analysis result storage unit.

The high in the clouds automated analysis of 1 target gene sequencing data of embodiment

1. the sequencing data of sample 1 is placed into the target gene two generations sequencing data of a sample in specific bit as required It puts.

2. system automatically detects sample 1 to be analyzed.

3. data are uploaded to high in the clouds.

4. data upload successfully, start high in the clouds data analysis.

5. monitoring high in the clouds data analysis state, analyze successfully.

6. quality inspection is qualified, data are positioned over analysis result storage unit.

The home automation analysis of 2 target gene sequencing data of embodiment

1. the sequencing data of sample 2 is placed into the target gene two generations sequencing data of a sample in specific bit as required It puts.

2. system automatically detects sample 2 to be analyzed.

3. data are uploaded to high in the clouds.

4. data upload failure, by data copy to local server.

5. start local data analysis.

6. monitoring local data analysis state, analyze successfully.

Quality inspection is qualified, and data are positioned over analysis result storage unit.

Claims

1. a kind of two generation of target gene sequencing data automatic analysis system, it is characterised in that：It is stored including data to be analyzed single Member, analysis mode decision package, high in the clouds data analysis unit, preliminary data analytic unit and analysis result storage unit；

The data storage cell to be analyzed is for storing data to be analyzed, if the unit is stored with data, then enters analysis Mode decision package；

The analysis mode decision package, is analyzed for determination data by high in the clouds or standby mode is analyzed, point It Jin Ru not high in the clouds data analysis unit or preliminary data analytic unit；

The high in the clouds data analysis unit will upload to the data to be analyzed in high in the clouds, as unit of sample, carry out data analysis；

The analysis result storage unit, it is single from the high in the clouds data analysis unit and preliminary data analysis for storing The analysis result of first quality testing qualification data；

Data analysis in the high in the clouds data analysis unit and the preliminary data analytic unit includes the following steps：

4）Data quality accessment：According to sequence alignment result, comparison rate, target area overburden depth, PCR are calculated Duplication percent informations, information judges sequencing data quality to user whereby；

8）SNV mass filters and InDel mass filters：To SNV the and InDel sites detected, its quality height is commented Estimate and mark different labels.

2. the system as claimed in claim 1, which is characterized in that the realization step of the analysis mode decision package is included such as Under：The data to be analyzed detected are uploaded into high in the clouds as unit of sample first；It is to be analyzed if uploaded successfully for the first time The sample data of data storage cell storage can be deleted, and enter high in the clouds data analysis unit；If uploading failure for the first time, It can be attempted three times in total again；If again attempting to success, high in the clouds data analysis unit will be entered；If upload final lose It loses, data will copy local server to, and enter preliminary data analytic unit.

3. the system as claimed in claim 1, which is characterized in that the high in the clouds data analysis unit carries out data analysis, if number Fail according to analysis, second of analysis of data will be re-started；If second of analysis still fails, artificial error correction reparation need to be carried out；If Data will be carried out quality control detection by data analysis success.

4. the system as claimed in claim 1, which is characterized in that the preliminary data analytic unit is enterprising in local analytics platform Row data analysis, local analytics platform submit analysis task by resource management software SGE, and different Sample-Parallel startings are analyzed, When inadequate resource, analysis task needs are waited in line；Copy the data to be analyzed of local server to, as unit of sample, Data analysis is carried out, if data analysis fails, artificial error correction reparation need to be carried out；If data will be carried out matter by data analysis success Amount control detection.

5. the system as claimed in claim 1, which is characterized in that the analysis result storage unit, as unit of sample, storage In specific position, data query and browsing are carried out convenient for user.

6. the system as claimed in claim 1, which is characterized in that logging unit is further included, for recording data analysis Full step, including data transmission, data analysis, quality testing and result storage.

7. system as claimed in claim 6, which is characterized in that the logging unit is used to record the full step of data analysis Suddenly, wherein the failure of any one step, the unit all send mail to specified mailbox for automation, remind specific failure information； When all steps are all successful, which, which can automate, sends mail to specified mailbox, and sample is reminded to successfully complete.

8. a kind of implementation method of two generation of target gene sequencing data automatic analysis system, which is characterized in that including walking as follows Suddenly：

Step 1, system detects data to be analyzed automatically, judges whether data storage cell to be analyzed is stored with data, if so, Then enter analysis mode decision package；

Step 2, data are uploaded to high in the clouds operation；Data upload successfully, enter step 3；Data upload failure, enter step 6；

Step 5, high in the clouds analysis is completed, and enters step 8；

Step 6, data upload failure, by data copy to local server, into preliminary data analytic unit, start local number According to analysis；

Step 7, Analysis on monitoring data state enters step 8；

Step 8, quality testing is carried out to data；

Step 9, quality testing is qualified, and data are positioned over analysis result storage unit；

In step 3 and step 6, the data analysis includes the following steps：

9. method as claimed in claim 8, which is characterized in that it is described that data are uploaded to high in the clouds operation in step 2, if the It is primary to upload failure, it reattempts three times.