CN106960135B - Target gene two generations sequencing data automatic analysis system and method - Google Patents

Target gene two generations sequencing data automatic analysis system and method Download PDF

Info

Publication number
CN106960135B
CN106960135B CN201710160731.6A CN201710160731A CN106960135B CN 106960135 B CN106960135 B CN 106960135B CN 201710160731 A CN201710160731 A CN 201710160731A CN 106960135 B CN106960135 B CN 106960135B
Authority
CN
China
Prior art keywords
data
analysis
unit
clouds
analyzed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710160731.6A
Other languages
Chinese (zh)
Other versions
CN106960135A (en
Inventor
孟鑫
祝鹏飞
彭建龙
戴珩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Xuzhenda Biotechnology Co ltd
Original Assignee
Plain (shanghai) Biotechnology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Plain (shanghai) Biotechnology Co Ltd filed Critical Plain (shanghai) Biotechnology Co Ltd
Priority to CN201710160731.6A priority Critical patent/CN106960135B/en
Publication of CN106960135A publication Critical patent/CN106960135A/en
Application granted granted Critical
Publication of CN106960135B publication Critical patent/CN106960135B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression

Abstract

The invention discloses a kind of two generation of target gene sequencing data automatic analysis system, including:Data storage cell to be analyzed, for storing data to be analyzed, if the unit has data, then into analysis mode decision package;Analysis mode decision package is analyzed for determination data by which kind of mode, respectively enters high in the clouds data analysis unit or preliminary data analytic unit;High in the clouds data analysis unit will upload to the data to be analyzed in high in the clouds, as unit of sample, carry out data analysis;Preliminary data analytic unit carries out data analysis on local analytics platform;Analysis result storage unit, for storing the analysis result from high in the clouds data analysis unit and preliminary data analytic unit quality testing qualification data.It is more there are manual operation that the present invention solves the prior art, the problem of not being suitable for large-scale data automatical analysis, and employs high in the clouds and local computing Double tabletop automatically switches strategy, ensure that the robustness of automated analysis.

Description

Target gene two generations sequencing data automatic analysis system and method
Technical field
The invention belongs to field of biological medicine, are related to the single nucleotide variations of two generation sequencing datas(SNV, Single Nucleotide Variant)With section segment insertion and deletion(InDel, Short Insertion-Deletion)Detection, specifically The automatic analysis system and method that SNV and InDel for target gene two generations sequencing data are detected.
Background technology
Two generation gene sequencing technologies, core concept are to be sequenced in synthesis.With the fluorescence of different colours label four respectively Kind different DNA A, T, C, G, are passing through PCR(PCR, Polymerase Chain Reaction)During the complementary strand of synthetic gene template, deoxyribonucleotide is added to the end of complementary strand successively, passes through capture The fluorescence signal of end identifies the deoxyribonucleotide type of addition, so that it is determined that the gene order of synthesis.Two generation genes are surveyed Sequence technology has the characteristics that high throughput, can millions of sequences of once sequencing.In two generation of target gene, is sequenced, i.e., is sequenced using two generations The DNA sequence dna of technology sequencing targeting.In order to capture target gene, first have to design and synthesize the spy with target gene complementary pairing Needle comes out target gene sequence capturing according to probe and complementary combine of targeting sequence.Then it is built according to the DNA sequence dna of capture Library simultaneously carries out the sequencing of two generations.The advantages of target gene is sequenced is directed sequencing target dna sequence, can reduce cost and improve number According to utilization rate.
Two generation sequencing datas include multiple analytical procedures.By taking SNV and InDel detections as an example, as shown in figure 3, needing to carry out Sequence alignment, comparison result sequence, label repeat(Duplication), data quality accessment, InDel compares again, base quality Correction, SNV and InDel detections and testing result filtering.The analysis realization of two generation sequencing datas is most of to be divided into two ways: 1)All steps are integrated into a flow, an analysis task is submitted by order line, completes data analysis;2)To each it divide Analysis step submits analysis task respectively, completes data analysis step by step.The shortcomings that both modes is must manually to deliver to appoint Business consumes human cost, extends analytical cycle, and there is also hidden danger for the stability of analysis result.
The realization process of current existing a kind of two generations sequencing data automatic analysis system can be summarized as:User's input Property parameter and data, system receives user's input, related analysis software and script called to input data according to input parameter Data analysis is carried out, then exports analysis result.Such as patent CN106021993A " tumour sequencing of extron group analysis system and side Method ", CN105653893A " a kind of genome resurveys sequence analysis system and method ", it is necessary first to which user is defeated in Web applying units Enter data to be analyzed and relevant parameter, receiving these users using Java interactive units inputs and start correlation analysis script Carry out data analysis.Patent CN105550536A " a kind of extron sequencing data analysis method based on biological cloud platform and is System " realizes that process is also similar, and only analysis platform has been put into high in the clouds.This kind of patent is required to user's input, is carried out according to input Automated analysis, and analysis process is fixed.The advantages of this kind of patent is that analytical parameters can be adjusted flexibly, and shortcoming is detrimental to full-page proof The data analysis of this amount.
The realization process of another kind of two generations sequencing data automatic analysis system can be summarized as:Establishment project and according to project Demand selects correlation analysis module and relevant parameter, according to the analysis module of selection and relevant parameter to project corresponding sequencing number According to being analyzed, then export analysis result.Related patents such as CN104484750A " automatic of the product parameters of biological information project Method of completing the square and system ", CN104484582A " select the biological information project automatic analysis method realized by modularization and are System ".This kind of patent needs user to create relevant item first and selects the required analysis module of the project and other parameter.This The advantages of class patent is can flexibly to select data analysis content, suitable for the analysis management of items of different types.Shortcoming is artificial Operating procedure is more, needs establishment project, selects the corresponding analysing content of project, the corresponding sequencing sample of project etc..
In view of existing two generations sequencing data automatic analysis system is more there are manual operation, it is not suitable for extensive number The shortcomings that according to automatical analysis, there is an urgent need for researching and developing a kind of system suitable for large-scale data automatical analysis, reduces manual operation Error and cost.In addition, in view of some genetic tests have timeliness higher requirement, there is an urgent need for research and develop stable and quickly analysis System.
Invention content
One of the technical problem to be solved in the present invention is to provide a kind of two generation of target gene sequencing data automated analysis system System, it is more there are manual operation to overcome the prior art, the shortcomings that not being suitable for large-scale data automatical analysis.
The second technical problem to be solved by the present invention is to provide based on target gene two generations sequencing data automated analysis system The implementation method of system.
In order to solve the above technical problems, the present invention adopts the following technical scheme that:
In one aspect of the invention, a kind of two generation of target gene sequencing data automatic analysis system is provided, including treating point Analysis data storage cell, analysis mode decision package, high in the clouds data analysis unit, preliminary data analytic unit and analysis result are deposited Storage unit;
The data storage cell to be analyzed if the unit is stored with data, then enters for storing data to be analyzed Analysis mode decision package;
The analysis mode decision package, is analyzed for determination data by high in the clouds or standby mode is divided Analysis, respectively enters high in the clouds data analysis unit or preliminary data analytic unit;
The high in the clouds data analysis unit will upload to the data to be analyzed in high in the clouds, as unit of sample, carry out data point Analysis;
The preliminary data analytic unit, carries out data analysis on local analytics platform;
The analysis result storage unit, for storing from the high in the clouds data analysis unit and the preliminary data point Analyse the analysis result of the qualified data of element quality detection.
As currently preferred technical solution, the realization step of the analysis mode decision package is including as follows:First By the data to be analyzed detected as unit of sample, high in the clouds is uploaded to;If it uploads successfully, into high in the clouds data analysis list Member;If uploading failure, can be attempted three times in total again;If again attempting to success, high in the clouds data analysis unit will be entered;Such as Fruit uploads and proves an abortion, and data will copy local server to, and enter preliminary data analytic unit.
As currently preferred technical solution, the high in the clouds data analysis unit carries out data analysis, if data analysis Failure will re-start second of analysis of data;If second of analysis still fails, artificial error correction reparation need to be carried out;If data point It analyses successfully, quality control detection will be carried out to data.
As currently preferred technical solution, the preliminary data analytic unit carries out data on local analytics platform Analysis, local analytics platform submit analysis task by resource management software SGE, and resource is worked as in different Sample-Parallel starting analyses When insufficient, analysis task needs are waited in line;The data to be analyzed of local server are copied to, as unit of sample, into line number According to analysis, if data analysis fails, artificial error correction reparation need to be carried out;If data will be carried out quality control by data analysis success Detection.
As currently preferred technical solution, the analysis result storage unit as unit of sample, is stored in specific Position carries out data query and browsing convenient for user.
As currently preferred technical solution, which further includes logging unit, for recording data analysis Full step, including data transmission, data analysis, quality testing and result storage.
As currently preferred technical solution, the logging unit is used to record the full step of data analysis, In any one step failure, the unit all for automation send mail to specified mailbox, remind specific failure information;When all When step is all successful, which, which can automate, sends mail to specified mailbox, and sample is reminded to successfully complete.
In another aspect of this invention, a kind of realization side of two generation of target gene sequencing data automatic analysis system is provided Method includes the following steps:
Step 1, system detects data to be analyzed automatically, judges whether data storage cell to be analyzed is stored with data, such as Fruit has, then into analysis mode decision package;
Step 2, data are uploaded to high in the clouds operation;Data upload successfully, enter step 3;Data upload failure, into step Rapid 6;
Step 3, data upload successfully, into high in the clouds data analysis unit, start high in the clouds data analysis;
Step 4, high in the clouds data analysis state is monitored, a data analysis task is restarted in analysis failure;
Step 5, high in the clouds analysis is completed, and enters step 8;
Step 6, data upload failure, by data copy to local server, into preliminary data analytic unit, start this Ground data analysis;
Step 7, Analysis on monitoring data state enters step 8;
Step 8, quality testing is carried out to data;
Step 9, quality testing is qualified, and data are positioned over analysis result storage unit.
It is described that data are uploaded to high in the clouds operation in step 2 as currently preferred technical solution, if on first time Failure is passed, is reattempted three times.
As currently preferred technical solution, in step 3 and step 6, the data analysis includes the following steps:
1)Sequence alignment:Sequencing data is compared onto reference gene group;
2)Comparison result sorts:To sequence alignment as a result, as unit of reference gene group coordinate, rearrange;
3)Mark Duplication:Mark the part that position consistency is compared in comparison result;
4)Data quality accessment:According to sequence alignment result, comparison rate, target area overburden depth, PCR are calculated The information such as duplication ratios, information judges sequencing data quality to user whereby;
5)InDel is compared again:The region that mistake is compared because being generated during InDel is compared again;
6)Base mass calibration:Base quality is corrected using machine learning method, to obtain more accurately base Quality;
7)SNV is detected and InDel detections:According to treated sequence alignment file, SNV and InDel detections are carried out respectively;
8)SNV mass filters and InDel mass filters:To SNV the and InDel sites detected, to its quality height into Row is assessed and marks different labels.
Compared with prior art, the beneficial effects of the present invention are:
1. automation
Existing two generations sequencing data relevant automatic analysis system needs be manually entered early period, including number is sequenced According to data analysis module and analysis relevant parameter etc. can just start datamation analysis.With existing two generations sequencing data Relevant automatic analysis system is compared, and this system is operated without any input, can detect data to be analyzed, and turn-on data automatically Analysis.Therefore this system can accomplish full-automation, whole to save human cost without human intervention, reduce analytical cycle, Manually-operated error probability is reduced, suitable for the batch quantity analysis of large-scale data.
2. operating procedure is traceable
Compared with other two generations sequencing data relevant automatic analysis systems, this system includes logging unit, record Enter the journal file of each operating procedure of each sample of system.For operation failure, mail reminder can be sent automatically, Relevant treatment is carried out in time convenient for user.For running successful sample in automated system, mail reminder user can be also sent It runs successfully.Therefore the traceable all operationss step to sequencing data of this system, and with automatic prompting function.
3. stability
The stability of this system embodies in the following areas:1)As a result stablize, all data analysis steps and relevant parameter tool There is consistency, so as to ensure the stability of analysis result.2)Function-stable, this system are incorporated there is provided committed step monitoring High in the clouds and local two analysis platforms, the Double tabletop strategy ensure to stablize, rapidly realize datamation analytic function.First A monitoring point uploads high in the clouds for data, attempts repeatedly to upload.Second monitoring point is high in the clouds data analysis, and trial is analyzed for several times. High in the clouds cannot be uploaded further for data caused by a variety of causes, system can automatically switch local spare analysis platform, ensure number According to being normally carried out for analysis.
4. analysis result is easily managed
Analysis result as unit of sample, is stored in specific position by this system, be convenient for analysis result retrieval and it is clear It lookes at.
5. suitable for large-scale data
This system is added to analysis data and detects automatically and originate the function of analysis, therefore this system is more suitable on a large scale The full-automatic analyzing and processing of data.In view of the stability of Data Analysis Platform, this system incorporates two analysis platforms, i.e., High in the clouds analysis platform and local analytics platform.System Priority selects cloud platform, and can according to circumstances automatically switch to local analytics Platform carries out data analysis, ensures system stable operation.The computing resource of cloud platform is enriched, and different samples can simultaneously divide the start of line Analysis, therefore can disposably handle a large amount of sequencing sample.Local analytics platform can submit analysis to appoint by resource management software SGE Business, different samples also can simultaneously the start of line be analyzed, but be constrained to the limitation of local analytics platform computing resource, work as inadequate resource When, analysis task needs are waited in line.Considering based on the time cycle, this system preferentially selects cloud platform.
Description of the drawings
Fig. 1 is the arrangement framework map of two generation of target gene sequencing data automatic analysis system of the present invention;
Fig. 2 is the particular flow sheet of two generation of target gene sequencing data automated analysis method of the present invention;
Fig. 3 is the data analysis flowcharts of two generation of target gene sequencing data automatic analysis system of the present invention.
Specific embodiment
With reference to specific embodiment, the present invention is furture elucidated, but these embodiments are only intended to illustrate the present invention, and It does not limit the scope of the invention.
As shown in Figure 1, two generation of target gene sequencing data automatic analysis system of the present invention, including following aspect:
1. data storage cell to be analyzed
This unit is used to store data to be analyzed.This system can detect the storage unit in predetermined time interval It is no to be stored with data, if so, analysis mode decision package will be entered.
2. analysis mode decision package
The function of this unit is that determination data is analyzed by high in the clouds or standby mode is analyzed.It first will detection The data to be analyzed arrived upload to high in the clouds as unit of sample.If it uploads successfully, into high in the clouds data analysis unit;On if Failure is passed, can be attempted three times in total again;If again attempting to success, high in the clouds data analysis unit will be entered;If it uploads It proves an abortion, data will copy local server to, and enter preliminary data analytic unit
3. high in the clouds data analysis unit
The computing resource of cloud platform is enriched, and different samples can simultaneously start of line analysis, therefore can disposably handle a large amount of survey Sequence sample.The data to be analyzed in high in the clouds are uploaded to, as unit of sample, carry out data analysis.If data analysis fails, will again Carry out second of analysis of data.If second of analysis still fails, artificial error correction reparation need to be carried out.If data analysis success, will be right Data carry out quality control detection, and the analysis result of quality testing qualification data is entered analysis result storage unit and is deposited Storage.
Data analysis step is as shown in figure 3, be specially:
1)Sequence alignment
Sequencing data is compared onto reference gene group, software used is bwa.
2)Comparison result sorts
It to sequence alignment as a result, as unit of reference gene group coordinate, rearranges, software used is Bamsormadup。
3)Mark Duplication
Mark the part that position consistency is compared in comparison result.
4)Data quality accessment
According to sequence alignment result, the letters such as comparison rate, target area overburden depth, PCR duplication ratios are calculated Breath.User information can judge sequencing data quality whereby.
5)InDel is compared again
The region that mistake is compared because being generated during InDel is compared again, software used is GATK.
6)Base mass calibration
Base quality is corrected using machine learning method, in order to obtain more accurately base quality, institute It is GATK with software.
7)SNV is detected and InDel detections
According to treated sequence alignment file, SNV and InDel detections are carried out respectively, and software used is GATK.
8)SNV mass filters and InDel mass filters
To SNV the and InDel sites detected, its quality height is assessed and marks different labels, software used For GATK.
4. preliminary data analytic unit
Preliminary data analytic unit is the alternative of this system, and data analysis is carried out on local analytics platform.It is local Analysis platform can submit analysis task by resource management software SGE, and different samples the start of line can simultaneously be analyzed, but be constrained to this The limitation of ground analysis platform computing resource, when inadequate resource, analysis task needs are waited in line.Copy local server to Data to be analyzed as unit of sample, carry out data analysis, data analysis step and high in the clouds are consistent.If data analysis is lost It loses, artificial error correction reparation need to be carried out.If data will be carried out quality control detection by data analysis success, by quality testing qualification The analysis result of data enters analysis result storage unit and is stored.
Data analysis step is as shown in figure 3, be specially:
1)Sequence alignment
Sequencing data is compared onto reference gene group, software used is bwa.
2)Comparison result sorts
It to sequence alignment as a result, as unit of reference gene group coordinate, rearranges, software used is Bamsormadup。
3)Mark Duplication
Mark the part that position consistency is compared in comparison result.
4)Data quality accessment
According to sequence alignment result, the letters such as comparison rate, target area overburden depth, PCR duplication ratios are calculated Breath.User information can judge sequencing data quality whereby.
5)InDel is compared again
The region that mistake is compared because being generated during InDel is compared again, software used is GATK.
6)Base mass calibration
Base quality is corrected using machine learning method, in order to obtain more accurately base quality, institute It is GATK with software.
7)SNV is detected and InDel detections
According to treated sequence alignment file, SNV and InDel detections are carried out respectively, and software used is GATK.
8)SNV mass filters and InDel mass filters
To SNV the and InDel sites detected, its quality height is assessed and marks different labels, software used For GATK.
In view of the stability of Data Analysis Platform, this system incorporates two analysis platforms, i.e., high in the clouds analysis platform and Local analytics platform.System Priority selects cloud platform, and can according to circumstances automatically switch to local analytics platform and carry out data point Analysis ensures system stable operation.The computing resource of cloud platform is enriched, and different samples can simultaneously start of line analysis, therefore can be disposable The a large amount of sequencing sample of processing.Local analytics platform can submit analysis task by resource management software SGE, and different samples also may be used And the start of line is analyzed, but is constrained to the limitation of local analytics platform computing resource, when inadequate resource, analysis task needs to arrange Team waits for.Considering based on the time cycle, this system preferentially selects cloud platform.
5. analysis result storage unit
The analysis result of quality testing qualification data is stored in specific position, data query and clear is carried out convenient for user It lookes at.
6. logging unit
The unit records the full step of data analysis as unit of sample, including data transmission, data analysis, quality inspection It surveys and result stores.The failure of wherein any one step, the unit all send mail to specified mailbox for automation, remind tool Body failure information facilitates related personnel's timely processing.When all steps are all successful, which can automate transmission mail and extremely refer to Fixed mailbox reminds sample to successfully complete.In order to monitor state of the sample in automated system in real time, this system is added to Log recording function.Under normal circumstances, the journal file of each operating procedure can be dispersed in different servers, be unfavorable for criticizing Buret is managed, and this system can be after each operating procedure of each sample by corresponding journal file and the fortune of operating procedure The state of row success or failure is sent to logging unit, and mail reminder can be sent in real time for operation this system of failure.
As shown in Fig. 2, the implementation method of two generation of target gene sequencing data automatic analysis system of the present invention, specifically includes Following process step:
1. system is automatic(In predetermined time interval)Data to be analyzed are detected, judge that data storage cell to be analyzed is It is no to be stored with data, if so, analysis mode decision package will be entered.
2. data are carried out to be uploaded to high in the clouds operation, if uploading failure for the first time, reattempt three times;Data upload successfully, Enter step 3;Data upload failure, enter step 6.
3. data upload successfully, into high in the clouds data analysis unit, start high in the clouds data analysis.
4. monitoring high in the clouds data analysis state, a data analysis task is restarted in analysis failure.
5. high in the clouds analysis is completed, 8 are entered step.
6. data upload failure, by data copy to local server, into preliminary data analytic unit, start local number According to analysis.
7. monitoring local data analysis state, 8 are entered step.
8. quality testing is carried out to data.
9. quality testing is qualified, data are positioned over analysis result storage unit.
The high in the clouds automated analysis of 1 target gene sequencing data of embodiment
1. the sequencing data of sample 1 is placed into the target gene two generations sequencing data of a sample in specific bit as required It puts.
2. system automatically detects sample 1 to be analyzed.
3. data are uploaded to high in the clouds.
4. data upload successfully, start high in the clouds data analysis.
5. monitoring high in the clouds data analysis state, analyze successfully.
6. quality inspection is qualified, data are positioned over analysis result storage unit.
The home automation analysis of 2 target gene sequencing data of embodiment
1. the sequencing data of sample 2 is placed into the target gene two generations sequencing data of a sample in specific bit as required It puts.
2. system automatically detects sample 2 to be analyzed.
3. data are uploaded to high in the clouds.
4. data upload failure, by data copy to local server.
5. start local data analysis.
6. monitoring local data analysis state, analyze successfully.
Quality inspection is qualified, and data are positioned over analysis result storage unit.

Claims (9)

1. a kind of two generation of target gene sequencing data automatic analysis system, it is characterised in that:It is stored including data to be analyzed single Member, analysis mode decision package, high in the clouds data analysis unit, preliminary data analytic unit and analysis result storage unit;
The data storage cell to be analyzed is for storing data to be analyzed, if the unit is stored with data, then enters analysis Mode decision package;
The analysis mode decision package, is analyzed for determination data by high in the clouds or standby mode is analyzed, point It Jin Ru not high in the clouds data analysis unit or preliminary data analytic unit;
The high in the clouds data analysis unit will upload to the data to be analyzed in high in the clouds, as unit of sample, carry out data analysis;
The preliminary data analytic unit, carries out data analysis on local analytics platform;
The analysis result storage unit, it is single from the high in the clouds data analysis unit and preliminary data analysis for storing The analysis result of first quality testing qualification data;
Data analysis in the high in the clouds data analysis unit and the preliminary data analytic unit includes the following steps:
1)Sequence alignment:Sequencing data is compared onto reference gene group;
2)Comparison result sorts:To sequence alignment as a result, as unit of reference gene group coordinate, rearrange;
3)Mark Duplication:Mark the part that position consistency is compared in comparison result;
4)Data quality accessment:According to sequence alignment result, comparison rate, target area overburden depth, PCR are calculated Duplication percent informations, information judges sequencing data quality to user whereby;
5)InDel is compared again:The region that mistake is compared because being generated during InDel is compared again;
6)Base mass calibration:Base quality is corrected using machine learning method, to obtain more accurately base quality;
7)SNV is detected and InDel detections:According to treated sequence alignment file, SNV and InDel detections are carried out respectively;
8)SNV mass filters and InDel mass filters:To SNV the and InDel sites detected, its quality height is commented Estimate and mark different labels.
2. the system as claimed in claim 1, which is characterized in that the realization step of the analysis mode decision package is included such as Under:The data to be analyzed detected are uploaded into high in the clouds as unit of sample first;It is to be analyzed if uploaded successfully for the first time The sample data of data storage cell storage can be deleted, and enter high in the clouds data analysis unit;If uploading failure for the first time, It can be attempted three times in total again;If again attempting to success, high in the clouds data analysis unit will be entered;If upload final lose It loses, data will copy local server to, and enter preliminary data analytic unit.
3. the system as claimed in claim 1, which is characterized in that the high in the clouds data analysis unit carries out data analysis, if number Fail according to analysis, second of analysis of data will be re-started;If second of analysis still fails, artificial error correction reparation need to be carried out;If Data will be carried out quality control detection by data analysis success.
4. the system as claimed in claim 1, which is characterized in that the preliminary data analytic unit is enterprising in local analytics platform Row data analysis, local analytics platform submit analysis task by resource management software SGE, and different Sample-Parallel startings are analyzed, When inadequate resource, analysis task needs are waited in line;Copy the data to be analyzed of local server to, as unit of sample, Data analysis is carried out, if data analysis fails, artificial error correction reparation need to be carried out;If data will be carried out matter by data analysis success Amount control detection.
5. the system as claimed in claim 1, which is characterized in that the analysis result storage unit, as unit of sample, storage In specific position, data query and browsing are carried out convenient for user.
6. the system as claimed in claim 1, which is characterized in that logging unit is further included, for recording data analysis Full step, including data transmission, data analysis, quality testing and result storage.
7. system as claimed in claim 6, which is characterized in that the logging unit is used to record the full step of data analysis Suddenly, wherein the failure of any one step, the unit all send mail to specified mailbox for automation, remind specific failure information; When all steps are all successful, which, which can automate, sends mail to specified mailbox, and sample is reminded to successfully complete.
8. a kind of implementation method of two generation of target gene sequencing data automatic analysis system, which is characterized in that including walking as follows Suddenly:
Step 1, system detects data to be analyzed automatically, judges whether data storage cell to be analyzed is stored with data, if so, Then enter analysis mode decision package;
Step 2, data are uploaded to high in the clouds operation;Data upload successfully, enter step 3;Data upload failure, enter step 6;
Step 3, data upload successfully, into high in the clouds data analysis unit, start high in the clouds data analysis;
Step 4, high in the clouds data analysis state is monitored, a data analysis task is restarted in analysis failure;
Step 5, high in the clouds analysis is completed, and enters step 8;
Step 6, data upload failure, by data copy to local server, into preliminary data analytic unit, start local number According to analysis;
Step 7, Analysis on monitoring data state enters step 8;
Step 8, quality testing is carried out to data;
Step 9, quality testing is qualified, and data are positioned over analysis result storage unit;
In step 3 and step 6, the data analysis includes the following steps:
1)Sequence alignment:Sequencing data is compared onto reference gene group;
2)Comparison result sorts:To sequence alignment as a result, as unit of reference gene group coordinate, rearrange;
3)Mark Duplication:Mark the part that position consistency is compared in comparison result;
4)Data quality accessment:According to sequence alignment result, comparison rate, target area overburden depth, PCR are calculated Duplication percent informations, information judges sequencing data quality to user whereby;
5)InDel is compared again:The region that mistake is compared because being generated during InDel is compared again;
6)Base mass calibration:Base quality is corrected using machine learning method, to obtain more accurately base quality;
7)SNV is detected and InDel detections:According to treated sequence alignment file, SNV and InDel detections are carried out respectively;
8)SNV mass filters and InDel mass filters:To SNV the and InDel sites detected, its quality height is commented Estimate and mark different labels.
9. method as claimed in claim 8, which is characterized in that it is described that data are uploaded to high in the clouds operation in step 2, if the It is primary to upload failure, it reattempts three times.
CN201710160731.6A 2017-03-17 2017-03-17 Target gene two generations sequencing data automatic analysis system and method Active CN106960135B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710160731.6A CN106960135B (en) 2017-03-17 2017-03-17 Target gene two generations sequencing data automatic analysis system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710160731.6A CN106960135B (en) 2017-03-17 2017-03-17 Target gene two generations sequencing data automatic analysis system and method

Publications (2)

Publication Number Publication Date
CN106960135A CN106960135A (en) 2017-07-18
CN106960135B true CN106960135B (en) 2018-06-26

Family

ID=59470363

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710160731.6A Active CN106960135B (en) 2017-03-17 2017-03-17 Target gene two generations sequencing data automatic analysis system and method

Country Status (1)

Country Link
CN (1) CN106960135B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110176276B (en) * 2019-04-12 2021-01-05 苏州赛美科基因科技有限公司 Biological information analysis process management method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103023943A (en) * 2011-09-27 2013-04-03 中国移动通信集团公司 Method, device and terminal equipment for task processing
WO2015140794A1 (en) * 2014-03-20 2015-09-24 Ramot At Tel-Aviv University Ltd. Methods and systems for genome comparison
CN105550536A (en) * 2015-12-29 2016-05-04 北京百迈客生物科技有限公司 Exon sequencing data analysis method and system based on biological cloud platform
CN105653897A (en) * 2015-12-25 2016-06-08 北京百迈客生物科技有限公司 Biological platform-based IncRNA analysis system and method
CN106021979A (en) * 2016-05-12 2016-10-12 北京百迈客云科技有限公司 Analysis system and method for human genome re-sequencing data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103023943A (en) * 2011-09-27 2013-04-03 中国移动通信集团公司 Method, device and terminal equipment for task processing
WO2015140794A1 (en) * 2014-03-20 2015-09-24 Ramot At Tel-Aviv University Ltd. Methods and systems for genome comparison
CN105653897A (en) * 2015-12-25 2016-06-08 北京百迈客生物科技有限公司 Biological platform-based IncRNA analysis system and method
CN105550536A (en) * 2015-12-29 2016-05-04 北京百迈客生物科技有限公司 Exon sequencing data analysis method and system based on biological cloud platform
CN106021979A (en) * 2016-05-12 2016-10-12 北京百迈客云科技有限公司 Analysis system and method for human genome re-sequencing data

Also Published As

Publication number Publication date
CN106960135A (en) 2017-07-18

Similar Documents

Publication Publication Date Title
Cheung et al. Current trends in flow cytometry automated data analysis software
CN107463800B (en) A kind of enteric microorganism information analysis method and system
CN107391963A (en) Eucaryon based on calculating cloud platform is without ginseng transcript profile interaction analysis system and method
CN107220885A (en) A kind of genetic test Product Reporting System and method
CN106874190A (en) The method of testing and server of user interface
CN107368700A (en) Based on the microbial diversity interaction analysis system and method for calculating cloud platform
CN110991486A (en) Method and device for controlling quality of multi-person collaborative image annotation
CN102411540B (en) Automatic management system of workflow-based common software testing process
CN111354418B (en) High-throughput sequencing technology animal tRFs data analysis method based on reference genome annotation file
OROZCO‐terWENGEL et al. Genealogical lineage sorting leads to significant, but incorrect Bayesian multilocus inference of population structure
CN107367622A (en) Full-automatic blood type analysis system
CN101118243B (en) Method for controlling automatic analyzer
CN104484558A (en) Method and system for automatically generating analysis reports of biological information projects
CN104484582A (en) Method and system for automatically analyzing bioinformation items through modular selection
CN107944228A (en) A kind of method for visualizing of gene sequencing variant sites
CN106960135B (en) Target gene two generations sequencing data automatic analysis system and method
CN110490613A (en) A kind of method and system of the product testing based on block chain
CN114493514A (en) Data processing method and device applied to human resources
CN112434032B (en) Automatic feature generation system and method
CN104484375A (en) Method and system for automatically building database in item analysis process
CN106444446A (en) Method and device for acidometer data collecting
Hertzum et al. The evaluator effect during first-time use of the cognitive walkthrough technique.
CN109406731B (en) Single-target gene editing T0 substitute tobacco strain screening method
Troein et al. [7] An Introduction to BioArray Software Environment
CN104484581A (en) Method and system for automatically analyzing biological information projects

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20221102

Address after: Room 607, Building 1, No. 55, Aona Road, China (Shanghai) Pilot Free Trade Zone, Pudong New Area, Shanghai, 200137

Patentee after: Shanghai xuzhenda Biotechnology Co.,Ltd.

Address before: 200131 Room D04, Floor 3, No. 207, Fute North Road, Free Trade Pilot Zone, Pudong New Area, Shanghai

Patentee before: WUXI NEXTCODE GENOMICS (SHANGHAI) CO.,LTD.

PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Automated Analysis System and Method for Targeted Gene Second Generation Sequencing Data

Effective date of registration: 20231130

Granted publication date: 20180626

Pledgee: Industrial Bank Co.,Ltd. Shanghai Zhangyang Sub branch

Pledgor: Shanghai xuzhenda Biotechnology Co.,Ltd.

Registration number: Y2023310000791