CN105095686B

CN105095686B - High-throughput transcript profile sequencing data method of quality control based on multi-core CPU hardware

Info

Publication number: CN105095686B
Application number: CN201410205571.9A
Authority: CN
Inventors: 周茜; 宁康; 苏晓泉; 徐健
Original assignee: Qingdao Institute of Bioenergy and Bioprocess Technology of CAS
Current assignee: Qingdao Institute of Bioenergy and Bioprocess Technology of CAS
Priority date: 2014-05-15
Filing date: 2014-05-15
Publication date: 2018-08-14
Anticipated expiration: 2034-05-15
Also published as: CN105095686A

Abstract

The present invention is a kind of high-throughput transcript profile sequencing data method of quality control based on multi-core CPU hardware.Including：Parallel processing is carried out to high-throughput transcript profile sequencing data using multi-core CPU, obtains the data for removing low sequencing quality sequence；The rRNA sequences in data using multi-core CPU to removing low sequencing quality sequence are predicted and are removed, and carry out the Qualitative Identification of polluted sequence；Sequence alignment result is counted and is evaluated.The present invention is based on multi-core CPU computer, the computational efficiency bottleneck based on monokaryon CPU hardware computer is overcome, high-throughput transcript profile data quality control efficiency can be made to improve 7 times or more；The application of the present invention will significantly improve accuracy and the speed of high-throughput transcript profile data quality control, contribute to the rapid development of transcript profile sequencing correlative study extensively.

Description

High-throughput transcript profile sequencing data method of quality control based on multi-core CPU hardware

Technical field

The invention patent relates to bioinformatics, specifically a kind of high-throughput transcript profile based on multi-core CPU hardware Sequencing data method of quality control quickly can carry out quality control to high-throughput transcript profile sequencing data.

Background technology

High throughput sequencing technologies are also known as " next generation " sequencing technologies, are the changes to tradition sequencing revolution, can Sequencing once is carried out to millions of DNA/RNA molecules to hundreds of thousands, is applied to biology phase more and more widely It closes in research.Compared with traditional Sanger sequencing technologies, the flux of new-generation sequencing technology improves one to two orders of magnitude, Data volume is more (100MB to several G).Transcript profile sequencing is a deeply application based on high throughput sequencing technologies, can be to one That the transcripting spectrums of a species carries out is careful, deeply and comprehensively analyzes.However, due to high throughput sequencing technologies itself limitation and The operating error that transcript profile extraction etc. is artificially tested, the transcript profile data being originally generated often contain part low quality sequence, packet Include low quality base, polluted sequence and nRNA sequence (rRNA) etc..After the presence of these low quality sequences will greatly influence The accuracy of continuous transcript profile data analysis, even results in the conclusion of mistake.Further, since subsequent transcriptome analysis result relies on It is obtained after (alignment) is compared with reference gene group in sequence, therefore the comparison quality of transcript profile sequence is also weighing apparatus Measure one of the key factor of transcript profile sequencing data total quality.In conclusion quality control is to carry out high-throughput transcript profile to survey The necessary committed step of sequence data analysis.Current existing transcript profile data quality control method, which focuses primarily upon, completes sequence ratio Quality evaluation to level, and can not quality control comprehensively be carried out at the same time for base, sequence, pollution and comparison quality.

Since high-throughput transcript profile sequencing data generally requires the multiple samples for measuring different condition or different time points acquisition This, each sample is generally respectively necessary for three or more biology and repeats to repeat with technology, therefore the sample size being sequenced is huge, leads Sequencing every time is caused often to obtain the data volume more than 20 samples and tens G, so corresponding to high-throughput transcript profile data Quality control, it is necessary to there is the supercomputer with suitable operational capability and corresponding analysis software to realize.Using current General analysis method scans several hundred million sequences using single CPU computer and is handled respectively one by one, it may be necessary to number Its even month time, the efficiency of data analysis is made also to become the big bottleneck in correlative study.

Invention content

Comprehensively, accurately it efficiently can not meet high-throughput transcript profile in order to solve traditional analysis and computing system and survey The problem of requirement of sequence data quality control, the present invention according to high-throughput transcript profile sequencing data can parallel processing the characteristics of, carry Go out a kind of high-throughput transcript profile sequencing data method of quality control based on multi-core CPU hardware.

Present invention technical solution used for the above purpose is：A kind of high throughput based on multi-core CPU hardware turn Record group sequencing data method of quality control, includes the following steps：

Parallel processing is carried out to high-throughput transcript profile sequencing data using multi-core CPU, obtains removing low sequencing quality sequence Data；

The rRNA sequences in data using multi-core CPU to removing low sequencing quality sequence are predicted and are removed, and are gone forward side by side The Qualitative Identification of row polluted sequence；

Sequence alignment result is counted and is evaluated.

The removal for carrying out low sequencing quality sequence to high-throughput transcript profile sequencing data using multi-core CPU, including with Lower step：

Input file is divided into several small-scale subdatas using Parallel-QC tools；

Each subdata is assigned on different CPU cores；

While the base quality and joint sequence of each sequence in its subdata are detected on multiple CPU cores, and according to The length that user specifies cuts off the low quality base at each sequence both ends successively, filters the low quality alkali containing user's designated ratio The sequence of base deletes joint sequence therein；

Treated sequence is merged together, to obtain removing the data of low sequencing quality sequence.

RRNA sequences in the data using multi-core CPU to removing low sequencing quality sequence are predicted and are removed, And the Qualitative Identification of polluted sequence is carried out, include the following steps：

By rRNA sequence construct Hidden Markov Model all in database SILVA；It is searched based on Hidden Markov Model Rope carries out rRNA predictions and extraction for transcript profile sequence, and the rRNA sequences of prediction are removed from transcript profile data；

16S the or 18S rRNA that will be predicted and extract are mapped on known rRNA sequence libraries SILVA, obtain institute There is the source of species information of sequence, the annotation result of 16S and 18S rRNA characteristic sequences gathers respectively, generates species Structure composition is as a result, to obtain all species that may be present and polluted information in transcript profile sequencing data；

It is described that rRNA predictions and extraction are carried out for transcript profile sequence based on Hidden Markov Model search, and by prediction RRNA sequences are removed from transcript profile data, are included the following steps：

The data file segmentation of the processed removal low quality sequencing sequences of Parallel-QC will be passed through into small-scale subnumber According to；

Different subdatas is assigned on different CPU cores；

Predict 16S, 18S, 23S or 28S rRNA characteristic sequences of subsequence simultaneously on numerous CPU cores；

All kinds of characteristic sequence prediction results are merged together；

Extensive input data is repeatedly loaded into memory according to characteristic sequence prediction result from external memory and is searched and is carried It takes, finally merges search result.

The result on sequence alignment to reference gene group is counted and is evaluated, including the number of statistical series, Sequence of calculation coverage summarizes both-end sequence comparison information.

The number of the statistical series includes full sequence, compares successful sequence, compares and arrive certain specific gene group areas The sequence in domain and above-mentioned sequence ratio shared in full sequence.

The sequence of calculation coverage includes the number for the gene that sequence successfully compares, the covering of the base of each gene Distribution of the sequence that degree, success compare on genome structure.

Sequence number, the number for the sequence that only one end successfully compares, the both-end ratio including the successful comparison of both-end To the Insert Fragment length of sequence.

The present invention has the following advantages and beneficial effects：

1. realizing comprehensive, efficient transcript profile data quality control, including for sequencing quality, rRNA sequences, pollution Various comprehensive analysis such as sequence and comparison result and Quality Control；

2. being matched with based on multi-core CPU computer, the computational efficiency bottle based on monokaryon CPU hardware computer is overcome Neck can make high-throughput transcript profile data quality control efficiency improve 7 times or more；

3. the application of the present invention will significantly improve accuracy and the speed of high-throughput transcript profile data quality control, extensively Contribute to the rapid development of transcript profile sequencing correlative study.

Description of the drawings

Fig. 1 is the hardware architecture diagram of the present invention；Wherein, it is 1. DMI and PCIe2.0 buses；2. being triple channel DDR3 memories Bus；3. being SATA buses；

Fig. 2 is the software flow pattern of the present invention；Wherein, (1) is low sequencing quality data processing；(2) be rRNA sequences and The Qualitative Identification of polluted sequence；(3) it is the evaluation and quality control of sequence comparison；

Fig. 3 is the test for the same transcript profile sequencing data using the 16 core CUP applications present invention and application monokaryon CPU Result figure.

Specific implementation mode

The present invention is described in further detail with reference to the accompanying drawings and embodiments.

The technical solution adopted by the present invention is that multi-core CPU computer and the highly efficient, unified software constructed thereon are flat Platform.Its main feature is that (1) high performance parallel computation and storage hardware system；(2) full-featured, high-performance, uniformly, it is configurable and Rowization software platform.

(1) high performance parallel computation and storage hardware

The hardware system carries out large-scale parallel calculating using multiple-path multiple-core CPU.Fig. 1 is the system knot of calculation server Composition：

First, multiple-path multiple-core CPU parallelizations calculate, and using 4 path processors, are connected using QPI buses between processor.Often Path processor has 8 independent calculating cores, is equipped with triple channel DDR3 RDIMM memories, while being also adapted to cloud computing server Calculating requirement.

Secondly, cache and high-speed bus：It is adapted to allotment and the collaboration work of the sequencing data analysis task of concurrent type frog Make needs of the environment in the distribution of extensive task.

Finally, RAID disk array：Stored by RAID disk array, not only improve central server response speed and Stability, and be conducive to irregular central server update.The backup and upgrading that cloud computing server can be dealt with simultaneously need It wants.

(2) full-featured, high-performance, software platform uniformly, configurable

High performance software platform includes low sequencing quality data processing, the Qualitative Identification of polluted sequence, rRNA pollution sequences (Fig. 2) such as the qualitative, quantitative identifications and the detection of sequence alignment quality of row.This system is named as RNA-QC-Chain softwares system Unite (http://www.computationalbioenergy.org/rna-qc-chain.html, independent intellectual property right), number It is according to quality control step：

First, the low sequencing quality data processing based on multi-core CPU parallel computation.Utilize Parallel-QC tools (http://www.computationalbioenergy.org/parallel-qc.html, independent intellectual property right), it will input Different subdatas is assigned at small-scale subdata on different CPU cores by file division, then simultaneously in multiple CPU The length predicting the base quality and joint sequence of each sequence on kernel, and specified according to user excision sequence both ends successively Low quality base filters the sequence containing certain proportion low quality base, deletes joint sequence therein, finally will be filtered Sequence is merged together, to obtain removing the data result of low sequencing quality sequence.

Second, the Qualitative Identification of the polluted sequence based on multi-core CPU parallel computation.First with rRNA-filter tools Remove the rRNA sequences in data.By rRNA sequences all in disclosed rRNA databases SILVA (including 16S, 18S, 23S With 28S rRNA sequences) structure Hidden Markov Model (HMM), and it is pre- for transcript profile sequence progress rRNA based on HMM search It surveys, then removes the rRNA sequences of prediction from transcript profile data.SILVA databases are to include in the world at present most comprehensively One of nRNA database of rRNA sequences covers the rRNA sequences in three big field of bacterium, fungi and eucaryote.Therefore, I Method can it is as much as possible removal transcript profile sequence contained in rRNA sequences.RRNA-filter is by input file It is divided into small-scale subdata, different subdatas is assigned on different CPU cores, it is then same on numerous CPU cores When predict subsequence 16S, 18S, 23S or 28S rRNA characteristic sequences, finally all kinds of characteristic sequence prediction results are merged into Together；Then, extensive input data is repeatedly loaded into from external memory by memory according to characteristic sequence prediction result and searched Extraction, finally merges search result.

Then, 16S or 18S rRNA sequences are a kind of shorter biomarker characteristic sequences, are widely used in protokaryon and true The identification of core species.RRNA-filter is based on that 16S or 18S rRNA are annotated as a result, qualitatively to predicting and extracting Obtain the source of species information of all sequences in high-flux sequence data, and searching 16S and 18S rRNA characteristic sequences respectively Hitch fruit gathers, generate patterned Species Structure composition as a result, to obtain it is all in transcript profile sequencing data can Species and polluted information existing for energy.

Third, the evaluation and quality control of comprehensive, accurate sequence alignment result.Utilize the SAM-stats of independent development Tool, the sequence alignment result file based on SAM formats, to the comparison result of transcript profile sequence and genomic data (known) into Row is accurate, comprehensively statistics and evaluation, function include：

The number of statistical series, including full sequence, compare successful sequence, compare to certain specific gene group regions Sequence and above-mentioned sequence ratio etc. shared in full sequence；

Sequence of calculation coverage, include the number of the gene that sequence successfully compares, the base coverage of each gene, at Distribution etc. of the sequence that work(compares on genome structure；

Summarize both-end sequence comparison information, including the successful sequence number compared of both-end, only one end successfully compare The number of sequence, Insert Fragment length of both-end aligned sequences etc..

In conclusion this software platform depends on multi-core CPU hardware platform, high efficiency can be played by only cooperating The function of transcript profile sequencing data quality control.

As shown in Figure 1, the high-throughput transcript profile sequencing data method of quality control based on multi-core CPU hardware, main portion Dividing is：First, the multiple dimensioned parallelization computing capability of 4 road multi-core CPU has independent 8 calculating core per road CPU, and has There is triple channel memory.Second, cache and high-speed bus.Third, RAID disk array not only improve the sound of central server Speed and stability are answered, and is conducive to irregular central server update.It calculates and storage hardware basic configuration is：Single channel CPU at least has 4 separate physicals and calculates core, dual access memory 2GB or more, hard disk at least 50G or more, CPU and storage Between interconnect at a high speed.

As shown in Fig. 2, its flow has main steps that：First, using Parallel-QC software tools, multi-core CPU pair is utilized Transcript profile sequence is handled, and cuts off the low quality base at input data sequence both ends successively, and filtering contains certain proportion low-quality The sequence for measuring base, deletes joint sequence therein, is then combined result, as high sequencing quality sequence data.So Afterwards, using rRNA-filter tools, for data obtained in the previous step carry out rRNA sequences prediction and polluted sequence it is qualitative The rRNA sequences (16S/18S or 23S/28S) of prediction are extracted and removed to detection, and will using parallelization multithreading calculating instrument 16S or 18S sequences therein are mapped on known rRNA sequence libraries SILVA, obtain all sequences source of species (including May must pollute species) information.Finally, it for the result (file of SAM formats) on sequence alignment to reference gene group, utilizes SAM-stats software tools, count from the angle of sequence alignment and evaluate the quality of transcript profile data, including compare successfully sequence Number, the effect etc. of the coverage of gene and both-end aligned sequences.In summary as a result, generating graphical analysis result and dividing Analysis report.Software platform basic configuration is：(SuSE) Linux OS, prepackage GCC running environment, CUDA running environment (3.0 with On), 1.0 or more RNA-QC-Chain software systems version, 2.0 or more Parallel-META software versions.RNA-QC-Chain The runnable interface of software systems and Parallel-META software systems is order line form, matches electronic edition operation instruction.Official simultaneously Square website (http://www.computationalbioenergy.org/software.html) long-term software upgrading clothes are provided Business.

The method of the present invention, overcomes the computational efficiency bottleneck based on monokaryon CPU hardware computer, makes high-throughput transcript profile Data quality control efficiency improves 7 times or more.As shown in figure 3, showing to use for the test of the same transcript profile sequencing data 16 core CPU can complete entire quality control process in 23 minutes, and it is 180 minutes that using monokaryon CPU when, which takes,.

Claims

1. a kind of high-throughput transcript profile sequencing data method of quality control based on multi-core CPU hardware, which is characterized in that including with Lower step：

Parallel processing is carried out to high-throughput transcript profile sequencing data using multi-core CPU, obtains the number for removing low sequencing quality sequence According to；

The rRNA sequences in data using multi-core CPU to removing low sequencing quality sequence are predicted and are removed, and carry out dirt Contaminate the Qualitative Identification of sequence；

Sequence alignment result is counted and is evaluated；

RRNA sequences in the data using multi-core CPU to removing low sequencing quality sequence are predicted and are removed, and are gone forward side by side The Qualitative Identification of row polluted sequence, includes the following steps：

By rRNA sequence construct Hidden Markov Model all in database SILVA；Based on Hidden Markov Model search pair RRNA predictions and extraction are carried out in transcript profile sequence, and the rRNA sequences of prediction are removed from transcript profile data；

16S the or 18S rRNA that will be predicted and extract are mapped on known rRNA sequence libraries SILVA, obtain institute orderly The source of species information of row respectively gathers the annotation result of 16S and 18S rRNA characteristic sequences, generates Species Structure Composition is as a result, to obtain all species that may be present and polluted information in transcript profile sequencing data；

It is described that rRNA predictions and extraction are carried out for transcript profile sequence based on Hidden Markov Model search, include the following steps：

The data file segmentation of the processed removal low quality sequencing sequences of Parallel-QC will be passed through into small-scale subdata；

Different subdatas is assigned on different CPU cores；

All kinds of characteristic sequence prediction results are merged together；

Extensive input data is repeatedly loaded into memory from external memory according to characteristic sequence prediction result and searches extraction, most Search result is merged afterwards.

2. the high-throughput transcript profile sequencing data method of quality control according to claim 1 based on multi-core CPU hardware, It is characterized in that, the removal for carrying out low sequencing quality sequence to high-throughput transcript profile sequencing data using multi-core CPU, including with Lower step：

Each subdata is assigned on different CPU cores；

The base quality and joint sequence of each sequence in its subdata are detected on multiple CPU cores simultaneously, and according to user Specified length cuts off the low quality base at each sequence both ends successively, low quality base of the filtering containing user's designated ratio Sequence deletes joint sequence therein；

3. the high-throughput transcript profile sequencing data method of quality control according to claim 1 based on multi-core CPU hardware, It being characterized in that, the result on sequence alignment to reference gene group is counted and is evaluated, including the number of statistical series, Sequence of calculation coverage summarizes both-end sequence comparison information.

4. the high-throughput transcript profile sequencing data method of quality control according to claim 3 based on multi-core CPU hardware, It is characterized in that, the number of the statistical series includes full sequence, compares successful sequence, compares and arrive certain specific gene group areas The sequence in domain and above-mentioned sequence distinguish shared ratio in full sequence.

5. the high-throughput transcript profile sequencing data method of quality control based on multi-core CPU hardware stated according to claim 3, special Sign is, the sequence of calculation coverage include the number for the gene that sequence successfully compares, the base coverage of each gene, Distribution of the sequence that success compares on genome structure.

6. the high-throughput transcript profile sequencing data method of quality control based on multi-core CPU hardware stated according to claim 3, special Sign is that the both-end sequence comparison information that summarizes includes the successful sequence number compared of both-end, only has one end successfully to compare Sequence number, the Insert Fragment length of both-end aligned sequences.