CN105095686A

CN105095686A - High-flux transcriptome sequencing data quality control method based on multi-core CPU (Central Processing Unit) hardware

Info

Publication number: CN105095686A
Application number: CN201410205571.9A
Authority: CN
Inventors: 周茜; 宁康; 苏晓泉; 徐健
Original assignee: Qingdao Institute of Bioenergy and Bioprocess Technology of CAS
Current assignee: Qingdao Institute of Bioenergy and Bioprocess Technology of CAS
Priority date: 2014-05-15
Filing date: 2014-05-15
Publication date: 2015-11-25
Anticipated expiration: 2034-05-15
Also published as: CN105095686B

Abstract

The present invention provides a high-flux transcriptome sequencing data quality control method based on multi-core CPU hardware. The method comprises: performing parallel processing on high-flux transcriptome sequencing data by using a multi-core CPU, so as to obtain data without low sequencing quality sequences; performing prediction and removal on rRNA sequences in the data without the low sequencing quality sequences by using the multi-core CPU, and performing qualitative identification on polluted sequences; and performing statistics and evaluation on a sequence comparison result. According to the high-flux transcriptome sequencing data quality control method based on the multi-core CPU hardware, provided by the present invention, based on a multi-core CPU computer, a computing efficiency bottleneck based on a single-core CPU hardware computer is overcome, so that high-flux transcriptome data quality control efficiency is increased by over 7 times; and by applying the high-flux transcriptome sequencing data quality control method, the accuracy and speed of the high-flux transcriptome data quality control are significantly improved, and rapid development of relevant researches of transcriptome sequencing is widely facilitated.

Description

Based on the high flux transcript profile sequencing data method of quality control of multi-core CPU hardware

Technical field

Patent of the present invention relates to bioinformatics, and a kind of high flux transcript profile sequencing data method of quality control based on multi-core CPU hardware, can carry out quality control to high flux transcript profile sequencing data fast specifically.

Background technology

High throughput sequencing technologies, also known as " next generation " sequencing technologies, is the change to tradition order-checking revolution, once can carries out sequencing to hundreds of thousands to millions of DNA/RNA molecules, be applied in biology correlative study more and more widely.Compared with traditional Sanger sequencing technologies, the flux of new-generation sequencing technology improves one to two orders of magnitude, data volume more (100MB is to number G).Transcript profile order-checking is a deep application based on high throughput sequencing technologies, can carry out careful, deep to the transcripting spectrum of species and comprehensively analyze.But, due to the operate miss of the artificial experiment such as the restriction of high throughput sequencing technologies itself and transcript profile extraction, the transcript profile data of original generation, often containing part inferior quality sequence, comprise inferior quality base, polluted sequence and nRNA sequence (rRNA) etc.The existence of these inferior quality sequences greatly will affect the accuracy of follow-up transcript profile data analysis, the conclusion even led to errors.In addition, sequence drawn with comparing after (alignment) with reference to genome because follow-up transcriptome analysis result depends on, therefore the comparison quality of transcript profile sequence is also one of key factor weighing transcript profile sequencing data total quality.In sum, the necessary committed step of high flux transcript profile sequencing data analysis is carried out in quality control.Current existing transcript profile data quality control method has mainly concentrated on the quality assessment of sequence alignment aspect, and comprehensively cannot carry out quality control for base, sequence, pollution and comparison quality simultaneously.

Because high flux transcript profile sequencing data generally needs the multiple samples measuring different condition or different time points collection, each sample generally needs three or more biology to repeat and technology repetition respectively, therefore the sample size checked order is huge, cause each data volume often obtained more than 20 samples and tens G that checks order, so correspond to the quality control of high flux transcript profile data, the supercomputer with suitable arithmetic capability and corresponding analysis software must be had to realize.Adopt current general analysis method to utilize single CPU computing machine scan one by one several hundred million sequences and process respectively, may a couple of days be needed time of even one month, make the efficiency of data analysis also become the large bottleneck of one in correlative study.

Summary of the invention

The problem of the requirement of high flux transcript profile sequencing data quality control comprehensively, accurately and efficiently cannot be met in order to solve traditional analysis and computing system, the present invention can the feature of parallel processing according to high flux transcript profile sequencing data, proposes a kind of high flux transcript profile sequencing data method of quality control based on multi-core CPU hardware.

The technical scheme that the present invention is adopted for achieving the above object is: a kind of high flux transcript profile sequencing data method of quality control based on multi-core CPU hardware, comprises the following steps:

Utilize multi-core CPU to carry out parallel processing to high flux transcript profile sequencing data, obtain the data removing low sequencing quality sequence;

Utilize multi-core CPU to predict the rRNA sequence removed in the data of low sequencing quality sequence and remove, and carry out the Qualitative Identification of polluted sequence;

Sequence alignment result is added up and evaluated.

The described multi-core CPU that utilizes carries out the removal of low sequencing quality sequence to high flux transcript profile sequencing data, comprises the following steps:

Parallel-QC instrument is utilized input file to be divided into several subdatas on a small scale;

Each subdata is assigned in different CPU core;

In multiple CPU core, detect base quality and the joint sequence of each sequence in its subdata simultaneously, and the inferior quality base at each sequence two ends is excised successively according to the length that user specifies, filter the sequence of the inferior quality base containing user's designated ratio, delete joint sequence wherein;

Sequence after above-mentioned process is merged together, thus obtains the data removing low sequencing quality sequence.

The described multi-core CPU that utilizes is predicted the rRNA sequence removed in the data of low sequencing quality sequence and is removed, and carries out the Qualitative Identification of polluted sequence, comprises the following steps:

By rRNA sequence construct Hidden Markov Model (HMM) all in database SILVA; Based on Hidden Markov Model (HMM) search, rRNA prediction and extraction are carried out for transcript profile sequence, and the rRNA sequence of prediction is removed from transcript profile data;

By 16S or 18SrRNA that predict and extract, be mapped on known rRNA sequence library SILVA, obtain the source of species information of all sequences, respectively the annotation result of 16S and 18SrRNA characteristic sequence is gathered, generate Species Structure composition result, thus obtain the species and polluted information that likely exist in transcript profile sequencing data;

Describedly based on Hidden Markov Model (HMM) search, rRNA prediction and extraction are carried out for transcript profile sequence, and the rRNA sequence of prediction to be removed from transcript profile data, comprise the following steps:

The data file segmentation of the removal inferior quality sequencing sequence through Parallel-QC process is become subdata on a small scale;

Different subdatas is assigned in different CPU core;

16S, 18S, 23S or 28SrRNA characteristic sequence of predictor sequence while of in numerous CPU core;

All kinds of characteristic sequence is predicted the outcome and is merged together;

Predict the outcome according to characteristic sequence and repeatedly from external storage, be loaded into internal memory by inputting data on a large scale and search extraction, finally Search Results is merged.

Described sequence alignment is added up to reference to the result on genome and evaluated, comprise the number of statistical series, sequence of calculation coverage, gather both-end sequence comparison information.

The number of described statistical series comprises full sequence, the successful sequence of comparison, comparison to the sequence in some specific gene group region and the ratio shared in full sequence of above-mentioned sequence.

Described sequence of calculation coverage includes the distribution of sequence on genome structure of the number of gene of sequence success comparison, the base coverage of each gene, successful comparison.

Described comprise all successful comparison of both-end sequence number, the number of sequence only having one end success comparison, both-end aligned sequences Insert Fragment length.

The present invention has the following advantages and beneficial effect:

1. achieve comprehensive, transcript profile data quality control efficiently, comprise for many-sided comprehensive analysis and Quality Controls such as sequencing quality, rRNA sequence, polluted sequence and comparison results;

2. with based on multi-core CPU computing machine match, overcome the counting yield bottleneck based on monokaryon CPU bare machine, high flux transcript profile data quality control efficiency can be made to improve more than 7 times;

3. application of the present invention will significantly improve accuracy and the speed of high flux transcript profile data quality control, extensively contributes to developing rapidly of transcript profile order-checking correlative study.

Accompanying drawing explanation

Fig. 1 is hardware structure figure of the present invention; Wherein, be 1. DMI and PCIe2.0 bus; 2. be triple channel DDR3 rambus; 3. be SATA bus;

Fig. 2 is software flow pattern of the present invention; Wherein, (1) is low sequencing quality data processing; (2) be the Qualitative Identification of rRNA sequence and polluted sequence; (3) be evaluation and the quality control of sequence comparison;

Fig. 3 is employing 16 core CUP application the present invention and applies the test result figure of monokaryon CPU for same transcript profile sequencing data.

Embodiment

Below in conjunction with drawings and Examples, the present invention is described in further detail.

The highly efficient, unified software platform that the technical solution used in the present invention is multi-core CPU computing machine and constructs thereon.Be characterized in (1) high performance parallel computation and storage hardware system; (2) complete function, high-performance, unification, configurable parallelization software platform.

(1) high performance parallel computation and storage hardware

This hardware system adopts multiple-path multiple-core CPU to carry out large-scale parallel calculating.Fig. 1 is the system construction drawing of calculation server:

First, multiple-path multiple-core CPU parallelization calculates, and adopts 4 path processors, adopts QPI bus to connect between processor.Every path processor has 8 and independently calculates core, is equipped with triple channel DDR3RDIMM internal memory, is also adapted to the calculation requirement of cloud computing server simultaneously.

Secondly, high-speed cache and high-speed bus: the allotment and the needs of cooperative working environment on extensive task matching that are adapted to the sequencing data analysis task of concurrent type frog.

Finally, RAID disk array: stored by RAID disk array, not only improves response speed and the stability of central server, and is conducive to the renewal of irregular central server.Backup and the upgrading needs of cloud computing server can be dealt with simultaneously.

(2) complete function, high-performance, unification, configurable software platform

High performance software platform comprises (Fig. 2) such as low sequencing quality data processing, the Qualitative Identification of polluted sequence, the qualitative, quantitative qualification of rRNA polluted sequence and the detections of sequence alignment quality.This system is named as RNA-QC-Chain software systems (http://www.computationalbioenergy.org/rna-qc-chain.html, independent intellectual property right), and its data quality control step is:

The first, based on the low sequencing quality data processing of multi-core CPU parallel computation.Utilize Parallel-QC instrument (http://www.computationalbioenergy.org/parallel-qc.html, independent intellectual property right), input file is divided into subdata on a small scale, different subdatas is assigned in different CPU core, then in multiple CPU core, predict base quality and the joint sequence of each sequence simultaneously, and the inferior quality base at the length of specifying according to user excision sequence two ends successively, filter the sequence containing certain proportion inferior quality base, delete joint sequence wherein, finally the sequence after filtration is merged together, thus obtain the data result removing low sequencing quality sequence.

The second, based on the Qualitative Identification of the polluted sequence of multi-core CPU parallel computation.First the rRNA sequence in rRNA-filter instrument removal data is utilized.RRNA sequences (comprising 16S, 18S, 23S and 28SrRNA sequence) all in disclosed rRNA database SILVA is built Hidden Markov Model (HMM) (HMM), and based on HMM search, rRNA prediction is carried out for transcript profile sequence, then the rRNA sequence of prediction is removed from transcript profile data.SILVA database includes one of nRNA database of the most comprehensive rRNA sequence in the world at present, covers the rRNA sequence in the large field of bacterium, fungi and eucaryote three.Therefore, the rRNA sequence that our method can be contained in removal transcript profile sequence as much as possible.Input file is divided into subdata on a small scale by rRNA-filter, different subdatas is assigned in different CPU core, then the while of in numerous CPU core, 16S, 18S, 23S or 28SrRNA characteristic sequence of predictor sequence, finally predicts the outcome all kinds of characteristic sequence and is merged together; Then, predict the outcome according to characteristic sequence and repeatedly from external storage, be loaded into internal memory by inputting data on a large scale and search extraction, finally Search Results is merged.

Then, 16S or 18SrRNA sequence is a kind of shorter biomarker characteristic sequence, is widely used in the qualification of protokaryon and eucaryon species.RRNA-filter based on to predict and extract the result that 16S or 18SrRNA annotates, obtain the source of species information of all sequences in high-flux sequence data qualitatively, and respectively the Search Results of 16S and 18SrRNA characteristic sequence is gathered, generate patterned Species Structure composition result, thus obtain the species and polluted information that likely exist in transcript profile sequencing data.

3rd, the evaluation of comprehensive, accurate sequence alignment result and quality control.Utilize the SAM-stats instrument of independent development, based on the sequence alignment destination file of SAM form, carry out accurately the comparison result of transcript profile sequence and genomic data (known), comprehensively add up and evaluate, its function comprises:

The number of statistical series, comprises full sequence, the successful sequence of comparison, comparison to the sequence in some specific gene group region and the ratio etc. shared in full sequence of above-mentioned sequence;

Sequence of calculation coverage, includes the distribution etc. of sequence on genome structure of the number of the gene of sequence success comparison, the base coverage of each gene, successful comparison;

Gather both-end sequence comparison information, comprise the sequence number of all successful comparison of both-end, only have the number of sequence of one end success comparison, both-end aligned sequences Insert Fragment length etc.

In sum, this software platform depends on multi-core CPU hardware platform, only cooperatively interacts and can play the function of high-level efficiency transcript profile sequencing data quality control.

As shown in Figure 1, based on the high flux transcript profile sequencing data method of quality control of multi-core CPU hardware, its major part is: first, the multiple dimensioned parallelization computing power of 4 road multi-core CPU, and every road CPU possesses independent 8 and calculates core, and has triple channel internal memory.The second, high-speed cache and high-speed bus.Three, RAID disk array, not only improves response speed and the stability of central server, and is conducive to the renewal of irregular central server.Calculating and storage hardware basic configuration are: single channel CPU at least possesses 4 separate physical and calculates cores, more than dual access memory 2GB, and hard disk is high speed interconnect between more than 50G, CPU and storage at least.

As shown in Figure 2, its flow process key step is: first, use Parallel-QC Software tool, multi-core CPU is utilized to process transcript profile sequence, excise the inferior quality base at input data sequence two ends successively, filter the sequence containing certain proportion inferior quality base, delete joint sequence wherein, then result is combined, as high sequencing quality sequence data.Then, utilize rRNA-filter instrument, the prediction of rRNA sequence and the qualitative detection of polluted sequence are carried out for data obtained in the previous step, use parallelization multithreading computational tool, extract and remove the rRNA sequence (16S/18S or 23S/28S) predicted, and by 16S or 18S sequence mapping wherein on known rRNA sequence library SILVA, obtain source of species (comprise and may the pollute species) information of all sequences.Finally, for sequence alignment to the result (file of SAM form) on reference genome, utilize SAM-stats Software tool, from the angle statistics of sequence alignment and the quality of evaluation transcript profile data, comprise the effect etc. of the comparison success number of sequence, the coverage of gene and both-end aligned sequences.Comprehensive the above results, generates graphical analysis result and analysis report.Software platform basic configuration is: (SuSE) Linux OS, prepackage GCC running environment, CUDA running environment (more than 3.0), RNA-QC-Chain software systems version more than 1.0, Parallel-META software version more than 2.0.The runnable interface of RNA-QC-Chain software systems and Parallel-META software systems is order line form, joins electronic edition operation instruction.Official website (http://www.computationalbioenergy.org/software.html) provides long-term software update service simultaneously.

Method of the present invention, overcomes the counting yield bottleneck based on monokaryon CPU bare machine, makes high flux transcript profile data quality control efficiency improve more than 7 times.As shown in Figure 3, the test for same transcript profile sequencing data shows, uses 16 core CPU can complete whole quality control process in 23 minutes, and consuming time when using monokaryon CPU be 180 minutes.

Claims

1., based on a high flux transcript profile sequencing data method of quality control for multi-core CPU hardware, it is characterized in that, comprise the following steps:

Sequence alignment result is added up and evaluated.

2. the high flux transcript profile sequencing data method of quality control based on multi-core CPU hardware according to claim 1, is characterized in that, the described multi-core CPU that utilizes carries out the removal of low sequencing quality sequence to high flux transcript profile sequencing data, comprises the following steps:

Each subdata is assigned in different CPU core;

3. the high flux transcript profile sequencing data method of quality control based on multi-core CPU hardware according to claim 1, it is characterized in that, the described multi-core CPU that utilizes is predicted the rRNA sequence removed in the data of low sequencing quality sequence and is removed, and carry out the Qualitative Identification of polluted sequence, comprise the following steps:

By 16S or 18SrRNA that predict and extract, be mapped on known rRNA sequence library SILVA, obtain the source of species information of all sequences, respectively the annotation result of 16S and 18SrRNA characteristic sequence is gathered, generate Species Structure composition result, thus obtain the species and polluted information that likely exist in transcript profile sequencing data.

4. the high flux transcript profile sequencing data method of quality control based on multi-core CPU hardware according to claim 3, it is characterized in that, described based on Hidden Markov Model (HMM) search rRNA prediction and extraction are carried out for transcript profile sequence, and the rRNA sequence of prediction is removed from transcript profile data, comprise the following steps:

Different subdatas is assigned in different CPU core;

5. the high flux transcript profile sequencing data method of quality control based on multi-core CPU hardware according to claim 1, it is characterized in that, described sequence alignment is added up to reference to the result on genome and evaluated, comprise the number of statistical series, sequence of calculation coverage, gather both-end sequence comparison information.

6. the high flux transcript profile sequencing data method of quality control based on multi-core CPU hardware according to claim 5, it is characterized in that, the number of described statistical series comprise full sequence, the successful sequence of comparison, comparison to the sequence in some specific gene group region and above-mentioned sequence in full sequence respectively shared by ratio.

7. according to the high flux transcript profile sequencing data method of quality control based on multi-core CPU hardware that claim 5 is stated, it is characterized in that, described sequence of calculation coverage includes the distribution of sequence on genome structure of the number of gene of sequence success comparison, the base coverage of each gene, successful comparison.

8. according to the high flux transcript profile sequencing data method of quality control based on multi-core CPU hardware that claim 5 is stated, it is characterized in that, described in comprise the sequence number of all successful comparison of both-end, the number of sequence only having one end success comparison, both-end aligned sequences Insert Fragment length.