CN108763869A

CN108763869A - A kind of sequencing data high-efficient treatment method

Info

Publication number: CN108763869A
Application number: CN201810378325.1A
Authority: CN
Inventors: 常珊; 陆旭峰; 许磊; 张大为
Original assignee: Jiangsu University of Technology
Current assignee: Jiangsu University of Technology
Priority date: 2018-04-25
Filing date: 2018-04-25
Publication date: 2018-11-06

Abstract

The invention discloses a kind of sequencing data high-efficient treatment methods, belong to biomedicine technical field, include the following steps：It is prepared for parallel computation according to high-flux sequence data；It is that sequencing data prepares reference sequences according to high-flux sequence data；Data off quality in data are filtered out by parallel computation；The comparison of sequencing data and reference sequences is completed by parallel computation；The SNP mutation site information in comparison data is extracted by parallel computation.The present invention carries out quality inspection to sequencing data, data after quality inspection are compared with reference sequences, the abrupt information of SNP is extracted according to comparison result, a whole set of flow chart of data processing uses Hadoop framework, high-flux sequence data cloud computing analysis system is carried out using Hadoop concurrent operation frames to develop, high-flux sequence data analysis tool quick, cheap, easy to use is provided for medical research, greatly improves data processing speed.

Description

A kind of sequencing data high-efficient treatment method

Technical field

The present invention relates to a kind of data processing methods to belong to raw more particularly to a kind of sequencing data high-efficient treatment method Object pharmaceutical technology field.

Background technology

The life science field of developing into of second generation sequencing technologies brings revolutionary breakthrough so that researcher Genomic sequence data can be quickly and easily obtained, thus to understand life mechanism, realizing that accurate medical treatment provides preceding institute not Some opportunities, to serve clinical diagnosis and treatment, become raw however, how quickly to analyze the sequencing data of these magnanimity Object researcher's urgent need to solve the problem.

Hadoop is a realization frame of the MapReduce computation module for using written in Java, is the tissue Apache that increases income One advanced project of foundation, it further comprises a distributed text other than comprising MapReduce core calculations frames Part system HDFS, Hadoop frame possesses the following feature：Development process is simple, high efficiency, property extending transversely, fault-tolerant energy Power, load balancing are increased income freely.

Hadoop is applied into biomedicine field, can be provided for life, medical scientific person a kind of quick, low High-flux sequence data analysis tool honest and clean, easy to use.

Invention content

The main object of the present invention be to provide for one kind can it is quick, cheap, be conveniently accomplished high-flux sequence data The sequencing data high-efficient treatment method of analysis solves the disadvantages such as low, the calculating cost height of existing high-flux sequence data-handling efficiency.

The purpose of the present invention can reach by using following technical solution：

A kind of sequencing data high-efficient treatment method, includes the following steps：

It is prepared for parallel computation according to high-flux sequence data；

It is that sequencing data prepares reference sequences according to high-flux sequence data；

Data off quality in data are filtered out by parallel computation；

The comparison of sequencing data and reference sequences is completed by parallel computation；

The SNP mutation site information in comparison data is extracted by parallel computation.

Preferably, it is prepared, is included the following steps for parallel computation according to high-flux sequence data：

Build Hadoop clusters；

Sequencing sequence and reference sequences are uploaded on HDFS；

Sequencing sequence is split, then the sequencing sequence after segmentation is sent to each slave of cluster and is handled；

Preferably, sequencing sequence is split, then the sequencing sequence after segmentation is sent at each slave of cluster Reason, includes the following steps：

Piecemeal is carried out according to the block size of Hadoop settings, and sends it in different map and is handled；

Use the ID of sequence reads as key in each Map, and using chain label, base sequence, quality as value；

It will possess the sequence assembling of identical Key into a reads block using the sort functions of Reduce after the completion of segmentation It stores in HDFS.

Preferably, it is that sequencing data prepares reference sequences according to high-flux sequence data, index is established simultaneously to reference sequences It is uploaded to together with sequencing data on HDFS.

Preferably, data off quality in data are filtered out by parallel computation, are included the following steps：

Data quality checking is carried out using FastUniq softwares；

Each slave of cluster is carried out at the same time quality inspection work to the sequencing sequence being assigned to；

The sequencing sequence obtained after quality inspection is stored on HDFS.

Preferably, the comparison that sequencing data and reference sequences are completed by parallel computation, includes the following steps：

Sequence alignment uses software for Bowtie2；

Reference sequences are divided accordingly according to genome and establish index；

When being compared using Bowtie2, sequencing sequence is compared with each genomic reference sequences respectively；

Comparison result is stored on HDFS.

Preferably, the SNP mutation site information in comparison data is extracted by parallel computation, included the following steps：

Prepare dbSNP databases；

The SNP mutation site information in comparison data is extracted using Samtools tools；

The destination file of gained is saved under the corresponding result lists of HDFS.

The advantageous effects of the present invention：Sequencing data high-efficient treatment method according to the invention, survey provided by the invention Ordinal number carries out quality inspection according to high-efficient treatment method, to sequencing data, and the data after quality inspection are compared with reference sequences, according to than The abrupt information of SNP is extracted to result, a whole set of flow chart of data processing uses Hadoop framework, first to need sequencing sequence and In transmission of reference sequences to HDFS, then sequencing sequence is divided accordingly, then the sequencing sequence after segmentation is sent to Each slave of cluster is handled, wherein processing includes the mass filter to data after segmentation, and to filtered data and reference Sequence is compared, finally according to comparison result extract SNP abrupt information, this method using Hadoop concurrent operations frame into Row high-flux sequence data cloud computing analysis system is developed, and it is quick, cheap, conveniently to provide one kind for medical scientific personnel The high-flux sequence data analysis tool used, greatly improves data processing speed.

Description of the drawings

Fig. 1 is the flow chart of a preferred embodiment of sequencing data high-efficient treatment method according to the invention.

Fig. 2 is the processing system schematic diagram of a preferred embodiment of sequencing data high-efficient treatment method according to the invention.

Specific implementation mode

To make the more clear and clear technical scheme of the present invention of those skilled in the art, with reference to embodiment and attached drawing The present invention is described in further detail, and embodiments of the present invention are not limited thereto.

In the present embodiment, related terms are explained as follows：

hadoop：The distributive parallel computation framework developed by Apache funds club.

HDFS(HadoopDistributedFileSystem)：The distributed file system realized by Hadoop.

MapReduce：For the programming model of parallel computation, by using corresponding program in Map functions to a large amount of Key/value data are handled, and merge the handling result of several Map functions by Reduce functions.

datanode：Working node in HDFS architectural frameworks, the i.e. progress of work.

In the present embodiment, as depicted in figs. 1 and 2, a kind of sequencing data high-efficient treatment method provided in this embodiment, packet Include following steps：

It is prepared, is included the following steps for parallel computation according to high-flux sequence data：

Build Hadoop clusters；

Sequencing sequence and reference sequences are uploaded on HDFS；

Sequencing sequence is split, then the sequencing sequence after segmentation is sent to each slave of cluster and is handled, including Following steps：

It is that sequencing data prepares reference sequences according to high-flux sequence data, index is established to reference sequences and by itself and survey Ordinal number evidence is uploaded on HDFS together；

Data off quality in data are filtered out by parallel computation, are included the following steps：

Data quality checking is carried out using FastUniq softwares；

The sequencing sequence obtained after quality inspection is stored on HDFS；

The comparison that sequencing data and reference sequences are completed by parallel computation, includes the following steps：

Sequence alignment uses software for Bowtie2；

Comparison result is stored on HDFS；

The SNP mutation site information in comparison data is extracted by parallel computation, is included the following steps：

Prepare dbSNP databases；

Hadoop clusters are built, Hadoop framework is divided into three kinds, respectively single cpu mode, pseudo- distribution pattern and distribution completely Pattern, preferably, the present invention uses complete distributed mode, specific build process to refer to official website study course：http:// hadoop.apache.org/；

After having built Hadoop clusters, Hadoop clusters are opened, start data preparation：It specifically includes and is built for reference sequences Index, and reference sequences and sequencing sequence are uploaded on HDFS together；

Sequencing sequence is split, specific steps include：

Sequencing data is split using MapReduce frames, piecemeal is carried out according to the block size of Hadoop settings, and Send it to different map processing；

Such as the size of data of sequencing sequence has the block size that 2G, Hadoop are set as 128M, then the data block divided is total There are 2*1024/128=16 blocks；

It will possess the sequence assembling of identical Key into a reads block using the sort functions of Reduce after the completion of segmentation It stores in HDFS；

After having distributed data block, each data block is just located on different datanode, i.e., on different computers, now It can start data quality checking parallel, specifically include：

Parallel computation is completed using MapReduce frames, each slave of cluster uses the sequencing sequence being assigned to FastUniq softwares are carried out at the same time quality inspection work；

The data for completing to remain after quality inspection are stored on HDFS, then start sequence alignment work, specific steps packet It includes：

Parallel computation is completed using MapReduce frames, reference sequences are subjected to corresponding segmentation simultaneously according to genome Establish index；

Each slave of cluster is compared using Bowtie2 softwares, by sequencing sequence respectively with it is each it is genomic refer to sequence Row are compared；

It completes destination file after comparing to be stored on HDFS, finally starts the extraction of SNP mutation informative site, specific steps Including：

Parallel computation is completed using MapReduce frames, prepares dbSNP databases；

Each slave of cluster calls the SNP mutation site information in Samtools tools extraction comparison data；

The destination file of gained is saved under the corresponding result lists of HDFS；

In the present embodiment, the dbSNP databases of preparation use mysql databases as carrier.

In conclusion in the present embodiment, according to the sequencing data high-efficient treatment method of the present embodiment, the present embodiment provides Sequencing data high-efficient treatment method, to sequencing data carry out quality inspection, the data after quality inspection are compared with reference sequences, root The abrupt information of SNP is extracted according to comparison result, a whole set of flow chart of data processing uses Hadoop framework, first to needing that sequence is sequenced On row and transmission of reference sequences to HDFS, then sequencing sequence is divided accordingly, then the sequencing sequence after segmentation is sent out Send to each slave of cluster and handled, wherein processing includes mass filter to data after segmentation, and to filtered data with Reference sequences are compared, and the abrupt information of SNP is finally extracted according to comparison result, and this method uses Hadoop concurrent operation frames Frame carry out high-flux sequence data cloud computing analysis system exploitation, for medical scientific personnel provide one kind it is quick, cheap, High-flux sequence data analysis tool easy to use, greatly improves data processing speed.

The above, further embodiment only of the present invention, but scope of protection of the present invention is not limited thereto, and it is any Within the scope of the present disclosure, according to the technique and scheme of the present invention and its design adds those familiar with the art With equivalent substitution or change, protection scope of the present invention is belonged to.

Claims

1. a kind of sequencing data high-efficient treatment method, which is characterized in that include the following steps：

It is prepared for parallel computation according to high-flux sequence data；

Data off quality in data are filtered out by parallel computation；

2. a kind of sequencing data high-efficient treatment method as described in claim 1, which is characterized in that according to high-flux sequence data It prepares, includes the following steps for parallel computation：

Build Hadoop clusters；

Sequencing sequence and reference sequences are uploaded on HDFS；

Sequencing sequence is split, then the sequencing sequence after segmentation is sent to each slave of cluster and is handled.

3. a kind of sequencing data high-efficient treatment method as claimed in claim 2, which is characterized in that divide sequencing sequence It cuts, then the sequencing sequence after segmentation is sent to each slave of cluster and is handled, include the following steps：

The sequence assembling for possessing identical Key is stored at a reads block using the sort functions of Reduce after the completion of segmentation Into HDFS.

4. a kind of sequencing data high-efficient treatment method as described in claim 1, which is characterized in that according to high-flux sequence data Prepare reference sequences for sequencing data, reference sequences are established and indexes and is uploaded to it on HDFS together with sequencing data.

5. a kind of sequencing data high-efficient treatment method as described in claim 1, which is characterized in that filtered out by parallel computation Data off quality, include the following steps in data：

Data quality checking is carried out using FastUniq softwares；

The sequencing sequence obtained after quality inspection is stored on HDFS.

6. a kind of sequencing data high-efficient treatment method as described in claim 1, which is characterized in that complete to survey by parallel computation Ordinal number includes the following steps according to the comparison with reference sequences：

Sequence alignment uses software for Bowtie2；

Comparison result is stored on HDFS.

7. a kind of sequencing data high-efficient treatment method as described in claim 1, which is characterized in that extract ratio by parallel computation To the SNP mutation site information in data, include the following steps：

Prepare dbSNP databases；