CN108763869A - A kind of sequencing data high-efficient treatment method - Google Patents

A kind of sequencing data high-efficient treatment method Download PDF

Info

Publication number
CN108763869A
CN108763869A CN201810378325.1A CN201810378325A CN108763869A CN 108763869 A CN108763869 A CN 108763869A CN 201810378325 A CN201810378325 A CN 201810378325A CN 108763869 A CN108763869 A CN 108763869A
Authority
CN
China
Prior art keywords
data
sequencing
sequence
parallel computation
reference sequences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810378325.1A
Other languages
Chinese (zh)
Inventor
常珊
陆旭峰
许磊
张大为
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University of Technology
Original Assignee
Jiangsu University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University of Technology filed Critical Jiangsu University of Technology
Priority to CN201810378325.1A priority Critical patent/CN108763869A/en
Publication of CN108763869A publication Critical patent/CN108763869A/en
Pending legal-status Critical Current

Links

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a kind of sequencing data high-efficient treatment methods, belong to biomedicine technical field, include the following steps:It is prepared for parallel computation according to high-flux sequence data;It is that sequencing data prepares reference sequences according to high-flux sequence data;Data off quality in data are filtered out by parallel computation;The comparison of sequencing data and reference sequences is completed by parallel computation;The SNP mutation site information in comparison data is extracted by parallel computation.The present invention carries out quality inspection to sequencing data, data after quality inspection are compared with reference sequences, the abrupt information of SNP is extracted according to comparison result, a whole set of flow chart of data processing uses Hadoop framework, high-flux sequence data cloud computing analysis system is carried out using Hadoop concurrent operation frames to develop, high-flux sequence data analysis tool quick, cheap, easy to use is provided for medical research, greatly improves data processing speed.

Description

A kind of sequencing data high-efficient treatment method
Technical field
The present invention relates to a kind of data processing methods to belong to raw more particularly to a kind of sequencing data high-efficient treatment method Object pharmaceutical technology field.
Background technology
The life science field of developing into of second generation sequencing technologies brings revolutionary breakthrough so that researcher Genomic sequence data can be quickly and easily obtained, thus to understand life mechanism, realizing that accurate medical treatment provides preceding institute not Some opportunities, to serve clinical diagnosis and treatment, become raw however, how quickly to analyze the sequencing data of these magnanimity Object researcher's urgent need to solve the problem.
Hadoop is a realization frame of the MapReduce computation module for using written in Java, is the tissue Apache that increases income One advanced project of foundation, it further comprises a distributed text other than comprising MapReduce core calculations frames Part system HDFS, Hadoop frame possesses the following feature:Development process is simple, high efficiency, property extending transversely, fault-tolerant energy Power, load balancing are increased income freely.
Hadoop is applied into biomedicine field, can be provided for life, medical scientific person a kind of quick, low High-flux sequence data analysis tool honest and clean, easy to use.
Invention content
The main object of the present invention be to provide for one kind can it is quick, cheap, be conveniently accomplished high-flux sequence data The sequencing data high-efficient treatment method of analysis solves the disadvantages such as low, the calculating cost height of existing high-flux sequence data-handling efficiency.
The purpose of the present invention can reach by using following technical solution:
A kind of sequencing data high-efficient treatment method, includes the following steps:
It is prepared for parallel computation according to high-flux sequence data;
It is that sequencing data prepares reference sequences according to high-flux sequence data;
Data off quality in data are filtered out by parallel computation;
The comparison of sequencing data and reference sequences is completed by parallel computation;
The SNP mutation site information in comparison data is extracted by parallel computation.
Preferably, it is prepared, is included the following steps for parallel computation according to high-flux sequence data:
Build Hadoop clusters;
Sequencing sequence and reference sequences are uploaded on HDFS;
Sequencing sequence is split, then the sequencing sequence after segmentation is sent to each slave of cluster and is handled;
Preferably, sequencing sequence is split, then the sequencing sequence after segmentation is sent at each slave of cluster Reason, includes the following steps:
Piecemeal is carried out according to the block size of Hadoop settings, and sends it in different map and is handled;
Use the ID of sequence reads as key in each Map, and using chain label, base sequence, quality as value;
It will possess the sequence assembling of identical Key into a reads block using the sort functions of Reduce after the completion of segmentation It stores in HDFS.
Preferably, it is that sequencing data prepares reference sequences according to high-flux sequence data, index is established simultaneously to reference sequences It is uploaded to together with sequencing data on HDFS.
Preferably, data off quality in data are filtered out by parallel computation, are included the following steps:
Data quality checking is carried out using FastUniq softwares;
Each slave of cluster is carried out at the same time quality inspection work to the sequencing sequence being assigned to;
The sequencing sequence obtained after quality inspection is stored on HDFS.
Preferably, the comparison that sequencing data and reference sequences are completed by parallel computation, includes the following steps:
Sequence alignment uses software for Bowtie2;
Reference sequences are divided accordingly according to genome and establish index;
When being compared using Bowtie2, sequencing sequence is compared with each genomic reference sequences respectively;
Comparison result is stored on HDFS.
Preferably, the SNP mutation site information in comparison data is extracted by parallel computation, included the following steps:
Prepare dbSNP databases;
The SNP mutation site information in comparison data is extracted using Samtools tools;
The destination file of gained is saved under the corresponding result lists of HDFS.
The advantageous effects of the present invention:Sequencing data high-efficient treatment method according to the invention, survey provided by the invention Ordinal number carries out quality inspection according to high-efficient treatment method, to sequencing data, and the data after quality inspection are compared with reference sequences, according to than The abrupt information of SNP is extracted to result, a whole set of flow chart of data processing uses Hadoop framework, first to need sequencing sequence and In transmission of reference sequences to HDFS, then sequencing sequence is divided accordingly, then the sequencing sequence after segmentation is sent to Each slave of cluster is handled, wherein processing includes the mass filter to data after segmentation, and to filtered data and reference Sequence is compared, finally according to comparison result extract SNP abrupt information, this method using Hadoop concurrent operations frame into Row high-flux sequence data cloud computing analysis system is developed, and it is quick, cheap, conveniently to provide one kind for medical scientific personnel The high-flux sequence data analysis tool used, greatly improves data processing speed.
Description of the drawings
Fig. 1 is the flow chart of a preferred embodiment of sequencing data high-efficient treatment method according to the invention.
Fig. 2 is the processing system schematic diagram of a preferred embodiment of sequencing data high-efficient treatment method according to the invention.
Specific implementation mode
To make the more clear and clear technical scheme of the present invention of those skilled in the art, with reference to embodiment and attached drawing The present invention is described in further detail, and embodiments of the present invention are not limited thereto.
In the present embodiment, related terms are explained as follows:
hadoop:The distributive parallel computation framework developed by Apache funds club.
HDFS(HadoopDistributedFileSystem):The distributed file system realized by Hadoop.
MapReduce:For the programming model of parallel computation, by using corresponding program in Map functions to a large amount of Key/value data are handled, and merge the handling result of several Map functions by Reduce functions.
datanode:Working node in HDFS architectural frameworks, the i.e. progress of work.
In the present embodiment, as depicted in figs. 1 and 2, a kind of sequencing data high-efficient treatment method provided in this embodiment, packet Include following steps:
It is prepared, is included the following steps for parallel computation according to high-flux sequence data:
Build Hadoop clusters;
Sequencing sequence and reference sequences are uploaded on HDFS;
Sequencing sequence is split, then the sequencing sequence after segmentation is sent to each slave of cluster and is handled, including Following steps:
Piecemeal is carried out according to the block size of Hadoop settings, and sends it in different map and is handled;
Use the ID of sequence reads as key in each Map, and using chain label, base sequence, quality as value;
It will possess the sequence assembling of identical Key into a reads block using the sort functions of Reduce after the completion of segmentation It stores in HDFS.
It is that sequencing data prepares reference sequences according to high-flux sequence data, index is established to reference sequences and by itself and survey Ordinal number evidence is uploaded on HDFS together;
Data off quality in data are filtered out by parallel computation, are included the following steps:
Data quality checking is carried out using FastUniq softwares;
Each slave of cluster is carried out at the same time quality inspection work to the sequencing sequence being assigned to;
The sequencing sequence obtained after quality inspection is stored on HDFS;
The comparison that sequencing data and reference sequences are completed by parallel computation, includes the following steps:
Sequence alignment uses software for Bowtie2;
Reference sequences are divided accordingly according to genome and establish index;
When being compared using Bowtie2, sequencing sequence is compared with each genomic reference sequences respectively;
Comparison result is stored on HDFS;
The SNP mutation site information in comparison data is extracted by parallel computation, is included the following steps:
Prepare dbSNP databases;
The SNP mutation site information in comparison data is extracted using Samtools tools;
The destination file of gained is saved under the corresponding result lists of HDFS.
In the present embodiment, as depicted in figs. 1 and 2, a kind of sequencing data high-efficient treatment method provided in this embodiment, packet Include following steps:
Hadoop clusters are built, Hadoop framework is divided into three kinds, respectively single cpu mode, pseudo- distribution pattern and distribution completely Pattern, preferably, the present invention uses complete distributed mode, specific build process to refer to official website study course:http:// hadoop.apache.org/;
After having built Hadoop clusters, Hadoop clusters are opened, start data preparation:It specifically includes and is built for reference sequences Index, and reference sequences and sequencing sequence are uploaded on HDFS together;
Sequencing sequence is split, specific steps include:
Sequencing data is split using MapReduce frames, piecemeal is carried out according to the block size of Hadoop settings, and Send it to different map processing;
Such as the size of data of sequencing sequence has the block size that 2G, Hadoop are set as 128M, then the data block divided is total There are 2*1024/128=16 blocks;
Use the ID of sequence reads as key in each Map, and using chain label, base sequence, quality as value;
It will possess the sequence assembling of identical Key into a reads block using the sort functions of Reduce after the completion of segmentation It stores in HDFS;
After having distributed data block, each data block is just located on different datanode, i.e., on different computers, now It can start data quality checking parallel, specifically include:
Parallel computation is completed using MapReduce frames, each slave of cluster uses the sequencing sequence being assigned to FastUniq softwares are carried out at the same time quality inspection work;
The data for completing to remain after quality inspection are stored on HDFS, then start sequence alignment work, specific steps packet It includes:
Parallel computation is completed using MapReduce frames, reference sequences are subjected to corresponding segmentation simultaneously according to genome Establish index;
Each slave of cluster is compared using Bowtie2 softwares, by sequencing sequence respectively with it is each it is genomic refer to sequence Row are compared;
It completes destination file after comparing to be stored on HDFS, finally starts the extraction of SNP mutation informative site, specific steps Including:
Parallel computation is completed using MapReduce frames, prepares dbSNP databases;
Each slave of cluster calls the SNP mutation site information in Samtools tools extraction comparison data;
The destination file of gained is saved under the corresponding result lists of HDFS;
In the present embodiment, the dbSNP databases of preparation use mysql databases as carrier.
In conclusion in the present embodiment, according to the sequencing data high-efficient treatment method of the present embodiment, the present embodiment provides Sequencing data high-efficient treatment method, to sequencing data carry out quality inspection, the data after quality inspection are compared with reference sequences, root The abrupt information of SNP is extracted according to comparison result, a whole set of flow chart of data processing uses Hadoop framework, first to needing that sequence is sequenced On row and transmission of reference sequences to HDFS, then sequencing sequence is divided accordingly, then the sequencing sequence after segmentation is sent out Send to each slave of cluster and handled, wherein processing includes mass filter to data after segmentation, and to filtered data with Reference sequences are compared, and the abrupt information of SNP is finally extracted according to comparison result, and this method uses Hadoop concurrent operation frames Frame carry out high-flux sequence data cloud computing analysis system exploitation, for medical scientific personnel provide one kind it is quick, cheap, High-flux sequence data analysis tool easy to use, greatly improves data processing speed.
The above, further embodiment only of the present invention, but scope of protection of the present invention is not limited thereto, and it is any Within the scope of the present disclosure, according to the technique and scheme of the present invention and its design adds those familiar with the art With equivalent substitution or change, protection scope of the present invention is belonged to.

Claims (7)

1. a kind of sequencing data high-efficient treatment method, which is characterized in that include the following steps:
It is prepared for parallel computation according to high-flux sequence data;
It is that sequencing data prepares reference sequences according to high-flux sequence data;
Data off quality in data are filtered out by parallel computation;
The comparison of sequencing data and reference sequences is completed by parallel computation;
The SNP mutation site information in comparison data is extracted by parallel computation.
2. a kind of sequencing data high-efficient treatment method as described in claim 1, which is characterized in that according to high-flux sequence data It prepares, includes the following steps for parallel computation:
Build Hadoop clusters;
Sequencing sequence and reference sequences are uploaded on HDFS;
Sequencing sequence is split, then the sequencing sequence after segmentation is sent to each slave of cluster and is handled.
3. a kind of sequencing data high-efficient treatment method as claimed in claim 2, which is characterized in that divide sequencing sequence It cuts, then the sequencing sequence after segmentation is sent to each slave of cluster and is handled, include the following steps:
Piecemeal is carried out according to the block size of Hadoop settings, and sends it in different map and is handled;
Use the ID of sequence reads as key in each Map, and using chain label, base sequence, quality as value;
The sequence assembling for possessing identical Key is stored at a reads block using the sort functions of Reduce after the completion of segmentation Into HDFS.
4. a kind of sequencing data high-efficient treatment method as described in claim 1, which is characterized in that according to high-flux sequence data Prepare reference sequences for sequencing data, reference sequences are established and indexes and is uploaded to it on HDFS together with sequencing data.
5. a kind of sequencing data high-efficient treatment method as described in claim 1, which is characterized in that filtered out by parallel computation Data off quality, include the following steps in data:
Data quality checking is carried out using FastUniq softwares;
Each slave of cluster is carried out at the same time quality inspection work to the sequencing sequence being assigned to;
The sequencing sequence obtained after quality inspection is stored on HDFS.
6. a kind of sequencing data high-efficient treatment method as described in claim 1, which is characterized in that complete to survey by parallel computation Ordinal number includes the following steps according to the comparison with reference sequences:
Sequence alignment uses software for Bowtie2;
Reference sequences are divided accordingly according to genome and establish index;
When being compared using Bowtie2, sequencing sequence is compared with each genomic reference sequences respectively;
Comparison result is stored on HDFS.
7. a kind of sequencing data high-efficient treatment method as described in claim 1, which is characterized in that extract ratio by parallel computation To the SNP mutation site information in data, include the following steps:
Prepare dbSNP databases;
The SNP mutation site information in comparison data is extracted using Samtools tools;
The destination file of gained is saved under the corresponding result lists of HDFS.
CN201810378325.1A 2018-04-25 2018-04-25 A kind of sequencing data high-efficient treatment method Pending CN108763869A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810378325.1A CN108763869A (en) 2018-04-25 2018-04-25 A kind of sequencing data high-efficient treatment method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810378325.1A CN108763869A (en) 2018-04-25 2018-04-25 A kind of sequencing data high-efficient treatment method

Publications (1)

Publication Number Publication Date
CN108763869A true CN108763869A (en) 2018-11-06

Family

ID=64011694

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810378325.1A Pending CN108763869A (en) 2018-04-25 2018-04-25 A kind of sequencing data high-efficient treatment method

Country Status (1)

Country Link
CN (1) CN108763869A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109616156A (en) * 2018-12-03 2019-04-12 郑州云海信息技术有限公司 A kind of gene sequencing date storage method and device
CN110016498A (en) * 2019-04-24 2019-07-16 北京诺赛基因组研究中心有限公司 The method of single nucleotide polymorphism is determined in the sequencing of Sanger method
CN110070911A (en) * 2019-04-12 2019-07-30 内蒙古农业大学 A kind of parallel comparison method of gene order based on Hadoop

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106407749A (en) * 2016-08-30 2017-02-15 上海华点云生物科技有限公司 Analysis method and analysis apparatus for searching chromosomal mutation site of sample
CN107391965A (en) * 2017-08-15 2017-11-24 上海派森诺生物科技股份有限公司 A kind of lung cancer somatic mutation determination method based on high throughput sequencing technologies
CN107563153A (en) * 2017-08-03 2018-01-09 华子昂 A kind of PacBio microarray dataset IT architectures based on Hadoop structures

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106407749A (en) * 2016-08-30 2017-02-15 上海华点云生物科技有限公司 Analysis method and analysis apparatus for searching chromosomal mutation site of sample
CN107563153A (en) * 2017-08-03 2018-01-09 华子昂 A kind of PacBio microarray dataset IT architectures based on Hadoop structures
CN107391965A (en) * 2017-08-15 2017-11-24 上海派森诺生物科技股份有限公司 A kind of lung cancer somatic mutation determination method based on high throughput sequencing technologies

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
黄芝准 等: "组学大数据环境下的基因变异信息并行处理与分析", 《北京生物医学工程》 *
黄芝准: "组学大教据杯境下的基因信息并行处理与分折方法研究", 《中国优秀硕士论文全文数据库 基础科学辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109616156A (en) * 2018-12-03 2019-04-12 郑州云海信息技术有限公司 A kind of gene sequencing date storage method and device
CN110070911A (en) * 2019-04-12 2019-07-30 内蒙古农业大学 A kind of parallel comparison method of gene order based on Hadoop
CN110016498A (en) * 2019-04-24 2019-07-16 北京诺赛基因组研究中心有限公司 The method of single nucleotide polymorphism is determined in the sequencing of Sanger method
CN110016498B (en) * 2019-04-24 2020-05-08 北京诺赛基因组研究中心有限公司 Method for determining single nucleotide polymorphism in Sanger method sequencing

Similar Documents

Publication Publication Date Title
CN110491449B (en) Management of healthcare analytic flows
CN104762402B (en) Method for rapidly detecting human genome single base mutation and micro-insertion deletion
US11031097B2 (en) System for genomic data processing with an in-memory database system and real-time analysis
CN108763869A (en) A kind of sequencing data high-efficient treatment method
CN107563153A (en) A kind of PacBio microarray dataset IT architectures based on Hadoop structures
EP2759953B1 (en) System and method for genomic data processing with an in-memory database system and real-time analysis
CN104239144A (en) Multilevel distributed task processing system
CN102982409A (en) Informationalized management design method for information biology high-performance computing platform
CN112380439A (en) Target object recommendation method and device, electronic equipment and computer-readable storage medium
Chen et al. Recent advances in sequence assembly: principles and applications
CN111312342B (en) Electronic structure computer-aided drug design system
CN110335641B (en) Four-body combination genetic relationship identification method and device
CN103942739A (en) Method for construction of construction project risk knowledge base
Zhang et al. A novel FPGA-based real-time simulator for micro-grids
CN105653897B (en) LncRNA analysis system and method based on biological cloud platform
CN107977504A (en) A kind of asymmetric in-core fuel management computational methods, device and terminal device
Kuo et al. A Hadoop/MapReduce based platform for supporting health big data analytics
CN113377696A (en) Bus data processing method based on computer equipment
Srivastava et al. Semantic workflows for benchmark challenges: Enhancing comparability, reusability and reproducibility
CN107967411A (en) Method and device for detecting off-target site and terminal equipment
Hu et al. Enhanced Hybrid Ant Colony Optimization for Machining Line Balancing Problem with Compound and Complex Constraints
CN1760903A (en) Flow verification system and method
CN113822379B (en) Process process anomaly analysis method and device, electronic equipment and storage medium
CN105047038B (en) Nuclear power plant's personnel's training system
CN107248118A (en) Data digging method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20181106