CN108763869A - A kind of sequencing data high-efficient treatment method - Google Patents
A kind of sequencing data high-efficient treatment method Download PDFInfo
- Publication number
- CN108763869A CN108763869A CN201810378325.1A CN201810378325A CN108763869A CN 108763869 A CN108763869 A CN 108763869A CN 201810378325 A CN201810378325 A CN 201810378325A CN 108763869 A CN108763869 A CN 108763869A
- Authority
- CN
- China
- Prior art keywords
- data
- sequencing
- sequence
- parallel computation
- reference sequences
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a kind of sequencing data high-efficient treatment methods, belong to biomedicine technical field, include the following steps:It is prepared for parallel computation according to high-flux sequence data;It is that sequencing data prepares reference sequences according to high-flux sequence data;Data off quality in data are filtered out by parallel computation;The comparison of sequencing data and reference sequences is completed by parallel computation;The SNP mutation site information in comparison data is extracted by parallel computation.The present invention carries out quality inspection to sequencing data, data after quality inspection are compared with reference sequences, the abrupt information of SNP is extracted according to comparison result, a whole set of flow chart of data processing uses Hadoop framework, high-flux sequence data cloud computing analysis system is carried out using Hadoop concurrent operation frames to develop, high-flux sequence data analysis tool quick, cheap, easy to use is provided for medical research, greatly improves data processing speed.
Description
Technical field
The present invention relates to a kind of data processing methods to belong to raw more particularly to a kind of sequencing data high-efficient treatment method
Object pharmaceutical technology field.
Background technology
The life science field of developing into of second generation sequencing technologies brings revolutionary breakthrough so that researcher
Genomic sequence data can be quickly and easily obtained, thus to understand life mechanism, realizing that accurate medical treatment provides preceding institute not
Some opportunities, to serve clinical diagnosis and treatment, become raw however, how quickly to analyze the sequencing data of these magnanimity
Object researcher's urgent need to solve the problem.
Hadoop is a realization frame of the MapReduce computation module for using written in Java, is the tissue Apache that increases income
One advanced project of foundation, it further comprises a distributed text other than comprising MapReduce core calculations frames
Part system HDFS, Hadoop frame possesses the following feature:Development process is simple, high efficiency, property extending transversely, fault-tolerant energy
Power, load balancing are increased income freely.
Hadoop is applied into biomedicine field, can be provided for life, medical scientific person a kind of quick, low
High-flux sequence data analysis tool honest and clean, easy to use.
Invention content
The main object of the present invention be to provide for one kind can it is quick, cheap, be conveniently accomplished high-flux sequence data
The sequencing data high-efficient treatment method of analysis solves the disadvantages such as low, the calculating cost height of existing high-flux sequence data-handling efficiency.
The purpose of the present invention can reach by using following technical solution:
A kind of sequencing data high-efficient treatment method, includes the following steps:
It is prepared for parallel computation according to high-flux sequence data;
It is that sequencing data prepares reference sequences according to high-flux sequence data;
Data off quality in data are filtered out by parallel computation;
The comparison of sequencing data and reference sequences is completed by parallel computation;
The SNP mutation site information in comparison data is extracted by parallel computation.
Preferably, it is prepared, is included the following steps for parallel computation according to high-flux sequence data:
Build Hadoop clusters;
Sequencing sequence and reference sequences are uploaded on HDFS;
Sequencing sequence is split, then the sequencing sequence after segmentation is sent to each slave of cluster and is handled;
Preferably, sequencing sequence is split, then the sequencing sequence after segmentation is sent at each slave of cluster
Reason, includes the following steps:
Piecemeal is carried out according to the block size of Hadoop settings, and sends it in different map and is handled;
Use the ID of sequence reads as key in each Map, and using chain label, base sequence, quality as
value;
It will possess the sequence assembling of identical Key into a reads block using the sort functions of Reduce after the completion of segmentation
It stores in HDFS.
Preferably, it is that sequencing data prepares reference sequences according to high-flux sequence data, index is established simultaneously to reference sequences
It is uploaded to together with sequencing data on HDFS.
Preferably, data off quality in data are filtered out by parallel computation, are included the following steps:
Data quality checking is carried out using FastUniq softwares;
Each slave of cluster is carried out at the same time quality inspection work to the sequencing sequence being assigned to;
The sequencing sequence obtained after quality inspection is stored on HDFS.
Preferably, the comparison that sequencing data and reference sequences are completed by parallel computation, includes the following steps:
Sequence alignment uses software for Bowtie2;
Reference sequences are divided accordingly according to genome and establish index;
When being compared using Bowtie2, sequencing sequence is compared with each genomic reference sequences respectively;
Comparison result is stored on HDFS.
Preferably, the SNP mutation site information in comparison data is extracted by parallel computation, included the following steps:
Prepare dbSNP databases;
The SNP mutation site information in comparison data is extracted using Samtools tools;
The destination file of gained is saved under the corresponding result lists of HDFS.
The advantageous effects of the present invention:Sequencing data high-efficient treatment method according to the invention, survey provided by the invention
Ordinal number carries out quality inspection according to high-efficient treatment method, to sequencing data, and the data after quality inspection are compared with reference sequences, according to than
The abrupt information of SNP is extracted to result, a whole set of flow chart of data processing uses Hadoop framework, first to need sequencing sequence and
In transmission of reference sequences to HDFS, then sequencing sequence is divided accordingly, then the sequencing sequence after segmentation is sent to
Each slave of cluster is handled, wherein processing includes the mass filter to data after segmentation, and to filtered data and reference
Sequence is compared, finally according to comparison result extract SNP abrupt information, this method using Hadoop concurrent operations frame into
Row high-flux sequence data cloud computing analysis system is developed, and it is quick, cheap, conveniently to provide one kind for medical scientific personnel
The high-flux sequence data analysis tool used, greatly improves data processing speed.
Description of the drawings
Fig. 1 is the flow chart of a preferred embodiment of sequencing data high-efficient treatment method according to the invention.
Fig. 2 is the processing system schematic diagram of a preferred embodiment of sequencing data high-efficient treatment method according to the invention.
Specific implementation mode
To make the more clear and clear technical scheme of the present invention of those skilled in the art, with reference to embodiment and attached drawing
The present invention is described in further detail, and embodiments of the present invention are not limited thereto.
In the present embodiment, related terms are explained as follows:
hadoop:The distributive parallel computation framework developed by Apache funds club.
HDFS(HadoopDistributedFileSystem):The distributed file system realized by Hadoop.
MapReduce:For the programming model of parallel computation, by using corresponding program in Map functions to a large amount of
Key/value data are handled, and merge the handling result of several Map functions by Reduce functions.
datanode:Working node in HDFS architectural frameworks, the i.e. progress of work.
In the present embodiment, as depicted in figs. 1 and 2, a kind of sequencing data high-efficient treatment method provided in this embodiment, packet
Include following steps:
It is prepared, is included the following steps for parallel computation according to high-flux sequence data:
Build Hadoop clusters;
Sequencing sequence and reference sequences are uploaded on HDFS;
Sequencing sequence is split, then the sequencing sequence after segmentation is sent to each slave of cluster and is handled, including
Following steps:
Piecemeal is carried out according to the block size of Hadoop settings, and sends it in different map and is handled;
Use the ID of sequence reads as key in each Map, and using chain label, base sequence, quality as
value;
It will possess the sequence assembling of identical Key into a reads block using the sort functions of Reduce after the completion of segmentation
It stores in HDFS.
It is that sequencing data prepares reference sequences according to high-flux sequence data, index is established to reference sequences and by itself and survey
Ordinal number evidence is uploaded on HDFS together;
Data off quality in data are filtered out by parallel computation, are included the following steps:
Data quality checking is carried out using FastUniq softwares;
Each slave of cluster is carried out at the same time quality inspection work to the sequencing sequence being assigned to;
The sequencing sequence obtained after quality inspection is stored on HDFS;
The comparison that sequencing data and reference sequences are completed by parallel computation, includes the following steps:
Sequence alignment uses software for Bowtie2;
Reference sequences are divided accordingly according to genome and establish index;
When being compared using Bowtie2, sequencing sequence is compared with each genomic reference sequences respectively;
Comparison result is stored on HDFS;
The SNP mutation site information in comparison data is extracted by parallel computation, is included the following steps:
Prepare dbSNP databases;
The SNP mutation site information in comparison data is extracted using Samtools tools;
The destination file of gained is saved under the corresponding result lists of HDFS.
In the present embodiment, as depicted in figs. 1 and 2, a kind of sequencing data high-efficient treatment method provided in this embodiment, packet
Include following steps:
Hadoop clusters are built, Hadoop framework is divided into three kinds, respectively single cpu mode, pseudo- distribution pattern and distribution completely
Pattern, preferably, the present invention uses complete distributed mode, specific build process to refer to official website study course:http://
hadoop.apache.org/;
After having built Hadoop clusters, Hadoop clusters are opened, start data preparation:It specifically includes and is built for reference sequences
Index, and reference sequences and sequencing sequence are uploaded on HDFS together;
Sequencing sequence is split, specific steps include:
Sequencing data is split using MapReduce frames, piecemeal is carried out according to the block size of Hadoop settings, and
Send it to different map processing;
Such as the size of data of sequencing sequence has the block size that 2G, Hadoop are set as 128M, then the data block divided is total
There are 2*1024/128=16 blocks;
Use the ID of sequence reads as key in each Map, and using chain label, base sequence, quality as
value;
It will possess the sequence assembling of identical Key into a reads block using the sort functions of Reduce after the completion of segmentation
It stores in HDFS;
After having distributed data block, each data block is just located on different datanode, i.e., on different computers, now
It can start data quality checking parallel, specifically include:
Parallel computation is completed using MapReduce frames, each slave of cluster uses the sequencing sequence being assigned to
FastUniq softwares are carried out at the same time quality inspection work;
The data for completing to remain after quality inspection are stored on HDFS, then start sequence alignment work, specific steps packet
It includes:
Parallel computation is completed using MapReduce frames, reference sequences are subjected to corresponding segmentation simultaneously according to genome
Establish index;
Each slave of cluster is compared using Bowtie2 softwares, by sequencing sequence respectively with it is each it is genomic refer to sequence
Row are compared;
It completes destination file after comparing to be stored on HDFS, finally starts the extraction of SNP mutation informative site, specific steps
Including:
Parallel computation is completed using MapReduce frames, prepares dbSNP databases;
Each slave of cluster calls the SNP mutation site information in Samtools tools extraction comparison data;
The destination file of gained is saved under the corresponding result lists of HDFS;
In the present embodiment, the dbSNP databases of preparation use mysql databases as carrier.
In conclusion in the present embodiment, according to the sequencing data high-efficient treatment method of the present embodiment, the present embodiment provides
Sequencing data high-efficient treatment method, to sequencing data carry out quality inspection, the data after quality inspection are compared with reference sequences, root
The abrupt information of SNP is extracted according to comparison result, a whole set of flow chart of data processing uses Hadoop framework, first to needing that sequence is sequenced
On row and transmission of reference sequences to HDFS, then sequencing sequence is divided accordingly, then the sequencing sequence after segmentation is sent out
Send to each slave of cluster and handled, wherein processing includes mass filter to data after segmentation, and to filtered data with
Reference sequences are compared, and the abrupt information of SNP is finally extracted according to comparison result, and this method uses Hadoop concurrent operation frames
Frame carry out high-flux sequence data cloud computing analysis system exploitation, for medical scientific personnel provide one kind it is quick, cheap,
High-flux sequence data analysis tool easy to use, greatly improves data processing speed.
The above, further embodiment only of the present invention, but scope of protection of the present invention is not limited thereto, and it is any
Within the scope of the present disclosure, according to the technique and scheme of the present invention and its design adds those familiar with the art
With equivalent substitution or change, protection scope of the present invention is belonged to.
Claims (7)
1. a kind of sequencing data high-efficient treatment method, which is characterized in that include the following steps:
It is prepared for parallel computation according to high-flux sequence data;
It is that sequencing data prepares reference sequences according to high-flux sequence data;
Data off quality in data are filtered out by parallel computation;
The comparison of sequencing data and reference sequences is completed by parallel computation;
The SNP mutation site information in comparison data is extracted by parallel computation.
2. a kind of sequencing data high-efficient treatment method as described in claim 1, which is characterized in that according to high-flux sequence data
It prepares, includes the following steps for parallel computation:
Build Hadoop clusters;
Sequencing sequence and reference sequences are uploaded on HDFS;
Sequencing sequence is split, then the sequencing sequence after segmentation is sent to each slave of cluster and is handled.
3. a kind of sequencing data high-efficient treatment method as claimed in claim 2, which is characterized in that divide sequencing sequence
It cuts, then the sequencing sequence after segmentation is sent to each slave of cluster and is handled, include the following steps:
Piecemeal is carried out according to the block size of Hadoop settings, and sends it in different map and is handled;
Use the ID of sequence reads as key in each Map, and using chain label, base sequence, quality as value;
The sequence assembling for possessing identical Key is stored at a reads block using the sort functions of Reduce after the completion of segmentation
Into HDFS.
4. a kind of sequencing data high-efficient treatment method as described in claim 1, which is characterized in that according to high-flux sequence data
Prepare reference sequences for sequencing data, reference sequences are established and indexes and is uploaded to it on HDFS together with sequencing data.
5. a kind of sequencing data high-efficient treatment method as described in claim 1, which is characterized in that filtered out by parallel computation
Data off quality, include the following steps in data:
Data quality checking is carried out using FastUniq softwares;
Each slave of cluster is carried out at the same time quality inspection work to the sequencing sequence being assigned to;
The sequencing sequence obtained after quality inspection is stored on HDFS.
6. a kind of sequencing data high-efficient treatment method as described in claim 1, which is characterized in that complete to survey by parallel computation
Ordinal number includes the following steps according to the comparison with reference sequences:
Sequence alignment uses software for Bowtie2;
Reference sequences are divided accordingly according to genome and establish index;
When being compared using Bowtie2, sequencing sequence is compared with each genomic reference sequences respectively;
Comparison result is stored on HDFS.
7. a kind of sequencing data high-efficient treatment method as described in claim 1, which is characterized in that extract ratio by parallel computation
To the SNP mutation site information in data, include the following steps:
Prepare dbSNP databases;
The SNP mutation site information in comparison data is extracted using Samtools tools;
The destination file of gained is saved under the corresponding result lists of HDFS.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810378325.1A CN108763869A (en) | 2018-04-25 | 2018-04-25 | A kind of sequencing data high-efficient treatment method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810378325.1A CN108763869A (en) | 2018-04-25 | 2018-04-25 | A kind of sequencing data high-efficient treatment method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108763869A true CN108763869A (en) | 2018-11-06 |
Family
ID=64011694
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810378325.1A Pending CN108763869A (en) | 2018-04-25 | 2018-04-25 | A kind of sequencing data high-efficient treatment method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108763869A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109616156A (en) * | 2018-12-03 | 2019-04-12 | 郑州云海信息技术有限公司 | A kind of gene sequencing date storage method and device |
CN110016498A (en) * | 2019-04-24 | 2019-07-16 | 北京诺赛基因组研究中心有限公司 | The method of single nucleotide polymorphism is determined in the sequencing of Sanger method |
CN110070911A (en) * | 2019-04-12 | 2019-07-30 | 内蒙古农业大学 | A kind of parallel comparison method of gene order based on Hadoop |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106407749A (en) * | 2016-08-30 | 2017-02-15 | 上海华点云生物科技有限公司 | Analysis method and analysis apparatus for searching chromosomal mutation site of sample |
CN107391965A (en) * | 2017-08-15 | 2017-11-24 | 上海派森诺生物科技股份有限公司 | A kind of lung cancer somatic mutation determination method based on high throughput sequencing technologies |
CN107563153A (en) * | 2017-08-03 | 2018-01-09 | 华子昂 | A kind of PacBio microarray dataset IT architectures based on Hadoop structures |
-
2018
- 2018-04-25 CN CN201810378325.1A patent/CN108763869A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106407749A (en) * | 2016-08-30 | 2017-02-15 | 上海华点云生物科技有限公司 | Analysis method and analysis apparatus for searching chromosomal mutation site of sample |
CN107563153A (en) * | 2017-08-03 | 2018-01-09 | 华子昂 | A kind of PacBio microarray dataset IT architectures based on Hadoop structures |
CN107391965A (en) * | 2017-08-15 | 2017-11-24 | 上海派森诺生物科技股份有限公司 | A kind of lung cancer somatic mutation determination method based on high throughput sequencing technologies |
Non-Patent Citations (2)
Title |
---|
黄芝准 等: "组学大数据环境下的基因变异信息并行处理与分析", 《北京生物医学工程》 * |
黄芝准: "组学大教据杯境下的基因信息并行处理与分折方法研究", 《中国优秀硕士论文全文数据库 基础科学辑》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109616156A (en) * | 2018-12-03 | 2019-04-12 | 郑州云海信息技术有限公司 | A kind of gene sequencing date storage method and device |
CN110070911A (en) * | 2019-04-12 | 2019-07-30 | 内蒙古农业大学 | A kind of parallel comparison method of gene order based on Hadoop |
CN110016498A (en) * | 2019-04-24 | 2019-07-16 | 北京诺赛基因组研究中心有限公司 | The method of single nucleotide polymorphism is determined in the sequencing of Sanger method |
CN110016498B (en) * | 2019-04-24 | 2020-05-08 | 北京诺赛基因组研究中心有限公司 | Method for determining single nucleotide polymorphism in Sanger method sequencing |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110491449B (en) | Management of healthcare analytic flows | |
CN104762402B (en) | Method for rapidly detecting human genome single base mutation and micro-insertion deletion | |
US11031097B2 (en) | System for genomic data processing with an in-memory database system and real-time analysis | |
CN108763869A (en) | A kind of sequencing data high-efficient treatment method | |
CN107563153A (en) | A kind of PacBio microarray dataset IT architectures based on Hadoop structures | |
EP2759953B1 (en) | System and method for genomic data processing with an in-memory database system and real-time analysis | |
CN104239144A (en) | Multilevel distributed task processing system | |
CN102982409A (en) | Informationalized management design method for information biology high-performance computing platform | |
CN112380439A (en) | Target object recommendation method and device, electronic equipment and computer-readable storage medium | |
Chen et al. | Recent advances in sequence assembly: principles and applications | |
CN111312342B (en) | Electronic structure computer-aided drug design system | |
CN110335641B (en) | Four-body combination genetic relationship identification method and device | |
CN103942739A (en) | Method for construction of construction project risk knowledge base | |
Zhang et al. | A novel FPGA-based real-time simulator for micro-grids | |
CN105653897B (en) | LncRNA analysis system and method based on biological cloud platform | |
CN107977504A (en) | A kind of asymmetric in-core fuel management computational methods, device and terminal device | |
Kuo et al. | A Hadoop/MapReduce based platform for supporting health big data analytics | |
CN113377696A (en) | Bus data processing method based on computer equipment | |
Srivastava et al. | Semantic workflows for benchmark challenges: Enhancing comparability, reusability and reproducibility | |
CN107967411A (en) | Method and device for detecting off-target site and terminal equipment | |
Hu et al. | Enhanced Hybrid Ant Colony Optimization for Machining Line Balancing Problem with Compound and Complex Constraints | |
CN1760903A (en) | Flow verification system and method | |
CN113822379B (en) | Process process anomaly analysis method and device, electronic equipment and storage medium | |
CN105047038B (en) | Nuclear power plant's personnel's training system | |
CN107248118A (en) | Data digging method, device and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181106 |