CN107273663B - A kind of DNA methylation sequencing data calculating deciphering method - Google Patents
A kind of DNA methylation sequencing data calculating deciphering method Download PDFInfo
- Publication number
- CN107273663B CN107273663B CN201710362178.4A CN201710362178A CN107273663B CN 107273663 B CN107273663 B CN 107273663B CN 201710362178 A CN201710362178 A CN 201710362178A CN 107273663 B CN107273663 B CN 107273663B
- Authority
- CN
- China
- Prior art keywords
- data
- cpu
- methylation
- dna methylation
- sequencing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
Landscapes
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a kind of DNA methylation sequencing datas to calculate deciphering method, and implementation steps include: to pre-process to the reference genomic data and original sequencing sample data that are sequenced for DNA methylation;It is compared by hard-wired comparative device on CPU calling FPGA by pretreated sequencing sample data and with reference to genome;The upper hard-wired deep learning model of identifier, FPGA for programming realization on GPU is called by CPU, and methylation identification is carried out based on comparison result;Result data is visualized, hard-wired deep learning model on FPGA is called to excavate and analyze the methylation function that result data reflects by CPU, and CPU calls the relevant video of programmed process analysis mining on GPU, animation and display task, CPU to call programmed process and the relevant figure of analysis mining, image and audio task on DSP.The present invention has the advantages that quickly real-time, precisely deep, easy-to-understand, various informative.
Description
Technical field
The present invention relates to gene sequencing technologies, and in particular to a kind of DNA methylation sequencing data calculating deciphering method.
Background technique
In recent years, with the extensive use of next-generation sequencing technologies (Next Generation Sequence, NGS), base
Because the cost of sequencing declines rapidly, gene sequencing technology is able in more extensive biology, medical treatment, health, criminal investigation, agricultural etc.
Many expanded applications in field.Wherein, the DNA based on NGS (Deoxyribo-Nucleic Acid, DNA) first
Baseization sequencing is the branch field for having very much application value, is widely paid close attention to.
Methylation (Methylation), which refers to, urges methyl from active methyl compound (such as S- adenosylmethionine)
Change the process for being transferred to other compounds.Methylation is one of the important research content of epigenetics (epigenetics).
The most common methylation is modified with DNA methylation and histone methylated.The DNA methylation of vertebrate typically occurs in CpG
Site (sites), i.e., cytimidine (Cytosine)-phosphoric acid (Phosphoric acid)-guanine in DNA sequence dna
(Guanine) site is 5-methylcytosine through dnmt rna catalysis Cytosines.About 80%-90% in human gene
The site CpG be methylated, 1%-2% human genome is CpG groups, and CpG methylation be inversely proportional with transcriptional activity.DNA
Methylation can cause the change of chromatin Structure, DNA conformation, DNA stability and DNA and protein interaction mode, can close
Close the activity of certain genes, demethylation then reactivating and expressing induction of gene.For example, existing research shows that people
DNA methylation and many diseases such as cancer, aging, senile dementia it is closely related, abnormal methylation is often many diseases
Cause.Therefore, DNA methylation detection has the multiple fields such as biological study, medical diagnosis, Forensic Biology very big
Application value.
In recent years, scientists are by traditional DNA methylation assay technology and target gene group capture technique and NGS high pass
Amount sequencing technologies combine, and quantitative determine the technology to methylate in people and other species gene groups and come into the practical stage.Mesh
It is preceding it is the most commonly used be sulphite PCR sequencing PCR (Bisulfite sequencing, BS-Seq), i.e., handle base with sulphite
Because of a group DNA, then the cytimidine not methylated is converted into uracil (Uracil), and the cytimidine to methylate is constant.With
BSP(Bisulfite sequencing PCR is designed afterwards) primer progress polymerase chain reaction (Polymerase Chain
Reaction, PCR), uracil is completely converted into thymidine (Thymine) in amplification procedure, finally carries out to PCR product
Sequencing is it may determine that whether the site CpG methylates.
The flow chart of data processing of DNA methylation sequencing based on NGS includes that data calculating and data interpret two big steps,
Middle data calculate that step completes pretreatment with reference to genome and raw sequencing data goes the calculating tasks such as puppet, comparison, duplicate removal,
To be used when data interpretation;Data interpret step to the data after data calculation processing in biology, medicine, health care etc.
The Scientific Meaning in field is analyzed, disclosed and is explained.
Currently, the DNA methylation sequencing technologies based on NGS are in the upper bottleneck there are in terms of two of application:
One bottleneck is that sequencing data output capacity is far longer than sequencing data processing capacity.For example, based on NGS's
More commonly used sequencing data, which calculates, in DNA methylation sequencing interprets software Methy-Pipe, to it is typical, include 300M
The single sample data for reading the short sequencing fragment (reads) of a length of 75 base-pair (base pair, bp), in 12 core Intel to strong
(Xeon) as soon as progress is entire on processor calculates the task interpreted in process --- it compares (alignment), time-consuming is about
5 hours, and 4000 sequenator of HiSeq of Illumina company being capable of output 200M reading a length of 300 within 5 hours
The reads of bp.Therefore, on the one hand, the far super Moore's Law of increasing speed of annual 3 to 5 times of initial data of generation is sequenced,
And the calculating interpretation of sequencing data is high input/output intensively with high computation-intensive task, is carried out to sequencing data real-time
, accurately calculate interpret and transmission become extremely difficult, be faced with huge challenge.On the other hand, typical sequencing at present
Data calculate deciphering method and still mainly rely on high performance central processing unit (Central Processing Unit, letter
Claim CPU), it is handled with the software based on multithreading.But under the premise of guaranteeing accuracy, it is obtainable
Calculate the demand that accelerating ability is still unable to satisfy above-mentioned challenge of interpreting.So this method has lacked duration.
Another bottleneck is that the depth interpreted of sequencing data, range are unable to satisfy the demand of scientific research personnel, at the same time its
The readable demand for being unable to satisfy ordinary populace again.The typical method that sequencing data is interpreted at present is based on one with reference to gene
Group has both been not enough to represent entire relative species however, currently used reference genome is inherently based on limited sample
Diversity, and incomplete, therefore will lead to deviation when data are calculated and interpreted, and lack and other biologies, medical information
Widely, depth intersection is analyzed, it is difficult to meet the needs of professional scientific researcher's further investigation.In addition, base is gone back in sequencing data interpretation
Originally professional domain is rested on, towards non-professional masses, and lacks readability, that is, lacks to the direct biological meaning of sequencing data
With easy-to-understand, the various informative interpretation of indirect health effect.
Currently, the common processor type of field of information processing has central processing unit (Central Processing
Unit, abbreviation CPU), field programmable gate array (Field Programmable Gate Array, abbreviation FPGA), figure
Processor (Graphics Processing Unit, abbreviation GPU) and digital signal processor (Digital Signal
Processor, abbreviation DSP).High-performance CPU includes usually multiple processor cores (Processor Core), from hardware
Support multithreading, but its design object is still towards general purpose application program, and relative to special calculating, general purpose application program
Concurrency it is smaller, need more complex control and lower performance objective.Therefore, the hardware resource of CPU on piece is mainly still
It for realizing complicated control rather than calculates, does not include special hardware for specific function, the calculating that can be supported is parallel
It spends not high.FPGA is a kind of semi-custom circuit, and advantage has: carrying out system development based on FPGA, the design cycle is short, development cost
It is low;It is low in energy consumption;Configuration can be remodified after production, design flexibility is high, and design risk is small.The disadvantage is that: realize same function,
Speed of the FPGA in general than specific integrated circuit (Application Specific Integrated Circuit, ASIC)
Degree is slow, bigger than ASIC circuit area.With the development of technology and evolution, FPGA is to more high density, larger capacity, lower function
The shortcomings that direction consumed and integrate more stone intellectual properties (Intellectual Property, IP) is developed, FPGA is contracting
It is small, and advantage is being amplified.Compared to CPU, FPGA can customize realization, modification with hardware description language and increase parallel meter
It calculates.GPU is initially a kind of microprocessor dedicated for image procossing, and texture mapping and polygon can be supported from hardware
The graphics calculations basic task such as color.It is calculated since graphics calculating is related to some general mathematicals, such as matrix and vector operation, and
GPU possesses the framework of highly-parallel, and therefore, with the development of related software and hardware technology, GPU computing technique is increasingly risen, i.e.,
GPU is no longer limited to graphics process, is also exploited for the parallel computations such as linear algebra, signal processing, numerical simulation, Ke Yiti
For the performance of decades of times or even up to a hundred times of CPU.But current GPU has 2: first is that, it is limited to the hardware of GPU
Architectural characteristic, many parallel algorithms cannot efficiently perform on GPU;Second is that amount of heat, energy consumption can be generated in GPU operation
It is higher.DSP is a kind of various signals quickly to be analyzed, convert, filter, detect, modulate, demodulate etc. with operations with digital method
The microprocessor of processing.For this purpose, DSP has done special optimization, such as hardware realization high speed, high-precision in portion's structure in the chip
Multiplication etc..With the arrival of digital Age, DSP be widely used in smart machine, resource exploration, it is digital control, biomedical,
The every field such as space flight and aviation have the characteristics that low in energy consumption, precision is high, can carry out two dimension and multidimensional is handled.More than in conclusion
Four kinds of calculating devices respectively have feature, and respectively have limitation.
It is how sharp for the bottleneck of two aspects existing for the aforementioned DNA methylation sequencing technologies application development based on NGS
Quick real-time, accurate deep, easy-to-understand, the various informative calculating solution of magnanimity sequencing data is realized with above-mentioned processor
It reads, then has become a key technical problem urgently to be resolved.
Summary of the invention
The technical problem to be solved in the present invention: it in view of the above problems in the prior art, provides a kind of quickly real-time, precisely deep
Enter, is easy-to-understand, various informative DNA methylation sequencing data calculates deciphering method.
In order to solve the above-mentioned technical problem, the technical solution adopted by the present invention are as follows:
A kind of DNA methylation sequencing data calculating deciphering method, implementation steps include:
1) the reference genomic data and original sequencing sample data that are sequenced for DNA methylation are pre-processed;
2) gene by pretreated sequencing sample data and is referred to by hard-wired comparative device on CPU calling FPGA
Group is compared;
3) called by CPU programmed on GPU the upper hard-wired deep learning model of identifier, FPGA of realization based on than
Methylation identification is carried out to result;
4) result data is visualized, by hard-wired deep learning model on CPU calling FPGA to result
Data reflection methylation function excavated and analyzed, and CPU call GPU on the relevant video of programmed process analysis mining,
Animation and display task, CPU call programmed process and the relevant figure of analysis mining, image and audio task on DSP.
Preferably, it includes: to reference genome number that step 1), which carries out pretreated detailed step to reference genomic data,
According to the raw letter conversion carried out for methylation, the upper hard-wired index maker of FPGA is called to make a living after letter converts by CPU
Reference genomic data generate the index for being used for subsequent comparison task, reference genomic data after exporting raw letter conversion and its
Index.
Preferably, it includes: to original survey that step 1), which carries out pretreated detailed step to original sequencing sample data,
Sequence sample data carries out data quality control and obtains reliable sample data, and the data quality control includes trimming DNA methylation
Raw sample data is sequenced, removes the joint sequence on reads and low-quality base, reliable sample data is carried out for first
The raw letter conversion of base, the reliable sample data of sequencing of the DNA methylation after exporting raw letter conversion.
Preferably, to the reference genomic data and original sequencing sample number being sequenced for DNA methylation in step 1)
It is concurrently to be executed on CPU based on different threads according to pretreatment is carried out.
Preferably, the detailed step of step 2 includes:
2.1) reference genomic data and its index after reading above-mentioned raw letter conversion;DNA first after reading above-mentioned trimming
Reliable sample data is sequenced in the DNA methylation that base is sequenced after reliable sample data and raw letter conversion;
2.2) according to the index of the reference genomic data after above-mentioned raw letter conversion, call hardware on FPGA real by CPU
The reference base after reliable sample data and above-mentioned raw letter conversion is sequenced in DNA methylation after above-mentioned raw letter conversion by existing comparative device
Because group data carry out precise alignment, reliable sample data and above-mentioned raw letter is sequenced in the DNA methylation after establishing above-mentioned raw letter conversion
The mapping relations between reference genomic data after conversion;
2.3) judge whether DNA methylation sequencing sample data is that both-end reads is then jumped and held if it is both-end reads
Row step 2.4);Otherwise it is single-ended reads, jumps and execute step 2.5);Indefinite reads is then directly removed;
2.4) for both-end reads, mismatch that number is controlled and both-end reads between reading away from controlled condition under, root
According to the index of the reference genomic data after above-mentioned raw letter conversion, hard-wired comparative device on FPGA is called again by CPU
Reference genomic data after DNA methylation after above-mentioned raw letter conversion to be sequenced to reliable sample data and above-mentioned raw letter conversion into
Row compares, and increases the ginseng that the DNA methylation after establishing above-mentioned raw letter conversion is sequenced after reliable sample data and above-mentioned raw letter conversion
Examine the mapping relations between genomic data;It jumps and executes step 2.6);
2.5) for single-ended reads, under conditions of mismatch number is controlled, according to the reference gene after above-mentioned raw letter conversion
The index of group data calls hard-wired comparative device on FPGA that above-mentioned life is believed to the DNA methylation after conversion again by CPU
Reference genomic data after reliable sample data and above-mentioned raw letter conversion is sequenced is compared, and above-mentioned raw letter conversion is established in increase
The mapping relations between the reference genomic data after reliable sample data and above-mentioned raw letter conversion are sequenced in DNA methylation afterwards;
2.6) according to above-mentioned comparison result, duplicate reads is removed;
2.7) according to above-mentioned comparison result, basic statistical information is generated, the basic statistical information includes comparison rate
At least one of statistics, methylation level of density statistics;
2.8) above-mentioned comparing result and basic statistics information are exported.
Preferably, the detailed step of step 3) includes:
3.1) reference genomic data and its index after reading above-mentioned raw letter conversion;DNA first after reading above-mentioned trimming
Reliable sample data is sequenced in the DNA methylation that base is sequenced after reliable sample data and raw letter conversion;Read above-mentioned comparing
Result information;Read above-mentioned basic statistics result information;
3.2) each effective methylation sites are identified;
3.3) identification in various specified special methylation areas is carried out;
3.4) hard-wired deep learning model on FPGA is called to be responsible for parallel execution ASMs identification by CPU;
3.5) output methylation recognition result information.
It preferably, include 2 sons concurrently executed when the identification in the various specified special methylation areas of step 3.3) progress
Task: subtask 1.: carry out the hypomethylation area identification of methylation density is low, gene expression amount is high region of DNA, and methylation
The hyper-methylation area identification for the region of DNA that density is high, gene expression amount is low;Subtask is 2.: passing through CPU and carries out the base in a variety of samples
It is called based on programming on GPU in fact because of the differential methylation area identification in the different region of methylation state in group, and by CPU
Existing identifier be responsible for it is parallel execute DMRs identification to realize the identification of DMRs between individual, wherein differential methylation area is looked at as
It may participate in the functional region of gene transcription level regulation.
Preferably, the detailed step of step 4) includes:
4.1) above-mentioned basic statistics result information, methylation recognition result information are read;
4.2) GPU and DSP is called to carry out basic statistics result information and methylation recognition result information visually by CPU
Change processing, and CPU calls the relevant video of programmed process analysis mining on GPU, animation and display task, CPU is called to be compiled on DSP
Journey processing figure relevant with analysis mining, image and audio task;
4.3) hard-wired deep learning model on FPGA is called to be responsible for parallel execution methylation functional analysis by CPU
And excavation;And CPU calls the relevant video of programmed process analysis mining on GPU, animation and display task, CPU is called to be compiled on DSP
Journey processing figure relevant with analysis mining, image and audio task;
4.4) output analysis data and depth interpret report.
DNA methylation sequencing data of the present invention calculates deciphering method and has an advantage that
1, DNA methylation sequencing data is calculated and interprets the time-consuming bottleneck of each of process, the calculation of task based access control itself
Method or model concurrency have carried out targetedly simultaneously respectively in conjunction with the characteristics of these four processors of CPU, FPGA, GPU and DSP
Row accelerates, and improves DNA methylation sequencing data and calculates the real-time interpreted.
2, the methylation identification in process is interpreted for the calculating of DNA methylation sequencing data and methylate functional analysis and digging
Pick, the target of task based access control itself introduce deep learning mould in conjunction with the characteristics of these four processors of CPU, FPGA, GPU and DSP
The processing of deep learning source data is accelerated and enriched to type, improves DNA methylation sequencing data and calculates the depth interpreted and wide
Degree.
3, for DNA methylation sequencing data calculate interpret process in data visualization, in conjunction with CPU, GPU and DSP this
The characteristics of three kinds of processors, cooperation completes visualization processing, improves the visual real-time of DNA methylation sequencing data, rich
The rich visual diversity of DNA methylation sequencing data.
Detailed description of the invention
Fig. 1 is that DNA methylation of embodiment of the present invention sequencing data calculates the main-process stream schematic diagram interpreted.
Fig. 2 is that DNA methylation of embodiment of the present invention sequencing data calculates the pretreatment process schematic diagram interpreted.
Fig. 3 is that DNA methylation of embodiment of the present invention sequencing data calculates the comparing flow diagram interpreted.
Fig. 4 is that DNA methylation of embodiment of the present invention sequencing data calculates the methylation identification process schematic diagram interpreted.
Fig. 5 is that DNA methylation of embodiment of the present invention sequencing data calculates the identification data visualization interpreted and methylation function
It can analysis mining flow diagram.
Specific embodiment
As shown in Figure 1, the implementation steps that the DNA methylation sequencing data of the present embodiment calculates deciphering method include:
1) the reference genomic data and original sequencing sample data that are sequenced for DNA methylation are pre-processed;
2) gene by pretreated sequencing sample data and is referred to by hard-wired comparative device on CPU calling FPGA
(alignment) is compared in group;The step is needed using both processors of CPU and FPGA;
3) the upper hard-wired deep learning (Deep of identifier, FPGA that realization is programmed on GPU is called by CPU
Learning, DL) model be based on comparison result carry out methylation identification;The step need using CPU, FPGA and GPU this three
Kind processor;
4) result data is visualized, by hard-wired deep learning model on CPU calling FPGA to result
Data reflection methylation function excavated and analyzed, and CPU call GPU on the relevant video of programmed process analysis mining,
Animation and display task, CPU call programmed process and the relevant figure of analysis mining, image and audio task on DSP.This step
Suddenly it needs using these four processors of CPU, FPGA, GPU and DSP.
As shown in Figure 1, in the present embodiment step 1) and 2) complete DNA methylation sequencing data calculating task;Step 3)
With the solution reading task for 4) completing DNA methylation sequencing data.It is default if not adding specified otherwise in detailed below in step description
Use CPU.
Step 1) includes 2 subtasks concurrently executed: original sample is sequenced in pretreatment and DNA methylation with reference to genome
The pretreatment of notebook data.As shown in Fig. 2, to the reference genome number being sequenced for DNA methylation in step 1) in the present embodiment
It is concurrently to be executed on CPU based on different threads (thread 1 and thread 2) according to pretreatment is carried out with original sequencing sample data
's.
Referring to fig. 2, it includes: to reference genome number that step 1), which carries out pretreated detailed step to reference genomic data,
According to carry out for methylation raw letter (in silico) conversion, it is by hard-wired index maker on CPU calling FPGA
Reference genomic data after raw letter conversion generates the index for being used for subsequent comparison task, the reference gene after exporting raw letter conversion
Group data and its index.The step is needed using both processors of CPU and FPGA;Reference genomic data is directed to
Methylation raw letter (in silico) conversion when, if using BS-Seq sequencing technologies needing that genome number will be referred to
The C that methylated cytosine (Cytosine) does not occur for all representatives in is converted to the T for representing thymidine (Thymine).If
DNA is double-strand (Watson and Crick strands), then 2 chains require to be converted.Reference after letter of making a living conversion
When genomic data generates the index for being used for subsequent comparison task, CPU is responsible for the Row control that index generates, and the upper hardware of FPGA is real
Existing index maker is responsible for parallel generation index, there is data and instruction interaction between CPU and FPGA.When only using CPU, this
Step is that entire DNA methylation sequencing data calculates one of the time-consuming bottleneck interpreted in process, and FPGA is added, can accelerate parallel
Complete computation-intensive task therein.Although whithin a period of time, specifically relatively fixed with reference to genomic data, can be generated
Index is primary, then the Reusability in similar application, still, once there is update with reference to genomic data, it is necessary to it regenerates
New index.
Referring to fig. 2, it includes: to original survey that step 1), which carries out pretreated detailed step to original sequencing sample data,
Sequence sample data carries out data quality control and obtains reliable sample data (clean datas), and the data quality control includes
It trims DNA methylation and raw sample data is sequenced, remove the joint sequence (the adapter sequence) and low on reads
The base (bases) of quality carries out the raw letter conversion for methylation to reliable sample data, the DNA after exporting raw letter conversion
The reliable sample data of the sequencing of methylation.Reliable sample data is sequenced to the DNA methylation obtained after above-mentioned trimming to carry out
For methylation when giving birth to letter conversion, if using BS-Seq sequencing technologies needing that reliable sample for DNA methylation is sequenced
All C for representing cytimidine (Cytosine) are converted to the T for representing thymidine (Thymine) in notebook data.
As shown in figure 3, the detailed step of step 2 includes:
2.1) reference genomic data and its index after reading above-mentioned raw letter conversion;DNA first after reading above-mentioned trimming
Reliable sample data is sequenced in the DNA methylation that base is sequenced after reliable sample data and raw letter conversion;
2.2) according to the index of the reference genomic data after above-mentioned raw letter conversion, call hardware on FPGA real by CPU
The reference after reliable sample data and above-mentioned raw letter conversion is sequenced in DNA methylation after above-mentioned raw letter conversion by existing comparative device 1
Genomic data carries out precise alignment, and reliable sample data and above-mentioned life is sequenced in the DNA methylation after establishing above-mentioned raw letter conversion
The mapping relations between reference genomic data after letter conversion;Hard-wired comparative device 1 on FPGA is called to carry out by CPU
When comparison, CPU is responsible for the Row control of data precise alignment, and the upper hard-wired comparative device 1 of FPGA is responsible for parallel execution of data
Precise alignment has data and instruction interaction between CPU and FPGA.When only using CPU, the step for be that entire DNA methylation is surveyed
Ordinal number is according to one of the time-consuming bottleneck interpreted in process is calculated, and the present embodiment is by being added hard-wired comparative device 1 on FPGA, energy
Computation-intensive task therein is completed in enough parallel acceleration.
2.3) judge whether DNA methylation sequencing sample data is both-end (paired-end) reads, if it is both-end
Reads is then jumped and is executed step 2.4);Otherwise it is single-ended (single-end) reads, jumps and execute step 2.5);It is indefinite
(ambiguous) reads is then directly removed;
2.4) for both-end reads, in mismatch (mismatches) number controlled (such as no more than 2) and both-end
Under the conditions of reading between reads is away from controlled (such as between 50 to 600 bases), according to the reference after above-mentioned raw letter conversion
The index of genomic data calls hard-wired comparative device 2 on FPGA that above-mentioned life is believed to the DNA after conversion again by CPU
The reference genomic data that methylation is sequenced after reliable sample data and above-mentioned raw letter conversion is compared, and above-mentioned life is established in increase
The mapping between the reference genomic data after reliable sample data and above-mentioned raw letter conversion is sequenced in DNA methylation after letter conversion
Relationship;It jumps and executes step 2.6);In the present embodiment by the DNA methylation after above-mentioned raw letter conversion be sequenced reliable sample data and
When reference genomic data after above-mentioned raw letter conversion is compared, CPU is responsible for the Row control of comparing, the upper hardware of FPGA
The comparative device 2 of realization is responsible for parallel execution of data comparison, there is data and instruction interaction between CPU and FPGA.When only using CPU,
The step for be that entire DNA methylation sequencing data calculates one of the time-consuming bottleneck interpreted in process, FPGA is added, can be parallel
Accelerate to complete computation-intensive task therein;
2.5) for single-ended reads, under conditions of mismatch (mismatches) number controlled (typically not greater than 2),
According to the index of the reference genomic data after above-mentioned raw letter conversion, hard-wired comparison on FPGA is called again by CPU
The reference genome number after reliable sample data and above-mentioned raw letter conversion is sequenced in DNA methylation after above-mentioned raw letter conversion by device 2
According to being compared, after increasing the reliable sample data of DNA methylation sequencing after establishing above-mentioned raw letter conversion and above-mentioned raw letter conversion
Reference genomic data between mapping relations;Above-mentioned life is believed by hard-wired comparative device 2 on CPU calling FPGA
When the reference genomic data that DNA methylation after conversion is sequenced after reliable sample data and above-mentioned raw letter conversion is compared,
CPU is responsible for the Row control of comparing, and the upper hard-wired comparative device 2 of FPGA is responsible for parallel execution of data comparison, CPU and
There is data and instruction interaction between FPGA.When only using CPU, the step for be that entire DNA methylation sequencing data is calculated and interpreted
One of time-consuming bottleneck in process is added FPGA, can accelerate to complete computation-intensive task therein parallel;
2.6) according to above-mentioned comparison result, the reads for repeating (duplicate) is removed;
2.7) according to above-mentioned comparison result, basic statistical information is generated, the basic statistical information includes comparison rate
(alignment rate) statistics, at least one of methylation level of density (methylation density level) statistics;
2.8) above-mentioned comparing result and basic statistics information are exported.
In the present embodiment, step 3) be based on comparison result carry out methylation identification need using CPU, FPGA and GPU this three
Kind processor;As shown in figure 4, the detailed step of step 3) includes:
3.1) reference genomic data and its index after reading above-mentioned raw letter conversion;DNA first after reading above-mentioned trimming
Reliable sample data is sequenced in the DNA methylation that base is sequenced after reliable sample data and raw letter conversion;Read above-mentioned comparing
Result information;Read above-mentioned basic statistics result information;
3.2) each effective methylation sites are identified;For example, DNA methylation primarily forms 5-methylcytosine (5-mC)
With a small amount of N6- methyl adenine (N6-mA) and 7- methyl guanine (7-mG).In eucaryote, 5-mC is mainly appeared on
In CpG sequence, CpXpG, CCA/TGG and GATC;
3.3) identification in various specified special methylation areas is carried out;
3.4) hard-wired deep learning model on FPGA is called to be responsible for parallel execution ASMs(allele- by CPU
Specific methylated regions, allele-specific methylation area) identification;In ASMs identification, CPU is responsible for
The Row control of ASMs identification, the upper hard-wired deep learning model of FPGA are responsible for the parallel ASMs that executes and are identified, CPU and FPGA
Between have data and instruction interaction.ASMs identification is carried out using deep learning method, can support the statistics based on big data
Model can be realized more accurate ASMs classification and prediction.When only using CPU, the step for be the sequencing of entire DNA methylation
Data calculate one of the time-consuming bottleneck interpreted in process, and FPGA is added, and hardware realization deep learning model can accelerate parallel
At deep learning task;
3.5) output methylation recognition result information.
In the present embodiment, includes 2 when step 3.3) carries out the identification in various specified special methylation areas and concurrently execute
Subtask: subtask 1.: carry out the hypomethylation area (hypo- of methylation density is low, gene expression amount is high region of DNA
Methylated regions) identification, and the hyper-methylation area for the low region of DNA of density height, gene expression amount that methylates
(hyper-methylated regions) identification;Subtask is 2.: methylate in genome in a variety of samples by CPU
The differential methylation area in the different region of state identifies, and is called by CPU and born based on the identifier for programming realization on GPU
Blame parallel execution DMRs(Differentially Methylated Regions, DMRs, differential methylation area) it identifies and comes in fact
DMRs(Inter-DMRs between existing individual) identification, wherein differential methylation area is looked at as that gene transcription level tune may be participated in
The functional region of control.2. subtask carries out the different region of methylation state in genome in a variety of samples by CPU
The identification of differential methylation area, and called by CPU and the parallel DMRs that executes to be responsible for based on the identifier for programming realization on GPU know
When other, CPU is responsible for the Row control of DMRs identification, and the identifier that realization is programmed on GPU is responsible for the parallel identification for executing DMRs,
There is data and instruction interaction between CPU and GPU.When only using CPU, the step for be that entire DNA methylation sequencing data calculates
One of the time-consuming bottleneck in process is interpreted, GPU is added, can accelerate to complete computation-intensive task therein parallel;
As shown in figure 5, the detailed step of step 4) includes:
4.1) above-mentioned basic statistics result information, methylation recognition result information are read;
4.2) GPU and DSP is called to carry out basic statistics result information and methylation recognition result information visually by CPU
Change processing, and CPU calls the relevant video of programmed process analysis mining on GPU, animation and display task, CPU is called to be compiled on DSP
Journey processing figure relevant with analysis mining, image and audio task;Carry out visualization processing, can with various science, it is intuitive,
Lively mode shows the meaning of data.Such as: the distribution and ratio of various methylation sites, difference methylation areas distribution and
Ratio, etc..Call GPU and DSP by basic statistics result information and methylation recognition result information by CPU in the present embodiment
When carrying out visualization processing, CPU is responsible for visual Row control;The tasks such as the upper programmed process video of GPU, animation and display,
There is data and instruction interaction between CPU and GPU;The tasks such as the upper programmed process figure of DSP, image and audio, between CPU and DSP
There is data and instruction interaction.When only using CPU, the step for be that entire DNA methylation sequencing data is calculated and interpreted in process
GPU and DSP is added in one of time-consuming bottleneck, they and CPU cooperate, and can accelerate to complete multi-media processing task parallel;
4.3) hard-wired deep learning model on FPGA is called to be responsible for parallel execution methylation functional analysis by CPU
And excavation;And CPU calls the relevant video of programmed process analysis mining on GPU, animation and display task, CPU is called to be compiled on DSP
Journey processing figure relevant with analysis mining, image and audio task;
Methylate functional analysis and excavation, i.e. data based on above-mentioned analysis, further analyses in depth relevant methylation function
Can, and expand extension, except known knowledge, then excavate unknown association.Such as: there is pass in known methylation and cancer
Connection, the step for just further analyse in depth effect of the various methylation patterns in cancer, further analyse in depth various
Effect of the methylation patterns in various subdivision cancers;It excavates between methylation and Other diseases with the presence or absence of association, etc..
Hard-wired deep learning model on FPGA is called to be responsible for parallel execution methylation function by CPU in the present embodiment
When can analyze and excavate, CPU is responsible for the Row control analyzed and excavated;The upper hard-wired deep learning model of FPGA is responsible for simultaneously
Row executes analysis and excavates, and has data and instruction interaction between CPU and FPGA.It is analyzed and is dug using deep learning method
Pick, can support the statistical models based on big data, can be realized more accurate analysis and excavation;The upper programmed process of GPU
The tasks such as the relevant video of analysis mining, animation and display have data and instruction interaction between CPU and GPU;On DSP at programming
The tasks such as figure relevant with analysis mining, image and audio are managed, there is data and instruction interaction between CPU and DSP.It only uses
When CPU, the step for be that entire DNA methylation sequencing data calculates one of the time-consuming bottleneck interpreted in process, be added FPGA,
GPU and DSP, they and CPU cooperate, and can accelerate to complete deep learning and associated multimedia processing task parallel;
4.4) output analysis data and depth interpret report.
In conclusion the DNA methylation sequencing data of the present embodiment, which calculates deciphering method, can satisfy sequencing data calculating
Quick real-time, accurate deep, easy-to-understand, the various informative requirement interpreted, is the application of DNA methylation sequencing technologies
Power-assisted.
The above is only a preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-mentioned implementation
Example, all technical solutions belonged under thinking of the present invention all belong to the scope of protection of the present invention.It should be pointed out that for the art
Those of ordinary skill for, several improvements and modifications without departing from the principles of the present invention, these improvements and modifications
It should be regarded as protection scope of the present invention.
Claims (5)
1. a kind of DNA methylation sequencing data calculates deciphering method, it is characterised in that implementation steps include:
1) the reference genomic data and original sequencing sample data that are sequenced for DNA methylation are pre-processed;
2) by CPU call the upper hard-wired comparative device of FPGA by pretreated sequencing sample data and refer to genome into
Row compares;
3) the upper hard-wired deep learning model of identifier, FPGA for programming realization on GPU is called to be based on comparing knot by CPU
Fruit carries out methylation identification;
4) result data is visualized, by hard-wired deep learning model on CPU calling FPGA to result data
The methylation function of reflection is excavated and is analyzed, and CPU calls the relevant video of programmed process analysis mining, animation on GPU
With display task, CPU calls programmed process and the relevant figure of analysis mining, image and audio task on DSP;
It includes: to carry out reference genomic data for first that step 1), which carries out pretreated detailed step to reference genomic data,
The raw letter conversion of base, the reference genome after letter conversion of being made a living by hard-wired index maker on CPU calling FPGA
Data generate the index for being used for subsequent comparison task, the reference genomic data and its index after exporting raw letter conversion;
Step 1) to original sequencing sample data carry out pretreated detailed step include: to original sequencing sample data into
Row data quality control obtains reliable sample data, and the data quality control includes that original sample is sequenced in trimming DNA methylation
Data remove the joint sequence on reads and low-quality base, turn for the raw letter of methylation to reliable sample data
It changes, the reliable sample data of sequencing of the DNA methylation after exporting raw letter conversion;
The detailed step of step 2 includes:
2.1) reference genomic data and its index after reading above-mentioned raw letter conversion;DNA methylation after reading above-mentioned trimming
Reliable sample data is sequenced in DNA methylation after reliable sample data and raw letter conversion is sequenced;
2.2) it according to the index of the reference genomic data after above-mentioned raw letter conversion, is called by CPU hard-wired on FPGA
The reference genome after reliable sample data and above-mentioned raw letter conversion is sequenced in DNA methylation after above-mentioned raw letter conversion by comparative device
Data carry out precise alignment, and reliable sample data and above-mentioned raw letter conversion is sequenced in the DNA methylation after establishing above-mentioned raw letter conversion
The mapping relations between reference genomic data afterwards;
2.3) judge whether DNA methylation sequencing sample data is that both-end reads is then jumped if it is both-end reads and executed step
It is rapid 2.4);Otherwise it is single-ended reads, jumps and execute step 2.5);Indefinite reads is then directly removed;
2.4) for both-end reads, mismatch that number is controlled and both-end reads between reading away from controlled condition under, according to upper
The index of reference genomic data after stating raw letter conversion, will be upper again by hard-wired comparative device on CPU calling FPGA
The reference genomic data that DNA methylation after stating raw letter conversion is sequenced after reliable sample data and above-mentioned raw letter conversion is compared
It is right, increase the reference base that the DNA methylation after establishing above-mentioned raw letter conversion is sequenced after reliable sample data and above-mentioned raw letter conversion
Because of the mapping relations between group data;It jumps and executes step 2.6);
2.5) for single-ended reads, under conditions of mismatch number is controlled, according to the reference genome number after above-mentioned raw letter conversion
According to index, call the upper hard-wired comparative device of FPGA by the DNA methylation sequencing after above-mentioned raw letter conversion again by CPU
Reference genomic data after reliable sample data and above-mentioned raw letter conversion is compared, after above-mentioned raw letter conversion is established in increase
The mapping relations between the reference genomic data after reliable sample data and above-mentioned raw letter conversion are sequenced in DNA methylation;
2.6) according to above-mentioned comparison result, duplicate reads is removed;
2.7) according to above-mentioned comparison result, generate basic statistical information, the basic statistical information include comparison rate statistics,
At least one for the level of density statistics that methylates;
2.8) export above-mentioned comparing result and basic statistics information.
2. DNA methylation sequencing data according to claim 1 calculates deciphering method, which is characterized in that right in step 1)
Carrying out pretreatment for the reference genomic data of DNA methylation sequencing and original sequencing sample data is based on CPU
What different threads concurrently executed.
3. DNA methylation sequencing data according to claim 1 calculates deciphering method, which is characterized in that step 3) it is detailed
Carefully step includes:
3.1) reference genomic data and its index after reading above-mentioned raw letter conversion;DNA methylation after reading above-mentioned trimming
Reliable sample data is sequenced in DNA methylation after reliable sample data and raw letter conversion is sequenced;Read the knot of above-mentioned comparing
Fruit information;Read above-mentioned basic statistics result information;
3.2) each effective methylation sites are identified;
3.3) identification in various specified special methylation areas is carried out;
3.4) hard-wired deep learning model on FPGA is called to be responsible for parallel execution ASMs identification by CPU;
3.5) output methylation recognition result information.
4. DNA methylation sequencing data according to claim 3 calculates deciphering method, which is characterized in that step 3.3) into
Include 2 subtasks concurrently executed when the identification in the various specified special methylation areas of row: subtask 1.: methylate close
Spend the hypomethylation area identification of region of DNA low, that gene expression amount is high, and the region of DNA that methylation density is high, gene expression amount is low
Hyper-methylation area identification;Subtask is 2.: carrying out in a variety of samples the different area of methylation state in genome by CPU
The differential methylation area in domain identifies, and is called by CPU and be responsible for parallel execution DMRs based on the identifier for programming realization on GPU
Identify the identification to realize DMRs between individual, wherein differential methylation area is looked at as that gene transcription level regulation may be participated in
Functional region.
5. DNA methylation sequencing data according to claim 4 calculates deciphering method, which is characterized in that step 4) it is detailed
Carefully step includes:
4.1) above-mentioned basic statistics result information, methylation recognition result information are read;
4.2) GPU and DSP is called to carry out basic statistics result information and methylation recognition result information at visualization by CPU
Reason, and CPU calls the relevant video of programmed process analysis mining on GPU, animation and display task, CPU is called on DSP at programming
Manage figure relevant with analysis mining, image and audio task;
4.3) hard-wired deep learning model on FPGA is called to be responsible for parallel execution methylation functional analysis and digging by CPU
Pick;And CPU calls the relevant video of programmed process analysis mining on GPU, animation and display task, CPU is called on DSP at programming
Manage figure relevant with analysis mining, image and audio task;
4.4) output analysis data and depth interpret report.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710362178.4A CN107273663B (en) | 2017-05-22 | 2017-05-22 | A kind of DNA methylation sequencing data calculating deciphering method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710362178.4A CN107273663B (en) | 2017-05-22 | 2017-05-22 | A kind of DNA methylation sequencing data calculating deciphering method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107273663A CN107273663A (en) | 2017-10-20 |
CN107273663B true CN107273663B (en) | 2018-12-11 |
Family
ID=60064456
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710362178.4A Active CN107273663B (en) | 2017-05-22 | 2017-05-22 | A kind of DNA methylation sequencing data calculating deciphering method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107273663B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019129200A1 (en) * | 2017-12-28 | 2019-07-04 | 安诺优达基因科技(北京)有限公司 | C-site extraction method and apparatus |
CN111627499B (en) * | 2020-05-27 | 2020-12-08 | 广州市基准医疗有限责任公司 | Methylation level vectorization representation and specific sequencing interval detection method and device |
CN114996763B (en) * | 2022-07-28 | 2022-11-15 | 北京锘崴信息科技有限公司 | Private data security analysis method and device based on trusted execution environment |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB0406769D0 (en) * | 2004-03-25 | 2004-04-28 | Global Genomics Ab | Methods and means for nucleic acid sequencing |
US6934597B1 (en) * | 2002-03-26 | 2005-08-23 | Lsi Logic Corporation | Integrated circuit having integrated programmable gate array and method of operating the same |
CN102776270A (en) * | 2011-05-12 | 2012-11-14 | 中国科学院上海生命科学研究院 | Method and device for detecting DNA methylation |
CN104762402A (en) * | 2015-04-21 | 2015-07-08 | 广州定康信息科技有限公司 | Method for rapidly detecting human genome single base mutation and micro-insertion deletion |
CN105046109A (en) * | 2015-06-26 | 2015-11-11 | 四川云合创智科技有限公司 | Acceleration platform used for biological information sequence analysis |
US9310432B2 (en) * | 2011-07-25 | 2016-04-12 | Cosmin Iorga | Method and system for measuring the impedance of the power distribution network in programmable logic device applications |
CN105483244A (en) * | 2015-12-28 | 2016-04-13 | 武汉菲沙基因信息有限公司 | Super-long genome-based variation detection algorithm and detection system |
CN106021993A (en) * | 2016-05-12 | 2016-10-12 | 北京百迈客云科技有限公司 | Tumor exome sequencing analysis system and method |
CN106295250A (en) * | 2016-07-28 | 2017-01-04 | 北京百迈客医学检验所有限公司 | Method and device is analyzed in the quick comparison of the short sequence of secondary order-checking |
CN106326184A (en) * | 2016-08-23 | 2017-01-11 | 成都卡莱博尔信息技术股份有限公司 | CPU (Central Processing Unit), GPU (Graphic Processing Unit) and DSP (Digital Signal Processor)-based heterogeneous computing framework |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106650254B (en) * | 2016-12-16 | 2018-11-20 | 武汉菲沙基因信息有限公司 | A method of based on transcript profile sequencing data detection fusion gene |
-
2017
- 2017-05-22 CN CN201710362178.4A patent/CN107273663B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6934597B1 (en) * | 2002-03-26 | 2005-08-23 | Lsi Logic Corporation | Integrated circuit having integrated programmable gate array and method of operating the same |
GB0406769D0 (en) * | 2004-03-25 | 2004-04-28 | Global Genomics Ab | Methods and means for nucleic acid sequencing |
CN102776270A (en) * | 2011-05-12 | 2012-11-14 | 中国科学院上海生命科学研究院 | Method and device for detecting DNA methylation |
US9310432B2 (en) * | 2011-07-25 | 2016-04-12 | Cosmin Iorga | Method and system for measuring the impedance of the power distribution network in programmable logic device applications |
CN104762402A (en) * | 2015-04-21 | 2015-07-08 | 广州定康信息科技有限公司 | Method for rapidly detecting human genome single base mutation and micro-insertion deletion |
CN105046109A (en) * | 2015-06-26 | 2015-11-11 | 四川云合创智科技有限公司 | Acceleration platform used for biological information sequence analysis |
CN105483244A (en) * | 2015-12-28 | 2016-04-13 | 武汉菲沙基因信息有限公司 | Super-long genome-based variation detection algorithm and detection system |
CN106021993A (en) * | 2016-05-12 | 2016-10-12 | 北京百迈客云科技有限公司 | Tumor exome sequencing analysis system and method |
CN106295250A (en) * | 2016-07-28 | 2017-01-04 | 北京百迈客医学检验所有限公司 | Method and device is analyzed in the quick comparison of the short sequence of secondary order-checking |
CN106326184A (en) * | 2016-08-23 | 2017-01-11 | 成都卡莱博尔信息技术股份有限公司 | CPU (Central Processing Unit), GPU (Graphic Processing Unit) and DSP (Digital Signal Processor)-based heterogeneous computing framework |
Non-Patent Citations (3)
Title |
---|
《基于GPU和压缩索引的新一代测序数据再测序研究》;应德全;《中国优秀硕士学位论文全文数据库基础科学辑》;20120315;第A006-78页; * |
《基于GPU并行化计算的宏基因组第二代测序模拟系统》;宣黎明;《中国优秀硕士学位论文全文数据库基础科学辑》;20121015;第A006-329页 * |
《基于Hash索引的高通量基因序列比对并行加速技术研究》;王文迪;《计算机研究与发展》;20131231;第50卷(第11期);第2463-3471页; * |
Also Published As
Publication number | Publication date |
---|---|
CN107273663A (en) | 2017-10-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Göttgens | Regulatory network control of blood stem cells | |
Palit et al. | Meeting the challenges of high-dimensional single-cell data analysis in immunology | |
Jiang et al. | Methy-Pipe: an integrated bioinformatics pipeline for whole genome bisulfite sequencing data analysis | |
CN107273663B (en) | A kind of DNA methylation sequencing data calculating deciphering method | |
Agapito et al. | Parallel extraction of association rules from genomics data | |
CN107194204A (en) | A kind of sequencing data of whole genome calculates deciphering method | |
CN107203703A (en) | A kind of transcript profile sequencing data calculates deciphering method | |
Li | A fast and exhaustive method for heterogeneity and epistasis analysis based on multi-objective optimization | |
Bansal et al. | Efficient genome-scale phylogenetic analysis under the duplication-loss and deep coalescence cost models | |
Planet et al. | Systematic analysis of DNA microarray data: ordering and interpreting patterns of gene expression | |
Otasek et al. | Visual data mining: effective exploration of the biological universe | |
Chen et al. | Recent advances in sequence assembly: principles and applications | |
Ojha et al. | Computational molecular phylogeny: concepts and applications | |
Smart et al. | A novel phylogenetic approach for de novo discovery of putative nuclear mitochondrial (pNumt) haplotypes | |
MacPherson et al. | A general birth-death-sampling model for epidemiology and macroevolution | |
Rokas et al. | From gene-scale to genome-scale phylogenetics: the data flood in, but the challenges remain | |
Lyantagaye | Current status and future perspectives of bioinformatics in Tanzania | |
Holec et al. | Integrating multiple-platform expression data through gene set features | |
Abu-Doleh et al. | XgCPred: Cell type classification using XGBoost-CNN integration and exploiting gene expression imaging in single-cell RNAseq data | |
Orlov et al. | Integrated computer analysis of genomic sequencing data based on ICGenomics tool | |
Som | Bioinformatics strategies for stem cell research | |
Ahmed et al. | Role of R in biological network analysis | |
Filipovic et al. | Unique challenges and best practices for single cell transcriptomic analysis in toxicology | |
Psiuk-Maksymowicz et al. | Scalability of a genomic data analysis in the biotest platform | |
Peltzer | Computational methods for ancient genome reconstruction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |