CN107273663B - A kind of DNA methylation sequencing data calculating deciphering method - Google Patents

A kind of DNA methylation sequencing data calculating deciphering method Download PDF

Info

Publication number
CN107273663B
CN107273663B CN201710362178.4A CN201710362178A CN107273663B CN 107273663 B CN107273663 B CN 107273663B CN 201710362178 A CN201710362178 A CN 201710362178A CN 107273663 B CN107273663 B CN 107273663B
Authority
CN
China
Prior art keywords
data
cpu
methylation
dna methylation
sequencing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710362178.4A
Other languages
Chinese (zh)
Other versions
CN107273663A (en
Inventor
宋卓
刘蓬侠
李�根
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Human And Future Biotechnology (changsha) Co Ltd
Original Assignee
Human And Future Biotechnology (changsha) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Human And Future Biotechnology (changsha) Co Ltd filed Critical Human And Future Biotechnology (changsha) Co Ltd
Priority to CN201710362178.4A priority Critical patent/CN107273663B/en
Publication of CN107273663A publication Critical patent/CN107273663A/en
Application granted granted Critical
Publication of CN107273663B publication Critical patent/CN107273663B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a kind of DNA methylation sequencing datas to calculate deciphering method, and implementation steps include: to pre-process to the reference genomic data and original sequencing sample data that are sequenced for DNA methylation;It is compared by hard-wired comparative device on CPU calling FPGA by pretreated sequencing sample data and with reference to genome;The upper hard-wired deep learning model of identifier, FPGA for programming realization on GPU is called by CPU, and methylation identification is carried out based on comparison result;Result data is visualized, hard-wired deep learning model on FPGA is called to excavate and analyze the methylation function that result data reflects by CPU, and CPU calls the relevant video of programmed process analysis mining on GPU, animation and display task, CPU to call programmed process and the relevant figure of analysis mining, image and audio task on DSP.The present invention has the advantages that quickly real-time, precisely deep, easy-to-understand, various informative.

Description

A kind of DNA methylation sequencing data calculating deciphering method
Technical field
The present invention relates to gene sequencing technologies, and in particular to a kind of DNA methylation sequencing data calculating deciphering method.
Background technique
In recent years, with the extensive use of next-generation sequencing technologies (Next Generation Sequence, NGS), base Because the cost of sequencing declines rapidly, gene sequencing technology is able in more extensive biology, medical treatment, health, criminal investigation, agricultural etc. Many expanded applications in field.Wherein, the DNA based on NGS (Deoxyribo-Nucleic Acid, DNA) first Baseization sequencing is the branch field for having very much application value, is widely paid close attention to.
Methylation (Methylation), which refers to, urges methyl from active methyl compound (such as S- adenosylmethionine) Change the process for being transferred to other compounds.Methylation is one of the important research content of epigenetics (epigenetics). The most common methylation is modified with DNA methylation and histone methylated.The DNA methylation of vertebrate typically occurs in CpG Site (sites), i.e., cytimidine (Cytosine)-phosphoric acid (Phosphoric acid)-guanine in DNA sequence dna (Guanine) site is 5-methylcytosine through dnmt rna catalysis Cytosines.About 80%-90% in human gene The site CpG be methylated, 1%-2% human genome is CpG groups, and CpG methylation be inversely proportional with transcriptional activity.DNA Methylation can cause the change of chromatin Structure, DNA conformation, DNA stability and DNA and protein interaction mode, can close Close the activity of certain genes, demethylation then reactivating and expressing induction of gene.For example, existing research shows that people DNA methylation and many diseases such as cancer, aging, senile dementia it is closely related, abnormal methylation is often many diseases Cause.Therefore, DNA methylation detection has the multiple fields such as biological study, medical diagnosis, Forensic Biology very big Application value.
In recent years, scientists are by traditional DNA methylation assay technology and target gene group capture technique and NGS high pass Amount sequencing technologies combine, and quantitative determine the technology to methylate in people and other species gene groups and come into the practical stage.Mesh It is preceding it is the most commonly used be sulphite PCR sequencing PCR (Bisulfite sequencing, BS-Seq), i.e., handle base with sulphite Because of a group DNA, then the cytimidine not methylated is converted into uracil (Uracil), and the cytimidine to methylate is constant.With BSP(Bisulfite sequencing PCR is designed afterwards) primer progress polymerase chain reaction (Polymerase Chain Reaction, PCR), uracil is completely converted into thymidine (Thymine) in amplification procedure, finally carries out to PCR product Sequencing is it may determine that whether the site CpG methylates.
The flow chart of data processing of DNA methylation sequencing based on NGS includes that data calculating and data interpret two big steps, Middle data calculate that step completes pretreatment with reference to genome and raw sequencing data goes the calculating tasks such as puppet, comparison, duplicate removal, To be used when data interpretation;Data interpret step to the data after data calculation processing in biology, medicine, health care etc. The Scientific Meaning in field is analyzed, disclosed and is explained.
Currently, the DNA methylation sequencing technologies based on NGS are in the upper bottleneck there are in terms of two of application:
One bottleneck is that sequencing data output capacity is far longer than sequencing data processing capacity.For example, based on NGS's More commonly used sequencing data, which calculates, in DNA methylation sequencing interprets software Methy-Pipe, to it is typical, include 300M The single sample data for reading the short sequencing fragment (reads) of a length of 75 base-pair (base pair, bp), in 12 core Intel to strong (Xeon) as soon as progress is entire on processor calculates the task interpreted in process --- it compares (alignment), time-consuming is about 5 hours, and 4000 sequenator of HiSeq of Illumina company being capable of output 200M reading a length of 300 within 5 hours The reads of bp.Therefore, on the one hand, the far super Moore's Law of increasing speed of annual 3 to 5 times of initial data of generation is sequenced, And the calculating interpretation of sequencing data is high input/output intensively with high computation-intensive task, is carried out to sequencing data real-time , accurately calculate interpret and transmission become extremely difficult, be faced with huge challenge.On the other hand, typical sequencing at present Data calculate deciphering method and still mainly rely on high performance central processing unit (Central Processing Unit, letter Claim CPU), it is handled with the software based on multithreading.But under the premise of guaranteeing accuracy, it is obtainable Calculate the demand that accelerating ability is still unable to satisfy above-mentioned challenge of interpreting.So this method has lacked duration.
Another bottleneck is that the depth interpreted of sequencing data, range are unable to satisfy the demand of scientific research personnel, at the same time its The readable demand for being unable to satisfy ordinary populace again.The typical method that sequencing data is interpreted at present is based on one with reference to gene Group has both been not enough to represent entire relative species however, currently used reference genome is inherently based on limited sample Diversity, and incomplete, therefore will lead to deviation when data are calculated and interpreted, and lack and other biologies, medical information Widely, depth intersection is analyzed, it is difficult to meet the needs of professional scientific researcher's further investigation.In addition, base is gone back in sequencing data interpretation Originally professional domain is rested on, towards non-professional masses, and lacks readability, that is, lacks to the direct biological meaning of sequencing data With easy-to-understand, the various informative interpretation of indirect health effect.
Currently, the common processor type of field of information processing has central processing unit (Central Processing Unit, abbreviation CPU), field programmable gate array (Field Programmable Gate Array, abbreviation FPGA), figure Processor (Graphics Processing Unit, abbreviation GPU) and digital signal processor (Digital Signal Processor, abbreviation DSP).High-performance CPU includes usually multiple processor cores (Processor Core), from hardware Support multithreading, but its design object is still towards general purpose application program, and relative to special calculating, general purpose application program Concurrency it is smaller, need more complex control and lower performance objective.Therefore, the hardware resource of CPU on piece is mainly still It for realizing complicated control rather than calculates, does not include special hardware for specific function, the calculating that can be supported is parallel It spends not high.FPGA is a kind of semi-custom circuit, and advantage has: carrying out system development based on FPGA, the design cycle is short, development cost It is low;It is low in energy consumption;Configuration can be remodified after production, design flexibility is high, and design risk is small.The disadvantage is that: realize same function, Speed of the FPGA in general than specific integrated circuit (Application Specific Integrated Circuit, ASIC) Degree is slow, bigger than ASIC circuit area.With the development of technology and evolution, FPGA is to more high density, larger capacity, lower function The shortcomings that direction consumed and integrate more stone intellectual properties (Intellectual Property, IP) is developed, FPGA is contracting It is small, and advantage is being amplified.Compared to CPU, FPGA can customize realization, modification with hardware description language and increase parallel meter It calculates.GPU is initially a kind of microprocessor dedicated for image procossing, and texture mapping and polygon can be supported from hardware The graphics calculations basic task such as color.It is calculated since graphics calculating is related to some general mathematicals, such as matrix and vector operation, and GPU possesses the framework of highly-parallel, and therefore, with the development of related software and hardware technology, GPU computing technique is increasingly risen, i.e., GPU is no longer limited to graphics process, is also exploited for the parallel computations such as linear algebra, signal processing, numerical simulation, Ke Yiti For the performance of decades of times or even up to a hundred times of CPU.But current GPU has 2: first is that, it is limited to the hardware of GPU Architectural characteristic, many parallel algorithms cannot efficiently perform on GPU;Second is that amount of heat, energy consumption can be generated in GPU operation It is higher.DSP is a kind of various signals quickly to be analyzed, convert, filter, detect, modulate, demodulate etc. with operations with digital method The microprocessor of processing.For this purpose, DSP has done special optimization, such as hardware realization high speed, high-precision in portion's structure in the chip Multiplication etc..With the arrival of digital Age, DSP be widely used in smart machine, resource exploration, it is digital control, biomedical, The every field such as space flight and aviation have the characteristics that low in energy consumption, precision is high, can carry out two dimension and multidimensional is handled.More than in conclusion Four kinds of calculating devices respectively have feature, and respectively have limitation.
It is how sharp for the bottleneck of two aspects existing for the aforementioned DNA methylation sequencing technologies application development based on NGS Quick real-time, accurate deep, easy-to-understand, the various informative calculating solution of magnanimity sequencing data is realized with above-mentioned processor It reads, then has become a key technical problem urgently to be resolved.
Summary of the invention
The technical problem to be solved in the present invention: it in view of the above problems in the prior art, provides a kind of quickly real-time, precisely deep Enter, is easy-to-understand, various informative DNA methylation sequencing data calculates deciphering method.
In order to solve the above-mentioned technical problem, the technical solution adopted by the present invention are as follows:
A kind of DNA methylation sequencing data calculating deciphering method, implementation steps include:
1) the reference genomic data and original sequencing sample data that are sequenced for DNA methylation are pre-processed;
2) gene by pretreated sequencing sample data and is referred to by hard-wired comparative device on CPU calling FPGA Group is compared;
3) called by CPU programmed on GPU the upper hard-wired deep learning model of identifier, FPGA of realization based on than Methylation identification is carried out to result;
4) result data is visualized, by hard-wired deep learning model on CPU calling FPGA to result Data reflection methylation function excavated and analyzed, and CPU call GPU on the relevant video of programmed process analysis mining, Animation and display task, CPU call programmed process and the relevant figure of analysis mining, image and audio task on DSP.
Preferably, it includes: to reference genome number that step 1), which carries out pretreated detailed step to reference genomic data, According to the raw letter conversion carried out for methylation, the upper hard-wired index maker of FPGA is called to make a living after letter converts by CPU Reference genomic data generate the index for being used for subsequent comparison task, reference genomic data after exporting raw letter conversion and its Index.
Preferably, it includes: to original survey that step 1), which carries out pretreated detailed step to original sequencing sample data, Sequence sample data carries out data quality control and obtains reliable sample data, and the data quality control includes trimming DNA methylation Raw sample data is sequenced, removes the joint sequence on reads and low-quality base, reliable sample data is carried out for first The raw letter conversion of base, the reliable sample data of sequencing of the DNA methylation after exporting raw letter conversion.
Preferably, to the reference genomic data and original sequencing sample number being sequenced for DNA methylation in step 1) It is concurrently to be executed on CPU based on different threads according to pretreatment is carried out.
Preferably, the detailed step of step 2 includes:
2.1) reference genomic data and its index after reading above-mentioned raw letter conversion;DNA first after reading above-mentioned trimming Reliable sample data is sequenced in the DNA methylation that base is sequenced after reliable sample data and raw letter conversion;
2.2) according to the index of the reference genomic data after above-mentioned raw letter conversion, call hardware on FPGA real by CPU The reference base after reliable sample data and above-mentioned raw letter conversion is sequenced in DNA methylation after above-mentioned raw letter conversion by existing comparative device Because group data carry out precise alignment, reliable sample data and above-mentioned raw letter is sequenced in the DNA methylation after establishing above-mentioned raw letter conversion The mapping relations between reference genomic data after conversion;
2.3) judge whether DNA methylation sequencing sample data is that both-end reads is then jumped and held if it is both-end reads Row step 2.4);Otherwise it is single-ended reads, jumps and execute step 2.5);Indefinite reads is then directly removed;
2.4) for both-end reads, mismatch that number is controlled and both-end reads between reading away from controlled condition under, root According to the index of the reference genomic data after above-mentioned raw letter conversion, hard-wired comparative device on FPGA is called again by CPU Reference genomic data after DNA methylation after above-mentioned raw letter conversion to be sequenced to reliable sample data and above-mentioned raw letter conversion into Row compares, and increases the ginseng that the DNA methylation after establishing above-mentioned raw letter conversion is sequenced after reliable sample data and above-mentioned raw letter conversion Examine the mapping relations between genomic data;It jumps and executes step 2.6);
2.5) for single-ended reads, under conditions of mismatch number is controlled, according to the reference gene after above-mentioned raw letter conversion The index of group data calls hard-wired comparative device on FPGA that above-mentioned life is believed to the DNA methylation after conversion again by CPU Reference genomic data after reliable sample data and above-mentioned raw letter conversion is sequenced is compared, and above-mentioned raw letter conversion is established in increase The mapping relations between the reference genomic data after reliable sample data and above-mentioned raw letter conversion are sequenced in DNA methylation afterwards;
2.6) according to above-mentioned comparison result, duplicate reads is removed;
2.7) according to above-mentioned comparison result, basic statistical information is generated, the basic statistical information includes comparison rate At least one of statistics, methylation level of density statistics;
2.8) above-mentioned comparing result and basic statistics information are exported.
Preferably, the detailed step of step 3) includes:
3.1) reference genomic data and its index after reading above-mentioned raw letter conversion;DNA first after reading above-mentioned trimming Reliable sample data is sequenced in the DNA methylation that base is sequenced after reliable sample data and raw letter conversion;Read above-mentioned comparing Result information;Read above-mentioned basic statistics result information;
3.2) each effective methylation sites are identified;
3.3) identification in various specified special methylation areas is carried out;
3.4) hard-wired deep learning model on FPGA is called to be responsible for parallel execution ASMs identification by CPU;
3.5) output methylation recognition result information.
It preferably, include 2 sons concurrently executed when the identification in the various specified special methylation areas of step 3.3) progress Task: subtask 1.: carry out the hypomethylation area identification of methylation density is low, gene expression amount is high region of DNA, and methylation The hyper-methylation area identification for the region of DNA that density is high, gene expression amount is low;Subtask is 2.: passing through CPU and carries out the base in a variety of samples It is called based on programming on GPU in fact because of the differential methylation area identification in the different region of methylation state in group, and by CPU Existing identifier be responsible for it is parallel execute DMRs identification to realize the identification of DMRs between individual, wherein differential methylation area is looked at as It may participate in the functional region of gene transcription level regulation.
Preferably, the detailed step of step 4) includes:
4.1) above-mentioned basic statistics result information, methylation recognition result information are read;
4.2) GPU and DSP is called to carry out basic statistics result information and methylation recognition result information visually by CPU Change processing, and CPU calls the relevant video of programmed process analysis mining on GPU, animation and display task, CPU is called to be compiled on DSP Journey processing figure relevant with analysis mining, image and audio task;
4.3) hard-wired deep learning model on FPGA is called to be responsible for parallel execution methylation functional analysis by CPU And excavation;And CPU calls the relevant video of programmed process analysis mining on GPU, animation and display task, CPU is called to be compiled on DSP Journey processing figure relevant with analysis mining, image and audio task;
4.4) output analysis data and depth interpret report.
DNA methylation sequencing data of the present invention calculates deciphering method and has an advantage that
1, DNA methylation sequencing data is calculated and interprets the time-consuming bottleneck of each of process, the calculation of task based access control itself Method or model concurrency have carried out targetedly simultaneously respectively in conjunction with the characteristics of these four processors of CPU, FPGA, GPU and DSP Row accelerates, and improves DNA methylation sequencing data and calculates the real-time interpreted.
2, the methylation identification in process is interpreted for the calculating of DNA methylation sequencing data and methylate functional analysis and digging Pick, the target of task based access control itself introduce deep learning mould in conjunction with the characteristics of these four processors of CPU, FPGA, GPU and DSP The processing of deep learning source data is accelerated and enriched to type, improves DNA methylation sequencing data and calculates the depth interpreted and wide Degree.
3, for DNA methylation sequencing data calculate interpret process in data visualization, in conjunction with CPU, GPU and DSP this The characteristics of three kinds of processors, cooperation completes visualization processing, improves the visual real-time of DNA methylation sequencing data, rich The rich visual diversity of DNA methylation sequencing data.
Detailed description of the invention
Fig. 1 is that DNA methylation of embodiment of the present invention sequencing data calculates the main-process stream schematic diagram interpreted.
Fig. 2 is that DNA methylation of embodiment of the present invention sequencing data calculates the pretreatment process schematic diagram interpreted.
Fig. 3 is that DNA methylation of embodiment of the present invention sequencing data calculates the comparing flow diagram interpreted.
Fig. 4 is that DNA methylation of embodiment of the present invention sequencing data calculates the methylation identification process schematic diagram interpreted.
Fig. 5 is that DNA methylation of embodiment of the present invention sequencing data calculates the identification data visualization interpreted and methylation function It can analysis mining flow diagram.
Specific embodiment
As shown in Figure 1, the implementation steps that the DNA methylation sequencing data of the present embodiment calculates deciphering method include:
1) the reference genomic data and original sequencing sample data that are sequenced for DNA methylation are pre-processed;
2) gene by pretreated sequencing sample data and is referred to by hard-wired comparative device on CPU calling FPGA (alignment) is compared in group;The step is needed using both processors of CPU and FPGA;
3) the upper hard-wired deep learning (Deep of identifier, FPGA that realization is programmed on GPU is called by CPU Learning, DL) model be based on comparison result carry out methylation identification;The step need using CPU, FPGA and GPU this three Kind processor;
4) result data is visualized, by hard-wired deep learning model on CPU calling FPGA to result Data reflection methylation function excavated and analyzed, and CPU call GPU on the relevant video of programmed process analysis mining, Animation and display task, CPU call programmed process and the relevant figure of analysis mining, image and audio task on DSP.This step Suddenly it needs using these four processors of CPU, FPGA, GPU and DSP.
As shown in Figure 1, in the present embodiment step 1) and 2) complete DNA methylation sequencing data calculating task;Step 3) With the solution reading task for 4) completing DNA methylation sequencing data.It is default if not adding specified otherwise in detailed below in step description Use CPU.
Step 1) includes 2 subtasks concurrently executed: original sample is sequenced in pretreatment and DNA methylation with reference to genome The pretreatment of notebook data.As shown in Fig. 2, to the reference genome number being sequenced for DNA methylation in step 1) in the present embodiment It is concurrently to be executed on CPU based on different threads (thread 1 and thread 2) according to pretreatment is carried out with original sequencing sample data 's.
Referring to fig. 2, it includes: to reference genome number that step 1), which carries out pretreated detailed step to reference genomic data, According to carry out for methylation raw letter (in silico) conversion, it is by hard-wired index maker on CPU calling FPGA Reference genomic data after raw letter conversion generates the index for being used for subsequent comparison task, the reference gene after exporting raw letter conversion Group data and its index.The step is needed using both processors of CPU and FPGA;Reference genomic data is directed to Methylation raw letter (in silico) conversion when, if using BS-Seq sequencing technologies needing that genome number will be referred to The C that methylated cytosine (Cytosine) does not occur for all representatives in is converted to the T for representing thymidine (Thymine).If DNA is double-strand (Watson and Crick strands), then 2 chains require to be converted.Reference after letter of making a living conversion When genomic data generates the index for being used for subsequent comparison task, CPU is responsible for the Row control that index generates, and the upper hardware of FPGA is real Existing index maker is responsible for parallel generation index, there is data and instruction interaction between CPU and FPGA.When only using CPU, this Step is that entire DNA methylation sequencing data calculates one of the time-consuming bottleneck interpreted in process, and FPGA is added, can accelerate parallel Complete computation-intensive task therein.Although whithin a period of time, specifically relatively fixed with reference to genomic data, can be generated Index is primary, then the Reusability in similar application, still, once there is update with reference to genomic data, it is necessary to it regenerates New index.
Referring to fig. 2, it includes: to original survey that step 1), which carries out pretreated detailed step to original sequencing sample data, Sequence sample data carries out data quality control and obtains reliable sample data (clean datas), and the data quality control includes It trims DNA methylation and raw sample data is sequenced, remove the joint sequence (the adapter sequence) and low on reads The base (bases) of quality carries out the raw letter conversion for methylation to reliable sample data, the DNA after exporting raw letter conversion The reliable sample data of the sequencing of methylation.Reliable sample data is sequenced to the DNA methylation obtained after above-mentioned trimming to carry out For methylation when giving birth to letter conversion, if using BS-Seq sequencing technologies needing that reliable sample for DNA methylation is sequenced All C for representing cytimidine (Cytosine) are converted to the T for representing thymidine (Thymine) in notebook data.
As shown in figure 3, the detailed step of step 2 includes:
2.1) reference genomic data and its index after reading above-mentioned raw letter conversion;DNA first after reading above-mentioned trimming Reliable sample data is sequenced in the DNA methylation that base is sequenced after reliable sample data and raw letter conversion;
2.2) according to the index of the reference genomic data after above-mentioned raw letter conversion, call hardware on FPGA real by CPU The reference after reliable sample data and above-mentioned raw letter conversion is sequenced in DNA methylation after above-mentioned raw letter conversion by existing comparative device 1 Genomic data carries out precise alignment, and reliable sample data and above-mentioned life is sequenced in the DNA methylation after establishing above-mentioned raw letter conversion The mapping relations between reference genomic data after letter conversion;Hard-wired comparative device 1 on FPGA is called to carry out by CPU When comparison, CPU is responsible for the Row control of data precise alignment, and the upper hard-wired comparative device 1 of FPGA is responsible for parallel execution of data Precise alignment has data and instruction interaction between CPU and FPGA.When only using CPU, the step for be that entire DNA methylation is surveyed Ordinal number is according to one of the time-consuming bottleneck interpreted in process is calculated, and the present embodiment is by being added hard-wired comparative device 1 on FPGA, energy Computation-intensive task therein is completed in enough parallel acceleration.
2.3) judge whether DNA methylation sequencing sample data is both-end (paired-end) reads, if it is both-end Reads is then jumped and is executed step 2.4);Otherwise it is single-ended (single-end) reads, jumps and execute step 2.5);It is indefinite (ambiguous) reads is then directly removed;
2.4) for both-end reads, in mismatch (mismatches) number controlled (such as no more than 2) and both-end Under the conditions of reading between reads is away from controlled (such as between 50 to 600 bases), according to the reference after above-mentioned raw letter conversion The index of genomic data calls hard-wired comparative device 2 on FPGA that above-mentioned life is believed to the DNA after conversion again by CPU The reference genomic data that methylation is sequenced after reliable sample data and above-mentioned raw letter conversion is compared, and above-mentioned life is established in increase The mapping between the reference genomic data after reliable sample data and above-mentioned raw letter conversion is sequenced in DNA methylation after letter conversion Relationship;It jumps and executes step 2.6);In the present embodiment by the DNA methylation after above-mentioned raw letter conversion be sequenced reliable sample data and When reference genomic data after above-mentioned raw letter conversion is compared, CPU is responsible for the Row control of comparing, the upper hardware of FPGA The comparative device 2 of realization is responsible for parallel execution of data comparison, there is data and instruction interaction between CPU and FPGA.When only using CPU, The step for be that entire DNA methylation sequencing data calculates one of the time-consuming bottleneck interpreted in process, FPGA is added, can be parallel Accelerate to complete computation-intensive task therein;
2.5) for single-ended reads, under conditions of mismatch (mismatches) number controlled (typically not greater than 2), According to the index of the reference genomic data after above-mentioned raw letter conversion, hard-wired comparison on FPGA is called again by CPU The reference genome number after reliable sample data and above-mentioned raw letter conversion is sequenced in DNA methylation after above-mentioned raw letter conversion by device 2 According to being compared, after increasing the reliable sample data of DNA methylation sequencing after establishing above-mentioned raw letter conversion and above-mentioned raw letter conversion Reference genomic data between mapping relations;Above-mentioned life is believed by hard-wired comparative device 2 on CPU calling FPGA When the reference genomic data that DNA methylation after conversion is sequenced after reliable sample data and above-mentioned raw letter conversion is compared, CPU is responsible for the Row control of comparing, and the upper hard-wired comparative device 2 of FPGA is responsible for parallel execution of data comparison, CPU and There is data and instruction interaction between FPGA.When only using CPU, the step for be that entire DNA methylation sequencing data is calculated and interpreted One of time-consuming bottleneck in process is added FPGA, can accelerate to complete computation-intensive task therein parallel;
2.6) according to above-mentioned comparison result, the reads for repeating (duplicate) is removed;
2.7) according to above-mentioned comparison result, basic statistical information is generated, the basic statistical information includes comparison rate (alignment rate) statistics, at least one of methylation level of density (methylation density level) statistics;
2.8) above-mentioned comparing result and basic statistics information are exported.
In the present embodiment, step 3) be based on comparison result carry out methylation identification need using CPU, FPGA and GPU this three Kind processor;As shown in figure 4, the detailed step of step 3) includes:
3.1) reference genomic data and its index after reading above-mentioned raw letter conversion;DNA first after reading above-mentioned trimming Reliable sample data is sequenced in the DNA methylation that base is sequenced after reliable sample data and raw letter conversion;Read above-mentioned comparing Result information;Read above-mentioned basic statistics result information;
3.2) each effective methylation sites are identified;For example, DNA methylation primarily forms 5-methylcytosine (5-mC) With a small amount of N6- methyl adenine (N6-mA) and 7- methyl guanine (7-mG).In eucaryote, 5-mC is mainly appeared on In CpG sequence, CpXpG, CCA/TGG and GATC;
3.3) identification in various specified special methylation areas is carried out;
3.4) hard-wired deep learning model on FPGA is called to be responsible for parallel execution ASMs(allele- by CPU Specific methylated regions, allele-specific methylation area) identification;In ASMs identification, CPU is responsible for The Row control of ASMs identification, the upper hard-wired deep learning model of FPGA are responsible for the parallel ASMs that executes and are identified, CPU and FPGA Between have data and instruction interaction.ASMs identification is carried out using deep learning method, can support the statistics based on big data Model can be realized more accurate ASMs classification and prediction.When only using CPU, the step for be the sequencing of entire DNA methylation Data calculate one of the time-consuming bottleneck interpreted in process, and FPGA is added, and hardware realization deep learning model can accelerate parallel At deep learning task;
3.5) output methylation recognition result information.
In the present embodiment, includes 2 when step 3.3) carries out the identification in various specified special methylation areas and concurrently execute Subtask: subtask 1.: carry out the hypomethylation area (hypo- of methylation density is low, gene expression amount is high region of DNA Methylated regions) identification, and the hyper-methylation area for the low region of DNA of density height, gene expression amount that methylates (hyper-methylated regions) identification;Subtask is 2.: methylate in genome in a variety of samples by CPU The differential methylation area in the different region of state identifies, and is called by CPU and born based on the identifier for programming realization on GPU Blame parallel execution DMRs(Differentially Methylated Regions, DMRs, differential methylation area) it identifies and comes in fact DMRs(Inter-DMRs between existing individual) identification, wherein differential methylation area is looked at as that gene transcription level tune may be participated in The functional region of control.2. subtask carries out the different region of methylation state in genome in a variety of samples by CPU The identification of differential methylation area, and called by CPU and the parallel DMRs that executes to be responsible for based on the identifier for programming realization on GPU know When other, CPU is responsible for the Row control of DMRs identification, and the identifier that realization is programmed on GPU is responsible for the parallel identification for executing DMRs, There is data and instruction interaction between CPU and GPU.When only using CPU, the step for be that entire DNA methylation sequencing data calculates One of the time-consuming bottleneck in process is interpreted, GPU is added, can accelerate to complete computation-intensive task therein parallel;
As shown in figure 5, the detailed step of step 4) includes:
4.1) above-mentioned basic statistics result information, methylation recognition result information are read;
4.2) GPU and DSP is called to carry out basic statistics result information and methylation recognition result information visually by CPU Change processing, and CPU calls the relevant video of programmed process analysis mining on GPU, animation and display task, CPU is called to be compiled on DSP Journey processing figure relevant with analysis mining, image and audio task;Carry out visualization processing, can with various science, it is intuitive, Lively mode shows the meaning of data.Such as: the distribution and ratio of various methylation sites, difference methylation areas distribution and Ratio, etc..Call GPU and DSP by basic statistics result information and methylation recognition result information by CPU in the present embodiment When carrying out visualization processing, CPU is responsible for visual Row control;The tasks such as the upper programmed process video of GPU, animation and display, There is data and instruction interaction between CPU and GPU;The tasks such as the upper programmed process figure of DSP, image and audio, between CPU and DSP There is data and instruction interaction.When only using CPU, the step for be that entire DNA methylation sequencing data is calculated and interpreted in process GPU and DSP is added in one of time-consuming bottleneck, they and CPU cooperate, and can accelerate to complete multi-media processing task parallel;
4.3) hard-wired deep learning model on FPGA is called to be responsible for parallel execution methylation functional analysis by CPU And excavation;And CPU calls the relevant video of programmed process analysis mining on GPU, animation and display task, CPU is called to be compiled on DSP Journey processing figure relevant with analysis mining, image and audio task;
Methylate functional analysis and excavation, i.e. data based on above-mentioned analysis, further analyses in depth relevant methylation function Can, and expand extension, except known knowledge, then excavate unknown association.Such as: there is pass in known methylation and cancer Connection, the step for just further analyse in depth effect of the various methylation patterns in cancer, further analyse in depth various Effect of the methylation patterns in various subdivision cancers;It excavates between methylation and Other diseases with the presence or absence of association, etc..
Hard-wired deep learning model on FPGA is called to be responsible for parallel execution methylation function by CPU in the present embodiment When can analyze and excavate, CPU is responsible for the Row control analyzed and excavated;The upper hard-wired deep learning model of FPGA is responsible for simultaneously Row executes analysis and excavates, and has data and instruction interaction between CPU and FPGA.It is analyzed and is dug using deep learning method Pick, can support the statistical models based on big data, can be realized more accurate analysis and excavation;The upper programmed process of GPU The tasks such as the relevant video of analysis mining, animation and display have data and instruction interaction between CPU and GPU;On DSP at programming The tasks such as figure relevant with analysis mining, image and audio are managed, there is data and instruction interaction between CPU and DSP.It only uses When CPU, the step for be that entire DNA methylation sequencing data calculates one of the time-consuming bottleneck interpreted in process, be added FPGA, GPU and DSP, they and CPU cooperate, and can accelerate to complete deep learning and associated multimedia processing task parallel;
4.4) output analysis data and depth interpret report.
In conclusion the DNA methylation sequencing data of the present embodiment, which calculates deciphering method, can satisfy sequencing data calculating Quick real-time, accurate deep, easy-to-understand, the various informative requirement interpreted, is the application of DNA methylation sequencing technologies Power-assisted.
The above is only a preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-mentioned implementation Example, all technical solutions belonged under thinking of the present invention all belong to the scope of protection of the present invention.It should be pointed out that for the art Those of ordinary skill for, several improvements and modifications without departing from the principles of the present invention, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims (5)

1. a kind of DNA methylation sequencing data calculates deciphering method, it is characterised in that implementation steps include:
1) the reference genomic data and original sequencing sample data that are sequenced for DNA methylation are pre-processed;
2) by CPU call the upper hard-wired comparative device of FPGA by pretreated sequencing sample data and refer to genome into Row compares;
3) the upper hard-wired deep learning model of identifier, FPGA for programming realization on GPU is called to be based on comparing knot by CPU Fruit carries out methylation identification;
4) result data is visualized, by hard-wired deep learning model on CPU calling FPGA to result data The methylation function of reflection is excavated and is analyzed, and CPU calls the relevant video of programmed process analysis mining, animation on GPU With display task, CPU calls programmed process and the relevant figure of analysis mining, image and audio task on DSP;
It includes: to carry out reference genomic data for first that step 1), which carries out pretreated detailed step to reference genomic data, The raw letter conversion of base, the reference genome after letter conversion of being made a living by hard-wired index maker on CPU calling FPGA Data generate the index for being used for subsequent comparison task, the reference genomic data and its index after exporting raw letter conversion;
Step 1) to original sequencing sample data carry out pretreated detailed step include: to original sequencing sample data into Row data quality control obtains reliable sample data, and the data quality control includes that original sample is sequenced in trimming DNA methylation Data remove the joint sequence on reads and low-quality base, turn for the raw letter of methylation to reliable sample data It changes, the reliable sample data of sequencing of the DNA methylation after exporting raw letter conversion;
The detailed step of step 2 includes:
2.1) reference genomic data and its index after reading above-mentioned raw letter conversion;DNA methylation after reading above-mentioned trimming Reliable sample data is sequenced in DNA methylation after reliable sample data and raw letter conversion is sequenced;
2.2) it according to the index of the reference genomic data after above-mentioned raw letter conversion, is called by CPU hard-wired on FPGA The reference genome after reliable sample data and above-mentioned raw letter conversion is sequenced in DNA methylation after above-mentioned raw letter conversion by comparative device Data carry out precise alignment, and reliable sample data and above-mentioned raw letter conversion is sequenced in the DNA methylation after establishing above-mentioned raw letter conversion The mapping relations between reference genomic data afterwards;
2.3) judge whether DNA methylation sequencing sample data is that both-end reads is then jumped if it is both-end reads and executed step It is rapid 2.4);Otherwise it is single-ended reads, jumps and execute step 2.5);Indefinite reads is then directly removed;
2.4) for both-end reads, mismatch that number is controlled and both-end reads between reading away from controlled condition under, according to upper The index of reference genomic data after stating raw letter conversion, will be upper again by hard-wired comparative device on CPU calling FPGA The reference genomic data that DNA methylation after stating raw letter conversion is sequenced after reliable sample data and above-mentioned raw letter conversion is compared It is right, increase the reference base that the DNA methylation after establishing above-mentioned raw letter conversion is sequenced after reliable sample data and above-mentioned raw letter conversion Because of the mapping relations between group data;It jumps and executes step 2.6);
2.5) for single-ended reads, under conditions of mismatch number is controlled, according to the reference genome number after above-mentioned raw letter conversion According to index, call the upper hard-wired comparative device of FPGA by the DNA methylation sequencing after above-mentioned raw letter conversion again by CPU Reference genomic data after reliable sample data and above-mentioned raw letter conversion is compared, after above-mentioned raw letter conversion is established in increase The mapping relations between the reference genomic data after reliable sample data and above-mentioned raw letter conversion are sequenced in DNA methylation;
2.6) according to above-mentioned comparison result, duplicate reads is removed;
2.7) according to above-mentioned comparison result, generate basic statistical information, the basic statistical information include comparison rate statistics, At least one for the level of density statistics that methylates;
2.8) export above-mentioned comparing result and basic statistics information.
2. DNA methylation sequencing data according to claim 1 calculates deciphering method, which is characterized in that right in step 1) Carrying out pretreatment for the reference genomic data of DNA methylation sequencing and original sequencing sample data is based on CPU What different threads concurrently executed.
3. DNA methylation sequencing data according to claim 1 calculates deciphering method, which is characterized in that step 3) it is detailed Carefully step includes:
3.1) reference genomic data and its index after reading above-mentioned raw letter conversion;DNA methylation after reading above-mentioned trimming Reliable sample data is sequenced in DNA methylation after reliable sample data and raw letter conversion is sequenced;Read the knot of above-mentioned comparing Fruit information;Read above-mentioned basic statistics result information;
3.2) each effective methylation sites are identified;
3.3) identification in various specified special methylation areas is carried out;
3.4) hard-wired deep learning model on FPGA is called to be responsible for parallel execution ASMs identification by CPU;
3.5) output methylation recognition result information.
4. DNA methylation sequencing data according to claim 3 calculates deciphering method, which is characterized in that step 3.3) into Include 2 subtasks concurrently executed when the identification in the various specified special methylation areas of row: subtask 1.: methylate close Spend the hypomethylation area identification of region of DNA low, that gene expression amount is high, and the region of DNA that methylation density is high, gene expression amount is low Hyper-methylation area identification;Subtask is 2.: carrying out in a variety of samples the different area of methylation state in genome by CPU The differential methylation area in domain identifies, and is called by CPU and be responsible for parallel execution DMRs based on the identifier for programming realization on GPU Identify the identification to realize DMRs between individual, wherein differential methylation area is looked at as that gene transcription level regulation may be participated in Functional region.
5. DNA methylation sequencing data according to claim 4 calculates deciphering method, which is characterized in that step 4) it is detailed Carefully step includes:
4.1) above-mentioned basic statistics result information, methylation recognition result information are read;
4.2) GPU and DSP is called to carry out basic statistics result information and methylation recognition result information at visualization by CPU Reason, and CPU calls the relevant video of programmed process analysis mining on GPU, animation and display task, CPU is called on DSP at programming Manage figure relevant with analysis mining, image and audio task;
4.3) hard-wired deep learning model on FPGA is called to be responsible for parallel execution methylation functional analysis and digging by CPU Pick;And CPU calls the relevant video of programmed process analysis mining on GPU, animation and display task, CPU is called on DSP at programming Manage figure relevant with analysis mining, image and audio task;
4.4) output analysis data and depth interpret report.
CN201710362178.4A 2017-05-22 2017-05-22 A kind of DNA methylation sequencing data calculating deciphering method Active CN107273663B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710362178.4A CN107273663B (en) 2017-05-22 2017-05-22 A kind of DNA methylation sequencing data calculating deciphering method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710362178.4A CN107273663B (en) 2017-05-22 2017-05-22 A kind of DNA methylation sequencing data calculating deciphering method

Publications (2)

Publication Number Publication Date
CN107273663A CN107273663A (en) 2017-10-20
CN107273663B true CN107273663B (en) 2018-12-11

Family

ID=60064456

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710362178.4A Active CN107273663B (en) 2017-05-22 2017-05-22 A kind of DNA methylation sequencing data calculating deciphering method

Country Status (1)

Country Link
CN (1) CN107273663B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019129200A1 (en) * 2017-12-28 2019-07-04 安诺优达基因科技(北京)有限公司 C-site extraction method and apparatus
CN111627499B (en) * 2020-05-27 2020-12-08 广州市基准医疗有限责任公司 Methylation level vectorization representation and specific sequencing interval detection method and device
CN114996763B (en) * 2022-07-28 2022-11-15 北京锘崴信息科技有限公司 Private data security analysis method and device based on trusted execution environment

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0406769D0 (en) * 2004-03-25 2004-04-28 Global Genomics Ab Methods and means for nucleic acid sequencing
US6934597B1 (en) * 2002-03-26 2005-08-23 Lsi Logic Corporation Integrated circuit having integrated programmable gate array and method of operating the same
CN102776270A (en) * 2011-05-12 2012-11-14 中国科学院上海生命科学研究院 Method and device for detecting DNA methylation
CN104762402A (en) * 2015-04-21 2015-07-08 广州定康信息科技有限公司 Method for rapidly detecting human genome single base mutation and micro-insertion deletion
CN105046109A (en) * 2015-06-26 2015-11-11 四川云合创智科技有限公司 Acceleration platform used for biological information sequence analysis
US9310432B2 (en) * 2011-07-25 2016-04-12 Cosmin Iorga Method and system for measuring the impedance of the power distribution network in programmable logic device applications
CN105483244A (en) * 2015-12-28 2016-04-13 武汉菲沙基因信息有限公司 Super-long genome-based variation detection algorithm and detection system
CN106021993A (en) * 2016-05-12 2016-10-12 北京百迈客云科技有限公司 Tumor exome sequencing analysis system and method
CN106295250A (en) * 2016-07-28 2017-01-04 北京百迈客医学检验所有限公司 Method and device is analyzed in the quick comparison of the short sequence of secondary order-checking
CN106326184A (en) * 2016-08-23 2017-01-11 成都卡莱博尔信息技术股份有限公司 CPU (Central Processing Unit), GPU (Graphic Processing Unit) and DSP (Digital Signal Processor)-based heterogeneous computing framework

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106650254B (en) * 2016-12-16 2018-11-20 武汉菲沙基因信息有限公司 A method of based on transcript profile sequencing data detection fusion gene

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6934597B1 (en) * 2002-03-26 2005-08-23 Lsi Logic Corporation Integrated circuit having integrated programmable gate array and method of operating the same
GB0406769D0 (en) * 2004-03-25 2004-04-28 Global Genomics Ab Methods and means for nucleic acid sequencing
CN102776270A (en) * 2011-05-12 2012-11-14 中国科学院上海生命科学研究院 Method and device for detecting DNA methylation
US9310432B2 (en) * 2011-07-25 2016-04-12 Cosmin Iorga Method and system for measuring the impedance of the power distribution network in programmable logic device applications
CN104762402A (en) * 2015-04-21 2015-07-08 广州定康信息科技有限公司 Method for rapidly detecting human genome single base mutation and micro-insertion deletion
CN105046109A (en) * 2015-06-26 2015-11-11 四川云合创智科技有限公司 Acceleration platform used for biological information sequence analysis
CN105483244A (en) * 2015-12-28 2016-04-13 武汉菲沙基因信息有限公司 Super-long genome-based variation detection algorithm and detection system
CN106021993A (en) * 2016-05-12 2016-10-12 北京百迈客云科技有限公司 Tumor exome sequencing analysis system and method
CN106295250A (en) * 2016-07-28 2017-01-04 北京百迈客医学检验所有限公司 Method and device is analyzed in the quick comparison of the short sequence of secondary order-checking
CN106326184A (en) * 2016-08-23 2017-01-11 成都卡莱博尔信息技术股份有限公司 CPU (Central Processing Unit), GPU (Graphic Processing Unit) and DSP (Digital Signal Processor)-based heterogeneous computing framework

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《基于GPU和压缩索引的新一代测序数据再测序研究》;应德全;《中国优秀硕士学位论文全文数据库基础科学辑》;20120315;第A006-78页; *
《基于GPU并行化计算的宏基因组第二代测序模拟系统》;宣黎明;《中国优秀硕士学位论文全文数据库基础科学辑》;20121015;第A006-329页 *
《基于Hash索引的高通量基因序列比对并行加速技术研究》;王文迪;《计算机研究与发展》;20131231;第50卷(第11期);第2463-3471页; *

Also Published As

Publication number Publication date
CN107273663A (en) 2017-10-20

Similar Documents

Publication Publication Date Title
Göttgens Regulatory network control of blood stem cells
Palit et al. Meeting the challenges of high-dimensional single-cell data analysis in immunology
Jiang et al. Methy-Pipe: an integrated bioinformatics pipeline for whole genome bisulfite sequencing data analysis
CN107273663B (en) A kind of DNA methylation sequencing data calculating deciphering method
Agapito et al. Parallel extraction of association rules from genomics data
CN107194204A (en) A kind of sequencing data of whole genome calculates deciphering method
CN107203703A (en) A kind of transcript profile sequencing data calculates deciphering method
Li A fast and exhaustive method for heterogeneity and epistasis analysis based on multi-objective optimization
Bansal et al. Efficient genome-scale phylogenetic analysis under the duplication-loss and deep coalescence cost models
Planet et al. Systematic analysis of DNA microarray data: ordering and interpreting patterns of gene expression
Otasek et al. Visual data mining: effective exploration of the biological universe
Chen et al. Recent advances in sequence assembly: principles and applications
Ojha et al. Computational molecular phylogeny: concepts and applications
Smart et al. A novel phylogenetic approach for de novo discovery of putative nuclear mitochondrial (pNumt) haplotypes
MacPherson et al. A general birth-death-sampling model for epidemiology and macroevolution
Rokas et al. From gene-scale to genome-scale phylogenetics: the data flood in, but the challenges remain
Lyantagaye Current status and future perspectives of bioinformatics in Tanzania
Holec et al. Integrating multiple-platform expression data through gene set features
Abu-Doleh et al. XgCPred: Cell type classification using XGBoost-CNN integration and exploiting gene expression imaging in single-cell RNAseq data
Orlov et al. Integrated computer analysis of genomic sequencing data based on ICGenomics tool
Som Bioinformatics strategies for stem cell research
Ahmed et al. Role of R in biological network analysis
Filipovic et al. Unique challenges and best practices for single cell transcriptomic analysis in toxicology
Psiuk-Maksymowicz et al. Scalability of a genomic data analysis in the biotest platform
Peltzer Computational methods for ancient genome reconstruction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant