CN112102885B - Method, apparatus and storage medium for determining methylation level of DNA sample - Google Patents

Method, apparatus and storage medium for determining methylation level of DNA sample Download PDF

Info

Publication number
CN112102885B
CN112102885B CN202011207667.0A CN202011207667A CN112102885B CN 112102885 B CN112102885 B CN 112102885B CN 202011207667 A CN202011207667 A CN 202011207667A CN 112102885 B CN112102885 B CN 112102885B
Authority
CN
China
Prior art keywords
bases
predetermined
site
methylation
extraction information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011207667.0A
Other languages
Chinese (zh)
Other versions
CN112102885A (en
Inventor
石教春
燕嘉慧
姚继成
王维锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Origimed Technology Shanghai Co ltd
Original Assignee
Origimed Technology Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Origimed Technology Shanghai Co ltd filed Critical Origimed Technology Shanghai Co ltd
Priority to CN202011207667.0A priority Critical patent/CN112102885B/en
Publication of CN112102885A publication Critical patent/CN112102885A/en
Application granted granted Critical
Publication of CN112102885B publication Critical patent/CN112102885B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Data Mining & Analysis (AREA)
  • Public Health (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioethics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

The present disclosure relates to a method, computing device, and storage medium for determining a methylation level of a DNA sample. The method comprises the following steps: comparing the filtered sequencing data with a DNA sample reference genome and a quality control sequence genome respectively to generate first comparison result data and second comparison result data; generating first extraction information based on reads of the predetermined locus of which the predetermined identifier meets a first predetermined threshold, and generating second extraction information based on reads of the predetermined locus of which the predetermined identifier meets a second predetermined threshold; respectively counting bases in the first extraction information and the second extraction information to calculate the methylation level of each predetermined locus of the reference genome based on each predetermined locus of the positive strand and the negative strand of the reference genome; determining a conversion efficiency and/or an error rate based on the second alignment result data to determine a methylation level of the DNA sample. The method can quickly and accurately calculate the methylation level of the DNA sample.

Description

Method, apparatus and storage medium for determining methylation level of DNA sample
Technical Field
The present disclosure relates generally to bioinformatics processing, and in particular, to methods, computing devices, and computer storage media for determining the methylation level of a DNA sample.
Background
DNA methylation is one of the earliest discovered, and currently most deeply studied, epigenetic regulatory mechanisms. Epigenetics refers to the phenomenon in which the expression level and function of a gene are altered and can be inherited to offspring without altering the DNA sequence of the gene. DNA methylation generally refers to a chemical modification process in which a methyl group is covalently bonded to the position of the fifth carbon atom (cytosine C is a six-atom cyclic structure, 2N atoms and 4C atoms in the ring, and is labeled in fixed order as number 1-6, and the fifth carbon atom is the C atom of number 5) of the C base of a dinucleotide, i.e., a dinucleotide structure of cytosine C-phosphate p-guanine G, in a genome CpG (i.e., a dinucleotide structure of cytosine C-phosphate p-guanine G) by DNA methyltransferase.
The human genome contains approximately 1% of methylated cytosines and is therefore the most abundant and widespread means of DNA modification. CpG sites in a gene region influence the transcription activity of the gene in a methylation mode, so that the expression of the gene is regulated. In tumor cells, there is a general change in DNA methylation levels that is different from normal cells, mainly characterized by a decrease in the overall methylation level and an increase in the local methylation level. In tumor cells, proto-oncogenes (i.e., intracellular genes associated with cell proliferation) are activated at a hypomethylated level and suppressor genes (i.e., tumor suppressor genes, which are a class of genes present in normal cells that inhibit cell growth and have a potential cancer suppressing effect) are suppressed at a hypermethylated level, resulting in the hyperproliferation of tumor cells. Therefore, accurate measurement and calculation of DNA methylation levels is of paramount importance.
Currently, the most common database building technique for detecting the methylation level of a sample is bisulfite conversion. Treating the sample DNA with bisulfite, converting all unmethylated cytosine C to uracil U, while the methylated cytosine C remains unchanged; and then after PCR amplification, converting uracil U into thymine T, and keeping methylated cytosine C unchanged, thereby distinguishing unmethylated cytosine C from methylated cytosine C in the original DNA, and further counting the methylation level of cytosine C at each CpG site.
The traditional scheme for determining the methylation level of a DNA sample is, for example, pyro-methylation sequencing technology, which is an enzyme cascade chemiluminescence reaction in the same reaction system catalyzed by 4 enzymes (DNA polymerase, ATP sulfurylase, luciferase and apyrase), and can accurately measure the methylation level of each site and evaluate the overall methylation level in a region. The pyromethylation sequencing technology can quickly detect the methylation frequency of CpG sites, and qualitatively and quantitatively detect the methylation sites in the sample. However, deficiencies of the pyromethylation sequencing and post-signal processing techniques include, for example: the test period is long, and normally, the pyromethylation determination technology needs to count the number of bases of all the sites of the reference genome in the alignment result one by one for calculating the methylation level, so that more time is needed for data statistics and calculation, for example, at least one week from the beginning of the experiment is needed to obtain the calculation result of the methylation level. In addition, the stability of the techniques of pyromethylation sequencing and post-signal processing is not ideal, and for example, the results obtained from the same sample and different companies may be different, even very different, and the results obtained from the same sample and the same company at different times may be different. In addition, pyromethylation sequencing technology has the disadvantages of length limitation and large DNA usage amount, for example, the length of the sequence to be detected is generally dozens of bases and at most not more than one hundred bases. If a longer sequence is to be detected, multiple experiments or parallel experiments are required (i.e., multiple experiments are performed on the same sample at the same time, and the conditions of each group are different; here, the positions of the sequences detected in each group are different, so that the purpose of detecting the longer sequence at the same time is achieved). In terms of the amount of DNA used, it generally takes at least 1ug of DNA fragments to be used in the experiment.
In summary, in the conventional scheme for determining the methylation level of a DNA sample, the determination period is long, the stability of the result is poor, the length is limited, and the amount of DNA used is large.
Disclosure of Invention
The present disclosure provides a method, a computing device, and a computer storage medium for determining a methylation level of a DNA sample, which enables a rapid and accurate calculation of the methylation level of the DNA sample.
According to a first aspect of the present disclosure, a method of determining the methylation level of a DNA sample is provided. The method comprises the following steps: filtering the received sequencing data for the DNA sample so as to leave sequencing data that meets a predetermined condition; comparing the remaining sequencing data with reference genomes corresponding to the DNA samples and genomes of the quality control sequences respectively so as to generate first comparison result data of positive strands and negative strands compared to the reference genomes and second comparison result data of positive strands and negative strands compared to the genomes of the quality control sequences; extracting reads of a predetermined position in the first comparison result data, wherein the predetermined position corresponds to a first predetermined threshold value, so as to generate first extraction information; extracting reads of the predetermined position with the predetermined identification meeting a second predetermined threshold value in the first comparison result data so as to generate second extraction information; counting the number of bases in the first extraction information and the number of bases in the second extraction information respectively based on each predetermined site of the positive strand and the negative strand of the reference genome, the predetermined sites belonging to a predetermined set of sites; calculating a methylation level for each predetermined site of the reference genome based on the number of bases in the first extracted information and the number of bases in the second extracted information; and determining at least one of a conversion efficiency and an error rate with respect to methylation of the DNA sample based on the second alignment result data for determining the methylation level of the DNA sample.
According to a second aspect of the present invention, there is also provided a computing device comprising: a memory configured to store one or more computer programs; and a processor coupled to the memory and configured to execute the one or more programs to cause the apparatus to perform the method of the first aspect of the disclosure.
According to a third aspect of the present disclosure, there is also provided a non-transitory computer-readable storage medium. The non-transitory computer readable storage medium has stored thereon machine executable instructions which, when executed, cause a machine to perform the method of the first aspect of the disclosure.
In some embodiments, the method of determining the methylation level of a DNA sample further comprises: based on the second alignment result data, the number of A, C, G, T four bases at each predetermined site of the positive strand and the negative strand of the genome of the quality control sequence was counted to calculate the methylation level of each predetermined site associated with the genome of the quality control sequence.
In some embodiments, the A, C, G, T numbers of four bases at each predetermined site of the positive and negative strands of the genome of the statistical control sequence include: extracting reads of a predetermined site with a predetermined identifier meeting a first predetermined threshold value in the second comparison result data so as to generate third extraction information; extracting reads of the predetermined sites of which the predetermined identifications meet a second predetermined threshold value in the second comparison result data so as to generate fourth extraction information; the number of A, C, G, T four bases in the third extracted information and the number of A, C, G, T four bases in the fourth extracted information were respectively counted based on each predetermined site of the positive strand and the negative strand of the genome of the quality control sequence for calculating the methylation level of each predetermined site associated with the genome of the quality control sequence.
In some embodiments, separately counting the number of bases in the first extracted information and the number of bases in the second extracted information comprises: counting A, C, G, T four bases in the first extraction information; and counting A, C, G, T the number of four bases in the second extraction information.
In some embodiments, calculating the methylation level for each predetermined site of the reference genome based on the number of bases in the first extracted information and the number of bases in the second extracted information comprises: calculating the total number of C, T two bases of the corresponding site in the second extraction information; calculating the total number of G, A two bases of the corresponding site in the first extraction information; the methylation level of the corresponding site is calculated based on the number of C bases of the corresponding site in the second extracted information, the number of G bases of the corresponding site in the first extracted information, the total number of C, T two bases of the corresponding site in the second extracted information, and the total number of G, A two bases of the corresponding site in the first extracted information, so as to obtain the methylation level for each predetermined site with respect to the reference genome.
In some embodiments, calculating the methylation level of each site associated with the genome of the quality control sequence comprises: calculating the total number of C, T two bases of the corresponding site in the fourth extraction information; calculating the total number of G, A two bases of the corresponding site in the third extraction information; the methylation level of each predetermined site associated with the genome of the quality control sequence is calculated based on the number of C bases of the corresponding site in the fourth extraction information, the number of G bases of the corresponding site in the third extraction information, the total number of C, T two bases of the corresponding site in the fourth extraction information, and the total number of G, A two bases of the corresponding site in the third extraction information.
In some embodiments, the second alignment data determining at least one of a conversion efficiency and an error rate for methylation of the DNA sample comprises: calculating the sum of the number of C bases of all unmethylated sites in the fourth extraction information; calculating the sum of the number of G bases of all unmethylated sites in the third extraction information; calculating the sum of the numbers of C, T two bases of all unmethylated sites in the fourth extraction information; calculating the sum of the numbers of G, A two bases of all unmethylated sites in the third extraction information; the conversion efficiency with respect to methylation of the DNA sample was calculated based on the sum of the number of C bases of unmethylated sites, the sum of the number of G bases of unmethylated sites, the sum of the number of C, T two bases of unmethylated sites, and the sum of the number of G, A two bases of unmethylated sites.
In some embodiments, the second alignment data determining at least one of a conversion efficiency and an error rate for methylation of the DNA sample comprises: calculating the sum of the number of C bases of all the methylated sites in the fourth extraction information; calculating the sum of the number of G bases of all the methylated sites in the third extraction information; calculating the sum of the numbers of C, T two bases of all methylation sites in the fourth extraction information; calculating the sum of the numbers of G, A two bases of all methylation sites in the third extraction information; the error rate with respect to methylation of the DNA sample was calculated based on the sum of the number of C bases of the methylation site, the sum of the number of G bases of the methylation site, the sum of the number of G, A two bases of the methylation site, and the sum of the number of G, A two bases of the methylation site.
In some embodiments, filtering the sequencing data so as to leave sequencing data that meets a predetermined condition comprises: removing the sequencing data comprising the linker sequence from the sequencing data; filtering out sequencing data having a sequencing quality value below a predetermined quality threshold; and filtering the sequencing data with the sequencing sequence length lower than the preset sequencing sequence length value so as to use the left sequencing data as the sequencing data meeting the preset condition.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the disclosure, nor is it intended to be used to limit the scope of the disclosure.
Drawings
Fig. 1 shows a schematic diagram of a system for a method of determining a methylation level of a DNA sample according to an embodiment of the present disclosure.
Fig. 2 shows a flow diagram of a method for determining a methylation level of a DNA sample according to an embodiment of the present disclosure.
Fig. 3 shows a flow chart of a method for calculating the methylation level of each predetermined site associated with the genome of a quality control sequence according to an embodiment of the present disclosure.
Fig. 4 shows a flow diagram of a method for determining an error rate with respect to methylation of a DNA sample according to an embodiment of the present disclosure.
FIG. 5 shows a schematic diagram for presenting images indicative of the distribution of methylation levels at each predetermined site, according to an embodiment of the present disclosure.
FIG. 6 schematically shows a block diagram of an electronic device suitable for use to implement an embodiment of the disclosure.
Like or corresponding reference characters designate like or corresponding parts throughout the several views.
Detailed Description
Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The term "include" and variations thereof as used herein is meant to be inclusive in an open-ended manner, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same object.
The term "methylation level" as used herein means the proportion of methylated cytosine C to all cytosine C. The term "conversion efficiency" means the ratio of bases at which unmethylated cytosine C is converted to uracil U after bisulfite treatment. The term "error rate" refers to the ratio of bases in which methylated cytosine C is converted to uracil U after bisulfite treatment, which should not normally occur. The term "quality control sequence" refers to a segment of exogenous DNA added to a DNA library during methylation banking, which DNA sequence comprises a plurality of known fully methylated CpG sites and fully unmethylated cytosine C sites. Based on the conversion of unmethylated cytosine C, the conversion efficiency of the sample as a whole can be evaluated. Based on the conversion of methylated cytosine C, the error rate of the sample as a whole can be evaluated. The commonly used quality control sequences include lambda DNA sequence and pUC19c sequence. The term "reads" refers to each record of sequencing data as a read, which is a generic term for sequencing results. In addition, guanine, cytosine, adenine and thymine are all of the mononucleotide type.
As mentioned above, in the conventional scheme for determining the methylation level of a DNA sample, for example, the number of bases of all the sites of the reference genome in the alignment result needs to be counted one by one for calculating the methylation level, so that it takes much data counting and calculating time. Therefore, the traditional scheme for determining the methylation level of the DNA sample has the defects of long determination period, poor stability of the result and the like.
To address, at least in part, one or more of the above issues and other potential issues, an example embodiment of the present disclosure proposes a scheme for determining a methylation level of a DNA sample. The scheme comprises the following steps: filtering the received sequencing data for the DNA sample so as to leave sequencing data that meets a predetermined condition; comparing the remaining sequencing data with reference genomes corresponding to the DNA samples and genomes of the quality control sequences respectively so as to generate first comparison result data of positive strands and negative strands compared to the reference genomes and second comparison result data of positive strands and negative strands compared to the genomes of the quality control sequences; extracting reads of a predetermined position in the first comparison result data, wherein the predetermined position corresponds to a first predetermined threshold value, so as to generate first extraction information; extracting reads of the predetermined position with the predetermined identification meeting a second predetermined threshold value in the first comparison result data so as to generate second extraction information; counting the number of bases in the first extraction information and the number of bases in the second extraction information respectively based on each predetermined site of the positive strand and the negative strand of the reference genome, the predetermined sites belonging to a predetermined set of sites; calculating a methylation level for each predetermined site of the reference genome based on the number of bases in the first extracted information and the number of bases in the second extracted information; and determining at least one of a conversion efficiency and an error rate with respect to methylation of the DNA sample based on the second alignment result data for determining the methylation level of the DNA sample.
In the above scheme, the filtered sequencing data meeting the predetermined condition is respectively aligned to the reference genome and the quality control sequence genome corresponding to the DNA sample so as to generate first comparison result data and second comparison result data, and the conversion efficiency and the error rate of the quality control sequence are calculated based on the extracted base statistics number of reads of the predetermined site meeting the first predetermined threshold in the "first comparison result data and the extracted base statistics number of reads of the predetermined site meeting the second predetermined threshold in the" second comparison result data, so as to determine the methylation level of the DNA sample; the method does not need to perform base statistics and methylation level calculation on all sites in a reference genome one by one, and only needs to perform reads extraction, base statistics and methylation level calculation on predetermined sites in a predetermined site set, so that the period for determining the methylation level of a DNA sample can be effectively shortened. In addition, the stability and accuracy of the methylation calculation result of the DNA sample can be improved by confirming the methylation calculation result of the DNA sample based on the conversion efficiency or error rate calculated by the base statistics of the second alignment result data. Moreover, the reads of the predetermined positions of which the predetermined identifications meet the first predetermined threshold and the second predetermined threshold are extracted, the reads of the predetermined positions of which the comparison results of the positive strand and the negative strand are in accurate states can be extracted, and the accuracy of calculation of the methylation level is improved. Therefore, the method can quickly and accurately calculate the methylation level of each site in the DNA sample.
Fig. 1 shows a schematic diagram of a system 100 for a method of determining a methylation level of a DNA sample according to an embodiment of the present disclosure. As shown in fig. 1, system 100 includes, for example, a computing device 110, a messaging server 150, a sequencing unit 160, and a network 170. The computing device 110 may interact with the messaging server 150 and the sequencing unit 160 in a wired or wireless manner via the network 170.
The computing device 110 is for calculating a methylation level for each predetermined site of the reference genome and determining at least one of a conversion efficiency and an error rate for methylation of the DNA sample for determining the methylation level of the DNA sample. In some embodiments, computing device 110 may have one or more processing units, including special purpose processing units such as GPUs, FPGAs, and ASICs, as well as general purpose processing units such as CPUs. The computing device 110 would be, for example, based on a Linux operating system with a shell programming language deployed thereon. In some embodiments, the computing device 110 is configured, for example and without limitation, with Bismark software, bsmap software, methyliextract software. In addition, one or more virtual machines may also be running on each computing device. The computing device 110 includes, for example, a data receiving unit 112, a filtering unit 114, a reference genome unit 120, and a quality control sequence genome unit 130. The data receiving unit 112, the filtering unit 114, the reference genome unit 120, and the quality control sequence genome unit 130 may be configured on one or more computing devices 110.
With respect to the reference genomic unit 120, it is used to assess the methylation level of each predetermined site with respect to the reference genome. The reference genomic unit 120 includes, for example: a reference genome alignment unit 116, a first extracted information generation unit 118, a second extracted information generation unit 122, a reference genome base statistics unit 124, and a reference genome methylation level calculation unit 126.
With respect to the control sequence genome unit 130, it is used to evaluate the conversion efficiency and error rate with respect to the methylation of the DNA sample. The quality control sequence genome unit 130 includes, for example: a quality control sequence genome alignment unit 132, a third extraction information generation unit 134, a fourth extraction information generation unit 136, a quality control sequence genome base statistics unit 138, a quality control sequence genome methylation level calculation unit 140, and a transformation efficiency and error rate determination unit 142.
With respect to the data receiving unit 112, it is used to receive sequencing data on a DNA sample from a sequencing unit 160 (e.g., a sequencer) or a messaging server 150.
And a filtering unit 114 for filtering the sequencing data on the DNA sample received by the data receiving unit 112 so as to leave the sequencing data meeting a predetermined condition.
With respect to the reference genome alignment unit 116, it is used to align the sequencing data left by filtering by the filtering unit 114 with the reference genome corresponding to the DNA sample so as to generate first comparison result data aligned to the plus strand and the minus strand of the reference genome.
Regarding the first extraction information generating unit 118, it is used to extract reads of predetermined locations in the first comparison result data, which predetermined locations identify meeting a first predetermined threshold, so as to generate first extraction information.
And a second extraction information generation unit 122 for extracting reads of the predetermined locations in the first comparison result data, the predetermined locations being identified as meeting a second predetermined threshold, so as to generate second extraction information.
And a reference genome base counting unit 124 for counting the number of bases in the first extracted information and the number of bases in the second extracted information, respectively, based on each predetermined site of the positive strand and the negative strand of the reference genome.
And a methylation level calculation unit 126 for calculating a methylation level of each predetermined site with respect to the reference genome based on the number of bases in the first extracted information and the number of bases in the second extracted information.
And a quality control sequence genome alignment unit 132 for aligning the remaining sequencing data with the genome of the quality control sequence to generate second alignment result data aligned to the positive strand and the negative strand of the genome of the quality control sequence.
And a third extraction information generating unit 134, configured to extract reads in the second comparison result data, which identify the predetermined locus meeting the first predetermined threshold, so as to generate third extraction information.
And a fourth extraction information generation unit 136 for extracting reads in the second comparison result data, which identify a predetermined locus that meets the second predetermined threshold, so as to generate fourth extraction information.
And a quality control sequence genome base counting unit 138 for counting A, C, G, T the number of four bases in the third extraction information and A, C, G, T the number of four bases in the fourth extraction information, respectively, based on each predetermined site of the positive strand and the negative strand of the genome of the quality control sequence.
And a control sequence genome methylation level calculation unit 140 for calculating the methylation level of each predetermined site associated with the genome of the control sequence. For example, the quality control sequence genome methylation level calculation unit 140 calculates the total number of C, T two bases of the corresponding site in the fourth extraction information; calculating the total number of G, A two bases of the corresponding site in the third extraction information; and calculating the methylation level of each predetermined site associated with the genome of the quality control sequence based on the number of C bases of the corresponding site in the fourth extraction information, the number of G bases of the corresponding site in the third extraction information, the total number of C, T two bases of the corresponding site in the fourth extraction information, and the total number of G, A two bases of the corresponding site in the third extraction information.
A conversion efficiency and error rate determination unit 142 for determining at least one of a conversion efficiency and an error rate with respect to methylation of the DNA sample based on the second alignment result data for determining the methylation level of the DNA sample. For example, the conversion efficiency and error rate determining unit 142 calculates the sum of the numbers of C bases of all unmethylated sites in the fourth extraction information; calculating the sum of the number of G bases of all unmethylated sites in the third extraction information; calculating the sum of the numbers of C, T two bases of all unmethylated sites in the fourth extraction information; calculating the sum of the numbers of G, A two bases of all unmethylated sites in the third extraction information; and calculating the conversion efficiency with respect to methylation of the DNA sample based on the sum of the number of C bases of the unmethylated site, the sum of the number of G bases of the unmethylated site, the sum of the number of C, T two bases of the unmethylated site, and the sum of the number of G, A two bases of the unmethylated site.
For another example, the conversion efficiency and error rate determining unit 142 calculates the sum of the numbers of C bases of all the methylated sites in the fourth extraction information; calculating the sum of the number of G bases of all the methylated sites in the third extraction information; calculating the sum of the numbers of C, T two bases of all methylation sites in the fourth extraction information; calculating the sum of the numbers of G, A two bases of all methylation sites in the third extraction information; and calculating an error rate with respect to methylation of the DNA sample based on the sum of the numbers of C bases of the methylation sites, the sum of the numbers of G bases of the methylation sites, the sum of the numbers of G, A two bases of the methylation sites, and the sum of the numbers of G, A two bases of the methylation sites.
Regarding the biometric server 150, it is used to manage the biometric information database, providing biometric information. The computing device 110 may obtain sequencing data about the DNA sample via the network 170 and the messaging server 150. In some embodiments, the trusted server 150 may have one or more processing units, including special purpose processing units such as GPUs, FPGAs, ASICs, and general purpose processing units such as CPUs. In addition, one or more virtual machines may also be running on each computing device.
With respect to the sequencing unit 160, it is used to generate sequencing data for a DNA sample. The sequencing unit 160 is, for example, NGS-based microbial sequencing technology, for example, after DNA extraction, a high-throughput sequencer is used to build a library for sequencing, i.e., to obtain raw reads data, and the storage format is usually a fastq file, which contains, for example, base quality information. The sequencing data about the DNA sample detected by the sequencing unit 160 can be sent to the computing device 110 via the network 170 for the computing device 110 to calculate methylation levels for each predetermined site of the reference genome based on the sequencing data about the DNA sample and to determine conversion efficiencies and error rates for methylation of the DNA sample.
A method for determining a methylation level of a DNA sample according to an embodiment of the present disclosure will be described below in conjunction with fig. 2. Fig. 2 shows a flow diagram of a method 200 for determining a methylation level of a DNA sample according to an embodiment of the present disclosure. It should be understood that the method 200 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be executed at the computing device 110 depicted in fig. 1. It should be understood that method 200 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.
At step 202, the computing device 110 may filter the received sequencing data for the DNA sample to leave sequencing data that meets a predetermined condition. For example, after the computing device 110 obtains sequencing data for the original DNA sample, AdapterRemoval software is used for data quality control to filter out undesirable and low quality sequencing data.
The received sequencing data on the DNA sample is, for example, sequencing result data for the DNA sample obtained via a sequencer.
In some embodiments, the computing device 110 can remove sequencing data that includes the linker sequence from the sequencing data; filtering out sequencing data having a sequencing quality value below a predetermined quality threshold; and filtering the sequencing data with the sequencing sequence length lower than the preset sequencing sequence length value so as to use the left sequencing data as the sequencing data meeting the preset condition.
Ways to remove sequencing data comprising linker sequences include, for example: the computing device 110 may remove sequencing data that includes a linker sequence from the sequencing data based on the relevant parameters (e.g., - -adapter1 AGATCGGAAGAGC- -adapter2 AGATCGGAAGAGC).
Examples of filtration formats for sequencing quality include: the computing device 110 may use "- - -trimqualites-minor 30" to remove sequencing data with a sequencing quality value below 30.
Examples of filtration formats for sequencing sequence length include: computing device 110 first determines that the predetermined sequencing sequence length value is 150; computing device 110 may then use "- -minlength 150" to remove sequencing data with sequencing sequence lengths below 150. In some embodiments, the predetermined sequencing sequence length may be set as a default value for the software.
At step 204, the computing device 110 aligns the remaining sequencing data with the reference genome and the genome of the control sequence corresponding to the DNA sample, respectively, to generate first alignment result data aligned to the positive strand and the negative strand of the reference genome, and second alignment result data aligned to the positive strand and the negative strand of the genome of the control sequence. For example, the first comparison result data is, for example, a comparison result bam file.
With respect to the quality control sequence, it is a segment of exogenous DNA added to a DNA library during methylation banking, which contains a plurality of known fully methylated CpG sites and fully unmethylated cytosine C sites. According to the conversion condition of the unmethylated cytosine C, the conversion efficiency of the whole sample can be evaluated; and based on the conversion of methylated cytosine C, the error rate of the sample as a whole can be evaluated. In methylation experiments, the control sequence is added to the sample and can be converted by unmethylated cytosine C in the previously synthesized control sequence. To check whether the methylation experimental quality of the preamble and the sequencing data thereof meet the predetermined requirements.
Regarding the manner of alignment, for example, the computing device 110 uses the data alignment module of the software bismark to perform an alignment of the sequencing data with a reference genome corresponding to the DNA sample, and an alignment of the sequencing data with a genome corresponding to the quality control sequence. The setting mode of the comparison parameters includes, for example: the alignment algorithm was set up as: "-" path _ to _ bowtie2 "(i.e., using the alignment algorithm of bowtie 2), and set for sequencing orientation: "- - -non _ directional". The above setting of the sequencing direction is mainly because the sequencing strategy of the PE150 (i.e. paired-end 150 sequencing, with a sequence length of 150 bases) is paired-end sequencing and non-directional, and therefore the above parameter setting is required for the sequencing direction. The other software parameters may each be set to default values, for example.
At step 206, computing device 110 extracts reads in the first comparison result data that identify a predetermined location that meets a first predetermined threshold to generate first extracted information.
With respect to the predetermined site, it is for example a predetermined drug-related methylation site, i.e. some site having clinical significance or drug use significance. All sites that have clinical or pharmaceutical significance with respect to methylation can be collected by public information or a database of beliefs and grouped into a predetermined set of sites. Each site in the predetermined set of sites is a predetermined site. By extracting the reads of the predetermined sites of which the predetermined identifications meet the first predetermined threshold in the first comparison result data, rather than the reads of all the sites of all the reference genomes, the efficiency of data processing can be greatly improved on the premise of ensuring the calculation accuracy of the methylated data.
As for the first predetermined threshold, it is, for example, 83 or 163. The first predetermined threshold is used to indicate an accurate status of the comparison result with one of the plus and minus strands of the reference genome.
For example, the computing device 110 extracts sequencing data aligned to the positive and negative strands of the reference genome corresponding to the DNA sample in the data alignment bam file; then, reads with flag marks (the flag marks represent digital marks of the reads comparison conditions, and the flag marks are usually located in the second column of the comparison result file bam file) of 83 or 163 in the comparison result bam file are extracted into first extraction information, and the first extraction information is, for example, an fc.
At step 208, computing device 110 extracts reads in the first comparison result data that identify predetermined locations that meet a second predetermined threshold to generate second extracted information.
As for the second predetermined threshold, it is, for example, 99 or 147. This 99 or 147 is used to indicate the exact status of the comparison result with the other of the plus and minus strands of the reference genome. The method can extract the reads of the predetermined sites under the accurate state of the comparison results of the positive strand and the negative strand of the reference genome by extracting the reads of the predetermined sites of which the predetermined identifications meet the first predetermined threshold and the second predetermined threshold so as to improve the accuracy of the subsequent methylation level calculation.
For example, the computing device 110 extracts sequencing data aligned to the positive and negative strands of the reference genome corresponding to the DNA sample in the data alignment bam file; and then extracting reads with flag being 99 or 147 in the comparison result bam file into second extraction information, wherein the second extraction information is fw. The second predetermined threshold is, for example, 99 or 147.
At step 210, the computing device 110 separately counts the number of bases in the first extracted information and the number of bases in the second extracted information based on each predetermined locus of the positive strand and the negative strand of the reference genome. In some embodiments, the computing device 110 may count the number of bases in the first extraction information and the number of bases in the second extraction information for all of the predetermined loci, respectively, and may also count the number of bases in the first extraction information and the number of bases in the second extraction information for some of the predetermined loci, respectively, in order to further improve the efficiency of data processing.
The method for counting the number of bases in the first extracted information and the number of bases in the second extracted information includes, for example: counting A, C, G, T four bases in the first extraction information; and counting A, C, G, T the number of four bases in the second extraction information. For example, the computing device 110 uses IGVTools software to count the number of A, C, G, T four bases at each position for two files extracted in step 208, such as fc. The relevant parameters of the IGVTools software are: count, using a count statistics function; w 1, statistics are performed one by one for individual sites. Other parameters of the IGVTools software may all be set to default values.
At step 212, the computing device 110 calculates the methylation level for each predetermined site of the reference genome based on the number of bases in the first extracted information and the number of bases in the second extracted information.
Methods of calculating the methylation level for each predetermined site of the reference genome include, for example: the computing device 110 calculates the total number of C, T two bases of the corresponding site in the second extraction information; calculating the total number of G, A two bases of the corresponding site in the first extraction information; and calculating the methylation level of the corresponding site based on the number of C bases of the corresponding site in the second extracted information, the number of G bases of the corresponding site in the first extracted information, the total number of C, T two bases of the corresponding site in the second extracted information, and the total number of G, A two bases of the corresponding site in the first extracted information, so as to obtain the methylation level for each predetermined site with respect to the reference genome. The manner in which the methylation level for each predetermined site of the reference genome is calculated is described below in conjunction with equation (1).
Figure 530194DEST_PATH_IMAGE001
(1)
In the above-mentioned formula (1),
Figure 650596DEST_PATH_IMAGE002
bam file, for example, representing the number of C bases of the corresponding site in the second extraction information.
Figure 646978DEST_PATH_IMAGE003
Bam file) represents the number of G bases of the corresponding site in the first extracted information (e.g., fc.
Figure 245449DEST_PATH_IMAGE004
Bam file, representing the total number of C, T two bases of the corresponding site in the second extraction information.
Figure 528663DEST_PATH_IMAGE005
Total number of G, A two bases representing the corresponding site in the first extraction information (e.g., fc. bam file).
Figure 819967DEST_PATH_IMAGE006
Representing the methylation level for each predetermined site of the reference genome. The corresponding sites are each of the predetermined sites.
At step 214, the computing device 110 determines at least one of a conversion efficiency and an error rate with respect to methylation of the DNA sample based on the second alignment result data for determining the methylation level of the DNA sample. For example, if the computing device 110 determines that the conversion efficiency with respect to methylation of the DNA sample reaches 99%, the quality of the methylation experiment is deemed satisfactory and the calculated methylation level of the DNA sample is accurate.
Methods for determining conversion efficiency with respect to methylation of a DNA sample include, for example: the calculation device 110 calculates the sum of the number of C bases of all unmethylated sites in the fourth extraction information; calculating the sum of the number of G bases of all unmethylated sites in the third extraction information; calculating the sum of the numbers of C, T two bases of all unmethylated sites in the fourth extraction information; calculating the sum of the numbers of G, A two bases of all unmethylated sites in the third extraction information; the conversion efficiency with respect to methylation of the DNA sample was calculated based on the sum of the number of C bases of unmethylated sites, the sum of the number of G bases of unmethylated sites, the sum of the number of C, T two bases of unmethylated sites, and the sum of the number of G, A two bases of unmethylated sites. The manner of calculating the conversion efficiency with respect to methylation of a DNA sample is described below in conjunction with equation (2).
Figure 805110DEST_PATH_IMAGE007
(2)
In the above-mentioned formula (2),
Figure 207272DEST_PATH_IMAGE008
bam file, for the quality control sequence, represents the sum of the number of C bases of all (e.g., n) unmethylated sites in the fourth extraction information.
Figure 344992DEST_PATH_IMAGE009
Bam file for quality control sequence) represents the sum of the number of G bases of all (e.g., n) unmethylated sites in the third extraction.
Figure 557930DEST_PATH_IMAGE010
Bam file for quality control sequence) representing the sum of the numbers of C, T two bases of all (e.g., n) unmethylated sites in the fourth extraction.
Figure 781101DEST_PATH_IMAGE011
The sum of the numbers of G, A two bases representing all (e.g., n) unmethylated sites in the third extraction information (for quality control sequence fc. bam file). n represents the number of all unmethylated sites of the control sequence.
Figure 455796DEST_PATH_IMAGE012
Represents the conversion efficiency with respect to methylation of the DNA sample, i.e., the ratio of bases at which unmethylated cytosine C is converted to uracil U after bisulfite treatment.
Regarding the method for determining the error rate related to the methylation of the DNA sample, the following description will be provided with reference to FIG. 4, and the details will not be repeated here.
In the above scheme, the filtered sequencing data meeting the predetermined condition is respectively aligned to the reference genome and the quality control sequence genome corresponding to the DNA sample so as to generate first comparison result data and second comparison result data, and the conversion efficiency and the error rate of the quality control sequence are calculated based on the extracted base statistics number of reads of the predetermined site meeting the first predetermined threshold in the "first comparison result data and the extracted base statistics number of reads of the predetermined site meeting the second predetermined threshold in the" second comparison result data, so as to determine the methylation level of the DNA sample; the method does not need to perform base statistics and methylation level calculation on all sites in a reference genome one by one, and only needs to perform reads extraction, base statistics and methylation level calculation on predetermined sites in a predetermined site set, so that the period for determining the methylation level of a DNA sample can be effectively shortened. In addition, the stability and accuracy of the methylation calculation result of the DNA sample can be improved by confirming the methylation calculation result of the DNA sample based on the conversion efficiency or error rate calculated by the base statistics of the second alignment result data. Moreover, the reads of the predetermined positions of which the predetermined identifications meet the first predetermined threshold and the second predetermined threshold are extracted, the reads of the predetermined positions of which the comparison results of the positive strand and the negative strand are in accurate states can be extracted, and the accuracy of calculation of the methylation level is improved. Therefore, the method can quickly and accurately calculate the methylation level of each site in the DNA sample.
In addition, experimental data show that: methylation calculation results of DNA samples can be completed in three days on average from the time of obtaining sample tissues to the time of finally obtaining the DNA samples. Moreover, the method for determining the methylation level of the DNA sample has good stability and high repeatability, and the methylation level of each site is basically consistent and has no difference of more than 1% in biological repeated experiments of a plurality of samples. The results obtained were also more consistent in duplicate tests of the same sample at different times. In addition, the method disclosed by the invention has no length limitation, can be used for sequencing and analyzing the whole genome at the longest, can also be used for detecting a certain sequence according to actual requirements, and has high flexibility. Furthermore, the requirement of the present disclosure on the amount of DNA used is small, and 50ng of DNA fragment can meet the experimental requirement, so that the present disclosure is very suitable for the case that the tissue sampling is difficult or the sample is very precious. In addition, the method disclosed by the invention evaluates the conversion efficiency and the error rate by adding the quality control sequence, and further ensures the accuracy of the methylation calculation result of the DNA sample.
In some embodiments, method 200 further comprises a method for presenting an image indicative of the distribution of methylation levels at each of the predetermined sites. For example, the computing device 110 renders an image indicating the distribution of the methylation level at each of the predetermined sites for presenting the image, based on the positional information of each of the predetermined sites on the genome and the calculated methylation level at each of the predetermined sites. For example, fig. 5 shows a diagram 500 for presenting a profile indicating the methylation level distribution at each predetermined site, in accordance with an embodiment of the present disclosure. In FIG. 5, each peak represents a predetermined locus. The abscissa of FIG. 5 represents the position of the predetermined locus on the genome, the ordinate represents the calculated methylation level of the locus, and the value at the peak top is the specific value on the ordinate corresponding to the predetermined locus.
In some embodiments, the method 200 further includes a method 300 for calculating the methylation level of each site associated with the genome of the quality control sequence. For example, the calculation device 110 counts the number of A, C, G, T four bases at each predetermined site of the positive strand and the negative strand of the genome of the quality control sequence based on the second alignment result data to calculate the methylation level of each site associated with the genome of the quality control sequence. The method 300 for calculating the methylation level of each site associated with the genome of the quality control sequence is described in detail below with reference to FIG. 3.
Fig. 3 shows a flow diagram of a method 300 for calculating the methylation level of each predetermined site associated with a genome of quality control sequences, according to an embodiment of the disclosure. It should be understood that the method 300 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be executed at the computing device 110 depicted in fig. 1. It should be understood that method 300 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.
At step 302, the computing device 110 extracts reads in the second alignment result data that identify predetermined loci that meet the first predetermined threshold to generate third extracted information. As for the first predetermined threshold, it is, for example, 83 or 163. The first predetermined threshold is used to indicate the accurate status of the alignment result with one of the positive and negative strands of the genome of the control sequence. At step 304, the computing device 110 extracts reads in the second alignment result data that identify predetermined loci that meet a second predetermined threshold to generate fourth extracted information. As for the second predetermined threshold, it is, for example, 99 or 147. This 99 or 147 is used to indicate the exact status of the alignment result with the other of the positive and negative strands of the genome of the quality control sequence.
At step 306, the computing device 110 separately counts A, C, G, T the number of four bases in the third extracted information and A, C, G, T the number of four bases in the fourth extracted information based on each predetermined site of the positive strand and the negative strand of the genome of the quality control sequence.
The A, C, G, T four base number in the third extracted information and the A, C, G, T four base number in the fourth extracted information are used to calculate the methylation level of each predetermined site associated with the genome of the control sequence. The specific calculation method includes, for example, the following steps 308 to 312.
At step 308, the computing device 110 calculates a total number of C, T two bases for the corresponding site in the fourth extraction information.
At step 310, the computing device 110 calculates the total number of G, A two bases for the corresponding site in the third extraction information.
At step 312, the computing device 110 calculates the methylation level of each predetermined site associated with the genome of the quality control sequence based on the number of C bases of the corresponding site in the fourth extracted information, the number of G bases of the corresponding site in the third extracted information, the total number of C, T two bases of the corresponding site in the fourth extracted information, and the total number of G, A two bases of the corresponding site in the third extracted information.
In some embodiments, the computing device 110 can calculate the methylation level of each site associated with the genome of the quality control sequence based on equation (1) above.
Fig. 4 shows a flow diagram of a method 400 for determining an error rate with respect to methylation of a DNA sample according to an embodiment of the present disclosure. It should be understood that the method 400 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be executed at the computing device 110 depicted in fig. 1. It should be understood that method 400 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.
At step 402, the computing device 110 calculates the sum of the number of C bases of all methylated sites in the fourth extraction information.
At step 404, the computing device 110 calculates the sum of the number of G bases of all methylated sites in the third extracted information.
At step 406, the computing device 110 calculates the sum of the C, T two base numbers for all methylation sites in the fourth extraction information.
At step 408, the computing device 110 calculates the sum of the G, A two base numbers for all methylation sites in the third extraction.
At step 410, the computing device 110 calculates an error rate for methylation of the DNA sample based on the sum of the number of C bases of the methylation site, the sum of the number of G bases of the methylation site, the sum of the number of G, A two bases of the methylation site, and the sum of the number of G, A two bases of the methylation site. The manner in which the error rate with respect to the methylation of the DNA sample is calculated is described below in conjunction with equation (3).
Figure 697290DEST_PATH_IMAGE013
(3)
In the above-mentioned formula (3),
Figure 330397DEST_PATH_IMAGE014
bam file representing the fourth extraction information (e.g. regarding quality control sequences) of the sum of the number of C bases of all (e.g. m) methylation sites.
Figure 40864DEST_PATH_IMAGE015
The sum of the number of G bases representing all (e.g., m) methylation sites in the third extraction information (e.g., fc. bam file for quality control sequences).
Figure 784829DEST_PATH_IMAGE016
Bam file for quality control sequences) of C, T sum of two bases representing all (e.g., m) methylation sites in the fourth extraction.
Figure 385225DEST_PATH_IMAGE017
Bam file for quality control sequences) of all (e.g., m) methylation sites. m represents the number of all methylated sites of the quality control sequence.
Figure 454812DEST_PATH_IMAGE018
Represents the error rate with respect to the methylation of the DNA sample, i.e., the ratio of bases in which methylated cytosine C is converted to uracil U after bisulfite treatment.
FIG. 6 schematically illustrates a block diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure. The device 600 may be a device for implementing the method 200 to 400 shown in fig. 2 to 4. As shown in fig. 6, device 600 includes a Central Processing Unit (CPU) 601 that may perform various appropriate actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM) 602 or loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM, various programs and data required for the operation of the device 600 may also be stored. The CPU, ROM, and RAM are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606, an output unit 607, a storage unit 608, the central processing unit 601 performs the respective methods and processes described above, for example, performing the methods 200 to 400. For example, in some embodiments, the methods 200-400 may be implemented as a computer software program stored on a machine-readable medium, such as the storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM and/or the communication unit 609. When the computer program is loaded into RAM and executed by a CPU, one or more of the operations of methods 200-400 described above may be performed. Alternatively, in other embodiments, the CPU may be configured by any other suitable means (e.g., by way of firmware) to perform one or more of the acts of methods 200-400.
It should be further appreciated that the present disclosure may be embodied as methods, apparatus, systems, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for carrying out various aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor in a voice interaction device, a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
The above are only alternative embodiments of the present disclosure and are not intended to limit the present disclosure, which may be modified and varied by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims (12)

1. A method for determining the methylation level of a DNA sample, comprising:
filtering the received sequencing data for the DNA sample so as to leave sequencing data that meets a predetermined condition;
comparing the remaining sequencing data with reference genomes corresponding to the DNA samples and genomes of the quality control sequences respectively so as to generate first comparison result data of positive strands and negative strands of the reference genomes and second comparison result data of the positive strands and the negative strands of the genomes of the quality control sequences;
extracting reads of the first comparison result data that identify a predetermined location meeting a first predetermined threshold for indicating an accurate status of the comparison result with one of the plus and minus strands of the reference genome, so as to generate first extracted information;
extracting reads in the first comparison result data that identify the predetermined location meeting a second predetermined threshold for indicating an accurate status of the comparison result with the other of the plus and minus strands of the reference genome, so as to generate second extracted information;
counting the number of bases in the first extraction information and the number of bases in the second extraction information respectively based on each of the predetermined sites of the positive strand and the negative strand of the reference genome, the predetermined sites belonging to a predetermined set of sites;
calculating a methylation level for each predetermined site of the reference genome based on the number of bases in the first extracted information and the number of bases in the second extracted information; and
determining at least one of a conversion efficiency and an error rate with respect to methylation of the DNA sample based on the second alignment result data for determining a methylation level of the DNA sample.
2. The method of claim 1, further comprising:
and counting the number of A, C, G, T four bases at each predetermined site of the positive strand and the negative strand of the genome of the quality control sequence based on the second alignment result data so as to calculate the methylation level of each predetermined site associated with the genome of the quality control sequence.
3. The method of claim 2, wherein counting the A, C, G, T four base numbers at each of the predetermined sites of the positive and negative strands of the genome of the quality control sequence comprises:
extracting reads of a predetermined site with a predetermined identifier meeting the first predetermined threshold value in the second comparison result data so as to generate third extraction information;
extracting reads of the predetermined locus with a predetermined identifier meeting the second predetermined threshold in the second comparison result data so as to generate fourth extraction information; and
and respectively counting A, C, G, T four bases in the third extraction information and A, C, G, T four bases in the fourth extraction information based on each predetermined site of the positive strand and the negative strand of the genome of the quality control sequence, so as to calculate the methylation level of each predetermined site related to the genome of the quality control sequence.
4. The method of claim 1, wherein separately counting the number of bases in the first extracted information and the number of bases in the second extracted information comprises:
counting A, C, G, T four bases in the first extraction information; and
and counting the number of A, C, G, T four bases in the second extraction information.
5. The method of claim 4, wherein calculating the methylation level for each predetermined site of the reference genome based on the number of bases in the first extracted information and the number of bases in the second extracted information comprises:
calculating the total number of C, T two bases of the corresponding site in the second extraction information;
calculating the total number of G, A two bases of the corresponding site in the first extraction information; and
calculating methylation levels of the corresponding sites based on the number of C bases of the corresponding sites in the second extracted information, the number of G bases of the corresponding sites in the first extracted information, the total number of C, T two bases of the corresponding sites in the second extracted information, and the total number of G, A two bases of the corresponding sites in the first extracted information, so as to obtain methylation levels for each predetermined site with respect to the reference genome.
6. The method of claim 3, wherein calculating the methylation level of each predetermined site associated with the genome of the quality control sequence comprises:
calculating the total number of C, T two bases of the corresponding site in the fourth extraction information;
calculating the total number of G, A two bases of the corresponding site in the third extraction information; and
calculating the methylation level of each predetermined site associated with the genome of the quality control sequence based on the number of C bases of the corresponding site in the fourth extraction information, the number of G bases of the corresponding site in the third extraction information, the total number of C, T two bases of the corresponding site in the fourth extraction information, and the total number of G, A two bases of the corresponding site in the third extraction information.
7. The method of claim 3, wherein determining at least one of a conversion efficiency and an error rate for methylation of the DNA sample based on the second alignment result data comprises:
calculating the sum of the number of C bases of all unmethylated sites in the fourth extraction information;
calculating the sum of the number of G bases of all unmethylated sites in the third extraction information;
calculating the sum of the numbers of C, T two bases of all unmethylated sites in the fourth extraction information;
calculating the sum of the numbers of G, A two bases of all unmethylated sites in the third extraction information; and
calculating the conversion efficiency with respect to methylation of the DNA sample based on the sum of the number of C bases of the unmethylated site, the sum of the number of G bases of the unmethylated site, the sum of the number of C, T two bases of the unmethylated site, and the sum of the number of G, A two bases of the unmethylated site.
8. The method of claim 3, determining at least one of a conversion efficiency and an error rate for methylation of the DNA sample based on the second alignment result data comprises:
calculating the sum of the number of C bases of all the methylated sites in the fourth extraction information;
calculating the sum of the number of G bases of all the methylated sites in the third extraction information;
calculating the sum of the numbers of C, T two bases of all methylation sites in the fourth extraction information;
calculating the sum of the numbers of G, A two bases of all methylation sites in the third extraction information; and
calculating an error rate with respect to methylation of the DNA sample based on a sum of the number of C bases of the methylation site, a sum of the number of G bases of the methylation site, a sum of C, T two bases of the methylation site, and a sum of G, A two bases of the methylation site.
9. The method of claim 1, wherein filtering the sequencing data so as to leave sequencing data that meets a predetermined condition comprises:
removing sequencing data comprising linker sequences in the sequencing data;
filtering out sequencing data having a sequencing quality value below a predetermined quality threshold; and
filtering the sequencing data with the sequencing sequence length lower than the predetermined sequencing sequence length value so as to use the remaining sequencing data as the sequencing data meeting the predetermined condition.
10. The method of claim 1, further comprising:
and drawing an image indicating the distribution of the methylation level of each predetermined site based on the position information of each predetermined site on the genome and the calculated methylation level of each predetermined site, for presenting the image.
11. A computing device, comprising:
at least one processing unit;
at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions when executed by the at least one processing unit, cause the apparatus to perform the steps of the method of any of claims 1 to 10.
12. A computer-readable storage medium, having stored thereon a computer program which, when executed by a machine, implements the method of any of claims 1-10.
CN202011207667.0A 2020-11-03 2020-11-03 Method, apparatus and storage medium for determining methylation level of DNA sample Active CN112102885B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011207667.0A CN112102885B (en) 2020-11-03 2020-11-03 Method, apparatus and storage medium for determining methylation level of DNA sample

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011207667.0A CN112102885B (en) 2020-11-03 2020-11-03 Method, apparatus and storage medium for determining methylation level of DNA sample

Publications (2)

Publication Number Publication Date
CN112102885A CN112102885A (en) 2020-12-18
CN112102885B true CN112102885B (en) 2021-02-12

Family

ID=73784532

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011207667.0A Active CN112102885B (en) 2020-11-03 2020-11-03 Method, apparatus and storage medium for determining methylation level of DNA sample

Country Status (1)

Country Link
CN (1) CN112102885B (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105132409B (en) * 2015-09-02 2018-02-06 中国科学院北京基因组研究所 The library constructing method of RNA 5mC bisulfites sequencing and its application
CN110117652A (en) * 2018-02-05 2019-08-13 箐健科技(天津)有限公司 Hepatocarcinoma early diagnosis method
CN109385464A (en) * 2018-07-27 2019-02-26 中山大学附属第六医院 A kind of DNA methylation detection kit and method
SG11202103486YA (en) * 2018-10-31 2021-05-28 Guardant Health Inc Methods, compositions and systems for calibrating epigenetic partitioning assays
CN111778316B (en) * 2020-06-09 2022-12-27 山东师范大学 Fluorescent probe based on oxidative damage basic group, kit and method for directly detecting DNA methylation

Also Published As

Publication number Publication date
CN112102885A (en) 2020-12-18

Similar Documents

Publication Publication Date Title
Müller et al. Capturing the dynamics of genome replication on individual ultra-long nanopore sequence reads
JP7119014B2 (en) Systems and methods for detecting rare mutations and copy number variations
Hangauer et al. Pervasive transcription of the human genome produces thousands of previously unidentified long intergenic noncoding RNAs
Krawitz et al. Microindel detection in short-read sequence data
Adey et al. The haplotype-resolved genome and epigenome of the aneuploid HeLa cancer cell line
Spencer et al. Comparison of clinical targeted next-generation sequence data from formalin-fixed and fresh-frozen tissue specimens
Guo et al. Three-stage quality control strategies for DNA re-sequencing data
Xuan et al. Next-generation sequencing in the clinic: promises and challenges
Duncavage et al. Targeted next generation sequencing of clinically significant gene mutations and translocations in leukemia
IL257074A (en) Single-molecule sequencing of plasma dna
US11901043B2 (en) Sequence analysis method, sequence analysis apparatus, reference sequence generation method, reference sequence generation apparatus, program, and storage medium
EP3405573A1 (en) Methods and systems for high fidelity sequencing
Hou et al. Extensive allele‐specific translational regulation in hybrid mice
CN111933214B (en) Method and computing device for detecting RNA level somatic gene variation
Smart et al. A novel phylogenetic approach for de novo discovery of putative nuclear mitochondrial (pNumt) haplotypes
CN105483210A (en) RNA (ribonucleic acid) editing locus detection method
Bansal et al. A probabilistic method for the detection and genotyping of small indels from population-scale sequence data
CN109461473B (en) Method and device for acquiring concentration of free DNA of fetus
CN105528532B (en) A kind of characteristic analysis method in rna editing site
CN110305945A (en) A kind of free Mitochondrial DNA Mutation detection technique based on two generation sequencing technologies
Wang et al. Evolving copy number gains promote tumor expansion and bolster mutational diversification
Kainth et al. Merging short and stranded long reads improves transcript assembly
CN112102885B (en) Method, apparatus and storage medium for determining methylation level of DNA sample
JP5946277B2 (en) Method and system for assembly error detection (assembly error detection)
KR102427600B1 (en) Method for screening for somatic mutations to determine culture adaptation of stem cells

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant