CN112397150B - ctDNA methylation level prediction device and method based on target region capture sequencing - Google Patents

ctDNA methylation level prediction device and method based on target region capture sequencing Download PDF

Info

Publication number
CN112397150B
CN112397150B CN202110072090.5A CN202110072090A CN112397150B CN 112397150 B CN112397150 B CN 112397150B CN 202110072090 A CN202110072090 A CN 202110072090A CN 112397150 B CN112397150 B CN 112397150B
Authority
CN
China
Prior art keywords
file
reads
methylation level
filtering
bam
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110072090.5A
Other languages
Chinese (zh)
Other versions
CN112397150A (en
Inventor
韩天澄
宋小凤
于佳宁
洪媛媛
裴志华
陈维之
何骥
杜波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Zhenhe Biotechnology Co.,Ltd.
Zhenhe (Beijing) Biotechnology Co.,Ltd.
Original Assignee
Wuxi Zhenhe Biotechnology Co ltd
Zhenhe Beijing Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Zhenhe Biotechnology Co ltd, Zhenhe Beijing Biotechnology Co ltd filed Critical Wuxi Zhenhe Biotechnology Co ltd
Priority to CN202110072090.5A priority Critical patent/CN112397150B/en
Publication of CN112397150A publication Critical patent/CN112397150A/en
Application granted granted Critical
Publication of CN112397150B publication Critical patent/CN112397150B/en
Priority to EP21920475.7A priority patent/EP4268231A1/en
Priority to PCT/CN2021/091761 priority patent/WO2022156089A1/en
Priority to US17/490,549 priority patent/US20220228209A1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Abstract

The invention provides a ctDNA methylation level prediction device and method based on target region capture sequencing, wherein the device comprises: the FASTQ file processing module is used for acquiring a FASTQ file for capturing and sequencing a ctDNA sample to be detected, and processing the FASTQ file to obtain a filtered FASTQ file; the comparison module of the sample to be tested is used for comparing the gene sequence in the obtained FASTQ file with the reference genome and removing duplication to obtain a corresponding Bam file; the reads horizontal filtering module is used for filtering the reads in the generated Bam file one by one according to the preset C-T conversion rate to obtain a filtered Bam file; and the methylation level prediction module is used for further filtering the Bam file according to the Bed file of the target area and the preset number of covered CpG sites in each read, and predicting the methylation level of the CpG sites according to the residual reads.

Description

ctDNA methylation level prediction device and method based on target region capture sequencing
Technical Field
The invention relates to the technical field of biomedicine, in particular to a ctDNA methylation level prediction device and method.
Background
Circulating tumor DNA (ctDNA) is a small fragment of DNA derived from tumor cell apoptosis and necrosis, and is released from tumor cells to peripheral blood circulation to form endogenous single-stranded or double-stranded DNA carrying molecular mutation information consistent with primary tumor tissue. Therefore, the ctDNA sample detection can be used as a substitute sample for clinical tissue sample gene detection.
Studies have shown that epigenetic changes are one of the most common molecular changes in tumor formation. DNA methylation is a widely studied epigenetic modification that plays an important role in regulating gene expression and the like. Generally, DNA methylation refers to the structure of 5-methylcytosine (5mC) added to the 5' C of cytosine by the action of DNA methyltransferase (DNMT) to form a methyl group. Research shows that DNA methylation is involved in cell activities such as cell differentiation and tissue-specific gene expression, and abnormal DNA methylation can cause diseases such as dysplasia and tumors. Therefore, DNA methylation is of great significance to both ontogeny and the mechanism of tumor development and development.
With the continuous development of the next-generation sequencing technology, the application of the second-generation sequencing technology in the fields of human genetic diseases and cancer diagnosis is more and more common, and the methylation sequencing of ctDNA has become an important means for researching the tumor occurrence and development mechanism. However, the human reference genome is 3G in size, and it is too costly to perform whole genome methylation sequencing, which results in a large data volume. Therefore, target region capture sequencing has become an ideal method in scientific research.
The current traditional quality detection process for DNA methylation capture data is generally: comparing the data in the FASTQ format with a human reference genome, reserving high-quality unique comparison reads, removing repeated reads, then evaluating the base content proportion, capture efficiency and sequencing depth of the reserved reads to obtain a Bam file of a ctDNA sample to be detected, and finally analyzing the Bam file by using third-party software to obtain methylation level data of the ctDNA sample to be detected at CpG sites (cytosine-phosphate-guanine sites, namely sites immediately following guanine after cytosine in a DNA sequence), and directly using the methylation level data in subsequent scientific research and analysis.
Bisulfite treatment is required during the above-described DNA methylation capture sequencing of the target region to convert all unmethylated cytosines (C) to uracil (U) and uracil to thymine (T) via PCR (polymerase chain reaction), a technique for amplifying a specific DNA fragment, but methylated cytosines are not altered during this process. It can be known that incomplete conversion of unmethylated cytosine is likely to occur in this process, and thus a prediction deviation occurs in the methylation level of the ctDNA sample to be detected. And because the content of ctDNA is very low, the methylation level of the ctDNA sample is more easily influenced by the C-T conversion rate, and the accuracy of the detection result is further influenced.
Disclosure of Invention
Aiming at the problems, the invention provides a ctDNA methylation level prediction device and method based on target region capture sequencing, which effectively overcome the defects of low accuracy, large data quality deviation and the like in the conventional ctDNA methylation level prediction.
The technical scheme provided by the invention is as follows:
in one aspect, the present invention provides a ctDNA methylation level prediction device based on target region capture sequencing, comprising:
the FASTQ file processing module is used for acquiring a FASTQ file for capturing and sequencing a ctDNA sample to be detected, and performing preprocessing operation on the FASTQ file to obtain a filtered FASTQ file;
the comparison module of the sample to be tested is used for comparing the gene sequence in the FASTQ file obtained by the FASTQ file processing module with the reference genome and removing duplication to obtain a corresponding Bam file;
the reads horizontal filtering module is used for filtering the reads in the Bam file generated by the to-be-detected sample comparison module one by one according to the preset C-T conversion rate to obtain a filtered Bam file;
and the methylation level prediction module is used for further filtering the Bam file output by the reads level filtering module according to the Bed file of the target area and the preset number of covered CpG sites in each read, and predicting the methylation level of the CpG sites according to the residual reads.
In this embodiment, FASTQ is a common type of high-throughput sequencing file. reads are the genome or transcriptome sequence fragments detected by a sequencer. According to the context of the methylated C base, the three types of CpG, CHG and CHH are divided, wherein H represents any one of bases except G base, namely A, C, T; the downstream of C where CpG is methylated is 1G base, CHG represents that 2 bases downstream of methylated C are H and G, CHH represents that two bases downstream of methylated C are both H, and CHG and CHH can be collectively called non CpG context. The Bam file is used to store the results of the sequencing sequence back-pasted to the reference genome. The C-T conversion rate is the ratio of C base to T base of non-CpG site in the original sequence.
Further preferably, in the FASTQ file processing module, the preprocessing operation performed on the acquired FASTQ file includes: removing the joints and low quality reads; and/or the presence of a gas in the gas,
in the comparison module of the sample to be tested, the gene sequence in the FASTQ file obtained by the FASTQ file processing module is respectively compared with the human reference genome and the internal reference lambda DNA reference genome and is subjected to de-duplication, and a Bam file of the human reference genome, a comparison report before de-duplication and a comparison report after de-duplication, and a Bam file of the internal reference lambda DNA reference genome, a comparison report before de-duplication and a comparison report after de-duplication are generated.
Further preferably, in the reads horizontal filtering module, the method includes:
the methylation number counting unit is used for reading reads in the Bam file generated by the to-be-detected sample comparison module line by line and counting the number of methylated and unmethylated bases under a non-CpG context mode;
a C-T conversion rate calculation unit for calculating the C-T conversion rate of each reads according to the sum of the number of non-CpG context bases which are methylated and the number of non-CpG context bases;
and the first filtering unit is used for filtering reads with the C-T conversion rate smaller than the preset C-T conversion rate in the Bam file to obtain a filtered Bam file.
Further preferably, the methylation level prediction module comprises:
the second filtering unit is used for filtering the known SNP sites in the dbSNP database and the SNP sites generated due to the specific variation reasons according to the target region Bed file to obtain the CpG sites of the ctDNA sample to be detected; and the device is used for further filtering the Bam file output by the reads horizontal filtering module according to the CpG sites obtained by filtering and the preset number of covered CpG sites in each read;
and the methylation level calculation unit is used for calculating the methylation level of the CpG sites according to the residual reads of the Bam file after the filtering of the second filtering unit.
In another aspect, the present invention provides a ctDNA methylation level prediction method based on target region capture sequencing, comprising:
acquiring a FASTQ file for capturing and sequencing a ctDNA sample to be detected, and carrying out pretreatment operation on the FASTQ file to obtain a filtered FASTQ file;
comparing the gene sequence in the obtained FASTQ file with a reference genome and removing duplication to obtain a corresponding Bam file;
filtering reads in the generated Bam file one by one according to a preset C-T conversion rate to obtain a filtered Bam file;
and further filtering the filtered Bam file according to the Bed file of the target area and the preset number of covered CpG sites in each read, and predicting the methylation level of the CpG sites according to the residual reads.
Further preferably, the obtaining of the FASTQ file for capturing and sequencing the ctDNA sample to be tested, and performing a preprocessing operation on the FASTQ file to obtain a filtered FASTQ file, includes: performing joint removal and low-quality reads operation on the acquired FASTQ file; and/or the presence of a gas in the gas,
comparing the gene sequence in the obtained FASTQ file with a reference genome and de-duplicating the gene sequence to obtain a corresponding Bam file, wherein the file comprises: and respectively comparing the gene sequences in the FASTQ file obtained by the FASTQ file processing module with a human reference genome and an internal reference lambda DNA reference genome and removing the duplication to generate a Bam file of the human reference genome, an alignment report before duplication removal and an alignment report after duplication removal, and an internal reference lambda DNA reference genome Bam file, an alignment report before duplication removal and an alignment report after duplication removal.
Further preferably, the filtering reads in the generated Bam file one by one according to a preset C-T conversion rate to obtain a filtered Bam file includes:
reading reads in the Bam file line by line, and counting the number of methylated and unmethylated bases under a non-CpG context mode;
calculating the C-T conversion rate of each reads according to the sum of the base number of the methylated non-CpG context and the base number of the methylated non-CpG context;
and filtering reads with the C-T conversion rate smaller than the preset C-T conversion rate in the Bam file to obtain a filtered Bam file.
Further preferably, the further filtering the filtered Bam file according to the target region Bed file and the preset number of covered CpG sites in each reads, and predicting the methylation level of CpG sites according to the remaining reads includes:
filtering the known SNP sites in the dbSNP database and the SNP sites generated due to specific variation reasons according to the Bed file of the target region to obtain CpG sites of the ctDNA sample to be detected;
further filtering the Bam file according to the CpG sites obtained by filtering and the preset number of covered CpG sites in each read;
the methylation level of CpG sites was calculated from the remaining reads of the filtered Bam file.
In another aspect, the present invention provides a terminal device, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the above ctDNA methylation level prediction method based on target region capture sequencing.
In another aspect, the present invention provides a computer-readable storage medium storing a computer program, wherein the computer program is configured to, when executed by a processor, implement any of the above-mentioned steps of the ctDNA methylation level prediction method based on target region capture sequencing.
The ctDNA methylation level prediction device and method based on target region capture sequencing provided by the invention can at least bring the following beneficial effects:
1. on the basis of the traditional methylation data quality detection process, the influence of the C-T conversion rate on the subsequent prediction methylation level is considered, and the filtered methylation data is ensured to have higher reliability by using strict screening standards. Specifically, the C-T conversion rate of each reads is counted in consideration of the particularity of the ctDNA sample to be detected, and reads noise generated due to low C-T conversion rate is filtered, so that the reliability of methylation data is greatly improved, and a foundation is laid for the subsequent methylation level prediction.
2. Based on a common CpG locus methylation level prediction method, a stricter methylation level prediction standard is adopted, so that the methylation level prediction is more accurate. Specifically, the methylation state of reads covering CpG sites is considered, and reads with low reliability are filtered out, so that the methylation level prediction is more accurate, and a reliable data basis is provided for scientific research.
Drawings
The foregoing features, technical features, advantages and embodiments are further described in the following detailed description of the preferred embodiments, which is to be read in connection with the accompanying drawings.
FIG. 1 is a schematic diagram of the ctDNA methylation level prediction device based on target region capture sequencing according to the present invention;
FIG. 2 is a schematic flow chart of the ctDNA methylation level prediction method based on target region capture sequencing according to the present invention;
fig. 3 is a schematic structural diagram of a terminal device in the present invention.
Reference numerals:
the device comprises a 100-ctDNA methylation level prediction device, a 110-FASTQ file processing module, a 120-to-be-detected sample comparison module, a 130-reads level filtering module and a 140-methylation level prediction module.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will be made with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.
In the prior art, quality detection of methylation capture data of a target region is mainly focused on the comparison rate with a reference genome, the base distribution ratio, the base content ratio, the capture efficiency and the sequencing depth, the C-T conversion rate of a ctDNA sample to be detected is not considered, and due to the library establishment mode of methylation sequencing of the target region and the particularity of the ctDNA sample, the C-T conversion rate may cause incomplete conversion of unmethylated cytosine, so that a large prediction deviation is generated on the methylation level of the ctDNA sample. In addition, the current software for predicting the methylation level of CpG sites is uneven, most algorithms for predicting the methylation level of CpG sites focus on dividing the number of reads with methylation by the sum of the numbers of reads without methylation and with methylation, and do not consider the number and state of CpG sites contained in each read, which also causes the prediction deviation of the methylation level of ctDNA samples, and cannot guarantee the accuracy and reliability of data results, thereby affecting data interpretation. Based on the fact, the invention provides a brand-new ctDNA methylation level prediction device and method based on target area capture sequencing, the accuracy of methylation level prediction is improved, and reliable data basis is provided for scientific research.
A first embodiment of the present invention, as shown in fig. 1, is a ctDNA methylation level prediction apparatus 100 based on target region capture sequencing, comprising: the FASTQ file processing module 110 is used for acquiring a FASTQ file for capturing and sequencing a ctDNA sample to be detected, and performing preprocessing operation on the FASTQ file to obtain a filtered FASTQ file; a to-be-detected sample comparison module 120, configured to compare and deduplicate a gene sequence in the FASTQ file obtained by the FASTQ file processing module 110 with a reference genome to obtain a corresponding Bam file; the reads horizontal filtering module 130 is configured to filter reads in the Bam file generated by the to-be-detected sample comparison module 120 one by one according to a preset C-T conversion rate to obtain a filtered Bam file; and the methylation level prediction module 140 is configured to further filter the Bam file output by the reads level filtering module 130 according to the target area Bed file and the preset number of covered CpG sites in each reads, and predict the methylation level of the CpG sites according to the remaining reads.
In the ctDNA methylation level predicting apparatus 100, first, the FASTQ file processing module 110 performs operations of removing linkers and low-quality reads on the acquired FASTQ file to obtain FASTQ format data that does not include linkers and low-quality bases. Then, the to-be-detected sample comparison module 120 compares the gene sequence in the FASTQ file obtained by the FASTQ file processing module 110 with the reference genome and removes duplication, and retains high-quality and non-duplicated reads, so as to obtain a Bam file of the to-be-detected ctDNA sample. Then, a reads horizontal filtering module 130 evaluates the C-T conversion rate of non-CpG context of reads in the obtained Bam file, and filters out unqualified reads to obtain a Bam file which can be used for subsequent analysis; and finally, filtering and analyzing the Bam file by a methylation level prediction module 140 to obtain accurate methylation level data of the CpG sites of the ctDNA sample to be detected.
After the FASTQ file processing module 110 obtains a FASTQ file for capturing and sequencing a ctDNA sample to be tested, a connector and low-quality reads are removed by using a connector-removing software trimommatic to obtain a filtered FASTQ file, and statistical analysis is performed on the data amount, the base quality distribution and the base content ratio of the ctDNA sample to be tested by using FASTQC (quality control software for high-throughput sequencing data) software. Specifically, after the adaptor sequence is cleaved, bases having a base mass of less than 20 at the beginning and end of the remaining portion are cleaved, the average mass is calculated by windowing a window of 5 in size from the 5' end of the reads, and if the average base mass in the window is less than 20, the window is cleaved, and the number of bases remaining after the cleavage is required to exceed 75.
The to-be-detected sample comparison module 120 utilizes a genome comparison tool Bismark (a comparison method software for searching the position of a sequencing sequence in a gene reference sequence and outputting a Bam format result file) to compare and deduplicate the gene sequence in the FASTQ file obtained by the FASTQ file processing module 110 with a human reference genome and an internal reference lambda DNA reference genome respectively, so as to generate a Bam file of the human reference genome, a comparison report before deduplication and a comparison report after deduplication, and a Bam file of the internal reference lambda DNA reference genome, a comparison report before deduplication and a comparison report after deduplication; and sequencing and marking the aligned Bam files by utilizing SAMtools and Picard tools for repeated processing. In this process, the original data path of the ctDNA sample to be detected and the name of the ctDNA sample to be detected are input.
Inputs to the reads horizontal filter module 130 are the alignment of the ctDNA sample to be tested to the reference genome and the elimination of duplicate Bam file paths and the minimum requirement for non-CpG context C-T conversion rate. In the filtering process, firstly, reading reads in the Bam file generated by the to-be-detected sample comparison module 120 line by a methylation number statistical unit, and counting the number of methylated and unmethylated bases in a non-CpG context mode according to the actual base condition of a site of which the original sequence is C base in each read in the Bam file; then, the C-T conversion calculating unit calculates the C-T conversion of each reads based on the number of bases of non-CpG context in which methylation has occurred and the sum of the numbers of bases of non-CpG context (sum of the numbers of methylated and unmethylated bases); and finally, filtering reads with the C-T conversion rate smaller than the preset C-T conversion rate in the Bam file by the first filtering unit, so as to filter reads which do not meet the lowest requirement of the non-CpG context C-T conversion rate, and outputting the filtered Bam format file, the C-T conversion rate of the filtered ctDNA sample to be detected and the data volume of the reads of the filtered ctDNA sample to be detected.
The input of the methylation level prediction module 140 is the minimum requirement for covering CpG sites on the Bam file path, the target region Bed file and each reads obtained after the filtering of the reads level filtering module 130. In the prediction process, firstly, the second filtering unit filters known SNP sites in the dbSNP database and SNP sites generated due to specific variation reasons (such as structural variation, chromosome copy number variation and the like) according to the Bed file of the target region by using BisSNP software (a software for analyzing methylation data and can be used for identifying methylation sites and predicting methylation level) to obtain CpG sites of the ctDNA sample to be detected; then, further filtering the Bam file output by the reads horizontal filtering module 130 according to the CpG sites obtained by filtering and the preset number of covered CpG sites in each of the reads (i.e. the minimum requirement for covering CpG sites on each of the reads), and filtering out the reads which do not meet the minimum requirement for covering CpG sites; and finally, the methylation level calculation unit calculates the methylation level of the CpG sites according to the residual reads of the Bam file after the filtering of the second filtering unit, and the methylation level calculation formula of each CpG site is that the number of the reads covering the CpG sites and subjected to methylation meets the minimum requirement is divided by the number of all the reads covering the sites and the number of the reads meets the minimum requirement. Meanwhile, the Bam file filtered by the reads horizontal filtering module 130 is processed by using Bedtools software (a tool for processing a genome algorithm) in combination with the Bed file, so that the capture efficiency of the ctDNA sample to be detected is obtained; and (3) processing the filtered Bam file by utilizing SAMtools (a tool for processing the Bam/sam file) to obtain the sequencing depth of the ctDNA sample to be detected at each site of the target area, and counting data such as the average sequencing depth of the ctDNA sample to be detected.
In practical applications, the FASTQ file processing module 110, the to-be-detected sample comparison module 120, the reads level filtering module 130, and the methylation level prediction module 140 may be performed separately, that is, performed in an independent and modularized manner, or may be integrated together to automatically complete all processes. In an automated methylation data quality detection and methylation level prediction process: inputting at one time: FASTQ file and target region Bed file (containing three columns of information of chromosome, starting point and ending point) for methylation target capture sequencing of ctDNA sample to be detected. The output file includes: a statistical table of ctDNA sample data to be detected (including original base data volume, original reads data volume, filtered base data volume, filtered reads data volume, comparison to reference genome reads data volume and proportion, duplication eliminating data volume, Total C base content, methylated C base content, unmethylated C base content, methylated C base content in CpG context and non-CpG context, unmethylated C base content in CpG context and non-CpG context, C base content before reads horizontal filtering, C-T conversion rate of sample before reads horizontal filtering, lambda C-T conversion rate of internal reference sample, C-T conversion rate of sample after reads horizontal filtering, data volume after reads horizontal filtering, base number of target region, data volume and proportion of target region, base number and proportion of target region capture under different sequencing depths and average sequencing depth), And the methylation level of the CpG sites of the target region of the ctDNA sample to be detected (including five information of chromosomes, starting points, ending points, the methylation level and the sequencing depth).
Correspondingly, the invention also provides a ctDNA methylation level prediction method based on target region capture sequencing, which is applied to the ctDNA methylation level prediction device, as shown in fig. 2, and the ctDNA methylation level prediction method comprises the following steps: s10, acquiring a FASTQ file for capturing and sequencing a ctDNA sample to be detected, and preprocessing the FASTQ file to obtain a filtered FASTQ file; s20, comparing the gene sequence in the obtained FASTQ file with a reference genome and removing duplication to obtain a corresponding Bam file; s30, filtering reads in the generated Bam file one by one according to a preset C-T conversion rate to obtain a filtered Bam file; s40, further filtering the filtered Bam file according to the Bed file of the target area and the preset number of covered CpG sites in each reads, and predicting the methylation level of the CpG sites according to the residual reads.
Specifically, step S20 includes: and respectively comparing the gene sequences in the FASTQ file obtained by the FASTQ file processing module with the human reference genome and the internal reference lambda DNA reference genome and removing the duplication to generate a Bam file of the human reference genome, an alignment report before duplication removal and an alignment report after duplication removal, and an internal reference lambda DNA reference genome Bam file, an alignment report before duplication removal and an alignment report after duplication removal. Step S30 includes: reading reads in the Bam file line by line, and counting the number of methylated and unmethylated bases under a non-CpG context mode; calculating the C-T conversion rate of each reads according to the sum of the base number of the methylated non-CpG context and the base number of the methylated non-CpG context; and filtering reads with the C-T conversion rate smaller than the preset C-T conversion rate in the Bam file to obtain a filtered Bam file. Step S40 includes: filtering the known SNP sites in the dbSNP database and the SNP sites generated due to specific variation reasons according to the Bed file of the target region to obtain CpG sites of the ctDNA sample to be detected; further filtering the Bam file according to the CpG sites obtained by filtering and the preset number of covered CpG sites in each read; the methylation level of CpG sites was calculated from the remaining reads of the filtered Bam file.
The ctDNA methylation level prediction method based on target region capture sequencing and its beneficial effects are illustrated below by an example:
1. sample preparation
Selecting ctDNA samples of 6 tumor patients to carry out library construction, target region capture and sequencing, repeating the steps for 2 times for each patient, and respectively carrying out the following operations:
1.1 treating plasma
1.1.1 after thawing the samples were added 15. mu.L proteinase K (proteinase K) (20mg/mL) and 50. mu.L Sodium Dodecyl Sulfate (SDS) solution (20%) per 1mL of sample. If the plasma volume is less than 4mL, make up with Phosphate Buffered Saline (PBS). Turning over, mixing, incubating at 60 deg.C for 20min, and ice-cooling for 5 min.
1.1.2 reagents were added to the deep well plates, the reagents and corresponding amounts added in each deep well plate are shown in table 1:
table 1: list of reagents added in deep well plate
Figure 478501DEST_PATH_IMAGE001
1.1.3 operating KingFisher FLEX magnetic bead extractor
Before the program runs, the clean magnetic head sleeve is placed at the designated position of the detection program, and the program runs to detect whether the magnetic head sleeve falls off or not. After the deep hole plate is added, an SATRT key on the automatic extraction instrument is clicked, and the magnetic head sleeve and the corresponding deep hole plate are sequentially placed according to the requirements of a display screen. The SATRT key is clicked again, and the automatic extractor starts to operate.
1.1.4 aspiration of DNA sample:
after the automatic extractor is operated, the No. 7 deep hole plate is taken out firstly, and then the STOP key is clicked. The DNA sample was aspirated into the corresponding labeled centrifuge tube with a pipette.
1.2 cfDNA library construction
1.2.1 preparation of internal reference
Adding Lamdba DNA into a 50uL breaking tube, breaking by using an M220 breaking instrument, diluting the broken internal reference DNA, and adding the diluted internal reference DNA into a sample during library building. Lamdba is a reference substance and is used for determining the transformation condition of the sample.
1.2.2 preparation of DNA samples
The extracted blood plasma of 6 tumor patients was divided into 2 parts in a total amount of 10ng, and an interrupted reference was added to prepare a library, where cfDNA samples were not interrupted. Sample operation information is shown in table 2.
Table 2: sample operation information List
Figure 581978DEST_PATH_IMAGE002
1.3 library preparation procedure:
1.3.1 EZ DNA Methylation-LightningTMKit (manufactured by Zymo Research Co., Ltd.) for transforming DNA
1.3.1.1 sample start volume is 20. mu.L. When the amount is less than 20. mu.L, the amount is made up with water.
1.3.1.2A 130. mu.L of Lightning Conversion Reagent in the kit was added to the DNA sample, mixed by shaking, centrifuged briefly, placed on a PCR instrument, and subjected to PCR reaction under the conditions shown in Table 3.
Table 3: conditions of PCR reaction
Figure 874682DEST_PATH_IMAGE003
1.3.1.3 Zymo-Spin in kitsTMAdding 600 μ L M-Binding Buffer in the kit into ICColumn, adding the product obtained by the above reaction into Zymo-Spin ™ IC Column containing M-Binding Buffer, blowing and mixing well with a gun, and standing for 2 min. Centrifuge at 12000rpm for 1 min.
1.3.1.4 adding the liquid in the collecting tube back to the adsorption column, standing for 2min, centrifuging at 12000rpm for 1min, and discarding the waste liquid.
1.3.1.5 Add 100. mu.L M-Wash Buffer in the kit, centrifuge at 12000rpm for 1min, discard the waste.
1.3.1.6 adding into 200 μ L-depletion Buffer in kit, incubating at room temperature (20-30 deg.C) for 15-20min, centrifuging at 12000rpm for 1min, and discarding waste liquid.
1.3.1.7M-Wash Buffer in 200. mu.L kit was added, centrifuged at 12000rpm for 1min, and the waste solution was discarded and repeated twice.
1.3.1.8 the adsorption column was returned to the collection tube, centrifuged at 12000rpm for 2min, and the waste liquid was decanted. And (4) opening the adsorption column, placing at room temperature for 2-5min to thoroughly dry the residual rinsing liquid in the adsorption material.
1.3.1.9 transferring the adsorption column into a clean centrifuge tube, suspending and dripping 20 μ L of elution buffer TE into the middle part of the adsorption membrane for elution, standing at room temperature for 2-5min, and centrifuging at 12000rpm for 1 min.
1.3.1.10 the liquid in the collection tube is added back to the adsorption column again, placed at room temperature for 2-5min, centrifuged at 12000rpm for 1min, and the tube with the DNA after transformation is stored at-20 deg.C (the DNA after transformation is used as soon as possible).
1.3.2 DNA pretreatment
1.3.2.1 PCR instrument was preheated in advance at 95 ℃ and the hot lid temperature was 105 ℃.
1.3.2.2 the transformed fragmented DNA was put into a 0.2ml PCR tube, and a Low concentration ethylenediaminetetraacetic acid TE buffer solution (Low EDTA TE) was added to dilute the total volume to 15. mu.L.
1.3.2.3 put the PCR tube into the PCR instrument, incubate at 95 ℃ for 2min, immediately put on ice, and stand for 2 min.
1.3.3 plus T7 Joint
1.3.3.1 the PCR instrument was preheated to 37 ℃ in advance and the hot lid temperature was 105 ℃.
1.3.3.2 the reaction system was prepared according to Table 4, in which the reagents were ACCEL-NGS METHYL YL-SEQ DNA LIBRARY KIT KITs (produced by Swift Biosciences).
Table 4: list of reagents
Figure 262151DEST_PATH_IMAGE004
1.3.3.3 Add 25. mu.L of the reagent to the pre-treated DNA sample PCR tube placed on ice, pipette and mix well, and centrifuge instantaneously.
1.3.3.4 the PCR tube was set in a PCR machine and the reaction was carried out under the conditions shown in Table 5.
Table 5: reaction conditions
Figure 573178DEST_PATH_IMAGE005
1.3.4 two-chain Synthesis reaction (Second strand synthesis reaction)
1.3.4.1 PCR instrument was preheated in advance at 98 ℃ and the hot lid temperature was 105 ℃.
1.3.4.2 reagents were prepared according to Table 6, from ACCEL-NGS METHYL-SEQ DNA LIBRARY KIT KIT (produced by Swift Biosciences).
Table 6: list of reagents
Figure 220191DEST_PATH_IMAGE006
1.3.4.3 mu.L of the reagent shown in Table 6 was added to the reaction system in the previous step, and the mixture was pipetted and mixed well and centrifuged instantaneously.
1.3.4.4 the PCR tube was set in a PCR machine to perform the double strand synthesis reaction under the conditions shown in Table 7.
Table 7: reaction conditions for two-chain synthesis
Figure 921562DEST_PATH_IMAGE007
1.3.4.5 the purified magnetic beads were removed from the reaction mixture at 4 ℃ and allowed to equilibrate at room temperature for half an hour.
1.3.4.6 after the reaction in the previous step, 101. mu.L of magnetic beads were added to the product, and the mixture was blown up and mixed.
1.3.4.7 standing at room temperature for 5min, placing on a magnetic frame until the liquid is clear, and discarding the supernatant.
1.3.4.8 was incubated with 200. mu.L of 80% ethanol for 30sec and discarded. The 80% ethanol is prepared in situ. The 200 μ L80% ethanol wash step was repeated once.
1.3.4.9 residual ethanol at the bottom of the centrifuge tube was discarded using a 10. mu.L pipette tip and dried at room temperature until ethanol was completely volatilized.
1.3.4.10 the tube was removed from the magnetic stand, 16. mu.L of ultrapure water was added, and the mixture was shaken and mixed. Incubate at room temperature for 2 min.
1.3.4.11 briefly, place on a magnetic rack until the liquid is clear, and transfer 15. mu.L of the sample to a new centrifuge tube.
1.3.5 plus T5 Joint
1.3.5.1 reagents were prepared according to Table 8, which were obtained from ACCEL-NGS METHYL-SEQ DNA LIBRARY KIT KIT (produced by Swift Biosciences). Adding 15 μ L of the reaction system into the sample in the previous step, blowing and mixing the mixture by using a pipette, and performing instantaneous centrifugation.
Table 8: list of reagents
Figure 486623DEST_PATH_IMAGE008
1.3.5.2 the PCR tubes were placed in a PCR machine and the PCR reactions were performed according to the conditions of Table 9.
Table 9: conditions of PCR reaction
Figure 278255DEST_PATH_IMAGE009
1.3.5.3 the purified beads were removed from 4 ℃ in advance and equilibrated at room temperature for half an hour.
1.3.5.4 after the ligation reaction was completed, 36. mu.L of magnetic beads were added, and the mixture was blown up and mixed.
Standing at room temperature for 5min at 1.3.5.5, placing on a magnetic frame until the liquid is clear, and discarding the supernatant.
1.3.5.6 was incubated with 200. mu.L of 80% ethanol for 30sec and discarded. The 80% ethanol is prepared in situ. The 200 μ L80% ethanol wash step was repeated once.
1.3.5.7 residual ethanol at the bottom of the centrifuge tube was discarded using a 10. mu.L pipette tip and dried at room temperature until ethanol was completely volatilized.
1.3.5.8 the tube was removed from the magnetic frame, 20. mu.L of ultrapure water was added, and the mixture was shaken and mixed. Incubate at room temperature for 2 min.
1.3.5.9 briefly, place on a magnetic rack until the liquid is clear, and transfer 20. mu.L of the sample to a new centrifuge tube.
1.3.6 amplification
1.3.6.1 configuring reaction reagents according to Table 10, adding 30 μ L of reaction system into the sample in the previous step, using a pipette to blow, uniformly mixing, and performing instant centrifugation, wherein the reagents in the table are from ACCEL-NGS METHYL-SEQ DNA LIBRARY KIT (produced by Swift Biosciences).
Table 10: list of reagents
Figure 96170DEST_PATH_IMAGE010
1.3.6.2 the PCR tubes were placed in a PCR machine and the PCR reactions were performed according to the conditions of Table 11.
Table 11: conditions of PCR reaction
Figure 278977DEST_PATH_IMAGE011
1.3.6.3 the purified magnetic beads were removed from the reaction mixture at 4 ℃ and allowed to equilibrate at room temperature for half an hour.
1.3.6.4 after the ligation reaction, 60. mu.L of magnetic beads were added and the mixture was pipetted and mixed.
1.3.6.5 standing at room temperature for 5min, placing on a magnetic frame until the liquid is clear, and discarding the supernatant.
1.3.6.6 was incubated with 200. mu.L of 80% ethanol for 30sec and discarded. The 80% ethanol is prepared in situ. The 200 μ L80% ethanol wash step was repeated once.
1.3.6.7 residual ethanol at the bottom of the centrifuge tube was discarded using a 10. mu.L pipette tip and dried at room temperature until ethanol was completely volatilized.
1.3.6.8 remove the tube from the magnetic frame, add 50. mu.L of ultrapure water, shake and mix. Incubate at room temperature for 2 min.
1.3.6.9 briefly, place on a magnetic rack until the liquid is clear, and transfer 50. mu.L of the sample to a new centrifuge tube.
1.4 library Capture
1.4.1 hybrid libraries:
capture was 1ug per total capture. Adding a hybridization reagent into the system, shaking and uniformly mixing, and centrifuging for a short time.
1.4.2 seal the EP tube with a sealing film, put into a vacuum centrifugal concentrator and evaporate to dryness (60 ℃, about 20min-1 hr). Note that it is checked at any time whether it has evaporated to dryness.
1.4.3 DNA denaturation:
after the samples were completely evaporated to dryness, 7.5. mu.L of 2 × Hybridization Buffer (via 5) and 3. mu.L of LHhybridization Component A (via 6) were added to each trap, mixed well with shaking, centrifuged briefly, and denatured at 95 ℃ for 10 min. Both reagents in this step were from SeqCap Hyb and Wash Kit kits (manufactured by Roche).
1.4.4 library hybridization to probes:
1.4.4.1 the probe was removed and centrifuged briefly.
1.4.4.2 the denatured DNA (always kept at 95 ℃) was quickly transferred to a PCR tube containing the probe by brief centrifugation, shaken and mixed well, and centrifuged briefly.
1.4.4.3 was placed in a PCR machine and hybridized at 47 ℃.
1.4.5 preparation of purification reagents
1.4.5.1A method for preparing the purified reagents required for capturing is shown in Table 12, and buffers were prepared according to the following table based on the number of captures. The reagents in the tables were SeqCap Hyb and Wash Kit kits (manufactured by Roche).
Table 12: list of formulated reagents to capture desired purification reagents
Figure 263376DEST_PATH_IMAGE012
1.4.5.2 incubation of Capture Beads (Capture Beads) and Wash Buffer (Wash Buffer) working solution:
the l Capture Beads were allowed to equilibrate at room temperature for 30min before use.
l Wash Buffer used it was incubated at 47 ℃ for 2 hr.
1.4.6 post-hybridization purification
1.4.6.1 mu.L of each capture bead was dispensed, 100. mu.L of the capture beads were placed on a magnetic rack until the liquid was clarified, and the supernatant was discarded.
1.4.6.2 adding 200 μ L of 1 × Bead Wash Buffer (via 7), shaking, mixing, placing on magnetic frame until the liquid is clear, discarding the supernatant, and repeating twice. Add 100. mu.L of 1 × Bead Wash Buffer (visual 7) again, shake and mix well, put on the magnetic frame until the liquid is clear, discard the supernatant completely. The bead pretreatment was completed and the next assay was performed immediately.
1.4.6.3 transfer the captured overnight hybridization fluid into washed magnetic beads and pipette ten strokes. Placing in a PCR instrument, incubating at 47 ℃ for 45min (the temperature of a PCR hot cover is set as 57 ℃), and shaking once every 15min to ensure that the magnetic beads are suspended. 1 xBead Wash Buffer (visual 7) was obtained from SeqCap Hyb and Wash Kit (manufactured by Roche).
1.4.7 Using SeqCap Hyb and Wash Kit (manufactured by Roche Co.) for cleaning
1.4.7.1 after completion of incubation, 100. mu.L of 1 × Wash Buffer I (visual 1) pre-warmed at 47 ℃ was added to each tube and mixed by shaking. Placing on a magnetic frame until the liquid is clear, and discarding the supernatant. The reagents used in all of the steps through 1.4.7.6 were obtained from SeqCap Hyb and Wash Kit (manufactured by Roche).
1.4.7.2 mu.L of 1 × Stringent Wash Buffer (visual 4) preheated at 47 ℃ was added and mixed by pipetting ten times. Incubating at 47 deg.C for 5min, placing on magnetic frame until the liquid is clear, and discarding the supernatant.
1.4.7.3 mu.L of 1 × Stringent Wash Buffer (visual 4) preheated at 47 ℃ was added and mixed by pipetting ten times. Incubating at 47 deg.C for 5min, placing on magnetic frame until the liquid is clear, and discarding the supernatant.
1.4.7.4 mu.L of 1 × Wash Buffer I (visual 1) at room temperature was added, shaken for 2min, centrifuged briefly, placed on a magnetic stand until the liquid was clear, and the supernatant was discarded.
1.4.7.5 mu.L of 1 × Wash Buffer II (visual 2) placed at room temperature was added, shaken for 1min, centrifuged briefly, placed on a magnetic stand until the liquid was clear, and the supernatant was discarded.
1.4.7.6 mu.L of 1 × Wash Buffer III (visual 3) at room temperature was added, shaken for 30sec, centrifuged briefly, placed on a magnetic stand until the liquid was clear, and the supernatant was discarded.
1.4.7.7 and adding 36 μ L of ultrapure water into the centrifuge tube for elution, shaking and mixing uniformly, and carrying out the next amplification test.
1.4.8 PCR reaction
1.4.8.1 according to the capture number, preparing the mixed solution according to the table 13, shaking and mixing evenly. The reagents in the tables are all from SeqCap Hyb and Wash Kit kits (manufactured by Roche).
Table 13: preparation reagent list of mixed solution
Figure 486678DEST_PATH_IMAGE013
1.4.8.2 were centrifuged briefly and the mixture was dispensed into PCR tubes at 30. mu.L/tube. Each captured sample was divided into two tubes for PCR amplification, with 20. mu.L of sample per tube.
1.4.8.3 the above samples were transferred to PCR reaction, shaken, mixed and centrifuged briefly.
1.4.8.4 was placed on a PCR machine and the PCR reaction was carried out under the conditions shown in Table 14.
Table 14: conditions of PCR reaction
Figure 413177DEST_PATH_IMAGE014
1.4.9 purification after amplification
1.4.9.1 the purified magnetic beads are removed and allowed to equilibrate at room temperature for 30 min.
1.4.9.2 mu.L of purified magnetic beads was put into a 1.5mL centrifuge tube, 100. mu.L of the amplified capture DNA library was added, mixed well with shaking, and incubated at room temperature for 15 min.
1.4.9.3 were placed on a magnetic stand until the liquid was clear and the supernatant was discarded.
1.4.9.4 was incubated with 200. mu.L of 80% ethanol for 30sec and discarded. The 80% ethanol is prepared in situ.
The 200 μ L80% ethanol wash step was repeated once.
1.4.9.5 residual ethanol at the bottom of the centrifuge tube was discarded using a 10. mu.L pipette tip and dried at room temperature until ethanol was completely volatilized.
1.4.9.6 the tube was removed from the magnetic frame, 120. mu.L of ultrapure water was added, and the mixture was shaken and mixed. Incubate at room temperature for 2 min.
1.4.9.7 briefly, the sample was placed on a magnetic rack until the liquid was clear and the captured sample was transferred to a new centrifuge tube.
1.5 library pooling and sequencing
And calculating the quality of the mixed library for each capture according to the data volume proportion, and mixing different captures into one sample according to the data volume proportion. And adding a Phix library to mix into an upper machine sample, and sequencing. Phix is a phage that can improve base imbalance, and can be used as a reference to evaluate the sequencing quality.
Off-line FASTQ files are processed into input files usable by various modules and software
After the data is downloaded, the downloaded data is firstly processed into a Bam file from a FASTQ file, and the specific software and steps are as follows:
2.1 removing the joint
Calling Trimmomatic-0.36 to take each pair of FASTQ files as pairing sequences (paired reads) to carry out joint removal and low-quality base treatment, and generating FASTQ files after joint removal. Specifically, after the adaptor sequence is cleaved, bases having a base mass of less than 20 at the beginning and end of the remaining portion are cleaved, the average mass is calculated by windowing a window of 5 in size from the 5' end of the reads, and if the average base mass in the window is less than 20, the window is cleaved, and the number of bases remaining after the cleavage is required to exceed 75.
2.2 alignment
Call Bismark-v0.19.0 to align the adaptor-removed FASTQ file as paired reads to hg19 human reference genomic sequence and lambda DNA reference genomic sequence, generating an initial Bam file and alignment report.
2.3 De-weighting
And calling a default module of Bismark-v0.19.0, performing deduplication processing on the initial Bam file, and generating a deduplicated Bam file and a deduplicated result report.
2.4 ordering tags
Calling a sort module of SAMtools-1.3, sorting the duplicate-removed Bam files, and generating sorted Bam files; and calling an AddOrReplaceReadGroups module of Picard-2.1.0 (a tool for processing high-throughput sequencing data, which can be used for processing the result file of sam/Bam equal ratio), and marking and grouping the sequenced Bam files.
2.5 screening
Calling a clipOverlap module of the BamHIT-1.0.14 to screen the Bam files after the marks are grouped, and carrying out cigar value conversion processing on reads which overlap bases of the Bam files and pairing sequences and compare the pairing sequences to a negative strand of a reference sequence to generate the Bam files; and calling SAMtools-1.3 view to filter the alignment quality (used for quantifying the possibility of aligning to wrong positions, the higher the value is, the lower the possibility is), of the Bam file with the overlapped sequences removed, wherein the alignment quality is required to exceed 20, and a final Bam file is generated. The Cigar value reports the relative alignment information for each read in the Bam file.
2.6 building an index
And calling an index module of SAMtools-1.3 to establish an index for the finally generated Bam file, and generating a bai file paired with the finally generated Bam file.
2.7 data statistics
Calling FASTQC-0.11.3 to count the base data volume, reads data volume, base distribution and the like of FASTQ files before and after the connection; counting Total C base content, methylated C base content, unmethylated C base content, methylated C base content in CpG context and non-CpG context, unmethylated C base content in CpG context and non-CpG context in human reference genome comparison report generated in comparison process; calling an intersector module of Bedtools-v2.26.0 to count the number of bases in a target region in a finally generated Bam file, and the data volume and proportion captured by the target region; and calling SAMtools-1.3 to count the sequencing depth, the average sequencing depth and the number and proportion of the bases captured by the target region under different sequencing depths of the finally generated Bam file.
Direct identification of CpG methylation levels in ctDNA samples by conventional methods
And (3) processing the finally generated Bam file by using BisSNP software: firstly, calling BisulfiteCovatates and BisulfiteTableRecalibration modules of BisSNP-0.82.2 to perform base quality correction to generate a corrected csv file and a corrected Bam file; then, identifying SNP sites and CpG sites of a sample to be detected by using a BisulfisetGenotyper module and a target region Bed file to generate an original VCF file of the SNP and the CpG; and finally, calling a VCFpostprocess module to filter the CpG sites according to the generated VCF file to obtain the final CpG sites and the methylation level thereof.
Identification of CpG methylation levels in ctDNA samples Using the methods of the invention
Adopting the finally generated Bam file as an input file, taking the C-T conversion rate of non-CpG context as the minimum requirement, calling a reads horizontal filtering module of the invention, reading the Bam file line by line, judging whether the non-CpG context of each read meets the minimum requirement of the C-T conversion rate, screening reads meeting the requirement, and generating the filtered Bam file; and then, using the filtered Bam file and CpG sites identified by BisSNP-0.82.2 software as input files, requiring that each reads at least comprises 3 CpG sites, calling a methylation level prediction module, filtering reads which do not meet the requirements in the Bam file, and then calculating the methylation level of each CpG site.
Methylation level prediction comparing traditional and inventive methods
The methylation levels of 6 pairs of replicate samples were compared for inter-sample correlation using different methods for methylation level prediction, respectively, and the results were as follows:
5.1 different methods, the consistency of the prediction results of the CpG site methylation level which is less than 1 at the same time among the repeated samples is shown in Table 15, wherein, the Sample column of Table 15 shows the matched repeated samples for calculating the correlation, the non-C-T-BisSNP column shows that the C-T conversion rate is not filtered, and the BisSNP-0.82.2 software is used for calculating the methylation level, namely the correlation coefficient of the traditional method; the C-T-BisSNP column indicates the correlation coefficient of the method for calculating the methylation level by using BisSNP-0.82.2 software after C-T conversion rate filtration; the column C-T-estimate represents the correlation coefficient of the method of the invention.
Table 15: list of correlation coefficients (all sites) for methylation level prediction results for each replicate sample under different methods
Figure 106718DEST_PATH_IMAGE015
5.2 different methods, the consistency of the prediction results of CpG site methylation level between repeated samples and site methylation level less than 0.02 is shown in Table 16, wherein, the Sample column of Table 16 shows the matched repeated samples used for calculating the correlation, the non-C-T-BisSNP column shows that the C-T conversion rate is not filtered, and the BisSNP-0.82.2 software is used for calculating the methylation level, namely the correlation coefficient of the traditional method; the C-T-BisSNP column indicates the correlation coefficient of the method for calculating the methylation level by using BisSNP-0.82.2 software after C-T conversion rate filtration; the column C-T-estimate represents the correlation coefficient of the method of the invention.
Table 16: list of correlation coefficients (low methylation level sites) for methylation level prediction results of each replicate sample under different methods
Figure 19442DEST_PATH_IMAGE016
As can be seen from the table, compared with the non-C-T-BisSNP and C-T-BisSNP methods, the added reads horizontal filtering module and the methylation level prediction module in the invention improve the correlation of the hypomethylation level among repeated samples, and are more suitable for the methylation level prediction of ctDNA.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of program modules is illustrated, and in practical applications, the above-described distribution of functions may be performed by different program modules, that is, the internal structure of the apparatus may be divided into different program units or modules to perform all or part of the above-described functions. Each program module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one processing unit, and the integrated unit may be implemented in a form of hardware, or may be implemented in a form of software program unit. In addition, the specific names of the program modules are only used for distinguishing the program modules from one another, and are not used for limiting the protection scope of the application.
Fig. 3 is a schematic structural diagram of a terminal device provided in an embodiment of the present invention, and as shown, the terminal device 200 includes: a processor 220, a memory 210, and a computer program 211 stored in the memory 210 and executable on the processor 220, such as: correlation programs were predicted based on ctDNA methylation levels of target region capture sequencing. Processor 220 implements the steps of the various ctDNA methylation level prediction method embodiments based on target region capture sequencing described above when executing computer program 211, or processor 220 implements the functions of the various modules of the ctDNA methylation level prediction apparatus embodiments based on target region capture sequencing described above when executing computer program 211.
The terminal device 200 may be a notebook, a palm computer, a tablet computer, a mobile phone, or the like. Terminal device 200 may include, but is not limited to, processor 220, memory 210. Those skilled in the art will appreciate that fig. 3 is merely an example of terminal device 200, does not constitute a limitation of terminal device 200, and may include more or fewer components than shown, or some components may be combined, or different components, such as: terminal device 200 may also include input-output devices, display devices, network access devices, buses, and the like.
The Processor 220 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor 220 may be a microprocessor or the processor may be any conventional processor or the like.
The memory 210 may be an internal storage unit of the terminal device 200, such as: a hard disk or a memory of the terminal device 200. The memory 210 may also be an external storage device of the terminal device 200, such as: a plug-in hard disk, an intelligent TF memory Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the terminal device 200. Further, the memory 210 may also include both an internal storage unit of the terminal device 200 and an external storage device. The memory 210 is used to store the computer program 211 and other programs and data required by the terminal device 200. The memory 210 may also be used to temporarily store data that has been output or is to be output.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or recited in detail in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described apparatus/terminal device embodiments are merely illustrative, and for example, a module or a unit may be divided into only one logical function, and may be implemented in other ways, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by sending instructions to relevant hardware through the computer program 211, where the computer program 211 may be stored in a computer readable storage medium, and when the computer program 211 is executed by the processor 220, the steps of the method embodiments may be implemented. Wherein the computer program 211 comprises: computer program code which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable storage medium may include: any entity or device capable of carrying the code of computer program 211, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the content of the computer readable storage medium can be increased or decreased according to the requirements of the legislation and patent practice in the jurisdiction, for example: in certain jurisdictions, in accordance with legislation and patent practice, the computer-readable medium does not include electrical carrier signals and telecommunications signals.
It should be noted that the above embodiments can be freely combined as necessary. The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for persons skilled in the art, numerous modifications and adaptations can be made without departing from the principle of the present invention, and such modifications and adaptations should be considered as within the scope of the present invention.

Claims (10)

1. A ctDNA methylation level prediction apparatus based on target region capture sequencing, comprising:
the FASTQ file processing module is used for acquiring a FASTQ file for capturing and sequencing a ctDNA sample to be detected, and performing preprocessing operation on the FASTQ file to obtain a filtered FASTQ file;
the comparison module of the sample to be tested is used for comparing the gene sequence in the FASTQ file obtained by the FASTQ file processing module with the reference genome and removing duplication to obtain a corresponding Bam file;
the reads horizontal filtering module is used for filtering the reads in the Bam file generated by the to-be-detected sample comparison module one by one according to the preset C-T conversion rate to obtain a filtered Bam file;
and the methylation level prediction module is used for further filtering the Bam file output by the reads level filtering module according to the Bed file of the target area and the preset number of covered CpG sites in each read, and predicting the methylation level of the CpG sites according to the residual reads.
2. The ctDNA methylation level prediction device according to claim 1,
in the FASTQ file processing module, the preprocessing operation performed on the acquired FASTQ file includes: removing the joints and low quality reads; and/or the presence of a gas in the gas,
in the comparison module of the sample to be tested, the gene sequence in the FASTQ file obtained by the FASTQ file processing module is respectively compared with the human reference genome and the internal reference lambda DNA reference genome and is subjected to de-duplication, and a Bam file of the human reference genome, a comparison report before de-duplication and a comparison report after de-duplication, and a Bam file of the internal reference lambda DNA reference genome, a comparison report before de-duplication and a comparison report after de-duplication are generated.
3. The ctDNA methylation level prediction device according to claim 1 or 2, wherein in the reads level filtering module, comprising:
the methylation number counting unit is used for reading reads in the Bam file generated by the to-be-detected sample comparison module line by line and counting the number of methylated and unmethylated bases under a non-CpG context mode;
a C-T conversion rate calculation unit for calculating the C-T conversion rate of each reads according to the sum of the number of non-CpG context bases which are methylated and the number of non-CpG context bases;
and the first filtering unit is used for filtering reads with the C-T conversion rate smaller than the preset C-T conversion rate in the Bam file to obtain the filtered Bam file.
4. The ctDNA methylation level prediction device according to claim 1 or 2, wherein in the methylation level prediction module, comprising:
the second filtering unit is used for filtering the known SNP sites in the dbSNP database and the SNP sites generated due to the specific variation reasons according to the target region Bed file to obtain the CpG sites of the ctDNA sample to be detected; and the device is used for further filtering the Bam file output by the reads horizontal filtering module according to the CpG sites obtained by filtering and the preset number of covered CpG sites in each read; the cause of the particular variation comprises structural variation or chromosomal copy number variation;
and the methylation level calculation unit is used for calculating the methylation level of the CpG sites according to the residual reads of the Bam file after the filtering of the second filtering unit.
5. A ctDNA methylation level prediction method based on target region capture sequencing is characterized by comprising the following steps:
acquiring a FASTQ file for capturing and sequencing a ctDNA sample to be detected, and carrying out pretreatment operation on the FASTQ file to obtain a filtered FASTQ file;
comparing the gene sequence in the obtained FASTQ file with a reference genome and removing duplication to obtain a corresponding Bam file;
filtering reads in the generated Bam file one by one according to a preset C-T conversion rate to obtain a filtered Bam file;
and further filtering the filtered Bam file according to the Bed file of the target area and the preset number of covered CpG sites in each read, and predicting the methylation level of the CpG sites according to the residual reads.
6. The ctDNA methylation level prediction method of claim 5, wherein,
the method for obtaining the FASTQ file for capturing and sequencing the ctDNA sample to be detected and performing the preprocessing operation on the FASTQ file to obtain the filtered FASTQ file comprises the following steps: performing joint removal and low-quality reads operation on the acquired FASTQ file; and/or the presence of a gas in the gas,
comparing the gene sequence in the obtained FASTQ file with a reference genome and de-duplicating the gene sequence to obtain a corresponding Bam file, wherein the file comprises: and respectively comparing the gene sequences in the FASTQ file obtained by the FASTQ file processing module with a human reference genome and an internal reference lambda DNA reference genome and removing the duplication to generate a Bam file of the human reference genome, an alignment report before duplication removal and an alignment report after duplication removal, and an internal reference lambda DNA reference genome Bam file, an alignment report before duplication removal and an alignment report after duplication removal.
7. The ctDNA methylation level prediction method of claim 5 or 6, wherein the filtering reads in the generated Bam file item by item according to a predetermined C-T conversion rate to obtain a filtered Bam file comprises:
reading reads in the Bam file line by line, and counting the number of methylated and unmethylated bases under a non-CpG context mode;
calculating the C-T conversion rate of each reads according to the sum of the base number of the methylated non-CpG context and the base number of the methylated non-CpG context;
and filtering reads with the C-T conversion rate smaller than the preset C-T conversion rate in the Bam file to obtain the filtered Bam file.
8. The ctDNA methylation level prediction method of claim 5 or 6, wherein,
the step of further filtering the filtered Bam file according to the target area Bed file and the preset number of covered CpG sites in each reads, and predicting the methylation level of the CpG sites according to the residual reads comprises the following steps:
filtering the known SNP sites in the dbSNP database and the SNP sites generated due to specific variation reasons according to the Bed file of the target region to obtain CpG sites of the ctDNA sample to be detected; the cause of the particular variation comprises structural variation or chromosomal copy number variation;
further filtering the Bam file according to the CpG sites obtained by filtering and the preset number of covered CpG sites in each read;
the methylation level of CpG sites was calculated from the remaining reads of the filtered Bam file.
9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor when executing the computer program implements the ctDNA methylation level prediction method based on target region capture sequencing of any one of claims 5-8.
10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps of the ctDNA methylation level prediction method for target region capture based sequencing as claimed in any one of claims 5-8.
CN202110072090.5A 2021-01-20 2021-01-20 ctDNA methylation level prediction device and method based on target region capture sequencing Active CN112397150B (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
CN202110072090.5A CN112397150B (en) 2021-01-20 2021-01-20 ctDNA methylation level prediction device and method based on target region capture sequencing
EP21920475.7A EP4268231A1 (en) 2021-01-20 2021-04-30 Dna methylation sequencing analysis methods
PCT/CN2021/091761 WO2022156089A1 (en) 2021-01-20 2021-04-30 Dna methylation sequencing analysis methods
US17/490,549 US20220228209A1 (en) 2021-01-20 2021-09-30 Dna methylation sequencing analysis methods

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110072090.5A CN112397150B (en) 2021-01-20 2021-01-20 ctDNA methylation level prediction device and method based on target region capture sequencing

Publications (2)

Publication Number Publication Date
CN112397150A CN112397150A (en) 2021-02-23
CN112397150B true CN112397150B (en) 2021-04-20

Family

ID=74625183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110072090.5A Active CN112397150B (en) 2021-01-20 2021-01-20 ctDNA methylation level prediction device and method based on target region capture sequencing

Country Status (1)

Country Link
CN (1) CN112397150B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022156089A1 (en) * 2021-01-20 2022-07-28 Genecast Biotechnology Co., Ltd Dna methylation sequencing analysis methods
CN115910197B (en) * 2021-12-29 2024-03-22 上海智峪生物科技有限公司 Gene sequence processing method, device, storage medium and electronic equipment
CN117157714A (en) * 2022-03-31 2023-12-01 京东方科技集团股份有限公司 Method, device, equipment and medium for processing genome methylation sequencing data
CN115064211B (en) * 2022-08-15 2023-01-24 臻和(北京)生物科技有限公司 ctDNA prediction method and device based on whole genome methylation sequencing

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108319817A (en) * 2018-01-15 2018-07-24 臻和(北京)科技有限公司 The processing method and processing device of Circulating tumor DNA repetitive sequence
CN108319813A (en) * 2017-11-30 2018-07-24 臻和(北京)科技有限公司 Circulating tumor DNA copies the detection method and device of number variation
CA3076894A1 (en) * 2017-09-25 2019-03-28 Memorial Sloan Kettering Cancer Center Tumor mutational load and checkpoint immunotherapy
WO2020165361A1 (en) * 2019-02-14 2020-08-20 Vib Vzw Retrotransposon biomarkers

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11021703B2 (en) * 2012-02-16 2021-06-01 Cornell University Methods and kit for characterizing the modified base status of a transcriptome
US10513739B2 (en) * 2017-03-02 2019-12-24 Youhealth Oncotech, Limited Methylation markers for diagnosing hepatocellular carcinoma and lung cancer
US20200402613A1 (en) * 2018-03-06 2020-12-24 Cancer Research Technology Limited Improvements in variant detection
WO2019178563A1 (en) * 2018-03-15 2019-09-19 The Board Of Trustees Of Leland Stanford Junior University Methods using nucleic acid signals for revealing biological attributes
CN109887548B (en) * 2019-01-18 2022-11-08 臻悦生物科技江苏有限公司 ctDNA ratio detection method and detection device based on capture sequencing
CN112176419B (en) * 2019-10-16 2022-03-22 中国医学科学院肿瘤医院 Method for detecting variation and methylation of tumor specific genes in ctDNA

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3076894A1 (en) * 2017-09-25 2019-03-28 Memorial Sloan Kettering Cancer Center Tumor mutational load and checkpoint immunotherapy
CN108319813A (en) * 2017-11-30 2018-07-24 臻和(北京)科技有限公司 Circulating tumor DNA copies the detection method and device of number variation
CN108319817A (en) * 2018-01-15 2018-07-24 臻和(北京)科技有限公司 The processing method and processing device of Circulating tumor DNA repetitive sequence
WO2020165361A1 (en) * 2019-02-14 2020-08-20 Vib Vzw Retrotransposon biomarkers

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Urine as an Alternative to Blood for Cancer Liquid Biopsy and Precision Medicine;Adam Zhang等;《2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)》;20190124;全文 *
基于染色体3D结构和关联分析解析植物复杂性状的遗传调控;裴志华;《中国优秀硕士学位论文全文数据库 基础科学辑》;20160215(第2期);全文 *
循环肿瘤DNA的检测:从数字化到测序;范昭璇等;《化学进展》;20191024;全文 *

Also Published As

Publication number Publication date
CN112397150A (en) 2021-02-23

Similar Documents

Publication Publication Date Title
CN112397150B (en) ctDNA methylation level prediction device and method based on target region capture sequencing
CN112029861B (en) Tumor mutation load detection device and method based on capture sequencing technology
CN108753967B (en) Gene set for liver cancer detection and panel detection design method thereof
CN112397151B (en) Methylation marker screening and evaluating method and device based on target capture sequencing
CN112735531B (en) Methylation analysis method and device of circulating cell-free nucleosome active region, terminal equipment and storage medium
CN110211633B (en) Detection method for MGMT gene promoter methylation, processing method for sequencing data and processing device
CN108229103B (en) Method and device for processing circulating tumor DNA repetitive sequence
Johnson et al. Single nucleotide analysis of cytosine methylation by whole‐genome shotgun bisulfite sequencing
CN115064211B (en) ctDNA prediction method and device based on whole genome methylation sequencing
CN111647648A (en) Gene panel for detecting breast cancer gene mutation and detection method and application thereof
CN102061337B (en) Method and system for detecting tissue-specific differentially methylated region (tDMR)
WO2020224159A1 (en) Next generation sequencing-based panel for detecting glioma, detection kit, detection method, and application thereof
CN111755072B (en) Method and device for simultaneously detecting methylation level, genome variation and insertion fragment
CN108595918B (en) Method and device for processing circulating tumor DNA repetitive sequence
CN107893116A (en) For detecting primer pair combination, kit and the method for building library of gene mutation
CN112941180A (en) Group of lung cancer DNA methylation molecular markers and application thereof in preparation of lung cancer early diagnosis kit
CN105132407A (en) Method for low-frequency mutant-enriched sequencing of DNA of exfoliative cells
CN112029842A (en) Kit and method for ABO blood type genotyping based on high-throughput sequencing
CN110106063B (en) System for detecting 1p/19q combined deletion of glioma based on second-generation sequencing
CN108319817B (en) Method and device for processing circulating tumor DNA repetitive sequence
CN111850116A (en) Gene mutation site group of NK/T cell lymphoma, targeted sequencing kit and application
CN112259165B (en) Method and system for detecting microsatellite instability state
CN110993025B (en) Method and device for quantifying fetal concentration and method and device for genotyping fetus
CN108570496A (en) A kind of molecular diagnosis method and kit of constitutional bone disease
CN109439741B (en) Gene probe composition for detecting idiopathic epilepsy, kit and application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 100191 903, 9 / F, healthsmart Valley Building, 35 Huayuan North Road, Haidian District, Beijing

Patentee after: Zhenhe (Beijing) Biotechnology Co.,Ltd.

Patentee after: Wuxi Zhenhe Biotechnology Co.,Ltd.

Address before: 100191 903, 9 / F, healthsmart Valley Building, 35 Huayuan North Road, Haidian District, Beijing

Patentee before: Zhenhe (Beijing) Biotechnology Co.,Ltd.

Patentee before: Wuxi Zhenhe Biotechnology Co.,Ltd.