WO2023142625A1 - Procédé de filtrage de données de séquençage de méthylation et application - Google Patents
Procédé de filtrage de données de séquençage de méthylation et application Download PDFInfo
- Publication number
- WO2023142625A1 WO2023142625A1 PCT/CN2022/132767 CN2022132767W WO2023142625A1 WO 2023142625 A1 WO2023142625 A1 WO 2023142625A1 CN 2022132767 W CN2022132767 W CN 2022132767W WO 2023142625 A1 WO2023142625 A1 WO 2023142625A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- methylation
- cancer
- haplotypes
- samples
- haplotype
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 238000001914 filtration Methods 0.000 title claims abstract description 43
- 238000012164 methylation sequencing Methods 0.000 title claims abstract description 42
- 102000054766 genetic haplotypes Human genes 0.000 claims abstract description 170
- 230000011987 methylation Effects 0.000 claims abstract description 155
- 238000007069 methylation reaction Methods 0.000 claims abstract description 155
- 210000002700 urine Anatomy 0.000 claims abstract description 34
- 206010044412 transitional cell carcinoma Diseases 0.000 claims abstract description 5
- 208000023747 urothelial carcinoma Diseases 0.000 claims abstract description 5
- 238000006243 chemical reaction Methods 0.000 claims description 118
- 206010028980 Neoplasm Diseases 0.000 claims description 102
- 201000011510 cancer Diseases 0.000 claims description 100
- 239000012634 fragment Substances 0.000 claims description 82
- 108020004414 DNA Proteins 0.000 claims description 61
- 238000012163 sequencing technique Methods 0.000 claims description 58
- 210000000265 leukocyte Anatomy 0.000 claims description 45
- 210000004027 cell Anatomy 0.000 claims description 40
- 208000023986 infiltrating bladder urothelial carcinoma Diseases 0.000 claims description 26
- 201000003325 invasive bladder transitional cell carcinoma Diseases 0.000 claims description 26
- 230000007067 DNA methylation Effects 0.000 claims description 24
- 238000012216 screening Methods 0.000 claims description 17
- 238000001514 detection method Methods 0.000 claims description 14
- 230000015572 biosynthetic process Effects 0.000 claims description 12
- 238000003786 synthesis reaction Methods 0.000 claims description 12
- 238000001369 bisulfite sequencing Methods 0.000 claims description 11
- 230000035945 sensitivity Effects 0.000 claims description 11
- 238000012360 testing method Methods 0.000 claims description 10
- 108091092584 GDNA Proteins 0.000 claims description 7
- 238000012549 training Methods 0.000 claims description 7
- 238000012353 t test Methods 0.000 claims description 5
- 238000003205 genotyping method Methods 0.000 claims description 2
- LSNNMFCWUKXFEE-UHFFFAOYSA-M Bisulfite Chemical compound OS([O-])=O LSNNMFCWUKXFEE-UHFFFAOYSA-M 0.000 abstract description 17
- 206010005084 bladder transitional cell carcinoma Diseases 0.000 description 18
- 210000001519 tissue Anatomy 0.000 description 18
- 201000001528 bladder urothelial carcinoma Diseases 0.000 description 14
- 108091029430 CpG site Proteins 0.000 description 13
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical class NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 10
- 230000004048 modification Effects 0.000 description 10
- 238000012986 modification Methods 0.000 description 10
- 238000012165 high-throughput sequencing Methods 0.000 description 9
- 210000003205 muscle Anatomy 0.000 description 9
- 230000008569 process Effects 0.000 description 9
- 238000007403 mPCR Methods 0.000 description 8
- 230000003321 amplification Effects 0.000 description 7
- 238000003199 nucleic acid amplification method Methods 0.000 description 7
- 238000004458 analytical method Methods 0.000 description 6
- 238000000605 extraction Methods 0.000 description 6
- 230000009545 invasion Effects 0.000 description 6
- 238000011027 product recovery Methods 0.000 description 6
- 239000003153 chemical reaction reagent Substances 0.000 description 5
- 238000007405 data analysis Methods 0.000 description 5
- ISAKRJDGNUQOIC-UHFFFAOYSA-N Uracil Chemical compound O=C1C=CNC(=O)N1 ISAKRJDGNUQOIC-UHFFFAOYSA-N 0.000 description 4
- 235000015895 biscuits Nutrition 0.000 description 4
- 229940104302 cytosine Drugs 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 239000000463 material Substances 0.000 description 4
- 239000000047 product Substances 0.000 description 4
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 4
- WVYWICLMDOOCFB-UHFFFAOYSA-N 4-methyl-2-pentanol Chemical compound CC(C)CC(C)O WVYWICLMDOOCFB-UHFFFAOYSA-N 0.000 description 3
- 201000009030 Carcinoma Diseases 0.000 description 3
- 238000007400 DNA extraction Methods 0.000 description 3
- 230000000875 corresponding effect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000010348 incorporation Methods 0.000 description 3
- 238000011002 quantification Methods 0.000 description 3
- RYVNIFSIEDRLSJ-UHFFFAOYSA-N 5-(hydroxymethyl)cytosine Chemical compound NC=1NC(=O)N=CC=1CO RYVNIFSIEDRLSJ-UHFFFAOYSA-N 0.000 description 2
- TWRXJAOTZQYOKJ-UHFFFAOYSA-L Magnesium chloride Chemical compound [Mg+2].[Cl-].[Cl-] TWRXJAOTZQYOKJ-UHFFFAOYSA-L 0.000 description 2
- 238000000692 Student's t-test Methods 0.000 description 2
- 239000011324 bead Substances 0.000 description 2
- 210000000349 chromosome Anatomy 0.000 description 2
- 230000002596 correlated effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000013467 fragmentation Methods 0.000 description 2
- 238000006062 fragmentation reaction Methods 0.000 description 2
- UYTPUPDQBNUYGX-UHFFFAOYSA-N guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 2
- 150000002500 ions Chemical class 0.000 description 2
- 239000007788 liquid Substances 0.000 description 2
- 125000002496 methyl group Chemical group [H]C([H])([H])* 0.000 description 2
- 238000003908 quality control method Methods 0.000 description 2
- 229940113082 thymine Drugs 0.000 description 2
- 229940035893 uracil Drugs 0.000 description 2
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 2
- GUAHPAJOXVYFON-ZETCQYMHSA-N (8S)-8-amino-7-oxononanoic acid zwitterion Chemical compound C[C@H](N)C(=O)CCCCCC(O)=O GUAHPAJOXVYFON-ZETCQYMHSA-N 0.000 description 1
- CKOMXBHMKXXTNW-UHFFFAOYSA-N 6-methyladenine Chemical compound CNC1=NC=NC2=C1N=CN2 CKOMXBHMKXXTNW-UHFFFAOYSA-N 0.000 description 1
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Chemical compound NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 description 1
- 229930024421 Adenine Natural products 0.000 description 1
- 102000053602 DNA Human genes 0.000 description 1
- 241000124008 Mammalia Species 0.000 description 1
- 108091028043 Nucleic acid sequence Proteins 0.000 description 1
- 239000012807 PCR reagent Substances 0.000 description 1
- 208000000453 Skin Neoplasms Diseases 0.000 description 1
- LSNNMFCWUKXFEE-UHFFFAOYSA-N Sulfurous acid Chemical compound OS(O)=O LSNNMFCWUKXFEE-UHFFFAOYSA-N 0.000 description 1
- 229960000643 adenine Drugs 0.000 description 1
- 238000010171 animal model Methods 0.000 description 1
- 230000033228 biological regulation Effects 0.000 description 1
- 210000005068 bladder tissue Anatomy 0.000 description 1
- 210000004369 blood Anatomy 0.000 description 1
- 239000008280 blood Substances 0.000 description 1
- 210000001124 body fluid Anatomy 0.000 description 1
- 239000010839 body fluid Substances 0.000 description 1
- 239000013611 chromosomal DNA Substances 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 239000012084 conversion product Substances 0.000 description 1
- 230000004049 epigenetic modification Effects 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 239000010931 gold Substances 0.000 description 1
- 229910052737 gold Inorganic materials 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 229910001629 magnesium chloride Inorganic materials 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 230000037361 pathway Effects 0.000 description 1
- 125000002467 phosphate group Chemical group [H]OP(=O)(O[H])O[*] 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000004445 quantitative analysis Methods 0.000 description 1
- 239000002096 quantum dot Substances 0.000 description 1
- 238000003753 real-time PCR Methods 0.000 description 1
- 238000006722 reduction reaction Methods 0.000 description 1
- 230000025449 regulation of DNA methylation Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 201000000849 skin cancer Diseases 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
Definitions
- the invention belongs to the field of biotechnology, and in particular relates to a methylation sequencing data filtering method and application.
- DNA methylation is an important epigenetic modification, and it is related to many biological regulation pathways.
- the present invention uses bisulfite sequencing (Bisulfite sequencing) , BS-seq, methseq).
- unmethylated C bases will be converted into U bases, and the conversion efficiency can reach about 99%.
- the single base level can be accurately measured by high-throughput sequencing methods methylation modification.
- the U base is converted into T, while the methylated C remains unchanged, so the methylation modification can be obtained through the base information at each position of the sequence specific base positions.
- the computer software determines the bases and sequences measured by the sequencer, obtains sequence fragments (read/fragment), and outputs FASTQ format data to record the read sequence and sequencing quality score.
- errors that may occur in bisulfite sequencing data include two situations: (1) After the unmethylated site is processed, the C base is not successfully changed into a T base; (2) ) The C base of the methylation site that should remain unchanged has changed.
- the invention provides a method for filtering methylation sequencing data, which is used to improve the conversion efficiency of converting unmethylated C bases into U bases when DNA is treated with bisulfite.
- the invention provides the form Application of base sequencing data filtering method in cancer detection to create a cancer prediction model with high sensitivity and good specificity.
- a device for filtering methylation sequencing data comprising:
- a device for selecting a DNA methylation haplotype capable of judging whether a subject has cancer or typing a cancer patient which includes
- a method of selecting a DNA methylation haplotype capable of judging whether the subject has cancer or typing a cancer patient includes the following steps:
- the fisher test is performed on the methylation haplotypes that have completed the methylation sequencing data filtering step, and the specific methylation differential haplotypes of different types of cancer cells that meet the following conditions are selected:
- the pairwise difference between different types of cancer cells and normal white blood cells is significantly less than 0.05, and the difference of methylation haplotypes in samples with cancer cell DNA concentration of 1 requires the methylation of different types of cancer samples
- the ratio of levels is greater than 2 or less than 1/2, and the ratio of methylation levels between cancer samples of different types and normal white blood cell samples is required to be greater than 2 or less than 1/2; select white blood cell-specific methylation that meets the following conditions
- Differential haplotype the multiple of difference between white blood cells and cancer samples of different types is greater than 3, and the haplotype frequency of cancer samples is not less than 0.0001;
- Detect the differences in methylation levels between cancer tissue samples of different types and normal urine samples and between cancer samples of different types select the haplotypes with significant difference p less than 0.05 in the results of three t tests, and calculate according to the average methylation level
- the logarithm of the methylation level difference is positive or negative to distinguish the haplotype category, and based on this, it is judged separately in tissue samples and urine samples, and the selected methylation difference haplotype is calculated to distinguish cancer samples from normal Sensitivity and specificity in human samples and samples of different types of cancer.
- a method for filtering methylation sequencing data which includes the following steps:
- a method for selecting a DNA methylation haplotype capable of judging whether a subject has cancer or genotyping a cancer patient comprising the following steps:
- the fisher test is performed on the methylation haplotypes that have completed the methylation sequencing data filtering step, and the specific methylation differential haplotypes of different types of cancer cells that meet the following conditions are selected:
- the pairwise difference between different types of cancer cells and normal white blood cells is significantly less than 0.05, and the difference of methylation haplotypes in samples with cancer cell DNA concentration of 1 requires the methylation of different types of cancer samples
- the ratio of levels is greater than 2 or less than 1/2, and the ratio of methylation levels between cancer samples of different types and normal white blood cell samples is required to be greater than 2 or less than 1/2; select white blood cell-specific methylation that meets the following conditions
- Differential haplotype the multiple of difference between white blood cells and cancer samples of different types is greater than 3, and the haplotype frequency of cancer samples is not less than 0.0001;
- Detect the differences in methylation levels between cancer tissue samples of different types and normal urine samples and between cancer samples of different types select the haplotypes with significant difference p less than 0.05 in the results of three t tests, and calculate according to the average methylation level
- the logarithm of the methylation level difference is positive or negative to distinguish the haplotype category, and based on this, it is judged separately in tissue samples and urine samples, and the selected methylation difference haplotype is calculated to distinguish cancer samples from normal Sensitivity and specificity in human samples and samples of different types of cancer.
- cancer types are muscle-invasive bladder urothelial carcinoma and non-muscle-invasive bladder urothelial carcinoma. skin cancer.
- the methylation sequencing data filtering method provided by the present invention can convert unmethylated C bases into U bases in the DNA treated with bisulfite, and the conversion efficiency is from 98.81 % increased to 99.44%.
- the application of the data filtering method of the present application can accurately evaluate the methylation level of the sequence and obtain methylation haplotypes with high quality, accuracy and reliability.
- Figure 1 is a schematic diagram of the methylation sequencing and analysis process.
- Figure 2 is a comparison chart of the improvement effect of the methylation conversion rate.
- Fig. 3 is a schematic flow chart of extracting methylated haplotypes.
- Fig. 4 is a schematic diagram of a linear model prediction flow chart of a gradient biological standard.
- Figure 5 is the linear model prediction, judging the minimum detection limit of MIBC model, NMIBC model and white blood cell specific model.
- Figure 6 is a schematic diagram of the real sample model prediction process.
- Figure 7 shows the prediction results of the real sample model.
- Embodiments of the present invention will be described below, but the present invention is not limited thereto.
- the present invention is not limited to the configurations described below, and various changes can be made within the scope of the claims for protection of the invention.
- Embodiments and examples obtained by appropriately combining the technical means disclosed in different embodiments and examples are also included in this document. within the technical scope of the invention.
- one (a)” or “one (an)” or “one (the)” may refer to “one”, and may also refer to “one or more”, “at least one” and “one or more than one”.
- the term "about” means that a value includes the standard deviation of error of the apparatus or method used to determine the value.
- cfDNA refers to free DNA (Cell-Free Circulating DNA), which is the partly degraded endogenous DNA in the body that is free from extracellular and exists in human circulating blood, urine and other body fluids.
- the length of cfDNA is mostly below 200bp.
- gDNA refers to "genome DNA”, which refers to the total DNA content of an organism in a haploid state.
- bisulfite sequencing is a technique for detecting methylation on the entire chromosomal DNA in a cell or tissue. Based on the fact that bisulfite can deaminate unmethylated cytosine (C) in DNA and convert it into uracil (U), while the methylated cytosine remains unchanged, and PCR amplifies the desired fragment, then Uracil (U) is all converted into thymine (T), and then the PCR product is subjected to high-throughput sequencing. By analyzing the methylation difference between different samples, the regulation of DNA methylation level on gene expression can be studied.
- C unmethylated cytosine
- U uracil
- T thymine
- the present invention uses the EZ DNA Methylation-Gold bisulfite conversion kit from Zymo Research Company as a reagent for detecting DNA methylation status, and performs bisulfite conversion on sample DNA.
- it can be implemented based on a high-throughput sequencing platform, for example, using sequencing platforms such as Miseq, Hiseq, Nextseq and Novaseq of Illumina Company, sequencing platforms such as Ion Torrent or Ion Proton of Life Technologies Company, and PacBio RS of Pacific Biosciences system company. II and other sequencing platforms, Oxford Nanopore Technologies' GridIONX5 and other sequencing platforms, etc.
- the present invention performs high-throughput sequencing based on the Novaseq platform of Illumina Company.
- DNA methylation refers to transfer of methyl group to specific base process. In mammals, methylation is predominantly at the 5'C-terminus of nucleotide cytosine residues. In the human genome, a large amount of DNA methylation occurs on cytosine in CpG dinucleotides, C is cytosine, G is guanine, and p is a phosphate group. DNA methylation can also occur in cytosine in nucleotide sequences such as CHG and CHH, where H is adenine, cytosine or thymine. DNA methylation can also occur on non-cytosines, such as N6-methyladenine. In addition, DNA methylation can also be in the form of 5-hydroxymethylcytosine.
- DNA methylation modification can be expressed as a single base methylation modification, or as the average value of multiple base methylation modifications on a continuous region, or as is a methylated haplotype, and can also be expressed as a methylated haplotype.
- DNA methylation modification degree can use methylation haplotype abundance to describe.
- methylation haplotype refers to two or more consecutive C bases on a single DNA fragment A combination of methylation modifications at base sites. Since the DNA methylation modification is correlated on the same chromosome (DNA chain), there may be multiple adjacent C base sites, and their methylation modification conditions are related to each other to form a certain fixed form. Therefore, DNA methylation haplotype is a linkage pattern in which DNA methylation modification exists on the same chromosome. Compared with the statistical information of a single methylation site, the information provided by the methylation haplotype is more accurate and stable, and it is a more effective indicator for early detection of cancer and continuous monitoring of progression.
- difference in methylation haplotype refers to a haplotype in which the abundance of methylation haplotypes in different samples is significantly different.
- conversion rate and “conversion frequency” are used interchangeably, which refers to the ratio of C bases at non-CpG sites in the positive strand of DNA converted to T bases during bisulfite sequencing , and the ratio of G bases converted to A bases at non-CpG sites in the DNA negative strand.
- sequence inserted during primer synthesis refers to the target fragment sequence amplified by PCR primers.
- sequencing depth refers to the number of DNA fragments that completely cover a certain methylated haplotype genomic region obtained during high-throughput sequencing.
- the methylation frequency of the C base refers to the value of dividing the number of DNA molecules modified by methylation at the C base site by the number of total DNA molecules containing the C base site .
- the methylation frequency of base C is empty means that an error occurs in the process of calculating or aligning sequences, and the methylation frequency of base C cannot be obtained.
- sequences regarded as errors include, but are not limited to, sequences with amplification errors, sequencing errors, alignment errors, etc.
- cancer cell DNA concentration and “tumor DNA content” can be used universally, and refer to the ratio of cancer cell DNA to the sum of cancer cell DNA and normal human white blood cell DNA.
- concentration of cancer cell DNA of the present invention can be 100%, 20%, 10%, 5%, 2%, 1%, 0.5% and/or 0%.
- the filtering of the methylation sequencing data of the present invention includes the following steps: i) intercepting the sequencing sequence on the inserted region during primer synthesis; ii) filtering the sequencing fragment according to the following conditions: a. removing C and G base conversion frequencies are all less than 0.9, or less than 0.8, or less than 0.7, or less than 0.6, or less than 0.5 fragments; b. remove C and G base conversion frequencies are greater than 0.5, or greater than 0.6, or greater than 0.7, or Fragments greater than 0.8, or greater than 0.9; c. Remove sequences that are considered errors; d.
- the method for filtering methylation sequencing data of the present invention includes the following steps: i) intercepting the sequencing sequence on the inserted region during primer synthesis; ii) filtering the sequencing fragment according to the following conditions: a. removing C and G base conversion frequency are less than 0.9; b. Remove the fragments with C and G base conversion frequency greater than 0.5; c. Remove the sequence that is regarded as an error; d. Keep the C base conversion rate less than 0.2 while G Sequencing fragments with a base conversion rate greater than 0.8; e. retaining sequenced fragments with a G base conversion rate of less than 0.2 and a C base conversion rate greater than 0.8; f. obtaining methylated haplotypes; iii) extracting sequences with a sequencing depth greater than 10 Haplotype; iii) Screening out haplotypes whose methylation frequency of C base is empty.
- the filtering method of the methylation sequencing data of the present invention includes the following steps: i) intercepting the sequencing sequence on the inserted region during primer synthesis; ii) filtering the sequencing fragment according to the following conditions: a. removing C and G base conversion frequency are less than 0.5; b. Remove the fragments with C and G base conversion frequency greater than 0.9; c. Remove the sequence that is regarded as an error; d. Keep the C base conversion rate less than 0.01 while G Sequencing fragments with a base conversion rate greater than 0.99; e. retaining sequenced fragments with a G base conversion rate of less than 0.01 and a C base conversion rate greater than 0.99; f. obtaining methylated haplotypes; iii) extracting sequences with a sequencing depth greater than 10 Haplotype; iii) Screening out haplotypes whose methylation frequency of C base is empty.
- the methylation haplotype in the method for filtering methylation sequencing data of the present invention, can be obtained by reading and analyzing the methylation dependence between adjacent CpGs by the biscuit sliding window method .
- the fisher test is performed on the methylation haplotypes that have completed the filtering of the methylation sequencing data, and the present invention selects cancer cell-specific methylation difference haplotypes of different types that meet the following conditions Type:
- the difference between cancer cells of different types and normal leukocytes is significantly less than 0.05, and the methylation haplotype difference is compared in samples with a cancer cell DNA concentration of 1, and the methylation haplotypes of cancer samples of different types are required
- the ratio of methylation levels is greater than 2, or greater than 5, or greater than 10, or less than 1/2, or less than 1/5, or less than 1/10, and the methylation levels of different types of cancer samples and normal white blood cell samples are also required
- the ratios are all greater than 2, or greater than 5, or greater than 10, or less than 1/2, or less than 1/5, or less than 1/10.
- the fisher test is performed on the methylation haplotypes that have completed the filtering of the methylation sequencing data, and the present invention selects cancer cell-specific methylation difference haplotypes of different types that meet the following conditions Type:
- the difference between cancer cells of different types and normal leukocytes is significantly less than 0.05, and the methylation haplotype difference is compared in samples with a cancer cell DNA concentration of 1, and the methylation haplotypes of cancer samples of different types are required
- the ratio of the methylation level is greater than 2 or less than 1/2.
- the ratio of methylation levels between cancer samples of different types and normal white blood cell samples is required to be greater than 2 or less than 1/2.
- the gDNA extraction kit was purchased from Meiji Company (D6315-03) or QIAGEN Company (69506), and was performed according to the kit instructions.
- the cfDNA extraction kit was purchased from Zymo (D3061) or Thermo Fisher (A29319), and was performed according to the kit instructions.
- the DNA bisulfite conversion kit was purchased from Zymo, and the conversion product was eluted with NF-H 2 O according to the instructions of the kit.
- PCR reaction program 95°C for 10 min; 25 ⁇ (95°C for 30 s, 65°C for 30 s, 54°C for 2 min, 65°C for 30 s, 72°C for 30 s); 72°C for 10 min; 16°C for storage.
- the amplified product was enriched and recovered according to the instructions, and dissolved in 23 ⁇ L of sterilized water.
- Reagents for index PCR were purchased from KAPA Biosystems (KK2602).
- Reagent Volume ( ⁇ L) Recycled product from previous step twenty three 2 ⁇ PCR mix 25 i5index 1 i7index 1 total 50
- PCR reaction program 98°C for 45s; 13 ⁇ (98°C for 15s, 60°C for 30s, 72°C for 30s); 72°C for 5min; 4°C for storage.
- the amplified product was enriched and recovered according to the instructions, and dissolved in 35 ⁇ L of sterilized water. Double strand quantification using dsDNA Qubit.
- the concentration of the library was determined by fluorescent quantitative PCR and qualified, the Illumina Novaseq platform was used for PE150 sequencing, and the data volume of each library was 5-8M reads.
- the DNA fragments are first treated with bisulfite, and then the unmethylated C bases are converted into T bases through a reduction reaction, while the methylated C bases are retained. constant.
- the errors that may occur here will include two situations: (1) after the unmethylated site is processed, the C base is not successfully changed into a T base; (2) the unmethylated site should have been maintained. The C base of the changed methylation site has changed. Judging from the conventional knowledge of biology, the vast majority of C base methylation occurs at CpG sites, and the probability of methylation at non-CpG sites is extremely low (less than 0.5%).
- the conversion efficiency of methylation can be obtained by counting the conversion ratio of C bases at non-CpG sites to T bases.
- the conversion efficiency of the bisulfite treatment method is very high and can reach levels above 99%.
- the inventors unexpectedly found that, surprisingly, the conversion rates of non-CpG sites on the same sequenced fragment were always correlated with each other. Since sulfite conversion requires unraveling of the double strands into a single strand structure, the phenomenon is actually not difficult to understand. It can be considered that the higher the C base conversion rate of a non-CpG site on a sequencing fragment, the higher the correct rate of methylation processing. If there is any C base that has not been converted on the CpG site, it means that there is a high probability that there are other C bases that have not been converted. The reliability of this type of sequence is low. Therefore, this type of sequence is tested Filtering can further improve the accuracy and confidence of methylation analysis. First remove the CpG sites from the corresponding reference sequence, and the remaining sequence contains only non-CpG base sites, then calculate the conversion rate of these sites, and finally filter according to the quality control conditions to obtain the required data.
- the present invention calculates the conversion rate according to the following method: count the number of unconverted bases on the sequencing fragments (reads) according to the reference sequence (according to the DNA double-strand complementary pairing principle, count the C The number of bases, for negative-strand sequencing fragments, that is, the number of G bases counted on negative-strand reads, see Figure 1 for the principle), the counted C and G bases are divided by the total number of bases at their respective positions to obtain two types of bases The unconverted rate of the base, the corresponding converted rate is equal to 1-unconverted rate.
- the present invention filters the sequenced fragments according to the following quality control conditions: 1 Remove the fragments whose conversion frequencies of C and G bases are both ⁇ 0.5, optionally ⁇ 0.6, ⁇ 0.7, ⁇ 0.8, ⁇ 0.9; 2 Remove C and G bases Fragments with base conversion frequency >0.9, optional >0.8, >0.7, >0.6, >0.5; remove 1 and 2 sequencing fragments, which are regarded as wrong sequences, including but not limited to amplification errors, sequencing errors, comparison Errors, etc.; 3 Keep the sequencing fragments with low conversion rate of C base and high conversion rate of G base, optional conversion rate of C base ⁇ 0.2, ⁇ 0.1, ⁇ 0.01, and conversion rate of G base > 0.99 , >0.9, >0.8; 4 Keep the sequencing fragments with low conversion rate of G base and high conversion rate of C base, optional conversion rate of G base ⁇ 0.2, ⁇ 0.1, ⁇ 0.01, and conversion rate of C base Rate > 0.99, > 0.9, > 0.8.
- the present invention uses the biscuit sliding window method to read and analyze the methylation dependence between adjacent CpGs on the filtered data, obtain the methylation haplotype, store it in epiread format, and then calculate the haplotype Methylation levels were used for detection.
- the 183 target samples detected were analyzed, and the results before and after statistical filtering found that the screening rate of low-quality DNA fragments was 0.8%-3%, and the average screening rate was 2.05%.
- the conversion efficiency of non-CpG sites increased from 98.81% to 99.44%, which proved that the screened out sequences carried about 63% unconverted C bases.
- the present invention effectively improves the accuracy rate by filtering out such incompletely converted DNA fragments.
- the sequencing sequence on the inserted region during primer synthesis is intercepted, and this part is the true and effective sequence that accurately falls on the designed primer region and has passed the depth limit.
- the haplotypes with a depth of N>10 are extracted, and the haplotypes whose C base frequency (freqC) is empty are screened out, because these are sequences with alignment errors, and finally the primer region plus the methylated freqC is used
- the naming method distinguishes haplotypes, which are the haplotype results with high quality, high accuracy and high reliability obtained from the final screening. The process is shown in Figure 3.
- bladder urothelial cancer cell lines RT4 non-muscle-invasive NMIBC type
- 5637 muscle-invasive MIBC type
- normal human leukocytes to detect biological standards to clarify the application of this application for bladder urothelial cancer and The lowest detection limit of its type.
- Bladder urothelial cancer cell lines and normal human leukocytes were purchased from the Cell Bank of the Chinese Academy of Sciences.
- the extraction kit was purchased from QIAGEN (69506), and was carried out according to the instructions of the kit.
- each group of the present application has set up 8 mode standard substance gradients altogether (being that the DNA content of bladder urothelial carcinoma is 100%, 20%, 10%, 5%, 2%, 1%, 0.5 %, 0%).
- haplotype extraction method refers to Example 1. Statistical detection of each haplotype was performed using the fisher test, and it was required to meet: P ⁇ 0.05 significant difference between muscle-invasive and non-muscle-invasive bladder urothelial carcinoma and normal white blood cells, and then in pure tissue (i.e.
- the tumor DNA content is 1) to compare the difference of methylation haplotypes in samples, the ratio of methylation level of muscle-invasive and non-muscle-invasive bladder urothelial carcinoma samples is required to be greater than 2 or less than 1/2 (including but not limited to >5, >10, or ⁇ 1/5, ⁇ 1/10), in addition, the ratio of the methylation levels of the two types of bladder urothelial carcinoma samples to normal white blood cell samples is required to be greater than 2 or less than 1/2 (including but not limited to >5, >10, or ⁇ 1/5, ⁇ 1/10), based on this standard, a total of 60 cancer-specific haplotypes were selected (among which Invasive 32, non-muscle invasive 28); In addition, the difference between leukocytes and the two types of cancer samples is required to be greater than 3, and the haplotype frequency of cancer samples is not less than 0.0001, so the selected 31 leukocyte-specific haplotypes.
- the process is shown in Figure 4.
- the data analysis method of this application is applied to the detection of bladder urothelial carcinoma.
- Muscle invasion and non-muscle invasion were distinguished in real tissue samples, and the t-test model was used to detect the difference between the two types of cancer tissue samples and normal urine samples and the difference in the methylation level between the two types of cancer, and selected three times Test the haplotypes with a significant difference of p ⁇ 0.05, and classify the haplotypes according to whether the logarithm of the average frequency difference is positive or negative.
- haplotypes were selected, including four types, which are There are 10 samples with lower frequency of muscle invasion than non-muscle invasion (m_n_negative), 20 samples with high frequency (m_n_positive), 8 samples with higher frequency of cancer samples than normal urine (u_negative), and lower frequency than normal urine There are 4 liquid (u_positive). Train the model in tissue samples and predict in tissue and urine samples respectively. The process is shown in Figure 6.
- the experiment included 142 samples in total, and the specific samples were the same as in Example 1.
- the experimental content includes: 1 DNA extraction; 2 DNA bisulfite conversion; 3 multiplex PCR amplification and product recovery; 4 index PCR and product recovery; 5 high-throughput sequencing; 6 and data analysis.
- the method includes the following steps: i) intercepting the sequencing sequence on the inserted region when the primer is synthesized; ii) filtering according to the following conditions Sequence the fragments to obtain the methylation haplotypes of the eligible sequencing fragments: 1 remove the fragments whose C and G base conversion frequencies are both ⁇ 0.5, optional ⁇ 0.6, ⁇ 0.7, ⁇ 0.8, ⁇ 0.9; 2Remove fragments with both C and G base conversion frequencies>0.9, optional>0.8,>0.7,>0.6,>0.5; 3Remove sequences that are considered errors, including but not limited to amplification errors, sequencing errors, Alignment errors, etc.; 4Reserve the sequencing fragments with low conversion rate of C base and high conversion rate of G base, optional conversion rate of C base ⁇ 0.2, ⁇ 0.1, ⁇ 0.01, and conversion rate of G base> 0.99, >0.9, >0.8; 5 The sequence fragments with low conversion rate of G base and high
- the experiment included a total of 142 samples and biological standards, the specific samples were the same as in Example 1, and the biological standards were the same as in Example 2.
- the experimental content includes: 1 DNA extraction; 2 DNA bisulfite conversion; 3 gradient incorporation of biological standards; 4 multiplex PCR amplification and product recovery; 5 index PCR and product recovery; 6 high-throughput sequencing; and 7 data analysis.
- the method generally includes three major step:
- Methylation sequencing data filtering step a. intercept the sequencing sequence on the insertion region during primer synthesis; b. filter the sequencing fragments according to the following conditions to obtain the methylation haplotypes of the sequencing fragments that meet the conditions: remove C Fragments with conversion frequencies of both C and G bases ⁇ 0.5, optional ⁇ 0.6, ⁇ 0.7, ⁇ 0.8, ⁇ 0.9; remove fragments with both C and G base conversion frequencies >0.9, optional >0.8, >0.7 , >0.6, >0.5; remove sequences that are considered errors, including but not limited to amplification errors, sequencing errors, alignment errors, etc.; retain sequencing fragments with low conversion rates of C bases and high conversion rates of G bases, which can be The conversion rate of the selected C base is ⁇ 0.2, ⁇ 0.1, ⁇ 0.01, while the conversion rate of the G base is >0.99, >0.9, >0.8; keep the sequencing fragments with low conversion rate of G base and high conversion rate of C base , the conversion rate of the optional G base is ⁇ 0.2, ⁇ 0.1, ⁇
- ii) Selection step of differential methylation haplotypes a. Perform fisher test on the methylation haplotypes obtained from the screening, and select cancer cell-specific methylation differential haplotypes of different types that meet the following conditions: The difference between muscle-invasive and non-muscle-invasive bladder urothelial carcinoma cells and normal leukocytes is significantly less than 0.05, and the difference of methylation haplotypes is compared in samples with a DNA concentration of 1 in cancer cells , it is required that the ratio of methylation levels of different types of cancer samples is greater than 2 or less than 1/2 (including but not limited to >5, >10, or ⁇ 1/5, ⁇ 1/10), and muscle invasiveness is also required and non-muscle-invasive bladder urothelial carcinoma samples and normal white blood cell samples meet the ratio of methylation level greater than 2 or less than 1/2 (including but not limited to >5, >10, or ⁇ 1/5, ⁇ 1/10); select leukocyte-specific methylation difference haplotypes that meet the following conditions: the difference
- Methylation differential haplotype training step a.
- For the selected methylation differential haplotypes detect muscle-invasive and non-muscle-invasive bladder urothelial cancer tissue samples and normal urine samples The differences in the methylation levels between the two types of cancer samples were selected, and the haplotypes with a significant difference p less than 0.05 in the three t-test results were selected, and the logarithm of the difference in the average methylation level was positive or negative. Distinguish haplotype categories, and judge them separately in tissue samples and urine samples accordingly, and calculate the selected methylation differential haplotypes for bladder urothelial carcinoma samples and normal human samples and muscle-invasive cysturia. Discrimination and Sensitivity and Specificity Between Road Thelial Carcinoma Samples and Non-Muscle-Invasive Bladder Urothelial Carcinoma Samples.
- step i) the methylation haplotype obtained through screening
- step ii) the selected methylation differential haplotype
- step ii) the minimum detection of each model obtained Out of limit
- step iii) the methylation difference haplotype and its category after the methylation level test
- step iii) the discrimination of the methylation difference haplotype obtained through the test to the sample (AUC ) and sensitivity and specificity.
Landscapes
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Biotechnology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioethics (AREA)
- Public Health (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
La présente invention se rapporte au domaine technique de la biologie, et concerne en particulier un procédé de filtrage de données de séquençage de méthylation et une application. Selon le procédé de filtrage de données de séquençage de méthylation fourni par la présente invention, l'efficacité de conversion d'une base C non méthylée en une base U dans l'ADN traité avec du bisulfite peut être améliorée de 98,81 % à 99,44 %. Sur cette base, en appliquant le procédé de filtrage de données fourni par la présente demande, le niveau de méthylation d'une séquence peut être évalué avec précision, et un haplotype méthylé ayant une qualité élevée, une précision élevée et une crédibilité élevée peut être obtenu. Des données expérimentales montrent qu'en utilisant l'haplotype méthylé obtenu au moyen du procédé de filtrage de données de séquençage de méthylation fourni par la présente invention, l'échantillon d'urine d'un patient atteint d'un carcinome urothélial peut être bien distingué de celui d'une personne normale (AUC = 0,94), et l'échantillon d'urine d'un patient atteint d'un carcinome urothélial musculo-invasif peut être bien distingué de celui d'un patient atteint d'un carcinome urothélial non musculo-invasif (AUC = 0,967).
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210101807.9A CN114496096A (zh) | 2022-01-27 | 2022-01-27 | 一种甲基化测序数据过滤方法及应用 |
CN202210101807.9 | 2022-01-27 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023142625A1 true WO2023142625A1 (fr) | 2023-08-03 |
Family
ID=81477021
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/132767 WO2023142625A1 (fr) | 2022-01-27 | 2022-11-18 | Procédé de filtrage de données de séquençage de méthylation et application |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN114496096A (fr) |
WO (1) | WO2023142625A1 (fr) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114496096A (zh) * | 2022-01-27 | 2022-05-13 | 安康优乐复生科技有限责任公司 | 一种甲基化测序数据过滤方法及应用 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017035821A1 (fr) * | 2015-09-02 | 2017-03-09 | 中国科学院北京基因组研究所 | Procédé de construction d'une bibliothèque par séquençage au bisulfite pour la 5mc d'un arn et son application |
WO2021088653A1 (fr) * | 2019-11-08 | 2021-05-14 | 中国科学院北京基因组研究所(国家生物信息中心) | Procédé et dispositif de classification de l'adn génomique de sédiments urinaires et utilisation de l'adn génomique de sédiments urinaires |
WO2021130356A1 (fr) * | 2019-12-24 | 2021-07-01 | Vib Vzw | Détection de maladie dans des biopsies liquides |
CN114496096A (zh) * | 2022-01-27 | 2022-05-13 | 安康优乐复生科技有限责任公司 | 一种甲基化测序数据过滤方法及应用 |
-
2022
- 2022-01-27 CN CN202210101807.9A patent/CN114496096A/zh active Pending
- 2022-11-18 WO PCT/CN2022/132767 patent/WO2023142625A1/fr unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017035821A1 (fr) * | 2015-09-02 | 2017-03-09 | 中国科学院北京基因组研究所 | Procédé de construction d'une bibliothèque par séquençage au bisulfite pour la 5mc d'un arn et son application |
WO2021088653A1 (fr) * | 2019-11-08 | 2021-05-14 | 中国科学院北京基因组研究所(国家生物信息中心) | Procédé et dispositif de classification de l'adn génomique de sédiments urinaires et utilisation de l'adn génomique de sédiments urinaires |
WO2021130356A1 (fr) * | 2019-12-24 | 2021-07-01 | Vib Vzw | Détection de maladie dans des biopsies liquides |
CN114496096A (zh) * | 2022-01-27 | 2022-05-13 | 安康优乐复生科技有限责任公司 | 一种甲基化测序数据过滤方法及应用 |
Also Published As
Publication number | Publication date |
---|---|
CN114496096A (zh) | 2022-05-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107771221B (zh) | 用于癌症筛查和胎儿分析的突变检测 | |
EP3658684B1 (fr) | Amélioration du dépistage du cancer au moyen d'acides nucléiques viraux acellulaires | |
CN113539355B (zh) | 预测cfDNA的组织特异性来源及相关疾病概率评估系统及应用 | |
US20210065842A1 (en) | Systems and methods for determining tumor fraction | |
US20200340064A1 (en) | Systems and methods for tumor fraction estimation from small variants | |
EP3973080A1 (fr) | Systèmes et procédés pour déterminer si un sujet a une pathologie cancéreuse à l'aide d'un apprentissage par transfert | |
JP2023516633A (ja) | メチル化シークエンシングデータを使用したバリアントをコールするためのシステムおよび方法 | |
WO2023142625A1 (fr) | Procédé de filtrage de données de séquençage de méthylation et application | |
CN116804218A (zh) | 用于检测肺结节良恶性的甲基化标志物及其应用 | |
CN110373458A (zh) | 一种地中海贫血检测的试剂盒及分析系统 | |
CN114703284A (zh) | 一种血液游离dna甲基化定量检测方法及其应用 | |
CN117441027A (zh) | Heatrich-BS:用于亚硫酸氢盐测序的富含CpG的区域的热富集 | |
US12043873B2 (en) | Molecule counting of methylated cell-free DNA for treatment monitoring | |
US20210254141A1 (en) | Method of and apparatus for analyzing tumor subclones | |
CN118749032A (zh) | 使用长游离dna分子进行疾病分类的分子分析 | |
WO2022152784A1 (fr) | Procédés pour déterminer le type de cancer | |
CN117789976A (zh) | 检测染色体区段信息在原发肝癌患者肝外转移预测中的应用 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22923419 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2022923419 Country of ref document: EP Effective date: 20240827 |