CN118119718A - Model for predicting pregnancy tumor tissue sources by utilizing plasma free DNA and construction method thereof - Google Patents

Model for predicting pregnancy tumor tissue sources by utilizing plasma free DNA and construction method thereof Download PDF

Info

Publication number
CN118119718A
CN118119718A CN202280070284.4A CN202280070284A CN118119718A CN 118119718 A CN118119718 A CN 118119718A CN 202280070284 A CN202280070284 A CN 202280070284A CN 118119718 A CN118119718 A CN 118119718A
Authority
CN
China
Prior art keywords
tumor
genes
sequencing depth
sample
copy number
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280070284.4A
Other languages
Chinese (zh)
Inventor
鞠佳
李佳
金鑫
章文蔚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BGI Shenzhen Co Ltd
Original Assignee
BGI Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BGI Shenzhen Co Ltd filed Critical BGI Shenzhen Co Ltd
Publication of CN118119718A publication Critical patent/CN118119718A/en
Pending legal-status Critical Current

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Engineering & Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

A model for predicting pregnancy tumor tissue sources by utilizing plasma free DNA and a construction method thereof comprise the following steps: 1) Respectively obtaining the expression levels of a plurality of genes of a plurality of tumor tissues; 2) For samples of pregnant women with known tumor types and combined tumor, acquiring cfDNA sequencing data of each sample; 3) For each sample, calculating the sequencing depth of a plurality of genes, wherein the sequencing depth of the genes is the sequencing depth before and after a transcription start site, a transcription stop site and/or an open region of a genome; 4) Obtaining, for each sample, a correlation coefficient based on the sequencing depth of the plurality of genes and the expression level of the plurality of genes for the respective tumor type; 5) For each sample, establishing a tumor copy number variation spectrum feature; 6) And combining the correlation coefficient of each sample with the tumor copy number variation spectrum characteristics, and training the prediction model by taking the characteristic sets of a plurality of samples as training samples to obtain a trained prediction model.

Description

Model for predicting pregnancy tumor tissue sources by utilizing plasma free DNA and construction method thereof Technical Field
The invention belongs to the technical field of biology, and particularly provides a model for predicting a pregnancy tumor tissue source by utilizing plasma free DNA and a construction method thereof.
Background
Noninvasive prenatal testing (non-INVASIVE PRENATAL TESTING, NIPT) is a noninvasive prenatal screening technique for detecting abnormal fetal chromosome numbers by high-throughput sequencing analysis of free DNA fragments (cell-fell DNA, cfDNA) in maternal peripheral blood. Compared with the traditional screening means, NIPT has the characteristics of safety, convenience and the like, and Down syndrome (T21), edwardsies syndrome (T18) and Papanic syndrome (T13) can be detected with high sensitivity and high specificity [8]. NIPT began commercial applications [9] in 2011, and so far, over ten million NIPT tests have been conducted worldwide, 70% of which occur in China [10].
Pregnancy combined with tumor development is rare, and the incidence is 0.07% to 0.1% [1,2]. The more common tumor types are lymphoma, breast, ovarian, melanoma, leukemia, colorectal [2]. Because of the concealment and latency of tumors and their occurrence during pregnancy, common symptoms of tumors are masked by physiological changes during pregnancy. Also because of the particularities of pregnancy, it is not easy for the doctor to immediately check the cause of these symptoms, considering the inherent risks of fetal exposure to supplementary checks, such as ionizing radiation, etc.; on the other hand pregnancy interferes with the sensitivity and specificity of the diagnostic method, and even if symptoms are properly checked, there is still a high misdiagnosis rate or missed diagnosis rate [1].
Tumor-derived circulating free DNA (ctDNA) is contained in the peripheral blood of tumor patients. ctDNA occupies only a small portion of all cfDNA. The ctDNA has the molecular characteristics related to the tumor, can be applied to liquid biopsy, and has important clinical application value. During pregnancy, cfDNA in the pregnant woman's peripheral blood is mostly from the maternal hematopoietic system, and there is additionally a small amount of cfDNA released into the maternal blood circulatory system [3] from placental trophoblast apoptosis. If the parent is provided with a tumor, part of the tumor withers cfDNA in the cfDNA. Several papers elucidated that tumors were one of the reasons [4] for failure of NIPT detection and for false positives, and studied cases [5,6] of unexpected discovery of tumor samples in multiple chromosome-bearing samples of NIPT data. Bianchi et al published achievements [5] on JAMA, performed retrospective analysis on NIPT abnormal samples (one or more chromosome aneuploidy), studied the tumorigenesis proportion in multiple abnormal types of NIPT multi-grouped samples, and concluded that the highest occurrence proportion was found in two or more chromosome aneuploidy (18%, [95% CI,7.5% -33.5% ]). Ji et al developed a set of information analysis algorithms based on copy number variation (copy number variation, CNV) in NIPT abnormal (multiple stained aneuploidy) samples to detect maternal tumors in 2019 [7], with a sensitivity (sensitivity) of 83%, a specificity (specificity) of 85%, and a positive predictive value (positive predictive value, PPV) of 75% if tumor marker information was combined. The method primarily solves the problem of tumor prediction, but cannot give specific tumor types.
In view of the above, there is no method for effectively predicting tumor tissue sources in common pregnancy based on low-depth pregnant woman plasma NIPT sequencing data.
Reference is made to:
1.Andersson,T.M.,et al.,Cancer during pregnancy and the postpartum period:A population-based study.Cancer,2015.121(12):p.2072-7.
2.Pavlidis,N.A.,Coexistence of pregnancy and malignancy.Oncologist,2002.7(4):p.279-87.
3.Taglauer,E.S.,L.Wilkins-Haug,and D.W.Bianchi,Review:cell-free fetal DNA in the maternal circulation as an indication of placental health and disease.Placenta,2014.35 Suppl:p.S64-8.
4.Bianchi,D.W.and R.W.K.Chiu,Sequencing of Circulating Cell-free DNA during Pregnancy.N Engl J Med,2018.379(5):p.464-473.
5.Bianchi,D.W.,et al.,Noninvasive Prenatal Testing and Incidental Detection of Occult Maternal Malignancies.JAMA,2015.314(2):p.162-9.
6.Amant,F.,et al.,Presymptomatic Identification of Cancers in Pregnant Women During Noninvasive Prenatal Testing.JAMA Oncol,2015.1(6):p.814-9.
7.Ji,X.,et al.,Identifying occult maternal malignancies from 1.93million pregnant women undergoing noninvasive prenatal screening tests.Genet Med,2019.
8.Benn,P.,H.Cuckle,and E.Pergament,Non-invasive prenatal testing for aneuploidy:current status and future prospects.Ultrasound Obstet Gynecol,2013.42(1):p.15-33.
9.Agarwal,A.,et al.,Commercial landscape of noninvasive prenatal testing in the United States.Prenat Diagn,2013.33(6):p.521-31.
10.Liu,S.,et al.,Genomic Analyses from Non-invasive Prenatal Testing Reveal Genetic Associations,Patterns of Viral Infections,and Chinese Population History.Cell,2018.175(2):p.347-359 e14.
Disclosure of Invention
The invention aims to predict pregnancy combined liver cancer, breast cancer and lymphoma by means of the relevant imprinting of the nucleosome of cfDNA reaction and combining with the expression quantity data of tumor tissues in a TCGA reference database, and realize the aim of tracing tumor.
Accordingly, in a first aspect, the present invention provides a method of constructing a model for predicting the origin of tumor tissue during pregnancy, comprising:
1) Respectively obtaining the expression levels of a plurality of genes of a plurality of tumor tissues;
2) For samples of pregnant women with known tumor types and combined tumor, acquiring cfDNA sequencing data of each sample;
3) Calculating, for each sample, the sequencing depth of the plurality of genes, the sequencing depth of the genes being the sequencing depth before and after the transcription start site, transcription stop site and/or genomic open region;
4) Obtaining, for each sample, a correlation coefficient based on the sequencing depth of the plurality of genes and the expression level of the plurality of genes for the respective tumor type;
5) For each sample, establishing a tumor copy number variation spectrum feature;
6) And combining the correlation coefficient of each sample with the tumor copy number variation spectrum characteristics, and training the prediction model by taking the characteristic sets of a plurality of samples as training samples to obtain a trained prediction model.
In one embodiment, the tumor tissue is breast cancer tissue, liver cancer tissue, and/or lymphoma.
In one embodiment, in 1), the expression level of the plurality of genes is an average of a plurality of samples for each tumor tissue, e.g., from a database, such as a TCGA database.
In one embodiment, in 3), the sequencing depth is the relative sequencing depth of the normalization process.
In one embodiment, in 3), the sequencing depth of the gene is the sequencing depth of 100bp, 400bp, 600bp, 1kb, etc., before and after the transcription start site, transcription stop site and/or genomic open region.
In one embodiment, in 4), the correlation coefficient is a normalized correlation coefficient.
In one embodiment, in 5), the tumor copy number variation spectral signature is expressed as P-MAD:
Wherein, C i: in the ith CNV segment interval, 2 is the median of the log-transformed copy number ratio with 2 as the base, 2 Ci is the copy number of the ith CNV segment, l i: length of the ith CNV fragment interval, L: full genome length.
In one embodiment, in 6), logistic regression or random forest training is performed.
In a second aspect, the invention provides a predictive model obtained using the method of the first aspect of the invention.
In a third aspect, the present invention provides a method of predicting the origin of neoplastic tissue during pregnancy comprising:
1) Respectively obtaining the expression levels of a plurality of genes of a plurality of tumor tissues;
2) Acquiring cfDNA sequencing data for a pregnant woman sample to be tested;
3) Calculating the sequencing depth of the plurality of genes, wherein the sequencing depth of the genes is the sequencing depth before and after a transcription start site, a transcription termination site and/or a genome open region;
4) Obtaining a correlation coefficient based on the sequencing depth of the plurality of genes and the expression levels of the plurality of genes for the plurality of tumor types;
5) Establishing tumor copy number variation spectrum characteristics;
6) And combining the correlation coefficient and the tumor copy number variation spectrum characteristic, and inputting the combined correlation coefficient and the tumor copy number variation spectrum characteristic as input data into a prediction model according to the second aspect of the invention for prediction.
In one embodiment, the tumor tissue is the same tumor tissue used to construct the predictive model.
In one embodiment, in 1), the plurality of genes is the same as the genes used to construct the predictive model for each tumor tissue.
In one embodiment, in 3), the sequencing depth is the relative sequencing depth of the normalization process.
In one embodiment, in 3), the calculation of the sequencing depth of the gene is the same as when constructing the predictive model.
In one embodiment, in 4), the correlation coefficient is a normalized correlation coefficient.
In one embodiment, in 5), the tumor copy number variation spectral signature is expressed as P-MAD:
Wherein, C i: in the ith CNV segment interval, 2 is the median of the log-transformed copy number ratio with 2 as the base, 2 Ci is the copy number of the ith CNV segment, l i: length of the ith CNV fragment interval, L: full genome length.
In a fourth aspect, the present invention provides a system for constructing a model for predicting the origin of tumor tissue during pregnancy, the system being configured for carrying out the method of the first aspect of the invention.
In a fifth aspect, the present invention provides a system for predicting the origin of tumoral tissue during pregnancy, said system being configured for carrying out the method of the third aspect of the invention.
According to the scheme, cfDNA data in pregnant woman plasma obtained by only one sampling is successfully used for the first time, related characteristics of nucleosomes are added outside conventional NIPT prediction, and by combining CNV or chromosome aneuploidy quantity, tumor type prediction is realized, clinical application value of NIPT data is remarkably increased, whether fetal genome is abnormal or not is obtained, a maternal database construction condition is obtained at the same time, possible tumor tissue types are further provided, and clinical auxiliary diagnosis is given.
The invention develops the biomarker for predicting the pregnancy tumor type, and has better application to pregnancy combined tumor diagnosis.
Drawings
FIG. 1 illustrates a model evaluation graph according to one embodiment of the invention.
Detailed Description
The invention adopts sequencing to analyze the distribution of cfDNA sequencing fragments in different chromosome segments and the distribution characteristics of different types of tumors, and analyzes the genome coverage of specific chromosome intervals of tumor samples, such as transcription initiation sites (TRANSCRIPT START SITE, TSS) and transcription termination sites (TRANSCRIPT END SITE, TES). The invention uses the expression quantity value of each cancer in the public database as the expression quantity reference, and adopts the correlation comparison of nucleosome position and coverage near TSS and TES judged by the gene expression level sequencing data of different cancer in the TCGA tumor database to estimate the tumor tissue type.
According to the invention, the pregnancy tumor tissue source can be predicted while the pregnancy tumor tissue source prediction model is constructed, so that the training sample and the sample to be tested are required to be processed in parallel. Thus, the method of predicting the source of tumor tissue during pregnancy using plasma free DNA of the present invention can be expressed as:
1) Respectively obtaining the expression levels of a plurality of genes of a plurality of tumor tissues;
2) For a training sample and a to-be-tested pregnant woman sample of a pregnant woman with a known tumor type, acquiring cfDNA sequencing data of the pregnant woman sample;
3) Calculating, for each sample, the sequencing depth of the plurality of genes, the sequencing depth of the genes being the sequencing depth before and after the transcription start site, transcription stop site and/or genomic open region;
4) Obtaining, for each sample, a correlation coefficient based on the sequencing depth of the plurality of genes and the expression levels of the plurality of genes for the plurality of tumor types;
5) For each sample, establishing a tumor copy number variation spectrum feature;
6) Combining the correlation coefficient of each sample and the tumor copy number variation spectrum characteristics of the training samples, and training the prediction model by taking the characteristic sets of a plurality of samples as the training samples to obtain a trained prediction model;
7) And for a pregnant woman sample to be tested, combining the correlation coefficient and the tumor copy number variation spectrum characteristic, and inputting the combined correlation coefficient and the tumor copy number variation spectrum characteristic as input data into the trained prediction model for prediction.
In an exemplary embodiment of the present invention, the establishment of the predictive model for predicting the tumor tissue origin during pregnancy using plasma free DNA according to the present invention comprises the following specific steps:
1. Early data processing: all sample sets were derived from long-term clinical follow-up, determined to have tumor at the same time as pregnancy, and clinically judged tumor type when confirmed as tumor. Raw, off-the-shelf data (fq format) for cfDNA sequencing of all samples used for model training and prediction and validation, after quality control is completed, the sequencing data is aligned to human reference chromosomes (reference genomes hg19, hg38 may be selected) using alignment software (e.g., samse mode in BWA).
2. Single sample sequencing coverage case RC value calculation: for each sample, the sequencing depth was calculated near the transcription initiation site region of each gene of the whole genome (1 kb upstream and downstream of the TSS is the region near the TSS); the transcription initiation site may be replaced with a transcription termination site (TRANSCRIPT END SITE, TES) or a genomic open region (NDR). The sequencing coverage RC values were calculated using different methods for single-strand sequencing and double-strand sequencing. For single strand sequencing, the starting position of sequencing according to the direction of alignment starts to extend 167bp in the direction of sequencing according to the sequencing reads to the theoretical length of cfDNA. For double-strand sequencing, the sequencing fragments were calculated for insert lengths between 120bp and 300bp, with read1 and read2 aligned exactly to the same chromosome.
After locating the distribution position of the sequenced fragments on the genome according to the alignment file, the cumulative sequencing depth near the TSS region upstream of each gene is calculated. Only the center 61bp of the sequenced fragment was calculated, and the relative sequencing depth (relative coverage, RC) was obtained by normalizing the number of reads according to the overall comparison (statistical normalization to the relative cumulative depth per 1 million reads), and removing the differences caused by the difference in the number of reads. An RC array is obtained corresponding to the cumulative relative sequencing depth (RC 1, RC2, …, RCg) of all genes in a certain sample. In the present invention, genes can be screened for an expected improvement in the predictive effect. In the invention, the gene expression quantity in the TCGA database is used as a reference value, and the relative statistics are obtained by comparing the reference value with the reference value to carry out model prediction.
3. Associated with the expression in the TCGA database, calculating the vector normalized correlation coefficient: selecting an expression quantity data matrix of a specific cancer tissue (such as BRCA breast cancer, DLBC lymphoma or LIHC liver cancer) in a TCGA database, respectively calculating the average value of the expression conditions of all samples of each gene in the cancer in each cancer, and obtaining an array Gj (G1, G2, …, gg) of the average value of the expression quantity of each sample of each cancer type of a certain TCGA of different genes, wherein j is E (BRAC, DLBC, LIHC).
And calculating a correlation coefficient value corij between the RC array and each cancer species expression quantity array Gj in the sample i, for example, for pre-classified tumor types, breast cancer (BRAC), lymphoma (DLBC) and liver cancer (LIHC), and obtaining a correlation coefficient array cori (coriBRCA, coriDLBC, coriLIHC).
The correlation matrix cori for the subsequent sample i is vector normalized (Vector normalization), such as: the following formula may be used for coriBRCA. And finally obtaining VNPC arrays containing the correlation coefficient values after vector normalization.
And (3) performing a leave-one-out method to perform testing, wherein one sample is remained in each round to perform testing, and the other samples are all used as training sets. If the total number of test samples to be trained is m, a matrix VNPC ij, j E (BRAC, ELBC, LIHC) can be obtained; i.epsilon.1, 2, … m.
4. Tumor Copy Number Variation (CNV) spectral features:
The spectral characteristics of CNV take into account the CNV length and the degree of variation of the CNV copy data, define P-MAD (Proportion of median absolute deviation from copy number neutrality) as a metric variable. In the prediction, the training samples and the test samples are not distinguished, and the characteristics of all the samples are identical: both the CNV and the correlation matrix mentioned above need to be calculated; P-MAD is a value of a sample.
Wherein, C i: in the ith CNV segment interval, 2 is the Median of the log-transformed copy number ratio (Median log2-transformed copy number ratio), 2 Ci is the copy number of the ith CNV segment, l i: length of the ith CNV fragment interval, L: full genome length.
5. Establishing a prediction model:
Combining the tumor Copy Number Variation (CNV) spectral features P-MAD of each sample with a matrix VNPC, and the columns are combined into a feature set of a training sample, wherein a plurality of samples form a two-dimensional matrix, and the training of logistic regression (Logistics regression), random Forest (Random Forest) or other prediction models can be performed by using statistical software such as R, and the obtained result is stored as a prediction model for the prediction of the last step.
In the present invention, the method of constructing a model for predicting the source of tumor tissue during pregnancy of the present invention may be presented in a systematic manner, which is configured to implement the method of constructing a model for predicting the source of tumor tissue during pregnancy of the present invention. The system for predicting the pregnancy tumor tissue source can be used for constructing a model for predicting the pregnancy tumor tissue source. For example, a system for constructing a model for predicting the source of tumor tissue during pregnancy may comprise a data acquisition unit, a calculation unit and a predictive model training unit. The data acquisition unit is used for respectively acquiring the expression levels of a plurality of genes of a plurality of tumor tissues, and acquiring cfDNA sequencing data of each sample for the samples of pregnant women with pregnancy combined with tumors of known tumor types. The calculation unit is used for calculating the sequencing depth of the genes for each sample, wherein the sequencing depth of the genes is the sequencing depth before and after a transcription start site, a transcription termination site and/or a genome open region, and the correlation coefficient is obtained based on the sequencing depth of the genes and the expression level of the genes of the corresponding tumor types, so that the tumor copy number variation spectrum characteristic is established. The prediction model training unit is used for combining the correlation coefficient of each sample and the tumor copy number variation spectrum characteristics, and the characteristic set of the plurality of samples is used as a training sample to train the prediction model to obtain a trained prediction model.
In the present invention, the method of predicting the source of pregnancy tumor tissue of the present invention may be presented in a systematic manner, which is configured to implement the method of predicting the source of pregnancy tumor tissue of the present invention. For example, a system for predicting a source of tumor tissue during pregnancy may include a data acquisition unit, a calculation unit, and a prediction unit. The data acquisition unit is used for respectively acquiring the expression levels of a plurality of genes of a plurality of tumor tissues, and acquiring cfDNA sequencing data for a pregnant woman sample to be tested. The calculation unit is used for calculating the sequencing depth of the genes, wherein the sequencing depth of the genes is the sequencing depth before and after a transcription start site, a transcription termination site and/or a genome open region, and the correlation coefficient is obtained based on the sequencing depth of the genes and the expression levels of the genes of the tumor types, so that the tumor copy number variation spectrum characteristic is established. The prediction unit is used for combining the correlation coefficient and the tumor copy number variation spectrum characteristic and inputting the combined correlation coefficient and the tumor copy number variation spectrum characteristic as input data into the prediction model for prediction.
The invention is illustrated by the following examples.
Examples
1) And (3) data collection:
The gene expression levels of tumor tissues of breast cancer, liver cancer and lymphoma were obtained from the TCGA database, as shown in table 1.
TABLE 1 expression level (FPKM) sample information of samples of respective disease types in TCGA database
Here, all available genes were selected, and the gene expression level in the TCGA database was used as a reference value. -the final 20532 genes were used in the subsequent step.
In this example, cfDNA sequencing data of three types of gestational tumor complicated with liver cancer, breast cancer and lymphoma, which are confirmed in the returned data of the abnormal results (multiple chromosome aneuploidy) of NIPT detection, are used as input data, and the number of liver cancer patients, breast cancer patients and lymphoma patients is 12, 15 and 9 respectively. cfDNA sequencing data was obtained by single strand cfDNA sequencing.
For each sample, the sequencing depth of the plurality of genes is calculated, the sequencing depth of the genes being the sequencing depth before and after the transcription start site, transcription stop site and/or genomic open region. The sequencing depth was calculated near the transcription initiation site region (1 kb upstream and downstream of TSS as the region near the TSS) of each gene of the whole genome.
For single strand sequencing, the starting position of sequencing according to the direction of alignment starts to extend 167bp in the direction of sequencing according to the sequencing reads to the theoretical length of cfDNA.
After locating the distribution position of the sequenced fragments on the genome according to the alignment file, the cumulative sequencing depth near the TSS region upstream of each gene is calculated. Only the center 61bp of the sequenced fragment was calculated, and the relative sequencing depth (relative coverage, RC) was obtained by normalizing the number of reads according to the overall comparison (statistical normalization to the relative cumulative depth per 1 million reads), and removing the differences caused by the difference in the number of reads. An RC array is obtained corresponding to the cumulative relative sequencing depth (RC 1, RC2, …, RCg) of all genes in a certain sample.
2) The P-MAD value of each sample and the vector normalized correlation coefficient value VNPC are calculated according to the above method to obtain the feature set of each sample.
3) Leave-one-out method for evaluating classification effect:
A random forest algorithm is selected, parameters ntree =500 and mtry=2, a patient data set is selected as a test sample each time, the other data sets are used as training samples, and the experiment is repeated 36 times. A model evaluation chart as shown in fig. 1 was obtained, showing the overall ROC curve. The prediction results of the model population are shown in the confusion matrix of table 2, and the accuracy evaluation of the model is shown in table 3. According to table 2, a total of 15 breast cancer samples, 10 were predicted correctly; 12 liver cancer samples, 7 cases are predicted correctly; 9 lymphoma samples, 3 were predicted correctly. According to table 3, the sensitivity of breast cancer prediction was 66.67% and the specificity was 61.9%; the sensitivity of liver cancer prediction was 58.33%, the specificity was 79.17%, the sensitivity of lymphoma prediction was 33.33%, and the specificity was 88.89%.
Table 2: confusion matrix results
Breast cancer Liver cancer Lymphoma cell
Predicting breast cancer 10 4 4
Predicting liver cancer 3 7 2
Predicting lymphoma 2 1 3
Totals to 15 12 9
TABLE 3 predictive performance of models
The inventors tested with 100bp, 400bp or 600bp upstream and downstream of TSS, and tested TES or NDR instead of TSS, and obtained experimental results similar to 1kb upstream and downstream of TSS.
The model and the method can be further upgraded into a multi-cancer prediction model by adding tests in other types of tumor types. The application in the directions of tumor recurrence detection and the like.

Claims (16)

  1. A method of constructing a model for predicting tumor tissue origin during pregnancy, comprising:
    1) Respectively obtaining the expression levels of a plurality of genes of a plurality of tumor tissues;
    2) For samples of pregnant women with known tumor types and combined tumor, acquiring cfDNA sequencing data of each sample;
    3) Calculating, for each sample, the sequencing depth of the plurality of genes, the sequencing depth of the genes being the sequencing depth before and after the transcription start site, transcription stop site and/or genomic open region;
    4) Obtaining, for each sample, a correlation coefficient based on the sequencing depth of the plurality of genes and the expression level of the plurality of genes for the respective tumor type;
    5) For each sample, establishing a tumor copy number variation spectrum feature;
    6) And combining the correlation coefficient of each sample with the tumor copy number variation spectrum characteristics, and training the prediction model by taking the characteristic sets of a plurality of samples as training samples to obtain a trained prediction model.
  2. The method of claim 1, wherein the tumor tissue is breast cancer tissue, liver cancer tissue, and/or lymphoma.
  3. The method according to claim 1 or 2, in 1) the expression level of the plurality of genes is an average of a plurality of samples for each tumor tissue, e.g. from a database, such as the TCGA database.
  4. A method according to any one of claims 1-3, in 3), the sequencing depth is the relative sequencing depth of the normalization process.
  5. The method according to any one of claims 1 to 4, wherein in 3), the sequencing depth of the gene is a sequencing depth of 100bp, 400bp, 600bp, 1kb, etc. before and after the transcription initiation site, transcription termination site and/or genomic open region.
  6. The method of any one of claims 1-5, in 4), the correlation coefficient is a normalized correlation coefficient.
  7. The method of any one of claims 1-6, wherein in 5) the tumor copy number variation spectral signature is expressed as P-MAD:
    Wherein, C i: in the ith CNV segment interval, 2 is the median of the log-transformed copy number ratio with 2 as the base, 2 Ci is the copy number of the ith CNV segment, l i: length of the ith CNV fragment interval, L: full genome length.
  8. The method according to any one of claims 1-7, in 6), logistic regression or random forest training is performed.
  9. A predictive model obtained by the method according to any one of claims 1-8.
  10. A method of predicting tumor tissue origin during pregnancy comprising:
    1) Respectively obtaining the expression levels of a plurality of genes of a plurality of tumor tissues;
    2) Acquiring cfDNA sequencing data for a pregnant woman sample to be tested;
    3) Calculating the sequencing depth of the plurality of genes, wherein the sequencing depth of the genes is the sequencing depth before and after a transcription start site, a transcription termination site and/or a genome open region;
    4) Obtaining a correlation coefficient based on the sequencing depth of the plurality of genes and the expression levels of the plurality of genes for the plurality of tumor types;
    5) Establishing tumor copy number variation spectrum characteristics;
    6) Combining the correlation coefficient and the tumor copy number variation spectrum characteristic, and inputting the combined correlation coefficient and the tumor copy number variation spectrum characteristic as input data into the prediction model according to claim 9 for prediction.
  11. The method of claim 10, wherein the tumor tissue is the same tumor tissue used to construct the predictive model of claim 9.
  12. The method according to claim 10 or 11, wherein in 1) the plurality of genes are the same as the genes used for constructing the predictive model according to claim 9 for each tumor tissue.
  13. The method of any one of claims 10-12, in 3), the sequencing depth is a normalized processed relative sequencing depth.
  14. The method according to any one of claims 10-13, in 3), the calculation of the sequencing depth of the gene is the same as when constructing the predictive model according to claim 9.
  15. The method according to any one of claims 10-14, in 4), the correlation coefficient is a normalized correlation coefficient.
  16. The method of any one of claims 10-15, wherein in 5) the tumor copy number variation spectral signature is expressed as P-MAD:
    Wherein, C i: in the ith CNV segment interval, 2 is the median of the log-transformed copy number ratio with 2 as the base, 2 Ci is the copy number of the ith CNV segment, l i: length of the ith CNV fragment interval, L: full genome length.
CN202280070284.4A 2022-01-28 2022-05-18 Model for predicting pregnancy tumor tissue sources by utilizing plasma free DNA and construction method thereof Pending CN118119718A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN202210108327 2022-01-28
CN2022101083275 2022-01-28
PCT/CN2022/093568 WO2023142311A1 (en) 2022-01-28 2022-05-18 Model for predicting tumor tissue source during pregnancy by utilizing plasma free dna and construction method of model

Publications (1)

Publication Number Publication Date
CN118119718A true CN118119718A (en) 2024-05-31

Family

ID=87470266

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202280070284.4A Pending CN118119718A (en) 2022-01-28 2022-05-18 Model for predicting pregnancy tumor tissue sources by utilizing plasma free DNA and construction method thereof

Country Status (2)

Country Link
CN (1) CN118119718A (en)
WO (1) WO2023142311A1 (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105349678A (en) * 2015-12-03 2016-02-24 上海美吉生物医药科技有限公司 Detection method of chromosome copy number variation
US20190287645A1 (en) * 2016-07-06 2019-09-19 Guardant Health, Inc. Methods for fragmentome profiling of cell-free nucleic acids
CN110272985B (en) * 2019-06-26 2021-08-17 广州市雄基生物信息技术有限公司 Tumor screening kit based on peripheral blood plasma free DNA high-throughput sequencing technology, system and method thereof
WO2021262770A1 (en) * 2020-06-22 2021-12-30 Children's Hospital Medical Center De novo characterization of cell-free dna fragmentation hotspots in healthy and early-stage cancers
CN113539355B (en) * 2021-07-15 2022-11-25 云康信息科技(上海)有限公司 Tissue-specific source for predicting cfDNA (deoxyribonucleic acid), related disease probability evaluation system and application

Also Published As

Publication number Publication date
WO2023142311A1 (en) 2023-08-03

Similar Documents

Publication Publication Date Title
US20230295738A1 (en) Systems and methods for detection of residual disease
Bratulic et al. The translational status of cancer liquid biopsies
US11581062B2 (en) Systems and methods for classifying patients with respect to multiple cancer classes
CN113257350B (en) ctDNA mutation degree analysis method and device based on liquid biopsy and ctDNA performance analysis device
CN112951418B (en) Method and device for evaluating methylation of linked regions based on liquid biopsy, terminal equipment and storage medium
CN113785076A (en) Methods and compositions for predicting cancer prognosis
US20200219587A1 (en) Systems and methods for using fragment lengths as a predictor of cancer
US11929148B2 (en) Systems and methods for enriching for cancer-derived fragments using fragment size
CN110423816B (en) Breast cancer prognosis quantitative evaluation system and application
US20230175058A1 (en) Methods and systems for abnormality detection in the patterns of nucleic acids
CA3167633A1 (en) Systems and methods for calling variants using methylation sequencing data
Sanchez-Carbayo Recent advances in bladder cancer diagnostics
KR20170032892A (en) Selection method of predicting genes for ovarian cancer prognosis
CN110570951A (en) Method for constructing classification model of new auxiliary chemotherapy curative effect of breast cancer
CN118119718A (en) Model for predicting pregnancy tumor tissue sources by utilizing plasma free DNA and construction method thereof
AU2018428853A1 (en) Methods and compositions for the analysis of cancer biomarkers
WO2022217096A2 (en) Cell-free dna sequence data analysis method to examine nucleosome protection and chromatin accessibility
WO2018077225A1 (en) The primary site of metastatic cancer identification method and system thereof
KR20160086496A (en) Selection method of predicting genes for ovarian cancer prognosis
US12073920B2 (en) Dynamically selecting sequencing subregions for cancer classification
CN115678999B (en) Application of marker in lung cancer recurrence prediction and prediction model construction method
US20240312564A1 (en) White blood cell contamination detection
US20240233872A9 (en) Component mixture model for tissue identification in dna samples
CN118448038A (en) Method for monitoring curative effect of esophageal squamous cell carcinoma based on multiple-group dynamics ctDNA
JP2024527142A (en) Methods for mutation detection in liquid biopsy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination