CN115612743B - HPV integration gene combination and application thereof in prediction of cervical cancer recurrence and metastasis - Google Patents

HPV integration gene combination and application thereof in prediction of cervical cancer recurrence and metastasis Download PDF

Info

Publication number
CN115612743B
CN115612743B CN202211597867.0A CN202211597867A CN115612743B CN 115612743 B CN115612743 B CN 115612743B CN 202211597867 A CN202211597867 A CN 202211597867A CN 115612743 B CN115612743 B CN 115612743B
Authority
CN
China
Prior art keywords
model
metastasis
cervical cancer
risk
relapse
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211597867.0A
Other languages
Chinese (zh)
Other versions
CN115612743A (en
Inventor
沈捷
王伟伟
张福泉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking Union Medical College Hospital Chinese Academy of Medical Sciences
Original Assignee
Peking Union Medical College Hospital Chinese Academy of Medical Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking Union Medical College Hospital Chinese Academy of Medical Sciences filed Critical Peking Union Medical College Hospital Chinese Academy of Medical Sciences
Priority to CN202211597867.0A priority Critical patent/CN115612743B/en
Publication of CN115612743A publication Critical patent/CN115612743A/en
Application granted granted Critical
Publication of CN115612743B publication Critical patent/CN115612743B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/20Polymerase chain reaction [PCR]; Primer or probe design; Probe optimisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/118Prognosis of disease development

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Analytical Chemistry (AREA)
  • Organic Chemistry (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Pathology (AREA)
  • Immunology (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Microbiology (AREA)
  • Public Health (AREA)
  • Bioethics (AREA)
  • General Engineering & Computer Science (AREA)
  • Oncology (AREA)
  • Hospice & Palliative Care (AREA)
  • Biochemistry (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)

Abstract

The invention discloses an HPV integrated gene combination and application thereof in prediction of cervical cancer recurrence and metastasis. The invention discovers a gene set closely related to cervical cancer recurrence or metastasis, and improves the reliability of assessing the cervical cancer recurrence or metastasis risk. Due to the lack of understanding on the functions of HPV in the cervical cancer patients who relapse and transfer after the late stage and treatment, the invention has important significance for guiding clinical treatment and preventing and treating the cervical cancer.

Description

HPV integration gene combination and application thereof in prediction of cervical cancer recurrence and metastasis
Technical Field
The invention relates to a composition, a reagent, a method and a system for disease diagnosis, in particular to a primer or probe composition, a reagent, a method and a system for predicting recurrence and metastasis risks of cervical cancer.
Background
Cervical cancer is one of the fourth most common cancers in the female population worldwide, and most cervical cancers occur in association with Human Papillomavirus (HPV) infection. However, the effect of HPV in patients with locally advanced cervical cancer and patients with recurrent and metastatic cervical cancer after treatment is lack of understanding. The current research data show that after the cervical cancer patients are treated, about 31 percent of the patients are easy to have uncontrolled tumors or relapse (29 to 38 percent), and the 5-year survival rate of the locally relapsed cervical cancer patients is less than 40 percent. Therefore, the method can accurately predict the chemoradiotherapy curative effect of the patient in advance and intervene actively, not only can improve the prognosis of the recurrent patient, but also is an important future research direction. Meanwhile, for chemoradiotherapy tolerant patients, a potential individualized sensitization or targeted therapy method is found, and the method has important clinical significance for realizing accurate individualized diagnosis and treatment. For cervical cancer patients with advanced and recurrent and metastatic cancer after treatment, the role played by HPV is lack of understanding, and if the integration site of HPV at the key position in the genome of the patients can be found, the integration site can become the key breakthrough point for treating the patients.
Although the integration site of HPV in the genome of a patient can be detected at present, there are still significant challenges and difficulties in how to effectively predict cervical cancer recurrence and metastasis risk.
Disclosure of Invention
The invention carries out fusion gene detection on samples of high-risk HPV infected patients clinically confirmed to be accompanied with metastasis or recurrence based on a nanopore sequencing technology, and obtains the locus combination which can be used for clinical cervical cancer recurrence or metastasis risk assessment after learning and training through various different models. Specifically, the present invention includes the following.
In a first aspect of the present invention, there is provided a primer or probe composition for predicting the risk of recurrence or metastasis of cervical cancer, consisting of a plurality of primers or probes capable of detecting insertion or integration of a gene fragment derived from HPV into the insertion or integration site of a target gene consisting of 338 genes shown in table 4 in a subject.
In a second aspect of the invention, there is provided a kit for assessing the risk of recurrence or metastasis of cervical cancer comprising a primer or probe composition according to the first aspect.
In a third aspect of the invention, there is provided a system or device for assessing the risk of recurrence or metastasis of cervical cancer, comprising:
a data acquisition unit for acquiring insertion site information in a biological sample of a subject, wherein the insertion site information refers to information on insertion or integration of HPV-derived gene fragments into a target gene of the subject, the target gene consisting of 338 genes shown in table 4;
the data processing unit is used for inputting the data of the data acquisition unit into a prediction model and performing data processing, wherein the prediction model is selected from at least one of logistic regression, random forest and gradient boost decision tree or a combination thereof;
an output unit for outputting a result of the subject assessing as high risk of relapse or metastasis, or low risk of relapse or metastasis.
In certain embodiments, the system for evaluating the risk of recurrence or metastasis of cervical cancer according to the present invention further comprises a nanopore sequencer in communication with the data acquisition unit, for passing DNA fragments of 1-5K in length, derived from a biological sample collected from a cervical cancer subject after treatment, through a chip nanopore located near an electrode, and detecting a current passing through the nanopore via the electrode.
In certain embodiments, the system for assessing the risk of recurrence or metastasis of cervical cancer according to the present invention, wherein the construction of the predictive model comprises:
grouping the processed raw data into an initial relapse-free group, an initial progression group and a relapse/metastasis group;
combining the data in the initial relapse-free group and the initial progress group, randomly dividing the data into a training set and a testing set, and using the data in the relapse/transfer group as a verification group to verify the accuracy of model prediction;
the independent variable of the model is the support number of the normalized reads inserted by each gene HPV, the dependent variable is the specific grouping information, and the model is built by respectively utilizing K neighbor, logistic regression, random forest, naive Bayes, gradient lifting decision tree and XGboost model.
In certain embodiments, the system for assessing recurrence or metastasis risk of cervical cancer according to the present invention, wherein the independent variables are ranked according to their importance coefficients given by each model, the independent variable with the importance coefficient greater than a threshold in the corresponding model is selected as a candidate combination of variables, and the 6 different models give 6 candidate combinations of independent variables, which are taken as a union set for prediction.
In certain embodiments, the system for assessing the risk of recurrence or metastasis of cervical cancer according to the present invention further comprises a step of constructing different predictive models using the obtained combination of independent variables for prediction, and a step of determining a final predictive model according to the prediction performance.
In certain embodiments, the system for assessing risk of cervical cancer recurrence or metastasis according to the invention, wherein the different predictive models are selected from logistic regression, random forest and gradient boosting decision trees.
In a fourth aspect of the present invention, there is provided a computer storage medium having a computer program stored therein, the computer program, when executed by a computer, implementing the steps of: obtaining insertion site information in a biological sample of a subject, wherein the insertion site information refers to information of insertion or integration of a gene fragment derived from HPV into a target gene of the subject, the target gene consists of 338 genes shown in Table 4, inputting the obtained information into a prediction model and performing data processing, and further outputting a result that the subject evaluates as high risk of relapse or metastasis, or low risk of relapse or metastasis.
The primer or probe composition of the invention is designed and developed aiming at the genes closely related to cervical cancer recurrence or metastasis. The genes comprise a plurality of genes discovered for the first time, so that the reliability of evaluating the recurrence or transfer risk of the cervical cancer is improved. The invention has important significance for the guidance of the clinical treatment of cervical cancer due to the lack of understanding of the functions exerted by HPV in the cervical cancer patients who relapse and transfer after the late stage and treatment.
Detailed Description
Reference will now be made in detail to various exemplary embodiments of the invention, the detailed description should not be construed as limiting the invention but rather as a more detailed description of certain aspects, features and embodiments of the invention.
It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. Further, for numerical ranges in this disclosure, it is understood that the upper and lower limits of the range, and each intervening value therebetween, is specifically disclosed. Every smaller range between any stated value or intervening value in a stated range and any other stated or intervening value in a stated range is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included or excluded in the range.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although only preferred methods and materials are described herein, any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention. All documents mentioned in this specification are incorporated by reference herein for the purpose of disclosing and describing the methods and/or materials associated with the documents. In case of conflict with any incorporated document, the present specification will control.
The conventional detection method has many disadvantages in detecting the insertion site of HPV in the genome. For example, HPV detection methods based on hybrid capture can detect known high-risk HPV types, but cannot determine specific HPV types. Although specific HPV types can be identified by using more real-time fluorescent quantitative PCR detection methods, whether HPV is fused with human genome or not cannot be judged. The HPV detection method based on the NGS and the probe capture can simultaneously detect the HPV type and the fusion state, but has long detection period and low fusion detection rate near a human genome repetitive sequence region due to the limitation of NGS read length.
Nanopore sequencing belongs to a third-generation sequencing technology, and is different from the existing HPV detection method in that the nanopore sequencing can detect known high-risk HPV and specific HPV types, and more importantly, the nanopore sequencing reading length is longer and can reach 1K-1Mbp, so that fusion near a human genome repetitive sequence region can be detected. Based on the method, a probe aiming at the whole genome of the high-risk HPV is designed to discover a plurality of novel integration sites and fusion genes, thereby greatly expanding the set of the fusion genes,
in addition, the invention obtains a group of small panel capable of predicting cervical cancer recurrence or metastasis evaluation from a set of thousands of fusion genes by combining different machine learning models, wherein the panel consists of 338 genes and has remarkably improved sensitivity, specificity, accuracy, precision and AUC for predicting cervical cancer recurrence or metastasis.
In the present invention, the insertion site is generally located within or in the vicinity of a specific gene in the genome of the subject, particularly in the 5' regulatory region of the gene, such as the promoter region and the like. Moreover, the invention finds that the conclusion that the HPV DNA fragments inserted into the genes or nearby are more easy to cause the recurrence or metastasis of cervical cancer is derived from real clinical cases, and each case has complete long-term follow-up data. The genes closely related to the recurrence or progression of cervical cancer are preferably identified based on follow-up results.
In the present invention, the subject is a human, preferably a patient with a confirmed cervical disease. The biological sample of the present invention is a tissue or a cell derived from the cervix or a processed product thereof. The treatment substance comprises a tissue or cell disruption or lysis solution or extract, particularly a DNA extract.
The kit of the present invention comprises primers and/or probes for prognosis of cervical cancer or for predicting the risk of recurrence or metastasis of cervical cancer. These primers and/or probes can complementarily bind to sequences flanking each insertion site in the reference set of insertion sites, or the probes can complementarily bind to sequences comprising the reference insertion sites.
In addition to the primer or probe sets described above, the kits of the invention may also include precautions relating to the regulatory manufacture, use or sale of the diagnostic kit in a form prescribed by a governmental agency. In addition, the kits of the invention may be provided with detailed instructions for use, storage, and troubleshooting. The kit may optionally also be provided in a suitable device, preferably for robotic handling in a high throughput setting.
In certain embodiments, the components (e.g., probe set) of the kits of the invention can be provided as a dry powder. When the reagents and/or components are provided as a dry powder, the powder can be reconstituted by the addition of a suitable solvent. It is contemplated that the solvent may also be disposed in another container. The container will typically comprise at least one vial, test tube, flask, bottle, syringe, and/or other container means, optionally in which the solvent is placed in equal portions. The kit may further comprise means for a second container comprising a sterile, pharmaceutically acceptable buffer and/or other solvent.
In certain embodiments, the components of the kits of the present invention may be provided in the form of a solution, such as an aqueous solution. The concentrations or contents of these ingredients, in the case of being present in aqueous solution, are readily determinable by the person skilled in the art as a function of the various requirements. For example, for storage purposes, the concentration of the probe may be present in a higher form, for example, and when in an operating state or in use, the concentration may be reduced to the operating concentration by, for example, diluting the higher concentration solution.
The kit of the present invention may further comprise other reagents or ingredients. For example, DNA polymerase, dNTPs of various types and ions such as Mg, required for carrying out PCR 2+ And the like. These additional agents or components are known to those skilled in the art and are readily known from publications such as molecular cloning, a laboratory manual, fourth edition, cold spring harbor, and the like.
Where more than one component is present in a kit, the kit will also typically comprise a second, third or other additional container into which additional components may be separately placed. In addition, combinations of various components may be included in the container.
Kits of the invention may also include components that retain or maintain DNA, such as agents that are resistant to nucleic acid degradation. Such components may be, for example, nucleases either without RNase or with protection against RNase. Any of the compositions or reagents described herein can be a component of a kit.
In the present invention, the specific sequence of the primer or probe composition is not particularly limited, and those skilled in the art can obtain the primer or probe sequence according to the first aspect of the present invention and synthesize the corresponding reagent containing the primer or probe by using the disclosed gene database, HPV sequence or conventional primer or probe design software according to the genes in the disclosed list.
Systems or arrangements
Those skilled in the art will appreciate that the various exemplary embodiments of the invention described herein may be implemented in software, or in combination with hardware as necessary. Therefore, embodiments according to the present invention may be embodied in the form of a software product, which may be stored in a non-volatile storage medium or a non-transitory computer-readable storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to cause a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the prediction method according to the present invention.
In an exemplary embodiment, the program product of the present invention can employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of readable storage media include, but are not limited to: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Correspondingly, based on the same inventive concept, the invention also provides the electronic equipment. In an exemplary embodiment, the electronic device is in the form of a general purpose computing device. Components of the electronic device may include, but are not limited to: at least one processor, at least one memory, and a bus connecting different system components (including the memory and the processor).
Wherein the memory stores program code executable by the processing unit to cause the processing unit to perform the method of the invention, wherein the processor comprises at least a data acquisition unit, a data processing unit (sometimes also referred to as "module") according to the invention. The memory may include a readable medium in the form of a volatile memory unit, such as a random access memory unit (RAM) and/or a cache memory unit, and may further include a read only memory unit (ROM).
The memory of the present invention may also include a program/utility having a set (at least one) of program modules including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
The bus may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device may also communicate with one or more external devices (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device to communicate with one or more other computing devices.
Such communication may be through an input/output (I/O) interface. Also, the electronic device may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) through a network adapter. The network adapter communicates with other modules of the electronic device over the bus. It should be appreciated that although not shown herein, other hardware and/or software modules may be used in conjunction with the electronic device, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Prediction method
According to the invention, the sequencing reads are filtered and homogenized to obtain 2579 genes related to high risk of recurrence or metastasis of cervical cancer. Then, the invention further constructs a prediction model, and takes the original data packet as a training set and a test set, in the analysis, six algorithms of K neighbor (KNN), logistic Regression (LR), random Forest (RF), naive Bayes (MNB), gradient Boosting Decision Tree (GBDT) and XGboost are firstly used, a classification model is trained on the training set, in the process, cross validation is used for parameter adjustment for the KNN and the LR, random search is used for parameter adjustment for the RF, and grid search is used for parameter adjustment for the GBDT and the XGboost. In the six obtained models, related genes containing HPV integration are used as characteristic factors, each related gene is endowed with a coefficient, the coefficients are ranked and collected, namely 338 related genes with higher importance coefficients in the six models are used as characteristic factors for second-stage model screening.
The inventor further finds that when 338 related genes are taken as characteristic factors, prediction performance can be remarkably improved by further using logistic regression, random forest and gradient boosting decision tree models as final prediction models, and the accuracy rates of the prediction performance can reach 0.778, 0.944 and 0.833 respectively.
In addition, in actual clinical prediction, 338 target genes of the present invention may be used, and when it is determined that an HPV integration site is present in a gene selected from the above genes, the target genes in which HPV integration is present are inputted into at least one model, for example, 1 model, or a plurality of models in the prediction model of the present invention, and then the results outputted from the plurality of models are comprehensively calculated to determine that the patient is at high risk of recurrence or metastasis of cervical cancer.
Examples
1. Clinical data collection
Based on sample data of 58 patients admitted to the hospital in the Beijing coordination of the Chinese medical academy of sciences. Specifically, the treatment is divided into a primary treatment non-recurrence group, a primary treatment progression group and a recurrence/metastasis group according to the clinical condition of the patient at the time of treatment. The initial clinical diagnosis of cervical cancer in the initial treatment group at the time of receiving treatment includes 40 cases of patients. The relapse/metastasis group is patients who are clinically confirmed to relapse or metastasis at the time of admission to my hospital after treatment or cure of cervical cancer patients, and includes 18 patients, the initial treatment group is shown in table 1, and the diagnosis at the time of admission to patients in the relapse/metastasis group is shown in table 2 below. Patients in the first treatment group are further regularly tracked for later treatment and recurrence, patients with no recurrence or metastasis found in at least one year are classified into the first treatment non-progression group, and patients with recurrence or metastasis in one year are classified into the first treatment progression group. All patients' blood was collected to analyze data for HPV insertion into the genome by nanopore sequencing.
TABLE 1 clinical information of treatment group
Figure 672985DEST_PATH_IMAGE001
Figure 498115DEST_PATH_IMAGE002
Figure 543432DEST_PATH_IMAGE003
TABLE 2 clinical information for relapsing-remitting group
Figure 204220DEST_PATH_IMAGE004
Figure 425117DEST_PATH_IMAGE005
2. Method for producing a composite material
1. Experimental materials
Liquid-based thin layer cell test sample (TCT) of clinical patients.
2. Primary sequencing platform and reagents
A sequencing platform: multi-model nanopore sequencer
The main reagents are as follows: PCR Barcoding Kit (SQK-PBK 004)
3. Probe design Synthesis
A mixed probe is designed and synthesized according to the clear genome design of 18 HPV (HPV 16, 18, 31, 33, 35, 39, 45, 51, 52, 56, 58, 59, 68, 26, 53, 66, 73 and 82) issued by CFDA as the "HPV nucleic acid detection and genotyping and reagent technical examination guide principle".
4. Experimental methods
4.1 Genomic DNA extraction
DNA of TCT samples of 59 clinically confirmed high-risk HPV infected patients is extracted. The method refers to a micro sample genome DNA extraction kit of Tiangen biology company.
4.2 Disruption of genomic DNA
The genomic DNA was disrupted to the major band 1-5Kb using a Covaris sonicator.
4.3 Library construction
Library construction was performed using a nanopore library construction kit with Barcode.
4.4 HPV genomic DNA Capture
The 18 high-risk probes capture HPV genome DNA in a sample to be detected, the process refers to xGen Lockdown Reagents (IDT), and the hybridization temperature and time are optimized to obtain a longer HPV-containing sequence segment and a higher virus sequence proportion.
4.5 Sequencing
And (4) sequencing by using a nanopore platform, and carrying out fusion gene statistical analysis.
3. Results and analysis
Sequencing data are grouped into a first treatment non-recurrence group, a first treatment progress group and a recurrence/transfer group, genes with the read support number of less than 5 are filtered, homogenization treatment is carried out simultaneously, the number of the relevant genes after filtration is 2579, and the number of the genes with the filtered read support number of 1 is 6986.
Compared with the read support number subjected to rm1 homogenization, the read support number subjected to rm1 homogenization is obviously improved, and the support number is relatively more obviously improved no matter the read support number is the minimum value or the maximum value, or the total support number and the average support number, particularly the recurrence transfer group sample.
Through comparison of the read support numbers after filtering by different filtering standards, the read support number of rm5 is obviously reduced compared with that of rm1, and the relation of the read support number and the read support number is about 10 times by combining the read data of specific genes in the original data.
Under the rm1 filtering condition, the normalized read support number of all group samples is obviously improved, and the arrangement sequence of the total read support number among the samples is also obviously changed. Under the rm5 filtering condition, the normalized read support number of all group samples is slightly improved, no obvious change exists, and the arrangement sequence of the total read support number among the samples is also obviously changed.
Using the data of the "initial relapse-free group" and the "initial progression group" in the raw data as a training set and a test set, dividing by proportion 7, randomly dividing the samples into the training set and the test set, and using the "relapse/metastasis group" in the raw data as a verification containing only positive samples to predict the accuracy of the model. The independent variable of the model is the support number of the normalized reads inserted by each gene HPV, and the dependent variable is the specific grouping information.
All factors are regarded as numerical variables for modeling, because the read support number of each inserted gene is different from the weight of an actual prediction model, so that categorical variable conversion and correlation analysis are not performed, namely, the genes with read support are not transcoded into 1, and modeling and prediction analysis are performed directly according to the actual read support number.
Six algorithms of K Nearest Neighbor (KNN), logistic Regression (LR), random Forest (RF), naive Bayes (MNB), gradient Boosting Decision Tree (GBDT) and XGboost are used, a classification model is trained on a training set to find the most suitable classification model, and each model is subjected to super-adjustment by using a corresponding parameter adjusting method.
The evaluation was performed on the test set, and the model prediction effect for the test set is shown in table 3. And performing model evaluation on the positive verification set with accuracy, wherein the accuracy of the six models in the positive verification set is KNN, LR, RF, MNB, GBDT and XGboost.
TABLE 3 model predictive Effect
Figure 513159DEST_PATH_IMAGE006
After the previous multiple correlation modeling analysis, genes with higher importance coefficients (shown in table 4) are selected from the multiple algorithm models and collected, and the relevant gene insertion data of the 338 genes is extracted from the raw data and used as the input raw data of the current modeling analysis.
TABLE 4 list of 338 genes
Figure 45771DEST_PATH_IMAGE007
Figure 651196DEST_PATH_IMAGE008
Figure 585654DEST_PATH_IMAGE009
The subsequent steps are analyzed for multiple times as before, the data of the initial relapse-free group and the data of the initial progress group in the original data are used as a training set and a testing set, the proportion is divided into 7. The independent variable of the model is the support number of the normalized reads inserted by each gene HPV, and the dependent variable is the specific grouping information.
Modeling all factors as numerical variables, modeling according to the reagent read support number and predictive analysis. And (3) training the classification model on the training set by using different algorithms, and trying to find the most suitable classification model, namely the model has good prediction performance on the training set and the test set. The models are subjected to hyperparameter adjustment, and the evaluation parameters, namely the effects of the relevant classification models in the test set are shown in the following table 5. In addition, six models were further validated in the positive validation set, with the accuracy rates shown in table 5.
TABLE 5 model prediction effect (338 gene)
Figure 844597DEST_PATH_IMAGE010
Taken together, the selected 338 genes were comparable in their effects in the six models to those in the unselected case. Among them, the AUC values of GBDT models in the test set are even improved, and the accuracy of LR and GBDT models in the positive validation set is improved a little, so 338 gene combinations can be used to simulate whole genes for later stage prediction. Meanwhile, the MNB, XGboost and LR models have excellent effect on predicting the initial treatment relapse by combining the effects of the six models. While RF, GBDT and XGBoost are more effective in predicting relapsing/metastatic groups.
While the present invention has been described with reference to exemplary embodiments, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. Many modifications and variations may be made to the exemplary embodiments of the present description without departing from the scope or spirit of the present invention. The scope of the claims is to be accorded the broadest interpretation so as to encompass all modifications and equivalent structures and functions.

Claims (14)

1. A composition for predicting the risk of recurrence or metastasis of cervical cancer, which consists of a plurality of primers or probes capable of detecting the insertion or integration of HPV-derived gene fragments into the insertion or integration site of a target gene of a subject, which is analyzed by multiple modeling of KNN, LR, RF, MNB, GBDT, XGBoost models, and is obtained from the pool, and consists of 338 genes shown in table 4.
2. A kit for assessing the risk of recurrence or metastasis of cervical cancer comprising the composition of claim 1.
3. A device for assessing the risk of recurrence or metastasis of cervical cancer, comprising:
a data acquisition unit for acquiring insertion site information in a biological sample of a subject, wherein the insertion site information refers to information on insertion or integration of HPV-derived gene fragments into target genes of the subject, which are obtained by multiple modeling analysis of KNN, LR, RF, MNB, GBDT, XGBoost models, and are extracted from the set and composed of 338 genes shown in table 4;
the data processing unit is used for inputting the data of the data acquisition unit into a prediction model and performing data processing, wherein the prediction model is selected from at least one of logistic regression, random forest and gradient boost decision tree or a combination thereof;
an output unit for outputting a result of the subject assessing as high risk of relapse or metastasis, or low risk of relapse or metastasis.
4. The apparatus according to claim 3, further comprising a nanopore sequencer in communication with the data acquisition unit for passing DNA fragments of 1-5K in length from a biological sample collected from a subject with cervical cancer after treatment through a chip nanopore located in the vicinity of an electrode and detecting a current passing through the nanopore via the electrode.
5. The apparatus of claim 3, wherein the construction of the predictive model comprises:
grouping the processed original data into an initial relapse-free group, an initial progression group and a relapse/transfer group;
combining the data in the initial relapse-free group and the initial progress group, randomly dividing the data into a training set and a testing set, and using the data in the relapse/transfer group as a verification group to verify the accuracy of model prediction;
the independent variable of the model is the support number of the normalized reads inserted by each gene HPV, the dependent variable is the specific grouping information, and the model is built by respectively utilizing K neighbor, logistic regression, random forest, naive Bayes, gradient lifting decision tree and XGboost model.
6. The apparatus according to claim 5, wherein the independent variables are ranked according to their importance coefficients given by each model, the independent variables whose importance coefficients are greater than a threshold in the corresponding model are selected as candidate independent variable combinations, and the 6 different models have 6 candidate independent variable combinations, and the union of the 6 candidate independent variable combinations is used as the independent variable combination for prediction.
7. The apparatus according to claim 6, further comprising a step of constructing different predictive models using the obtained combination of independent variables for prediction, and a step of determining a final predictive model according to the predictive performance.
8. The apparatus for assessing the risk of cervical cancer recurrence or metastasis according to claim 7, wherein said different predictive models are selected from the group consisting of logistic regression, random forest and gradient boosting decision trees.
9. A device for assessing the risk of a first treatment recurrence of cervical cancer, comprising:
a data acquisition unit for acquiring insertion site information in a biological sample of a subject, wherein the insertion site information refers to information on insertion or integration of HPV-derived gene fragments into target genes of the subject, which are obtained by multiple modeling analysis of KNN, LR, RF, MNB, GBDT, XGBoost models, and are extracted from the set and composed of 338 genes shown in table 4;
the data processing unit is used for inputting the data of the data acquisition unit into a prediction model and performing data processing, wherein the prediction model is selected from an MNB (maximum network node B) or XGboost model;
an output unit for outputting a result of the subject assessing as high risk of relapse or metastasis, or low risk of relapse or metastasis.
10. The apparatus of claim 9, further comprising a nanopore sequencer in communication with the data acquisition unit for passing DNA fragments of 1-5K in length from a biological sample taken from a subject with cervical cancer after treatment through a nanopore on a chip located near an electrode and detecting a current through the nanopore via the electrode.
11. The apparatus for assessing the risk of cervical cancer recurrence or metastasis according to claim 9, wherein construction of said predictive model comprises:
grouping the processed original data into an initial relapse-free group, an initial progression group and a relapse/transfer group;
combining the data in the initial relapse-free group and the initial progress group, randomly dividing the data into a training set and a testing set, and using the data in the relapse/transfer group as a verification group to verify the accuracy of model prediction;
the independent variable of the model is the support number of the normalized reads inserted by each gene HPV, the dependent variable is the specific grouping information, and the model is built by respectively utilizing K neighbor, logistic regression, random forest, naive Bayes, gradient lifting decision tree and XGboost model.
12. The apparatus according to claim 9, wherein the independent variables are ranked according to their importance coefficients given by each model, the independent variables whose importance coefficients are greater than a threshold in the corresponding model are selected as candidate independent variable combinations, and the 6 different models have 6 candidate independent variable combinations, and the union of the 6 candidate independent variable combinations is used as the independent variable combination for prediction.
13. The apparatus for assessing the risk of recurrence or metastasis of cervical cancer according to claim 9, further comprising the step of further constructing different predictive models using the obtained combination of independent variables for prediction, and the step of determining a final predictive model based on the predictive performance.
14. A computer storage medium, having a computer program stored therein, which when executed by a computer, performs the steps of: obtaining insertion site information in a biological sample of a subject, wherein the insertion site information refers to information of insertion or integration of a gene fragment derived from HPV into a target gene of the subject, the target gene is obtained by multiple modeling analysis of KNN, LR, RF, MNB, GBDT and XGBoost models, is extracted from the target gene and is collected, the target gene consists of 338 genes shown in Table 4, then inputting the obtained information into a prediction model and carrying out data processing, and outputting results of the subject which are evaluated to be high in recurrence or metastasis risk or low in recurrence or metastasis risk.
CN202211597867.0A 2022-12-14 2022-12-14 HPV integration gene combination and application thereof in prediction of cervical cancer recurrence and metastasis Active CN115612743B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211597867.0A CN115612743B (en) 2022-12-14 2022-12-14 HPV integration gene combination and application thereof in prediction of cervical cancer recurrence and metastasis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211597867.0A CN115612743B (en) 2022-12-14 2022-12-14 HPV integration gene combination and application thereof in prediction of cervical cancer recurrence and metastasis

Publications (2)

Publication Number Publication Date
CN115612743A CN115612743A (en) 2023-01-17
CN115612743B true CN115612743B (en) 2023-03-21

Family

ID=84880288

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211597867.0A Active CN115612743B (en) 2022-12-14 2022-12-14 HPV integration gene combination and application thereof in prediction of cervical cancer recurrence and metastasis

Country Status (1)

Country Link
CN (1) CN115612743B (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2019370860A1 (en) * 2018-11-04 2021-06-24 Pfs Genomics, Inc. Methods and genomic classifiers for prognosis of breast cancer and predicting benefit from adjuvant radiotherapy
CN112176065A (en) * 2020-08-20 2021-01-05 中国医学科学院北京协和医院 Reagents and methods for prognosis of cervical cancer or prediction of risk of cervical cancer recurrence or metastasis
EP4301867A1 (en) * 2021-03-01 2024-01-10 PFS Genomics, Inc. Methods and genomic classifiers for prognosis of breast cancer and identifying subjects not likely to benefit from radiotherapy
CN113215264B (en) * 2021-07-07 2021-10-01 上海伯豪生物技术有限公司 Detection kit for early screening of TMEM101 gene methylation in human peripheral blood circulating tumor DNA (deoxyribonucleic acid) of endometrial cancer
CN113823372A (en) * 2021-09-29 2021-12-21 山东大学第二医院 Data collection and processing system for liver cancer recurrence prediction

Also Published As

Publication number Publication date
CN115612743A (en) 2023-01-17

Similar Documents

Publication Publication Date Title
JP7368483B2 (en) An integrated machine learning framework for estimating homologous recombination defects
CN103403543B (en) Colon cancer gene expression signature and using method
CN112086129B (en) Method and system for predicting cfDNA of tumor tissue
CN113096728B (en) Method, device, storage medium and equipment for detecting tiny residual focus
CN111748632A (en) Characteristic lincRNA expression profile combination and liver cancer early prediction method
US20220165363A1 (en) De novo compartment deconvolution and weight estimation of tumor tissue samples using decoder
Chu et al. The application of bayesian methods in cancer prognosis and prediction
WO2020157508A1 (en) Method of predicting survival rates for cancer patients
Lyu et al. Deciphering a TB-related DNA methylation biomarker and constructing a TB diagnostic classifier
Goswami et al. RNA-Seq for revealing the function of the transcriptome
CN115612743B (en) HPV integration gene combination and application thereof in prediction of cervical cancer recurrence and metastasis
CN111733251A (en) Characteristic miRNA expression profile combination and early prediction method of renal clear cell carcinoma
CN111944902A (en) Early prediction method of renal papillary cell carcinoma based on lincRNA expression profile combination characteristics
EP3588506A1 (en) Systems and methods for genomic and genetic analysis
Simon Review of Statistical Methods for Biomarker-Driven Clinical Trials
Ponomarenko et al. Mining DNA sequences to predict sites which mutations cause genetic diseases
CN114333998A (en) Tumor neoantigen prediction method and system based on deep learning model
Hu et al. Sequential model selection-based segmentation to detect DNA copy number variation
Chang et al. Gene set correlation enrichment analysis for interpreting and annotating gene expression profiles
WO2023093782A1 (en) Molecular analyses using long cell-free dna molecules for disease classification
Kolluri Evaluation of performance of MSI detection tools using targeted sequencing data
Akbar et al. Unlocking Esophageal Carcinoma’s Secrets: An integrated Omics Approach Unveils DNA Methylation as a pivotal Early Detection Biomarker with Clinical Implications.
Wang Mixture Model Approaches To Integrative Analysis Of Multi-Omics Data And Spatially Correlated Genomic Data
Maleki et al. Silver: Forging almost Gold Standard Datasets. Genes 2021, 12, 1523
Hunter et al. Whole Genome 3D Blood Biopsy Profiling of Canine Cancers: Development and Validation of EpiSwitch Multi-Choice Array-Based Diagnostic Test.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant