CN115881296B - Thyroid papillary carcinoma (PTC) risk auxiliary layering system - Google Patents

Thyroid papillary carcinoma (PTC) risk auxiliary layering system Download PDF

Info

Publication number
CN115881296B
CN115881296B CN202310090839.8A CN202310090839A CN115881296B CN 115881296 B CN115881296 B CN 115881296B CN 202310090839 A CN202310090839 A CN 202310090839A CN 115881296 B CN115881296 B CN 115881296B
Authority
CN
China
Prior art keywords
model
risk
ptc
data
layering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310090839.8A
Other languages
Chinese (zh)
Other versions
CN115881296A (en
Inventor
罗定存
郭天南
吴凡
孙耀庭
李远慧
张煜
时晶晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou First Peoples Hospital
Original Assignee
Hangzhou First Peoples Hospital
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou First Peoples Hospital filed Critical Hangzhou First Peoples Hospital
Priority to CN202310090839.8A priority Critical patent/CN115881296B/en
Publication of CN115881296A publication Critical patent/CN115881296A/en
Application granted granted Critical
Publication of CN115881296B publication Critical patent/CN115881296B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention discloses a thyroid papillary carcinoma (PTC) risk auxiliary layering system, which is characterized by comprising the following components: the system comprises a data acquisition module, a data preprocessing module, a model building, training and verifying module, a layering module and a database platform. According to the invention, the characteristic proteins and the characteristic clinical variables are screened through machine learning, a system prediction risk degree model is constructed to realize the value of proteomics on PTC dangerous layering, and a decision basis is provided for making a personalized treatment scheme for PTC patients.

Description

Thyroid papillary carcinoma (PTC) risk auxiliary layering system
Technical Field
The invention relates to a tumor low or medium-high risk auxiliary layering system and method, in particular to an artificial intelligent Papillary Thyroid Carcinoma (PTC) risk auxiliary layering system and method. The invention belongs to the technical field of medical artificial intelligence auxiliary layering.
Background
Thyroid cancer is one of the most common endocrine malignancies worldwide. In recent years, the incidence of thyroid cancer has rapidly increased. Thyroid cancer includes 4 histological types, with papillary thyroid cancer (Papillary thyroid cancer, PTC) accounting for 85% of thyroid cancers. Although PTC is generally well predicted, survival rates of over 90% of 10 years, there are still parts of the tumor that exhibit highly invasive biological behavior early in the course of the time, and 30-80% of patients develop lymph node metastasis at the time of initial diagnosis. The invasion of PTC coating is found to be about 30-50% in the primary surgery, and even if the surgery is completely resected, the recurrence rate is still 10-25% after the surgery. Therefore, the PTC focus condition and the risk degree are accurately estimated in early stage, and the customization of the personalized surgery scheme and the specification are the basis of accurate diagnosis and treatment.
In the current stage, doctors mostly rely on their own experience and knowledge in the tumor treatment process and refer to the medical guidelines of the existing medicine to evaluate and treat, and the problems of low evaluation accuracy, complicated work, experience limitation and the like are solved. At present, ultrasound, CT and MRI are the main imaging means for assessing PTC lesions and cervical lymph nodes, but the prior literature fails to show a high efficacy for assessing lesion risk. Studies exploring PTC risk stratification from genomic level changes are relatively limited, and genomics and transcriptomics reveal somatic mutations in different thyroid cancers, such as BRAFV600E, RAS, TP53, TERT and RET/PTC gene fusions; among them, the pathogenesis of BRAIV 600E is focused, and related data indicate that 85% of PTC has BRAIV 600E mutation, but only a single index of BRAIV 600E mutation has weaker prediction ability on PTC prognosis.
Unlike nucleic acids in genomics, proteins are directly involved in all life processes and determine cell and organ phenotypes. The protein is used as the most direct product of life body movement, and can show stable expression in different life stages and different protein periods of cells. Proteomics has a remarkable prediction effect on prognosis of tumors such as lung cancer, oral squamous cell carcinoma, gastric cancer, esophageal cancer, colorectal cancer and the like. Proteomics has also grown endlessly for Thyroid cancer, and studies such as Sofiadis (SOFIADIS A, DINES A, ORRE L M, et al Proteomic study of Thyroid tumors reveals frequent up-regulation of the Ca2+ -binding protein S100A6 in papillary Thyroid carcinoma [ J ]. Thyroid, 2010, 20 (10): 1067-76) have revealed that frequent upregulation of Ca2+ binding protein S100A6 in PTC may serve as a potential adjunct marker for the identification of follicular Thyroid tumors and PTC. Quantitative proteomic analysis of sporadic myeloid carcinomas (ZHAN S, LIJ, WANG T, et al Quantitative Proteomics Analysis of Sporadic Medullary Thyroid Cancer Reveals FN1 as a Potential Novel Candidate Prognostic Biomarker [ J ]. Oncology, 2018, 23 (12): 1415-25) also revealed that fibronectin 1 could be a novel candidate prognostic biomarker.
Traditional proteomic sample preparation techniques such as those based on in-solution digestion methods or Two-dimensional gel electrophoresis (Two-dimensional gel electrophoresis, 2D PAGE) are time consuming and inefficient. The advent of pressure cycling technology (Pressure cycling technology, PCT) was an important processing technology to construct FFPE specimen proteomics databases based, employing rapid alternating hydrostatic pressure changes between ambient normal pressure (14.7 psi) and high pressure (up to 45,000 psi) within a rise time of 3 seconds and a fall time of milliseconds, which is simple, rapid, high throughput, and effective in promoting the de-crosslinking, extraction and trypsin digestion of proteins in FFPE sections, thus effectively increasing the number of proteins identified from FFPE samples.
In addition to PCT-assisted sample processing, quantitative accurate mass spectrometry (Mass spectrometry, MS) analysis is also required to identify disease-related protein expression level changes. Data independent acquisition (Data independent acquisition, DIA) is a proteomic technique that performs fragmentation and secondary mass spectrometry of all ions within a selected range of mass-to-charge ratios (m/z). DIA is an alternative scheme of Data-dependent acquisition (Data-dependent acquisition, DDA), and the greatest advantage of DIA over DDA is that protein molecules with extremely low abundance in complex samples can be efficiently measured, complete Data can be obtained, deep coverage and accurate quantification of proteins are realized, reliability of quantitative analysis is greatly improved, and higher quantitative accuracy and repeatability are provided. The DIA-MS is combined with data processing based on the deep neural network, so that the repeatability, the identification number and the quantitative accuracy are effectively improved.
With the development of networks and technologies, a large number of data sets are available, artificial intelligence is applied to a plurality of fields at the present stage, and how to apply artificial intelligence and mine and process the data, so that the system is better used for human life progress, and the aim of struggling in various fields of various subjects at present is achieved. When the artificial intelligence strides into medical treatment, the medical records are converted into medical knowledge, so that the occurrence and development of diseases can be better understood; the expression difference of each biological information of diseases can be better understood by collecting the biological information such as gene sequences, DNA sequence data and the like. Zhu et al compared exosome metabolome patterns in relapsed and unreturned patients using machine learning and demonstrated a marker panel consisting of 3' -UMP, palmitoleic acid, palmial, and isobutyl decanoate for predicting esophageal squamous cell carcinoma recurrence. Previous studies have shown that the combination of artificial intelligence and basic disciplines can be effectively applied to the problem that clinical treatment is difficult to overcome.
Patent US6005256a provides a detailed description of a device and method for simultaneous detection of multiple fluorescently labeled markers in a body sample, including devices and methods for the purpose of identifying cancer cells, but does not relate to specific applications for cancer assessment or mention that appropriate marker combinations may yield higher specificity.
Patent CN110129442a discloses the use of a reagent for detecting the level of a gene and its expression products in the preparation of a product for diagnosing thyroid cancer, characterized in that the gene is selected from LRRN4CL or ZNF883. The product comprises a chip, a kit or a nucleic acid membrane strip. The gene chip comprises an oligonucleotide probe aiming at LRRN4CL or ZNF883 genes for detecting the transcription level of the LRRN4CL or ZNF883 genes, and the protein chip comprises a specific binding agent of LRRN4CL or ZNF883 proteins; the kit comprises a gene detection kit and a protein detection kit, wherein the gene detection kit comprises a reagent or chip for detecting the transcription level of LRRN4CL or ZNF883 genes, and the protein detection kit comprises a reagent or chip for detecting the expression level of LRRN4CL or ZNF883 proteins.
The invention patent of the patent CN107563383A discloses a hierarchical auxiliary prediction system for risk of benign and malignant lesions, which comprises the steps of nodule detection, doctor guidance, semantic annotation generation, sample generation, online learning and the like, wherein doctor guidance is needed, namely doctor annotation points are automatically calculated to obtain the outline of a lung nodule in a region of interest and the prediction probability of each pixel point in the outline, and a lung nodule annotation sample containing an accurate semantic outline is generated for learning by a self-learning system.
Currently, a commonly used thyroid cancer prognosis evaluation system comprises TNM stage in AJCC/UICC, AMES, AGES, MACIS scoring system, ATA risk stratification guideline and the like. The AJCC/UICC is a stage system based on pTNM parameters and age, and is suitable for all types of tumors and thyroid cancer. AMES, AGES, and MACIS, among others, mainly include the presence of lymph node/distant metastasis, patient age, and tumor extent. In addition, ATA risk stratification guidelines propose a thyroid cancer 3-level stratification system for predicting the risk of recurrence of thyroid cancer. These prognostic evaluation systems are all based on post-operative clinical pathology results, but no evaluation system is currently available that can evaluate PTC risk prior to surgery.
Disclosure of Invention
In order to solve the defects of the prior art, the invention aims to provide a thyroid papillary carcinoma (PTC) risk auxiliary layering system, which improves the confidence of clinicians and assists the clinical management decision of the thyroid papillary carcinoma.
One aspect of the present invention is directed to a Papillary Thyroid Carcinoma (PTC) risk auxiliary stratification system, comprising:
and a data acquisition module: the data processing module is used for acquiring the existing data, including proteomics, clinical indexes, blood immune indexes and BRAVV 600E mutation tags, performing verification and evaluation on the data integrity and the data quality, performing data desensitization processing and sending the data desensitization processing to the data preprocessing module;
and a data preprocessing module: performing missing value interpolation and normalization processing on the acquired existing data;
model building, training and verifying module: feature screening is carried out based on the whole proteome detection data obtained by the data preprocessing module, and input features and PTC risk layering are related based on a model of a gradient lifting tree, namely limit gradient lifting (XGBoost), so that the problem of supervised learning is solved; the discovery set is used as a training set for training and adjusting parameters in a model, the discovery set is divided into a training sequence and a test sequence, the model is built in the training sequence, internal verification is carried out on the test sequence, the model with the optimal AUC on the test set is selected and stored, then verification is carried out on two independent verification sets, and the performance of the model is evaluated through the area under the curve (AUC) of a subject working characteristic curve; the characteristics of 6 characteristic proteins, 5 clinical characteristics, 5 blood immune indexes and BRAVV 600E mutation characteristics are finally determined by combining protein characteristics selected based on XGBoost algorithm with 10-fold cross validation to serve as characteristics of a constructed model; establishing a thyroid papillary carcinoma (PTC) risk layering model, determining a model threshold value, and predictably layering each level of risk lesions according to the value;
layering module: analyzing the detection sample, layering the tumor into low-risk lesions or medium-high-risk lesions, and feeding back to the terminal equipment and the database platform of the doctor;
database platform: the module is connected with the modules and used for storing and calling the model building data, continuously collecting newly-added data and providing an online model updating function.
Wherein the assessment criteria is based on ATA 2015 management guidelines: samples without tumor invasion and lymph node metastasis are considered to be low risk PTC, while samples with tumor invasion or lymph node metastasis are considered to be moderate or high risk PTC.
Preferably, the model threshold is 0.6645.
Preferably, the 6 characteristic proteins are: DPP7, PDLIM3, COL12A1, CTSL, TUBB2A and ITGB5.
Preferably, the 5 clinical features are: tunica invasion, extraglandular invasion, tumor diameter, multifocal nature and age.
Preferably, the 5 blood immune indicators are: platelet count (plt), neutrophil count (N), lymphocyte count (L), monocyte count (M), and Lymphocyte-to-Monocyte Ratio (LMR).
Preferably, the sources of the existing data include: case data for hospitals.
Preferably, the supervised learning is a process of adjusting the parameters of the classifier to achieve the required performance using a set of samples of known classes.
Preferably, the sample is included in a standard: (1) primary surgery, and lymph node cleansing; (2) no history of chemotherapy or radiotherapy exists in the past; (3) postoperative pathology examination was diagnosed as classical PTC; (4) the post-operative pathology diagnostic contains complete information about patient risk stratification. Exclusion criteria: (1) history of neck trauma; (2) combined or past suffering from other cancers; (3) the postoperative pathology is diagnosed with other subtype PTC or other pathology type; (4) lack of fully available post-operative pathology.
A further aspect of the invention aims to provide a method of using a thyroid papillary carcinoma (PTC) risk auxiliary stratification system for non-diagnostic purposes, characterized in that stratification results are obtained using an auxiliary stratification system based on the above.
A final aspect of the invention aims to provide the use of a Papillary Thyroid Cancer (PTC) risk-assisted stratification system as described above for preparing a device for predictive stratification of the risk of papillary thyroid cancer in a patient.
The following is an explanation of some terms involved in the present invention:
abbreviation/symbol illustration
Figure SMS_1
Drawings
FIG. 1 is a PTC sample PCT-DIA workflow diagram;
FIG. 2 is a flow chart of feature selection, construction of a machine learning model, and model verification;
FIG. 3 is a predictive performance of a machine learning model that assists a clinician and only the clinician in distinguishing between low risk and medium and high risk PTC;
fig. 4 is a schematic diagram of proteins with significant expression differences between high-risk and low-risk groups (panel a), high-risk and medium-risk groups (panel B), and medium-risk and low-risk groups (panel C), each point in the diagram representing a certain protein difference, and points outside the dashed line are considered to be statistically different. The red dot in the upper right quadrant is protein expressing up regulation, and the blue dot in the upper left quadrant is protein expressing down regulation;
fig. 5 is a summary of the SHAP analysis showing the average absolute SHAP values of the 17 most important features of the model. The feature importance of the model is listed from top to bottom. The closer the features are to the upper part, the greater the sample distinguishing degree is, and the greater the influence of the features on the output is;
FIG. 6 is a graph showing the expression levels of 6 proteins selected by machine learning in high-medium-risk groups and low-risk groups, wherein the upper and lower edges of the box graph are quartiles of the protein expression levels, respectively, and the middle line represents the median.
Detailed Description
1. Object and method
Study object (one)
1. General data
The PTC sample set that the study incorporates was divided into a discovery set and an independent validation set. The independent verification set includes a review verification set and a look-ahead verification set. The PTC paraffin section specimens and clinical information of the first people's hospital in Hangzhou and the Shandong Yu Ding Hospital in Hangzhou attached to Zhejiang university medical school are retrospectively incorporated, the time span is from 6 months in 2013 to 11 months in 2020, 283 cases are firstly incorporated, 9 cases are removed due to the fact that the sample size of the sheets is small or the number of the sheets is insufficient, and 274 cases are finally incorporated (191 training set/83 test set). Meanwhile, 166 PTC paraffin section specimens and clinical information of the two units from 1 month in 2016 to 12 months in 2021 are collected as retrospective verification sets, and 118 puncture biopsy samples of the two units from 1 month in 2020 to 12 months in 2021 are collected as prospective verification sets.
Sample inclusion criteria: (1) primary surgery, and lymph node cleansing; (2) no history of chemotherapy or radiotherapy exists in the past; (3) postoperative pathology examination was diagnosed as classical PTC; (4) the post-operative pathology diagnostic contains complete information about patient risk stratification. Exclusion criteria: (1) history of neck trauma; (2) combined or past suffering from other cancers; (3) the postoperative pathology is diagnosed with other subtype PTC or other pathology type; (4) lack of fully available post-operative pathology.
2. Clinical pathology information
Extracting patient clinical information and tumor characteristics from the electronic medical record, including (1) pre-operative clinical information: patient sex, age, presence or absence of hashimoto thyroiditis, maximum diameter of tumor, whether tumor is multifocal, whether tumor invades the envelope or extraglandular. (2) blood immune index: platelet count (plt), neutrophil count (N), lymphocyte count (L), macrophage count (M), calculate Platelet-to-Lymphocyte ratio (PLR), neutrophil-to-Lymphocyte ratio (NLR), lymphocyte/macrophage ratio (LMR), and Systemic immune inflammation index (Systemic immune-inflammatory index, SII). (3) BRAF mutation status.
The incorporated PTC patients were tumor staged according to TNM staging published by the united states tumor committee (American Joint Committee on Cancer, AJCC) version 8. PTC tumor risk refers to PTC postoperative recurrence risk stratification for ATA of latest 2015 edition.
ATA recurrence risk stratification system
Figure SMS_2
(II) preparation of tissue samples
The FFPE thyroid cancer tissue block was serially cut out into 4 paraffin thin slices of 10 μm thickness on a tissue microtome, which were attached to a glass slide. Each tissue sample was examined and prepared by two experienced pathologists, and tissue coring was performed after microscopic marking of the diseased region of 10 μm paraffin sheet, in contrast to the post-operative hematoxylin-eosin stained pathological diagnostic sections. The puncture biopsy sample is obtained by puncturing tissue with a thyroid nodule fine needle before or during operation, and after the puncture biopsy sample is clear by a doctor of a pathology department, the puncture biopsy sample is stored in a refrigerator at-80 ℃.
(III) batch design
In the discovery set, 274 paraffin samples were randomly drawn 24 biological replicates and 27 samples as technical replicates, randomly allocated to 21 batches, in order to minimize batch effects caused by different experimental batches. Retrospective test set 166 samples and 13 technical replicates were divided into 12 batches, and prospective test set 118 samples and 8 technical replicates were divided into 9 batches. Each batch included 15 thyroid samples-one mouse liver sample and pooled high, medium and low risk thyroid samples as quality controls. In the analysis of this discovery set, the histopathological diagnosis of each paraffin section was clear, thus creating a data separation model.
(IV) dewaxing, rehydration and hydrolysis of FFPE tissue
For each PTC case in the discovery set, a total of 4 FFPE bio-replica tissue cores were prepared. Samples were dewaxed in heptane followed by hydration in 100% ethanol (Sigma), 90% ethanol and 75% ethanol in that order at room temperature. Then washed with 100mM Tris-HCl (pH 10, sigma) and alkaline hydrolysis conditions at 95℃were established. The reaction was carried out at 95℃and 600rmp for 30min, after which the sample was rapidly cooled to 4 ℃.
Fifth, tissue lysis, protein extraction and protein digestion
To the dewaxed sample was added 6M urea, 2M thiourea, 10 mM tris (2-carboxyethyl) phosphine hydrochloride (TCEP) and 40mM Iodoacetamide (IAA), followed by 90 cycles using pressure cycling techniques (Pressure cycling technology, PCT), 45000psi 25s, 10s at 30℃under normal pressure. Incubation was performed in the dark for 30min with a mini spin, followed by 40:1 (protein to lysC (lysC)). PCT-assisted lysC digestion was performed in the following settings: 45 cycles, 20000psi,50s. At 30℃and normal pressure for 10s. Final pancreatin digestion to 40:1 (protein to trypsin) was carried out using PCT, set as follows: 90 cycles, 50s,20000psi. Working at 30℃under normal pressure for 10s. Digestion was then stopped by addition of 10% TFA, adjusted to pH 2-3 and centrifuged at 12000g for 5 min. The concentration was measured after reconstitution and adjusted to 0.2 ug/ul.
(six) proteomics data acquisition and analysis
Clean polypeptides were isolated using a nano lc-MS/MS system equipped with 15cm x 75 μmid chromatography columns with a gradient of 45mins,3-25% linear gradient (buffer a:2% acetonitrile, 0.1% formic acid; buffer B:98% acetonitrile, 0.1% formic acid) and a flow rate of 300nL/min. The effluent tire was passed through a QExactyHF mass spectrometer. Data acquisition is performed in DIA mode. 390-1010 m/z was analyzed in Orbitrap at a resolution of 60,000 (m/z 200) using an AGC target value of 3E6 charge and a maximum injection time of 100 ms. After a full MS scan, 24 MS/MS scans were obtained, each with 30,000 resolution (m/z 200), AGC target value of 1E6 charges, normalized collision energy of 27%, default charge state set to 2, maximum sample time set to auto. 24. The cycle period of the secondary MS/MS scan (isolation window center) is 3 wide isolation windows (m/z): 410. 430, 450, 470, 490, 510, 530, 550, 570, 590, 610,630, 650, 670, 690, 710, 730, 770, 790, 820, 860, 910, 970. The whole MS and MS/MS scan acquisition cycle takes approximately 3 seconds and is repeated throughout the LC/MS analysis. The collected data is matched with a thyroid polypeptide spectrum library through DIA-NN (1.7.15) to search the library.
(seventh) quality control
Eighth, constructing a predictive model based on XGBoost
The invention constructs a prediction model based on XGBoost so as to classify any given proteome data sample and clinical features into one of high-medium-risk and low-risk categories so as to achieve the best precision. This includes 4 phases: data preprocessing, feature selection, machine learning model construction and model verification.
The following is a detailed step of 4 stages (see FIG. 3A)
Stage 1: data preprocessing
From 2 datasets, 3 groups, a look-back validation set, a look-ahead validation set, was used to develop the DNN model for the discovery set. 274 samples from the discovery set queue, 191 samples divided into training sets for model building, 83 samples divided into test sets for optimizing parameters so that the AUC result of the test set corresponding to the parameters is optimal, and then the trained model is used for reviewing the verification set, looking ahead the data of the verification set queue, and carrying out external verification to show the generalization capability of the model.
The pretreatment comprises two steps: (1) missing value interpolation and (2) normalization. The missing value is inevitably a feature of the protein intensity data. Taking into account that most of the missing values occur when the protein content is below the detection threshold, interpolation is done by filling all missing values with 0.8 Dmin. Where Dmin is the minimum of all eigenvalues in the discovery set, dmin=13 in this work. Thus, for each feature after the interpolation step, the mean and variance of that feature is estimated from the discovery set and the feature for each training sample is normalized as follows.
Stage 2: feature selection
Feature selection is required for two reasons: (1) Because of the whole proteome detection, most of the detected proteins have low correlation with the problem, and moreover, excessive proteins undergo machine learning to reduce the generalization capability of a model and cause overfitting, so that the proteins are deleted from a feature matrix in the machine learning; (2) In clinical practice application, the number of proteins is reduced as much as possible, and the optimal combination is selected to achieve the most effective distinguishing effect. It is completed in two steps. The first step is feature screening. In the original protein profile, the data set is stratified of differentially expressed proteins at high, medium and low risk, and proteins associated with published literature associated with thyroid or thyroid cancer. In 274 cases, no occurrence in the dataset of the present invention will be excluded. Further, if the deletion rate of such a protein is more than 45%, it is deleted. If the absolute value of the Pearson correlation between a pair of proteins is less than 0.1, they are deleted. In a second step, a combination optimization is performed to select the best combination of 10 proteins from the screened proteins. Although no algorithm can guarantee a globally efficient optimal solution, machine learning algorithms are used here to find the best protein combination.
The evolution operations (crossover, mutation and selection operations) are used to generate new protein feature combinations from existing protein feature combinations. In each iteration, the algorithm eliminates the low fitness combinations and generates new combinations based on the remaining high fitness combinations.
Stage 3: model training
The invention designs a model based on a gradient lifting tree, namely limit gradient lifting, which relates input features to PTC risk layering and is used for solving the problem of supervised learning. Supervised learning is the process of adjusting the parameters of a classifier to achieve a desired performance using a set of samples of known classes. A total of 15 possible clinical indicators related to preoperative PTC risk stratification in the present invention were incorporated into the machine learning model of the present invention. The discovery set is used as a training set for training and adjusting parameters in the model, the discovery set is divided into a training sequence and a test sequence, the model is constructed in the training sequence, the test sequence is used for verification, the model with the optimal AUC on the test set is selected and stored, and then the two independent verification sets are used for verification, and the performance of the model is evaluated through the area under the curve (AUC) of the working characteristic curve of the subject. Thereby yielding a stratification result.
Stage 4: model verification
The unknown sample is placed into the established model. Based on the feature combination selected by the model, the clinical information and the proteomic information of the unknown sample are input, and the possible risk stratification of the patient can be obtained according to the obtained model. The feature selection, construction of machine learning model and flow chart for model verification are detailed in fig. 2.
Ninth statistical analysis
Statistical analysis was performed using R software (version 3.5.1) with a hectmap and mapping function. CV is calculated as the ratio of standard deviation to mean. P-value expression of protein combination features was calculated by one-way analysis of variance. Selecting volcanic diagram to calculate differential protein, screening condition of differential protein: 1) Unpaired two side Welch's t test p <0.05; 2) fold-change >1.2 or fold-change < -1.2. Biological insights were analyzed by inventive pathway analysis (Ingenuity pathway analysis, IPA). Repeated data correlation strength was evaluated using Pearson correlation coefficient (Pearson correlation coefficient). The average algorithm performance index was evaluated using AUC. SHAP summary figure to illustrate the importance distribution of the individual variables.
2. Results
General clinical characteristics
The study included a total of 558 PTC samples, with an average age of 45.69 years, 397 for female patients, 161 for male patients, and a sex ratio of 2.47:1. tumor average diameter 13.01mm. There were 244 cases of PTC ultrasound-assisted membrane invasion, 43.73%, ultrasound extraglandular invasion, 179%, 32.08%, 103, and 18.46%.
Clinical pathology data for 558 PTC patients are shown in table 1.
TABLE 1 clinical pathology data for 558 PTC patients
Figure SMS_3
(II) construction of a proteomics database
This study constructs a thyroproteomic database by PCT-MS to support the identification and quantification of protein in papillary thyroid carcinoma by DIA-MS. The invention finally identifies and quantifies 5774, 5025 and 6301 proteins on three groups of original data, namely a discovery set, a retrospective verification set and a prospective verification set, through filtering the polluted proteins and repeated proteins. Of the three sets of raw data, the thyroid database was constructed to contain 121960 peptide fragments and 9941 proteomes. The resulting DIA dataset validated this library and applied it to proteomic stratification at high, medium and low risk.
(III) quality control
In order to ensure stable and reliable data of proteomics which is subsequently incorporated into the constructed model, the invention sets strict quality control.
Pool sample analysis: the overall pool sample variation coefficients (Coefficient variation, CV) between the found paraffin-integrated samples, the retrospective paraffin-integrated samples and the prospective puncture biopsy samples were calculated to be 0.019,0.014 and 0.011, respectively, and the results suggest low variation (see fig. 4A for details).
2. The technique is repeated for analysis: and respectively randomly selecting 27, 13 and 8 samples from the three sets, repeating mass spectrum analysis once, comparing mass spectrum data of the two times, and calculating according to a pearson correlation mode, wherein the correlation coefficient is more than 0.95, and the closer the absolute value of the correlation coefficient is to 1, the stronger the correlation is (see fig. 4B).
3. Biological repeat analysis: one sample was randomly selected for each batch in the discovery set, and a total of 22 samples were used as biological replicates for mass spectrometry, with correlation coefficients above 0.9, suggesting strong correlation, and no significant batch effect in the discovery set (see fig. 4C for details). Meaning that the deviation from living beings and machines is negligible.
(IV) differential proteins grouped by different Risk
After quality control, complete and stable proteomes can be obtained, and proteins with differences can be found in all protein levels when the expression levels between the high, medium and low-risk groups are compared. In further analysis of the proteins of the high, medium and low risk packets, as shown in the volcanic diagram of fig. 5, panel a: the high-risk and low-risk samples are compared, the whole protein expression level is similar, 97 protein expression levels are up-regulated, and 71 protein expression levels are down-regulated. Graph B: the overall protein level of the high-risk sample is similar to that of the medium-risk sample, the expression level of 44 proteins is up-regulated, and the expression level of 44 proteins is down-regulated. Graph C: the overall protein level of the medium-risk sample is similar to that of the low-risk sample, 49 protein expression levels are up-regulated, and 26 protein expression levels are down-regulated.
Fifth machine learning framework predictive model
In order to distinguish the PTC with different risk layering by a proteomics method, the invention establishes an artificial neural network algorithm based on a characteristic selection process. The study design and workflow is shown in figure 2. Modeling comprises four parts of data preprocessing, feature selection, deep learning model training and prediction. The machine learning model is built by cross training and validation based on the set of findings as described previously. Protein features selected based on XGBoost algorithm combined with 10-fold cross-validation allow the invention to ultimately determine features of 6 feature proteins and 11 clinical features as building models.
Machine-learning models are generated that predict PTC sample risk levels using proteomic data and the detailed clinical and genetic data described above. Using our training set, student's t test and Fold Change (FC) value calculations were performed for each two risk levels and each protein feature to determine the protein feature of the PTC sample that best distinguished between different risk levels. Proteins were selected with P values of 0.05 and |log2 (FC) | > 0.25. Further eliminating the protein with the deletion rate more than or equal to 0.5. Based on these criteria we selected 6 proteins. These protein features are normalized between 0 and 1, and their deletion values are set to 0. We use the same characteristics and perform the same normalization on both test sets. The undetectable features in the test dataset were set to the 0 vector.
Before training the model, 274 PTC samples of the discovery set were separated into a training set (n=191) and an internal validation set (n=83). The training set is then used to develop a model, and the internal validation set is used to validate and optimize the performance of the model. We have devised a machine learning architecture that includes feature selection and risk level classification.
The core of the algorithm is a cascade of two-step feature selection steps, allowing selection of protein features and other multiple sets of mathematical features. Firstly, optimizing parameters and characteristics by using a grid search algorithm, then, setting a group of parameters, constructing an XGBoost model by using a protein matrix, and sequencing the importance of all protein characteristics. Using this tool we selected the first 6 protein features, and the number of proteins in clinical (7), immunohistochemistry (8) and genetic features (1) as the average. We then combine all 22 features and construct another XGBoost model with the previous parameters to get the importance of the features. Finally, the top k features with the best area under the curve (area under the curve, AUC) values are selected using the validation set. This pipeline produces the following algorithm.
Algorithm 1
Input: protein matrix P; other feature matrix Q; grid space G
Best_AUC = 0
For the grid in G:
Model1 = XGBoost (P, grid)
Importance1 = sort (Model1.importance)
P_selected = P [:,Importance1[:6]]
Multi_omics_features = [Q,P_selected]
Model2 = XGBoost (Multi_omics_features, grid)
Importance2 = sort (Model2.importance)
For num in range (num of Multi_omics_features):
Final_features = Multi_omics_features [:,Importance2[:num]]
Model3 = XGBoost (Final_features, tree_num=20)
Pred_score = Model3 (validation_data[:,Final_features])
Temp_AUC = AUC (label,Pred_score)
If Temp_AUC>Best_AUC:
Best_parameter = [grid, num]
Best_features = Final_features
return Best_parameter, Best_features
Output: Model_parameter= Best_parameter, Model_features= Best_features
We have generated four models using an increasing number of feature dimensions. Model a uses only one dimension of features, preoperative clinical data (n=7). Model B adds a second dimension, immunological index (n=8), to the previous model. Model C has increased genetic characteristics on a prior basis. Finally, model D is combined with previous proteomic features (n=6) to provide the final four-dimensional feature-based model. Finally, we compared the predictive performance of the 4 models.
Clinically, it is more desirable to effectively distinguish cases with low and medium-high risk, so that the present invention uses this as a final index for further analysis. The present invention compares the predicted performance of different feature combinations on the results (see Table 2 for details).
1. In the initial model construction, model a constructed solely by means of clinical data (tunica invasion, extraglandular invasion, tumor diameter, multifocal nature and age) failed to exhibit good results with AUCs of 0.86, 0.68, 0.71 in the discovery, retrospective and prospective validation sets, respectively.
2. And secondly, the invention is incorporated into a model B combined by clinical and immune, the verification results of all the sets are obviously improved, and AUC reaches 0.94, 0.65 and 0.71 respectively.
3. In model C constructed by combining the clinical treatment, immunization and BRAF, the verification effect of the discovery set is obviously improved, and the AUC reaches 0.93, 0.68 and 0.73.
4. Finally, the invention incorporates protein indexes, takes a model D of clinical +immunity +BRAF +protein as a prediction model, and the AUC in three sets respectively reaches 0.92, 0.79 and 0.80.
Table 2 AUC of each combined model in discovery set, review validation set, and look-ahead validation set
Inclusion index Discovery set Review verification set Look-ahead verification set
Model a: clinical application 0.86 0.68 0.71
Model B: clinical + immunization 0.94 0.65 0.71
Model C: clinical + immune + BRAF 0.93 0.68 0.73
Model D: clinical + immunity + BRAF + protein 0.92 0.79 0.80
Sixth final feature selection and auxiliary risk prediction
Outputting a machine learning and screening the model to obtain a characteristic modeling index and a characteristic result:
based on the characteristics of four dimensions of clinical + immunity + BRAF + protein, the XGBoost verification model is constructed, and the characteristic index is output by using characteristic selection: clinical factors (tunica invasion, extraglandular invasion, tumor diameter, multifocal and age), blood immune indicators (preoperative monocyte count (M), preoperative neutrophil count (N), preoperative lymphocyte to monocyte ratio (LMR), preoperative platelet count (Plt) and preoperative lymphocyte count (L)), proteomics (DPP 7, PDLIM3, COL12A1, CTSL, TUBB2A and ITGB 5) and genomics (BRAF gene mutation).
In order to demonstrate the impact of each feature in the model, the invention constructs a visual interpretation model through a saprolitic addition interpretation (Shapley additive explanations, SHAP) algorithm. For each predicted sample, the model generates a predicted value, the SHAP value being the value assigned to each feature in the sample (FIG. 6), and is a ranking map of the importance of the features. The greatest advantage of SHAP values is that SHAP can have an impact on reflecting the characteristics in each sample, and also exhibit positive and negative effects.
The XGBoost model described above outputs a score for each patient tested: if the score is less than 0.6645, the machine learning model considers a low risk PTC; if the score is greater than 0.6645, the machine learning model considers a medium/high risk PTC. While a preliminary independent analysis was performed by 9 differently qualified clinicians alone. The clinician's analysis results are compared with the results of the XGBoost model to assist the physician in making the final analysis results.
To evaluate the effect of machine learning models, the present invention compares the effect of pre-operative risk stratification between XGBoost models constructed from four-dimensional features and clinicians of different seniority levels (fig. 4A-B). The predictive performance results gradually improve as the clinician's level of qualification increases. In the retrospective test set, XGBoost accuracy (0.737 [95% CI 0.668-0.802 ]) is mainly higher than that of the senior only clinicians (0.598 [95% CI 0.584-0.606], P < 0.0001). In the prospective test set, XGBoost accuracy (0.736 [95% CI 0.717-0.740 ]) was also significantly higher than that of the senior clinician alone (0.579 [95% CI 0.573-0.593], P < 0.0001). These results indicate that the XGBoost model is more predictive than the clinician.
The present invention further discusses the predictive capabilities of the clinician aided by machine learning. The results show that with the aid of the machine learning model, the prediction efficacy of the clinician gradually increases with the seniority level. In the retrospective test set, the accuracy of the XGBoost model was 0.737 (95% ci was 0.668-0.802), whereas the prediction accuracy of the senior clinicians under machine learning assistance was increased to 0.844 (95% ci was 0.829-0.844) (P <0.0001; fig. 4A). In the prospective test set, the XGBoost model has an accuracy of 0.736 (95% CI 0.717-0.740), and the senior clinician with machine learning assistance has improved prediction accuracy to 0.835 (95% CI 0.822-0.842) (P <0.0001; FIG. 4B). These results indicate that the clinician's predicted performance with machine learning assistance is better than the predicted performance of machine learning.
While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims (7)

1. A Papillary Thyroid Carcinoma (PTC) risk-assisted stratification system, comprising:
and a data acquisition module: is used for collecting the existing data, including proteomics, clinical indexes, blood immunity indexes and BRAF V600E Mutation tag, data integrity and numberAfter verification and evaluation are carried out according to the quality, data desensitization processing is carried out and the data are sent to a data preprocessing module;
and a data preprocessing module: performing missing value interpolation and normalization processing on the acquired existing data;
model building, training and verifying module: feature screening is carried out on the whole proteome detection data obtained based on the data preprocessing module, and input features and PTC risk layering are linked based on the model-limit gradient lifting of the gradient lifting tree, so that the problem of supervised learning is solved; the discovery set is used as a training set for training and adjusting parameters in the model, the discovery set is divided into a training sequence and a test sequence, the model is built in the training sequence, internal verification is carried out on the test sequence, the model with the optimal AUC on the test set is selected and stored, then verification is carried out on two independent verification sets, and the performance of the model is evaluated through the AUC of the working characteristic curve of the subject; protein characterization selected based on XGBoost algorithm combined with 10-fold cross-validation determination of 6 characterized proteins, 5 clinical features, 5 blood immune indicators and BRAF V600E The mutation characteristic is used as a characteristic of a construction model; establishing a thyroid papillary carcinoma (PTC) risk layering model, determining a model threshold value, and predictably layering each level of risk lesions according to the value; the 6 characteristic proteins are as follows: DPP7, PDLIM3, COL12A1, CTSL, TUBB2A and ITGB5;
layering module: analyzing the detection sample, layering the tumor into low-risk lesions or medium-high-risk lesions, and feeding back to the terminal equipment and the database platform of the doctor;
database platform: the module is connected with the modules and used for storing and calling the model building data, continuously collecting newly-added data and providing an online model updating function.
2. The auxiliary layering system according to claim 1, wherein: the 5 clinical features are: tunica invasion, extraglandular invasion, tumor diameter, multifocal nature and age.
3. The auxiliary layering system according to claim 1, wherein: the 5 blood immune indexes are as follows: platelet count (plt), neutrophil count (N), lymphocyte count (L), monocyte count (M), and Lymphocyte-to-Monocyte Ratio (LMR).
4. The auxiliary layering system of claim 1, wherein the source of existing data comprises: case data for hospitals.
5. The aided layering system of claim 1 wherein the supervised learning is a process of adjusting parameters of a classifier to achieve a desired performance using a set of samples of known classes.
6. The auxiliary stratification system of claim 1, wherein the sample inclusion criteria: (1) primary surgery, and lymph node cleansing; (2) no history of chemotherapy or radiotherapy exists in the past; (3) postoperative pathology examination was diagnosed as classical PTC; (4) the post-operative pathology diagnosis contains complete information about patient risk stratification; exclusion criteria: (1) history of neck trauma; (2) combined or past suffering from other cancers; (3) the postoperative pathology is diagnosed with other subtype PTC or other pathology type; (4) lack of fully available post-operative pathology.
7. A method of using a non-diagnostic thyroid papillary carcinoma (PTC) risk auxiliary stratification system, characterized in that the use of an auxiliary stratification system according to any of the claims 1-6 gives stratification results.
CN202310090839.8A 2023-02-09 2023-02-09 Thyroid papillary carcinoma (PTC) risk auxiliary layering system Active CN115881296B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310090839.8A CN115881296B (en) 2023-02-09 2023-02-09 Thyroid papillary carcinoma (PTC) risk auxiliary layering system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310090839.8A CN115881296B (en) 2023-02-09 2023-02-09 Thyroid papillary carcinoma (PTC) risk auxiliary layering system

Publications (2)

Publication Number Publication Date
CN115881296A CN115881296A (en) 2023-03-31
CN115881296B true CN115881296B (en) 2023-05-26

Family

ID=85760938

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310090839.8A Active CN115881296B (en) 2023-02-09 2023-02-09 Thyroid papillary carcinoma (PTC) risk auxiliary layering system

Country Status (1)

Country Link
CN (1) CN115881296B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116230237B (en) * 2023-05-06 2023-07-21 四川省医学科学院·四川省人民医院 Lung cancer influence evaluation method and system based on ROI focus features

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111724903A (en) * 2020-06-29 2020-09-29 北京市肿瘤防治研究所 System for predicting gastric cancer prognosis in a subject
CN115144599A (en) * 2022-09-05 2022-10-04 西湖大学 Application of protein combination in preparation of kit for carrying out prognosis stratification on thyroid cancer of children, and kit and system thereof

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7378233B2 (en) * 2003-04-12 2008-05-27 The Johns Hopkins University BRAF mutation T1796A in thyroid cancers
US20090061422A1 (en) * 2005-04-19 2009-03-05 Linke Steven P Diagnostic markers of breast cancer treatment and progression and methods of use thereof
JP6878405B2 (en) * 2015-07-29 2021-05-26 ノバルティス アーゲー Combination therapy with antibody molecule against PD-1
CN114927223A (en) * 2021-07-02 2022-08-19 中国医学科学院北京协和医院 Evaluation model and method for differentiated thyroid cancer after total resection
CN114171200A (en) * 2021-12-24 2022-03-11 中南大学湘雅医院 PTC (Positive temperature coefficient) prognosis marker, application thereof and construction method of PTC prognosis evaluation model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111724903A (en) * 2020-06-29 2020-09-29 北京市肿瘤防治研究所 System for predicting gastric cancer prognosis in a subject
CN115144599A (en) * 2022-09-05 2022-10-04 西湖大学 Application of protein combination in preparation of kit for carrying out prognosis stratification on thyroid cancer of children, and kit and system thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
甲状腺乳头状癌相关基因突变与其临床病理特征的关系;王成晨;向大鹏;李志宇;;实用肿瘤杂志(第03期);全文 *

Also Published As

Publication number Publication date
CN115881296A (en) 2023-03-31

Similar Documents

Publication Publication Date Title
CN111128299B (en) Construction method of ceRNA regulation and control network with significant correlation to colorectal cancer prognosis
CN112086129B (en) Method and system for predicting cfDNA of tumor tissue
CN109872776B (en) Screening method for potential biomarkers of gastric cancer based on weighted gene co-expression network analysis and application thereof
CN109801680A (en) Tumour metastasis and recurrence prediction technique and system based on TCGA database
US20020169730A1 (en) Methods for classifying objects and identifying latent classes
CN109055562A (en) A kind of biomarker, predict clear-cell carcinoma recurrence and mortality risk method
CN115881296B (en) Thyroid papillary carcinoma (PTC) risk auxiliary layering system
WO2023179263A1 (en) System, model and kit for evaluating malignancy grade or probability of thyroid nodules
KR20190085667A (en) Circulating Tumor DNA Detection Method Using Sample comprising Cell free DNA and Uses thereof
CN113270188A (en) Method and device for constructing prognosis prediction model of patient after esophageal squamous carcinoma radical treatment
CN115375640A (en) Tumor heterogeneity identification method and device, electronic equipment and storage medium
Liu et al. Pathological prognosis classification of patients with neuroblastoma using computational pathology analysis
CN117083680A (en) Artificial intelligence-based cancer diagnosis and cancer type prediction method
CN110942808A (en) Prognosis prediction method and prediction system based on gene big data
CN105243300A (en) Approximation spectral clustering algorithm based method for predicting cancer metastasis and recurrence
CN115792247B (en) Application of protein combination in preparation of thyroid papillary carcinoma risk auxiliary layering system
CN115424666A (en) Method and system for screening pan-cancer early-screening molecular marker based on whole genome bisulfite sequencing data
CN114822690A (en) Multi-class multifunctional intelligent classification method applied to whole genome expression profile data
CN112382341A (en) Method for identifying biomarkers related to esophageal squamous carcinoma prognosis
US20080132420A1 (en) Consolidated approach to analyzing data from protein microarrays
Wilk et al. On Stability of Feature Selection Based on MALDI Mass Spectrometry Imaging Data and Simulated Biopsy
EP4350707A1 (en) Artificial intelligence-based method for early diagnosis of cancer, using cell-free dna distribution in tissue-specific regulatory region
US20210104327A1 (en) Risk Assessment from Modulated Sequences by Deconvolution of Reference Specimen Profiles
CN116959554A (en) CAFs related gene-based prostate cancer biochemical recurrence prediction model and application thereof
Karimov PREDICTING THE PRIMARY TISSUES OF CANCERS OF UNKNOWN PRIMARY USING MACHINE LEARNING

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant