CN115128285B - Kit and system for identifying and evaluating thyroid follicular tumor by protein combination - Google Patents

Kit and system for identifying and evaluating thyroid follicular tumor by protein combination Download PDF

Info

Publication number
CN115128285B
CN115128285B CN202211046085.8A CN202211046085A CN115128285B CN 115128285 B CN115128285 B CN 115128285B CN 202211046085 A CN202211046085 A CN 202211046085A CN 115128285 B CN115128285 B CN 115128285B
Authority
CN
China
Prior art keywords
protein
follicular
proteins
model
combination
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211046085.8A
Other languages
Chinese (zh)
Other versions
CN115128285A (en
Inventor
郭天南
孙耀庭
王赫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Westlake University
Original Assignee
Westlake University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Westlake University filed Critical Westlake University
Priority to CN202211046085.8A priority Critical patent/CN115128285B/en
Publication of CN115128285A publication Critical patent/CN115128285A/en
Application granted granted Critical
Publication of CN115128285B publication Critical patent/CN115128285B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/68Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
    • G01N33/6803General methods of protein analysis not limited to specific proteins or families of proteins
    • G01N33/6848Methods of protein analysis involving mass spectrometry
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/53Immunoassay; Biospecific binding assay; Materials therefor
    • G01N33/574Immunoassay; Biospecific binding assay; Materials therefor for cancer
    • G01N33/57407Specifically defined cancers
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The present invention relates to a kit comprising a combination of proteins. The invention also relates to application of the protein combination in preparing a kit for identifying and evaluating thyroid follicular tumors. The invention also relates to a system for the differential evaluation of thyroid follicular tumors, which comprises a substance for detecting the relative expression amount of the protein combination, a data processing device and an output device. According to the invention, 123 highly credible protein candidate pools are found according to TMT labeled proteome data of adult thyroid follicular adenoma and follicular cancer samples, and a 25 protein combination is screened by combining an extreme gradient lifting model. According to the protein quantitative value of the protein combination, the extreme gradient lifting model is combined, and the benign and malignant thyroid follicular tumor can be identified and evaluated with the AUC of more than 0.9 and the accuracy of more than 85 percent, so that a clinician is assisted in making clinical decisions.

Description

Kit and system for identifying and evaluating thyroid follicular tumor by protein combination
Technical Field
The invention relates to the field of medical diagnosis, and particularly provides an auxiliary means for identifying and evaluating thyroid follicular tumor based on protein and machine learning.
Background
The incidence of thyroid nodules and thyroid cancer has continued to rise over the last two decades. Although ultrasound examination and ultrasound-guided fine needle puncture are helpful in distinguishing benign and malignant nodules, about 10-30% of thyroid nodules are still not identifiable by cytopathology and require surgical diagnosis. The surgical specimens are carefully examined by the pathologist, and the histopathological changes are followed to provide a clear and complete diagnosis. Such patients often undergo unnecessary surgery because many benign nodules are obscured from pre-operative diagnosis. The most ambiguous diagnoses occur in follicular tumors, which represent approximately 30-50% of the indeterminate nodules prior to surgery.
Follicular tumors are tumors formed by differentiation of follicular cells and consist of microfiltered structures. Follicular adenomas are hard or rubbery in texture, uniform round or oval tumors, enveloped by a thin fibrous envelope, a common benign tumor of the thyroid gland. The incidence of thyroid follicular tumors was 3-4.3% in necropsy results. Follicular cancer, however, is more cellular, has a thick, irregular coating, often with areas of necrosis and more frequent nuclear division. Follicular carcinoma differs from follicular adenoma by invasion of the entire envelope, invasion of blood vessels, extrathyroid invasion, lymph node metastasis or systemic metastasis. Vascular infiltration is the most reliable sign of malignancy. Distant metastasis occurs in 10-15% of all follicular cancer patients and recurrence occurs in 11-39%. Invasive follicular cancer patients have a 10-year disease-specific mortality rate of 15-28%. The ratio of follicular adenoma to follicular carcinoma in the surgical specimens was about 5.
Benign follicular tumor follicular adenoma cannot be identified before surgery from malignant follicular tumor follicular carcinoma because invasion of the envelope cannot be assessed in cytological, ultrasonic and clinical features. The only method for distinguishing the two is to carry out diagnostic operation and further carry out benign and malignant judgment on the tumor. Nevertheless, follicular carcinoma and follicular adenoma are often difficult to distinguish in pathological diagnosis of paraffin sections after surgery, and microscopic features of follicular carcinoma are very similar to the manifestation of follicular adenoma, and thus can be determined only by examining the invasion condition of tumor under a microscope through continuous sections. Sometimes, even a continuous slice is difficult to judge, and eventually only a fuzzy diagnosis result is given. In addition, there are two not negligible problems, on the one hand, the envelope is not visible to the pathologist, and on the other hand, some follicular adenomas may develop into follicular carcinoma, but at the beginning of the disease development when the current operation is performed, the tumor does not break through the envelope. Therefore, the simple envelope violation definition is not reliable and accurate. This obviously also requires other means to assist in completing the identification.
Several next-generation sequencing-based nucleic acid molecule assays have been developed and have achieved some success for diagnosing indeterminate thyroid nodules. However, there has been no report of distinguishing follicular tumors from follicular carcinomas based on genomic and transcriptome characteristics. RAS mutations and PAX8/PPAR γ rearrangements are the most common alterations of follicular tumors, but this model of mutation is detectable in both benign and malignant follicular tumors, and therefore cannot be distinguished by the results of genetic testing.
The protein is located at the most downstream of the biological center rule and is a direct executor or a direct embodiment of life activities. In clinical diagnosis of diseases, important biological roles such as biomarkers, drug targets, etc. are played. Proteomics is a discipline for quantitative analysis of proteins detected in biological samples. The proteomics based on the integration of multiple groups of science such as proteome provides verification and explanation of a closer phenotype for a genome, provides more accurate and reliable information for early cancer discovery, benign and malignant diagnosis, typing, personalized medicine application, curative effect monitoring, prognosis judgment and the like, and makes accurate medicine more accurate.
In general, proteomes can be obtained by two methods, one is a conventional non-labeled quantitative method, and the other corresponds to a labeled quantitative method. The labeled quantitative proteomics method can analyze 6-16 samples simultaneously in one detection, and the detection flux is higher than that of the unlabeled quantitative method. Meanwhile, labeled quantitative proteomics can deeply and quantitatively detect protein expression in a sample, and can detect nearly ten thousand or more than ten thousand proteins under normal conditions. The most common method for label quantification is the Tandem Mass Tag (TMT) method. For sample analysis that performed very similarly biologically, deeper protein coverage could effectively find potential biomarkers.
Disclosure of Invention
In the application, the inventor detects protein expression in a sample based on TMT (Tandem Mass Tag) isotope labeling quantitative Tandem Mass spectrometry, and combines a machine learning method to accurately evaluate and identify follicular carcinoma and follicular adenoma from a protein expression level. According to the invention, a new combination of 25 proteins is screened out by analyzing proteomics data of the thyroid follicular tumor, and based on the 25 proteins, the thyroid follicular tumor can be identified from the protein molecular level by combining a polar gradient promotion model, so that a clinician can be assisted in making a decision, and the problems of over diagnosis and inaccurate evaluation of the thyroid follicular tumor in clinic can be relieved to a certain extent.
The invention is obtained by the following steps:
1. data generation method
First, a tissue sample of thyroid follicular adenoma and follicular carcinoma is obtained, proteins in the tissue are extracted by a pressure cycling technique, and the proteins are digested into a polypeptide sample by using enzymes. Subsequently, the polypeptides in the different samples were labeled with TMT reagent and the labeled peptide fragments were further fractionated by high pH liquid chromatography, each fraction being subjected to acquisition of mass spectral data by a 60 minute data-dependent acquisition mode. And finally, carrying out library searching and quantification on the original file data after mass spectrum acquisition by using protome discover software.
2. Data preprocessing method
For the protein matrix generated by the library searching software, firstly, proteins with deletion rate exceeding 60% are removed, then, a robust sequence filling method in an R software package NAguider is used for filling the deletion value, and finally, a ComBat algorithm is used for batch correction of the data entirety.
3. Protein feature combinatorial preselection
Firstly, analyzing the protein matrix after pretreatment to determine the differential protein of follicular carcinoma and follicular adenoma, thereby realizing the characteristic filtration of candidate proteins. Then, further filtration was performed by three methods: analysis of variance, kruskal-Wallis test, and information gain method to determine preliminary protein combinations.
4. Classification model construction and final feature combination determination
Firstly, the hyper-parameters of an eXtreme Gradient boost (XGboost) algorithm are optimized through random search and five-fold cross validation, and then, based on the preliminary combination of the proteins, more refined feature selection is carried out: the method is characterized in that the importance of the proteins is ranked through multiple training models, and the best protein quantity and protein combination are determined through the cross validation effect of the models. After the protein is determined, the hyper-parameter tuning and the model training are carried out again, the final extreme gradient elevation model can evaluate the benign and malignant degree of the thyroid follicular tumor, a score between 0 and 1 is given, and the higher the score is, the higher the malignant degree is, so that the final extreme gradient elevation model can be applied to a new data set.
Thus, in one aspect, the present invention provides the use of a combination of proteins for the preparation of a kit for the differential assessment of thyroid follicular tumors, said combination of proteins consisting of: q8TF72_ SHROOM3, Q86UX2_ ITIH5, Q8NBF6_ AVL9, Q8N6Y0_ USHBP1, Q96RR4_ CAMKK2, Q92828_ CORO2A, Q96K21_ ZFYVE19, Q96FN5_ KIF12, Q9H223_ EHD4, Q9HCD6_ TANC2, Q8IYT2_ CMTR2, P14649_ MYL6B, Q9UNA1_ ARHGAP26, P02765_ AHSG, Q86YB7_ ECHDC2, Q9UBM7_ DHCR7, Q04941_ PLP2, P02 _ TPO, Q687X5_ STEAP4, O60706_ ABCC9, O95429_ BAG4, Q9Y487_ ATP6V0A2, O82 _ TRIM3, Q959 _ SLCO 3, Q923 _ SLO 959_ CO2, and a reagent wherein the kit comprises a relative amount of MPP 753 protein expressed by the kit.
In one embodiment, the relative expression of the combination of proteins is detected by mass spectrometry.
In another embodiment, the relative expression of the combination of proteins is detected by tandem mass spectrometry labeling.
In yet another embodiment, the evaluating comprises inputting data obtained by detecting the relative expression levels of the combination of proteins by tandem mass spectrometry labeling technique into a polar gradient elevation model, outputting a score between 0 and 1, the higher the score, the higher the degree of malignancy, and a cutoff value of 0.5.
In another aspect, the present invention provides a kit containing, but not limited to, the following proteins to be detected, heavy-gauge isotopic peptide fragments corresponding to the proteins, and the combination of proteins consisting of: q8TF72_ SHROOM3, Q86UX2_ ITIH5, Q8NBF6_ AVL9, Q8N6Y0_ USHBP1, Q96RR4_ CAMKK2, Q92828_ CORO2A, Q96K21_ ZFYVE19, Q96FN5_ KIF12, Q9H223_ EHD4, Q9HCD6_ TANC2, Q8IYT2_ CMTR2, P14649_ MYL6B, Q9UNA1_ ARHGAP26, P02765_ AHSG, Q86YB7_ ECHDC2, Q9UBM7_ DHCR7, Q04941_ PLP2, P02 _ TPO, Q687X5_ STEAP4, O60706_ ABCC9, O95429_ BAG4, Q9Y487_ ATP6V0A2, O32 _ TRIM 953 _ ITIH 2, Q959 _ CO2A 923 _ MPP 1 and Q02765 _ MPP 2.
In yet another aspect, the present invention provides a method for constructing a model for differential evaluation of thyroid follicular tumors, comprising: training a machine learning model by taking the relative expression quantity of a protein combination in thyroid gland follicular adenoma and follicular carcinoma as a training sample to obtain the model, wherein the protein combination consists of the following components: q8TF72_ SHROOM3, Q86UX2_ ITIH5, Q8NBF6_ AVL9, Q8N6Y0_ USHBP1, Q96RR4_ CAMKK2, Q92828_ CORO2A, Q96K21_ ZFYVE19, Q96FN5_ KIF12, Q9H223_ EHD4, Q9HCD6_ TANC2, Q8IYT2_ CMTR2, P14649_ MYL6B, Q9UNA1_ ARHGAP26, P02765_ AHSG, Q86YB7_ ECHDC2, Q9UBM7_ DHCR7, Q04941_ PLP2, P02 _ TPO, Q687X5_ STEAP4, O60706_ ABCC9, O95429_ BAG4, Q9Y487 6V0A2, O82 _ TRIM3, Q959 _ SLCO2A 923, Q92753 3_ MPP 3, and Q923 _ MPP 753 _ MPP 3.
In one embodiment, the model is obtained from a gradient lifting model algorithm construction.
In another aspect, the invention provides a system for differential assessment of thyroid follicular tumors, comprising a substance for detecting the relative expression of a combination of proteins, and data processing means and output means, wherein said combination of proteins consists of: q8TF72_ SHROOM3, Q86UX2_ ITIH5, Q8NBF6_ AVL9, Q8N6Y0_ USHBP1, Q96RR4_ CAMKK2, Q92828_ CORO2A, Q96K21_ ZFYVE19, Q96FN5_ KIF12, Q9H223_ EHD4, Q9HCD6_ TANC2, Q8IYT2_ CMTR2, P14649_ MYL6B, Q9UNA1_ ARHGAP26, P02765_ AHSG, Q86YB7_ ECHDC2, Q9UBM7_ DHCR7, Q04941_ PLP2, P02 _ TPO, Q687X5_ STEAP4, O60706_ ABCC9, O95429_ BAG4, Q9Y487_ ATP6V0A2, O32 _ TRIM 953 _ ITIH 2, Q959 _ CO2A 923 _ MPP 1 and Q02765 _ MPP 2.
In one embodiment, the data processing apparatus comprises a discriminatory evaluation module comprising a polar gradient boost model.
In yet another embodiment, the data of the relative expression amounts of the combination of proteins is input into the extreme gradient elevation model for processing, and the output device outputs a score between 0 and 1, wherein the higher the score is, the higher the malignancy degree of the thyroid follicular tumor is, and the cutoff value is 0.5.
The invention provides a new 25 protein combinations (Q8 TF72_ SHROOM3, Q86UX2_ ITIH5, Q8NBF6_ AVL9, Q8N6Y0_ USHBP1, Q96RR4_ CAMKK2, Q9292 _ CORO2A, Q96K21_ ZFYVE19, Q96FN5_ KIF12, Q9H223_ EHD4, Q9HCD6_ TANC2, Q8IYT2_ CMTR2, P14649_ MYL6B, Q9UNA1_ ARHGAP26, P65 _ AHSG, Q86YB7_ ECHDC2, Q9UBM7_ DHCR7, Q04941_ PLP2, P0402 _ TPO, Q687X5_ STEAP4, O60706_ ABCC9, O95429_ BAG4, Q9Y487_ ATP6V0A2, O75382_ TRIM3, Q92959_ SLCO2A1 and Q8N3R9_ MPP5, wherein the character before the _ "is the protein Unit Access ID, and the character after the _" is the protein gene name), according to the protein group data of the protein combination, combining an extreme gradient elevation model, the good or bad degree of the thyroid follicular tumor can be evaluated by the AUC of more than 0.9 and the accuracy of more than 85 percent, thereby assisting the clinician to diagnose, treat and operate.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The following embodiments are merely illustrative of the technical solutions of the present invention, and should not be used to limit the scope of the present invention.
Unless otherwise specifically indicated or limited, the technical means used in the embodiments of the present application are all conventional technical means well known to those skilled in the art, and the materials and/or devices, apparatuses, instruments, reagents, consumables and the like used in the embodiments of the present application are all commercially available.
1. Data generation method
First, a tissue sample of thyroid follicular adenoma and follicular carcinoma is obtained, proteins in the tissue are extracted by a pressure cycling technique, and the proteins are digested into a polypeptide sample by using enzymes. Subsequently, the polypeptides in the different samples were labeled with TMT reagent and the labeled peptides were further fractionated by high pH liquid chromatography, each fraction being subjected to acquisition of mass spectral data by a 60 minute data dependent acquisition mode. And finally, carrying out library searching and quantification on the original file data after mass spectrum acquisition by using protome discover software.
2. Data preprocessing method
For a protein matrix generated after library search, firstly, performing deletion value evaluation on proteins, namely analyzing the deletion rate threshold values of different proteins through the deletion rate of the proteins, determining the deletion rate threshold value and removing the proteins with high deletion rate, so that the total deletion rate of the matrix is less than 10%. Then, missing value padding is performed, using a robust sequence padding method in the R software package NAguideR. And finally, carrying out batch correction by a Combat method. For non-positive values in the protein matrix due to padding/correction, the substitution was made with a minimum value of 0.5 times its corresponding positive protein expression value.
3. Protein feature preselection
Firstly, the difference protein of the follicular carcinoma and the follicular adenoma is determined by analyzing the filled and corrected protein matrix, thereby realizing the candidate protein characteristic filtering. Fold difference is given by mean thyroid follicular oncoprotein expression divided by mean follicular adenomatous protein expression. The differential protein calculation conditions were: the difference multiple is more than 1.2 times, and the student t-test correction P value is less than 0.05. Subsequently, further filtration was performed by three principles: 1) The P value of the analysis of variance is more than or equal to 0.001; 2) The P value is more than or equal to 0.001 by Kruskal-Wallis test; 3) There is no information gain, the characteristic protein satisfying any one of the above filtering principles is removed, and the remaining characteristic protein is the determined preliminary combination of proteins. The combination can be used for classification of follicular carcinoma from follicular adenoma after still further screening.
4. Classification model construction and final feature combination determination
Firstly, the hyper-parameters of the extreme gradient lifting model are optimized through random search (100 times of search in a parameter space) and five-fold cross validation, and then in order to further refine the extreme gradient lifting model, more refined feature selection is carried out based on the determined protein preliminary combination, and the total steps are divided into two steps. The first step is as follows: firstly, training 100 models, calculating feature importance and selecting the first 50 proteins according to a Gini coefficient in each training, then integrating the results of the 100 times, reserving the selected proteins for not less than 30 times, continuing to train the models for 100 times by using the proteins, sequencing the feature importance, and finally averaging the results of the 100 times to obtain the ranking of the remaining proteins; the second step: 100 cross-validations were performed with the first 5, first 10, first 15, \8230;. Protein characteristics at 5 variable intervals, and the optimal number of variables was determined by mean AUC, and the above procedure was repeated at intervals of 1 based on the preliminary results to determine the final number of proteins and protein combinations. After the protein is determined, the hyper-parameter tuning and model training are carried out again, the final extreme gradient lifting model can evaluate the benign and malignant degree of the thyroid follicular tumor, a score between 0 and 1 is given, the higher the score is, the higher the malignant degree is, and therefore the final extreme gradient lifting model can be applied to a new data set.
Examples
Example 1-sample inclusion.
The thyroid gland tissue related to this example was obtained from thirteen clinical hospitals in china and singapore in 2010-2020, and ethical approval was obtained from these hospitals and this research unit. A total of 645 samples were initially taken in this experiment, specifically including 341 follicular tumors and 303 follicular carcinomas, and 1 missing from the study due to failure to match. And (3) after the H & E section corresponding to each sample wax lump is rechecked and checked by a pathology expert, confirming that the percentage of the tumor tissue area is more than 70%, taking 1 section for proteomics detection and analysis, and obtaining the section with the thickness of 5-10 mu m.
Example 2-proteomics data acquisition and pre-processing.
Paraffin sections were washed sequentially with 100% heptane, 100% ethanol, 90% ethanol, 75% ethanol for 5 minutes each, and subjected to dewaxing and hydration processes. The dewaxed sample was added with Tris base solution of pH =10 and reacted at 95 ℃ for 30 minutes. Then, urea, thiourea, a reducing agent and an alkylating reagent are added, and the mixture is circulated alternately at high pressure and low pressure through a pressure circulation system, namely, the mixture reacts for 50 seconds under the pressure of 45000 p.s.i., the mixture reacts for 10 seconds under normal pressure, and the circulation operation is carried out for 90 times. After cleavage, the cleavage is performed by protease cleavage with trypsin and LysC enzyme, and the obtained cleavage peptide is desalted by C18. Subsequently, clean polypeptides were labeled with 16plex TMTpro reagent. And fractionating the marked sample by adopting high pH liquid chromatography, and obtaining 30 fractions by fractionating under a gradient of 60 min, wherein each fraction is subjected to data dependence acquisition by high resolution mass spectrometry. The original data are subjected to spectrum decomposition and quantification by using two pieces of software, namely FragPipe software and protein resolver (PD), and 11,533 proteins (FragPipe) and 10,336 Proteins (PD) are respectively identified under the condition that the error finding rate is less than 0.01. To ensure the reliability of the data, only 10,032 proteins identified by both software components were retained. And the protein matrix output by the PD software was used in subsequent analyses. Subsequently, 2236 (22.3%) proteins with deletion rates greater than 60% were filtered, resulting in an overall deletion rate of less than 10% for the entire protein matrix. Then, the deletion value filling and the Combat method are carried out by a robust sequence filling method in the R package NAguideR to carry out batch correction. Non-positive values appearing in the protein matrix were replaced with 0.5 times the minimum of their corresponding protein positive expression values.
Example 3 protein feature preselection.
To find the differences in the molecular biology level between follicular adenoma and follicular carcinoma, a comparison of the proteomic expression profiles of the two was made. Under the condition that the corrected P value of the student t-test is less than 0.05 and the difference multiple (or the reciprocal thereof) is more than 1.2 times as screening conditions, 178 different proteins are obtained in total. The candidate proteins are further filtered by three conditions, namely, analysis of variance P value is more than or equal to 0.001, kruskal-Wallis test P value is more than or equal to 0.001, and no information gain, and the proteins meeting any filtering principle are removed from the 178 differential protein candidate pool. A total of 55 proteins were filtered out by the above filtration, and the remaining 123 proteins were the preliminary combination of proteins, as shown in table 1. The protein candidate pool is closely related to thyroid follicular tumors, is discovered for the first time by the method, and is not reported at all.
Table 1: 123 proteins (Unit Access ID) obtained by preselection
O00339 P05546 P29373 Q15742 Q8N6Y0 Q9H223
O14524 P06727 P29762 Q1HG43 Q8NBF6 Q9H788
O14727 P07202 P29966 Q53RD9 Q8NFP9 Q9HCD6
O15037 P07858 P30291 Q5S007 Q8TDX6 Q9NXH8
O15460 P08697 P36551 Q5TF21 Q8TF72 Q9P219
O43148 P11388 P42574 Q5VSL9 Q8WXA9 Q9P258
O60303 P12429 P46013 Q687X5 Q92828 Q9P2K5
O60502 P14649 P47736 Q6NV74 Q92959 Q9UBM7
O60706 P15090 P52926 Q6UX53 Q93099 Q9UIJ5
O75096 P16104 P57729 Q6ZS11 Q96FN5 Q9UKS7
O75382 P16401 P61077 Q6ZS30 Q96GM8 Q9ULC0
O95210 P16402 P61916 Q7Z7B0 Q96K21 Q9ULH0
O95372 P16403 P61925 Q86SF2 Q96RR4 Q9UNA1
O95429 P16671 P81877 Q86U70 Q99470 Q9Y487
P00740 P16949 P85037 Q86UX2 Q9BQB6 Q9Y4H2
P01019 P17096 Q00613 Q86XX4 Q9BRL6 Q9Y4P1
P01266 P17535 Q04941 Q86YB7 Q9BWG4 Q9Y646
P02765 P20962 Q07352 Q8IWS0 Q9BX97 Q9Y6M1
P02766 P22223 Q13454 Q8IYT2 Q9BY12
P02774 P22748 Q14195 Q8N3R9 Q9C0H9
P04275 P25311 Q14376 Q8N6N7 Q9H1E3 。
Example 4 final protein combination determination.
After the hyper-parameters are adjusted, the selection of the number of characteristic proteins is evaluated more accurately. Firstly, training 100 times of models, calculating the feature importance and selecting the first 50 proteins in each training, and then combining the 100 times of results, reserving 58 selected proteins which are not less than 30 times. Then, the model is continuously trained for 100 times by using the 58 proteins, the feature importance ranking is carried out, and finally, the results of the 100 times are averaged to obtain the ranking of the 58 proteins. Model potency comparisons for different numbers of protein features resulted in 25 selected proteins (Q8 TF72_ SHROOM3, Q86UX2_ ITIH5, Q8NBF6_ AVL9, Q8N6Y0_ USHBP1, Q96RR4_ CAMKK2, Q92828_ CORO2A, Q96K21_ ZFYVE19, Q96FN5_ KIF12, Q9H223_ EHD4, Q9HCD6_ TANC2, Q8IYT2_ CMTR2, P14649_ MYL6B, Q9UNA1_ ARHGAP26, P02765_ AHSG, Q86YB7_ ECHDC2, Q9UBM7_ DHCR7, Q04941_ PLP2, P07202_ TPO, Q687X5_ STEAP4, O606060 _ ABCC9, O429 _ BAG4, Q959Y 959_ UBM 2_ DHCR7, Q04941_ PLP2, P07202_ TPO 687X5_ STEAP4, Q606 _ ABCC 9_ ACO 3_ ACO 2, Q92753 _ ACO 3_ ACK 2, and Q922 _ ACK 2. A2. And Q # ACK 2. A2. The maximum cross-ACK 3. The protein was then verified. Of these 25 proteins, 7 were reported to be associated with thyroid cancer or thyroid function, of which only the ITIH5 protein was reported to be associated with thyroid follicular tumors, while the remaining 24 proteins were reported for the first time to be associated with thyroid follicular tumors. This is shown in table 2, and is ranked according to its classification potency for follicular carcinoma and adenoma.
Table 2: further summary of 25 proteins
Figure 930798DEST_PATH_IMAGE001
Example 5-evaluation model construction and testing.
Based on the above results, an extreme gradient model was selected in combination with 25 final characteristic proteins for assessment of follicular adenomas and follicular carcinomas.
To construct the final model, the hyper-parameters were re-tuned by five-fold cross validation, as detailed in table 3. Then, the model is trained and tested on the corresponding data set, and the effects are shown in table 4, wherein the AUC of the internal validation set of the model is 0.951 (0.944-0.959), the accuracy is 0.872 (0.859-0.892), the sensitivity is 0.875 (0.836-0.889), the specificity is 0.871 (0.866-0.910), the PPV is 0.856 (0.849-0.894), and the NPV is 0.887 (0.860-0.899); in the independent test set, the model potency suggested an AUC of 0.904 (0.852-0.956), an accuracy of 0.859 (0.789-0.908), a sensitivity of 0.877 (0.772-0.938), a specificity of 0.843 (0.738-0.911), a positive predictive value PPV of 0.838 (0.731-0.908), and a negative predictive value NPV of 0.881 (0.778-0.940). The result sensitivity is higher than specificity, and the screening capability of the prompt model for follicular carcinoma is stronger. NPV is higher than PPV, and the prompt model has strong exclusion capability. The reliability of the predicted results in the independent test set fluctuated within about ± 5%.
Table 3: parameter setting after extreme gradient lifting model tuning
Figure 682853DEST_PATH_IMAGE003
Table 4: model prediction efficiency
Figure 442999DEST_PATH_IMAGE004
Example 6 evaluation of thyroid follicular tumor malignancy in thyroid tissue to be tested in a subject.
The method comprises the steps of preparing a thyroid tissue sample to be detected of a subject by using a pressure circulation system, quantifying by using a TMT (tetramethylbenzidine) label, collecting protein quantification result data by using a high performance liquid chromatography and a mass spectrum together, inputting the mass spectrum data into a final extreme gradient lifting model of the application, and giving a fraction of 0-1, wherein the higher the fraction is, the higher the malignancy degree is. After the model was constructed, we tested 135 independent test cohorts, with a median score of 0.82, an average score of 0.75, a first quartile score of 0.63, and a third quartile score of 0.95 for follicular cancer samples; the median follicular adenoma score was 0.27, the mean score was 0.30, the first quartile score was 0.11, and the third quartile score was 0.43, which data reflects the accuracy of the model score. In practical applications, 25 protein expressions in a sample are detected, the quantitative results of the proteins are used as model inputs, the malignancy score is output through the model under the parameters, and when the score is more than 0.5, the sample is judged to be follicular cancer.
While the present invention has been described in detail hereinabove with respect to specific embodiments thereof, it will be apparent to those skilled in the art that modifications and improvements can be made based on the disclosure. Therefore, it is intended that all such modifications and improvements be included within the scope of the invention without departing from the spirit thereof.

Claims (1)

1. A method for constructing a model for identifying and evaluating thyroid follicular tumor comprises the following steps:
a) A method for generating data, comprising obtaining the relative expression levels of proteins in thyroid tissue from thyroid follicular adenoma and follicular carcinoma,
b) The data preprocessing method comprises the steps of filling missing values by using a robust sequence filling method in an R software package NAguideeR, performing batch correction of the data entirety by adopting a ComBat algorithm,
c) Protein feature combination preselection comprising determining the differential protein between follicular carcinoma and follicular adenoma by analysis of the protein matrix after pretreatment, and further filtering the data by three methods: analysis of variance, kruskal-Wallis test and information gain method,
d) The model construction method comprises the steps of optimizing a gradient lifting model algorithm through random search and five-fold cross validation, calculating feature importance according to a kini coefficient, training a model by combining a forward feature selection method, sequencing importance of proteins through multiple times of model training, and obtaining a protein combination for identifying and evaluating thyroid follicular tumors through a model cross validation effect, wherein the protein combination comprises the following components: q8TF72_ SHROOM3, Q86UX2_ ITIH5, Q8NBF6_ AVL9, Q8N6Y0_ USHBP1, Q96RR4_ CAMKK2, Q92828_ CORO2A, Q96K21_ ZFYVE19, Q96FN5_ KIF12, Q9H223_ EHD4, Q9HCD6_ TANC2, Q8IYT2_ CMTR2, P14649_ MYL6B, Q9UNA1_ ARHGAP26, P02765_ AHSG, Q86YB7_ ECHDC2, Q9UBM7_ DHCR7, Q04941_ PLP2, P02 _ TPO, Q687X5_ STEAP4, O60706_ ABCC9, O95429_ BAG4, Q9Y487_ ATP6V0A2, O32 _ TRIM 953 _ ITIH 2, Q959 _ CO2A 923 _ MPP 1 and Q02765 _ MPP 2.
CN202211046085.8A 2022-08-30 2022-08-30 Kit and system for identifying and evaluating thyroid follicular tumor by protein combination Active CN115128285B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211046085.8A CN115128285B (en) 2022-08-30 2022-08-30 Kit and system for identifying and evaluating thyroid follicular tumor by protein combination

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211046085.8A CN115128285B (en) 2022-08-30 2022-08-30 Kit and system for identifying and evaluating thyroid follicular tumor by protein combination

Publications (2)

Publication Number Publication Date
CN115128285A CN115128285A (en) 2022-09-30
CN115128285B true CN115128285B (en) 2023-01-06

Family

ID=83387441

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211046085.8A Active CN115128285B (en) 2022-08-30 2022-08-30 Kit and system for identifying and evaluating thyroid follicular tumor by protein combination

Country Status (1)

Country Link
CN (1) CN115128285B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115436640B (en) * 2022-11-07 2023-04-18 西湖欧米(杭州)生物科技有限公司 Surrogate matrix for polypeptides that can assess the malignancy or probability of thyroid nodules

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111292801A (en) * 2020-01-21 2020-06-16 西湖大学 Method for evaluating thyroid nodule by combining protein mass spectrum with deep learning
CN111424091A (en) * 2020-04-20 2020-07-17 中国医学科学院北京协和医院 Marker for differential diagnosis of benign and malignant thyroid follicular tumor and application thereof
CN114414704A (en) * 2022-03-22 2022-04-29 西湖欧米(杭州)生物科技有限公司 System, model and kit for evaluating malignancy degree or probability of thyroid nodule

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8455208B2 (en) * 2007-04-14 2013-06-04 The Regents Of The University Of Colorado Biomarkers for follicular thyroid carcinoma and methods of use
US9234244B2 (en) * 2007-08-27 2016-01-12 The United States Of America, As Represented By The Secretary, Department Of Health And Human Services Diagnostic tool for diagnosing benign versus malignant thyroid lesions
CN106498076A (en) * 2010-05-11 2017-03-15 威拉赛特公司 For diagnosing the method and composition of symptom
WO2018204764A1 (en) * 2017-05-05 2018-11-08 Camp4 Therapeutics Corporation Identification and targeted modulation of gene signaling networks
JP2023504555A (en) * 2019-12-05 2023-02-03 ボストンジーン コーポレイション Machine learning techniques for gene expression analysis
CN112862756B (en) * 2021-01-11 2024-03-08 中国医学科学院北京协和医院 Method for identifying lesion type and gene mutation in thyroid tumor pathological image
CN114705794B (en) * 2022-04-15 2022-12-02 西湖大学 Proteomics analysis method for biological sample

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111292801A (en) * 2020-01-21 2020-06-16 西湖大学 Method for evaluating thyroid nodule by combining protein mass spectrum with deep learning
CN111424091A (en) * 2020-04-20 2020-07-17 中国医学科学院北京协和医院 Marker for differential diagnosis of benign and malignant thyroid follicular tumor and application thereof
CN114414704A (en) * 2022-03-22 2022-04-29 西湖欧米(杭州)生物科技有限公司 System, model and kit for evaluating malignancy degree or probability of thyroid nodule

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
TFF3和C1orf24协助鉴别甲状腺滤泡型肿瘤良恶性;张福彬等;《医学信息》;20141231(第021期);全文 *
甲状腺滤泡型肿瘤分子标记物的研究进展;熊金华等;《国际内分泌代谢杂志》;20111231;第31卷(第006期);全文 *

Also Published As

Publication number Publication date
CN115128285A (en) 2022-09-30

Similar Documents

Publication Publication Date Title
CN102027373B (en) It was found that being determined for prostate cancer diagnosis and the biomarker and medicine target calibration method and its biomarker of establishment for the treatment of
KR101054732B1 (en) How to Identify Biological Conditions Based on Hidden Patterns of Biological Data
Vandel et al. Hepatic molecular signatures highlight the sexual dimorphism of nonalcoholic steatohepatitis (NASH)
White et al. Bioinformatics strategies for proteomic profiling
CN108603887A (en) Nonalcoholic fatty liver disease (NAFLD) and nonalcoholic fatty liver disease (NASH) biomarker and application thereof
Srinivasan et al. Accurate diagnosis of acute graft-versus-host disease using serum proteomic pattern analysis
CN115144599B (en) Application of protein combination in preparation of kit for carrying out prognosis stratification on thyroid cancer of children, and kit and system thereof
CN106461647A (en) Protein biomarker profiles for detecting colorectal tumors
CN115575636B (en) Biomarker for lung cancer detection and system thereof
US20170059581A1 (en) Methods for diagnosis and prognosis of inflammatory bowel disease using cytokine profiles
CN114414704B (en) System, model and kit for evaluating malignancy degree or probability of thyroid nodule
CN113167782A (en) Method for sample quality assessment
CN110662966A (en) Panel of protein biomarkers for detecting colorectal cancer and advanced adenoma
CN115128285B (en) Kit and system for identifying and evaluating thyroid follicular tumor by protein combination
CN112748191A (en) Small molecule metabolite biomarker for diagnosing acute diseases, and screening method and application thereof
JP2006294014A (en) Analysis program, protein chip, method for manufacturing protein chip and antibody cocktail
US20130218581A1 (en) Stratifying patient populations through characterization of disease-driving signaling
Zhao et al. Discovery of distinct protein profiles for polycystic ovary syndrome with and without insulin resistance by surface-enhanced laser adsorption/ionization time of flight mass spectrometry
CN112037852A (en) Method and system for predicting lymph node metastasis of colorectal cancer at stage T1
West-Nørager et al. Feasibility of serodiagnosis of ovarian cancer by mass spectrometry
CN117233389A (en) Marker for rapidly identifying CEBPA double mutation in acute myeloid leukemia
CN116386716B (en) Metabolomics and methods for gastric cancer diagnosis
CN113718032B (en) Application of biomarker in early detection of cervical cancer
Wilz et al. Development of a test to identify bladder cancer in the urine of patients using mass spectroscopy and subcellular localization of the detected proteins
CN116469471A (en) Model for detecting lymph node metastasis of colorectal cancer in T1 stage by using proteomics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant