WO2023179263A1 - 评估甲状腺结节恶性程度或概率的系统、模型及试剂盒 - Google Patents
评估甲状腺结节恶性程度或概率的系统、模型及试剂盒 Download PDFInfo
- Publication number
- WO2023179263A1 WO2023179263A1 PCT/CN2023/076918 CN2023076918W WO2023179263A1 WO 2023179263 A1 WO2023179263 A1 WO 2023179263A1 CN 2023076918 W CN2023076918 W CN 2023076918W WO 2023179263 A1 WO2023179263 A1 WO 2023179263A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- model
- sample
- proteins
- establishing
- Prior art date
Links
- 208000009453 Thyroid Nodule Diseases 0.000 title claims abstract description 45
- 206010028980 Neoplasm Diseases 0.000 title claims abstract description 28
- 201000011510 cancer Diseases 0.000 title claims abstract description 24
- 230000036210 malignancy Effects 0.000 title claims abstract description 23
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 115
- 102000004169 proteins and genes Human genes 0.000 claims abstract description 115
- 108090000765 processed proteins & peptides Proteins 0.000 claims abstract description 78
- 102000004196 processed proteins & peptides Human genes 0.000 claims abstract description 66
- 229920001184 polypeptide Polymers 0.000 claims abstract description 49
- 238000000034 method Methods 0.000 claims abstract description 44
- 238000004949 mass spectrometry Methods 0.000 claims abstract description 42
- 230000003211 malignant effect Effects 0.000 claims abstract description 27
- 238000011156 evaluation Methods 0.000 claims abstract description 21
- 238000005516 engineering process Methods 0.000 claims abstract description 18
- 238000001574 biopsy Methods 0.000 claims abstract description 9
- 238000004128 high performance liquid chromatography Methods 0.000 claims abstract description 6
- 238000012360 testing method Methods 0.000 claims description 52
- 238000012549 training Methods 0.000 claims description 47
- 238000010200 validation analysis Methods 0.000 claims description 27
- 238000001514 detection method Methods 0.000 claims description 18
- 238000004458 analytical method Methods 0.000 claims description 13
- 238000010801 machine learning Methods 0.000 claims description 9
- 230000014759 maintenance of location Effects 0.000 claims description 9
- 238000013210 evaluation model Methods 0.000 claims description 8
- 238000012795 verification Methods 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 7
- 238000002360 preparation method Methods 0.000 claims description 5
- 238000007781 pre-processing Methods 0.000 claims description 4
- 125000003275 alpha amino acid group Chemical group 0.000 claims 4
- 238000012545 processing Methods 0.000 abstract description 6
- 208000024770 Thyroid neoplasm Diseases 0.000 abstract description 4
- 238000001819 mass spectrum Methods 0.000 abstract description 3
- 230000001351 cycling effect Effects 0.000 abstract description 2
- 102000007079 Peptide Fragments Human genes 0.000 abstract 1
- 108010033276 Peptide Fragments Proteins 0.000 abstract 1
- 239000000523 sample Substances 0.000 description 53
- 150000002500 ions Chemical class 0.000 description 25
- 238000013103 analytical ultracentrifugation Methods 0.000 description 9
- 238000002474 experimental method Methods 0.000 description 7
- 238000003908 quality control method Methods 0.000 description 7
- 238000006243 chemical reaction Methods 0.000 description 6
- 230000000052 comparative effect Effects 0.000 description 6
- 230000000694 effects Effects 0.000 description 6
- 206010054107 Nodule Diseases 0.000 description 5
- 238000005457 optimization Methods 0.000 description 5
- 210000001685 thyroid gland Anatomy 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000004587 chromatography analysis Methods 0.000 description 4
- 239000007791 liquid phase Substances 0.000 description 4
- BDAGIHXWWSANSR-UHFFFAOYSA-N methanoic acid Natural products OC=O BDAGIHXWWSANSR-UHFFFAOYSA-N 0.000 description 4
- 239000002243 precursor Substances 0.000 description 4
- 239000000243 solution Substances 0.000 description 4
- WEVYAHXRMPXWCK-UHFFFAOYSA-N Acetonitrile Chemical compound CC#N WEVYAHXRMPXWCK-UHFFFAOYSA-N 0.000 description 3
- 210000004027 cell Anatomy 0.000 description 3
- 238000003745 diagnosis Methods 0.000 description 3
- 201000010099 disease Diseases 0.000 description 3
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 3
- 230000004927 fusion Effects 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000002552 multiple reaction monitoring Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000007704 transition Effects 0.000 description 3
- OSWFIVFLDKOXQC-UHFFFAOYSA-N 4-(3-methoxyphenyl)aniline Chemical compound COC1=CC=CC(C=2C=CC(N)=CC=2)=C1 OSWFIVFLDKOXQC-UHFFFAOYSA-N 0.000 description 2
- 208000003200 Adenoma Diseases 0.000 description 2
- 108010026552 Proteome Proteins 0.000 description 2
- PZBFGYYEXUXCOF-UHFFFAOYSA-N TCEP Chemical compound OC(=O)CCP(CCC(O)=O)CCC(O)=O PZBFGYYEXUXCOF-UHFFFAOYSA-N 0.000 description 2
- XSQUKJJJFZCRTK-UHFFFAOYSA-N Urea Chemical compound NC(N)=O XSQUKJJJFZCRTK-UHFFFAOYSA-N 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000005119 centrifugation Methods 0.000 description 2
- 239000003153 chemical reaction reagent Substances 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000009089 cytolysis Effects 0.000 description 2
- 238000013480 data collection Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000002405 diagnostic procedure Methods 0.000 description 2
- 210000003743 erythrocyte Anatomy 0.000 description 2
- 201000004260 follicular adenoma Diseases 0.000 description 2
- 208000030878 follicular thyroid adenoma Diseases 0.000 description 2
- 235000019253 formic acid Nutrition 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000002068 genetic effect Effects 0.000 description 2
- 239000012071 phase Substances 0.000 description 2
- 230000002980 postoperative effect Effects 0.000 description 2
- 238000004451 qualitative analysis Methods 0.000 description 2
- 238000011002 quantification Methods 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 238000001356 surgical procedure Methods 0.000 description 2
- UMGDCJDMYOKAJW-UHFFFAOYSA-N thiourea Chemical compound NC(N)=S UMGDCJDMYOKAJW-UHFFFAOYSA-N 0.000 description 2
- 201000002510 thyroid cancer Diseases 0.000 description 2
- 238000002604 ultrasonography Methods 0.000 description 2
- 201000002015 Thyroid Crisis Diseases 0.000 description 1
- 206010043786 Thyrotoxic crisis Diseases 0.000 description 1
- 102000004142 Trypsin Human genes 0.000 description 1
- 108090000631 Trypsin Proteins 0.000 description 1
- 230000002152 alkylating effect Effects 0.000 description 1
- 239000007864 aqueous solution Substances 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000011888 autopsy Methods 0.000 description 1
- 210000004369 blood Anatomy 0.000 description 1
- 239000008280 blood Substances 0.000 description 1
- 230000006037 cell lysis Effects 0.000 description 1
- 239000003638 chemical reducing agent Substances 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 230000000120 cytopathologic effect Effects 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000029087 digestion Effects 0.000 description 1
- 238000004090 dissolution Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 238000013399 early diagnosis Methods 0.000 description 1
- 238000010828 elution Methods 0.000 description 1
- 230000007071 enzymatic hydrolysis Effects 0.000 description 1
- 238000006047 enzymatic hydrolysis reaction Methods 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000003325 follicular Effects 0.000 description 1
- 239000005556 hormone Substances 0.000 description 1
- 229940088597 hormone Drugs 0.000 description 1
- 238000003364 immunohistochemistry Methods 0.000 description 1
- PGLTVOMIXTUURA-UHFFFAOYSA-N iodoacetamide Chemical compound NC(=O)CI PGLTVOMIXTUURA-UHFFFAOYSA-N 0.000 description 1
- 238000004190 ion pair chromatography Methods 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000004223 overdiagnosis Methods 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000001575 pathological effect Effects 0.000 description 1
- 230000007170 pathology Effects 0.000 description 1
- 238000010882 preoperative diagnosis Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000007065 protein hydrolysis Effects 0.000 description 1
- 238000004445 quantitative analysis Methods 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 238000004579 scanning voltage microscopy Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 208000024891 symptom Diseases 0.000 description 1
- 208000030901 thyroid gland follicular carcinoma Diseases 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000002759 z-score normalization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N30/00—Investigating or analysing materials by separation into components using adsorption, absorption or similar phenomena or using ion-exchange, e.g. chromatography or field flow fractionation
- G01N30/02—Column chromatography
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N30/00—Investigating or analysing materials by separation into components using adsorption, absorption or similar phenomena or using ion-exchange, e.g. chromatography or field flow fractionation
- G01N30/02—Column chromatography
- G01N30/62—Detectors specially adapted therefor
- G01N30/72—Mass spectrometers
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N30/00—Investigating or analysing materials by separation into components using adsorption, absorption or similar phenomena or using ion-exchange, e.g. chromatography or field flow fractionation
- G01N30/02—Column chromatography
- G01N30/86—Signal analysis
- G01N30/8675—Evaluation, i.e. decoding of the signal into analytical information
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N33/00—Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
- G01N33/48—Biological material, e.g. blood, urine; Haemocytometers
- G01N33/50—Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
- G01N33/68—Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
- G01N33/6803—General methods of protein analysis not limited to specific proteins or families of proteins
- G01N33/6848—Methods of protein analysis involving mass spectrometry
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
Definitions
- the invention relates to the field of medical detection, and specifically to systems, models and kits for assessing the malignancy or probability of thyroid nodules.
- Thyroid Nodules is a common clinical disease. According to autopsy reports, the incidence of thyroid nodules in the general population is about 50% to 60%, and it is more common in women. The vast majority of patients with thyroid nodules have no clinical symptoms and are often discovered through physical examination or physical examination. Among thyroid nodules discovered through pathological examination, only 5% to 15% are malignant nodules, that is, thyroid cancer.
- thyroid nodules The evaluation of thyroid nodules recommended by current clinical guidelines is mainly based on the following three points: first, high-resolution ultrasound exploration, second, blood biochemical indicators, and third, fine needle aspiration biopsy (FNAB or FNA).
- FNA fine needle aspiration biopsy
- the coincidence rate of FNA results usually depends on the skills and experience of the puncture operator and cytopathologist, and there are still 15% to 30% of thyroid nodules that cannot be clearly evaluated by FNA and cytopathology.
- the mainstream view is to perform total thyroidectomy or near hemisection.
- most postoperative pathology confirms benign nodules, which obviously leads to overdiagnosis and overtreatment.
- Protein is the executor of life activities and the ultimate expression of life phenotype. Quantitative proteomics research can explain the causes and patterns of the occurrence and development of certain biological phenomena from the proteome level, which is of great significance to life sciences and the diagnosis and treatment of human diseases. For quantitative proteomic studies of tumor tissues and non-tumor tissues, it is possible to find some or some tumor-specific proteins as disease markers, which can be used for early diagnosis, confirmation and classification of tumors.
- the present invention relates to a new detection method - a method for evaluating the malignancy of thyroid nodules based on targeted detection of proteins (polypeptides) and machine learning.
- the present invention provides a non-diagnostic method for assessing the malignancy or probability of malignancy of a subject's thyroid nodules based on targeted detection of proteins or polypeptides and machine learning, including:
- step b) Proteomic data of the target protein or polypeptide in the FNA sample obtained in step b), wherein the target protein or polypeptide includes a protein or polypeptide selected from Table 1, and the proteomic data is determined by high performance liquid chromatography. and mass spectrometry methods; the proteomic data includes MRM ion transitions and peak areas;
- the mass spectrometry method involved in the present invention is completed using mass spectrometry multiple reaction monitoring (Multiple Reaction Monitoring, MRM) technology.
- MRM Multiple Reaction Monitoring
- the mass spectrometry multiple reaction monitoring technology that is, mass spectrometry MRM technology, is a targeted method based on known information or assumed information.
- the key to MRM technology is to first detect specific precursor ions, then only collision-induced the selected specific precursor ions, and finally remove interference from other product ions. interference, only the selected specific product ions are collected for mass spectrometry signals. Since the triple quadrupole system (TQS) is the most sensitive mass spectrometry system for single mass-to-charge ratio scanning, it is the most suitable mass spectrometry instrument for MRM analysis.
- TQS triple quadrupole system
- the MRM technology can selectively detect specific precursor ions and product ions in the first pole (Q1) and third pole (Q3) of the triple quadrupole, eliminate interference at both levels of the precursor ion and product ions, and enhance detection specificity. Therefore, the present invention also relates to the parent and daughter ion pairs of the target protein or polypeptide.
- the peak area involved in the present invention refers to the peak area of parent-child ion pair chromatography.
- the analysis of step d) in the evaluation method of this embodiment includes establishing an AI model, and establishing the AI model includes dividing the retrospective data set into a training set, a validation set and an independent test set, wherein for each A unit that provides samples, if the unit's sample delivery batch M ⁇ 2, then randomly select one batch of data from the M batches of data and divide it into an independent test set, and the remaining M-1 batches of data The data is divided into training set and validation set;
- establishing the AI model further includes dividing the data divided into the training set and the validation set into 70% of the training set and 30% of the validation set according to the time sequence of mass spectrometry generation;
- establishing the AI model further includes using a prospective data set as a second independent test set, and the sample batch and mass spectrometry time of the prospective data set are strictly independent of the retrospective data set.
- the retrospective data set is a collection of low-quality data obtained by reviewing clinical cases in a retrospective study.
- the prospective data set is a collection of high-quality data from clinical cases collected in prospective studies.
- establishing the AI model in the evaluation method of this embodiment also includes calculating the proportion of the individual protein peak areas of the three noise proteins HBB, THYG and H4 in the sample to the sum of the total protein peak areas and the three proteins The ratio of the sum of the peak areas to the sum of the total protein peak areas.
- the ratio of a single protein peak area is >70% or the ratio of the sum of the peak areas of these three proteins is >95%, the sample is determined to be an unqualified sample;
- the AI model established therein also includes the use of removing samples containing extremely high abundance target proteins or polypeptides, which include VNVDEVGGEALGR, EFTPPVQAAYQK, LALQFTTNPK, LAAQSTLSFYQR, LEDIPVASLPDLHDIER, FLQGDHFGTSPR, QVDQFLGVPYAAPPLAERR, GGADVASIHLLTAR, RISGLIYEETR, ISGLIYEETR and VFLENVIR.
- target proteins or polypeptides which include VNVDEVGGEALGR, EFTPPVQAAYQK, LALQFTTNPK, LAAQSTLSFYQR, LEDIPVASLPDLHDIER, FLQGDHFGTSPR, QVDQFLGVPYAAPPLAERR, GGADVASIHLLTAR, RISGLIYEETR, ISGLIYEETR and VFLENVIR.
- the mass spectrometry method in the evaluation method of this embodiment includes collecting data from the protein or polypeptide eluted from the chromatography column on a triple quadrupole mass spectrometer using Scheduled MRM TM mode in positive ion mode.
- the Schedule window is 2.5 minutes.
- the present invention provides the use of a target protein or polypeptide as a detection target in the preparation of a kit for determining the malignancy of a subject's thyroid nodules based on targeted detection of the protein or polypeptide and machine learning or The probability of malignancy is evaluated, wherein the kit includes a tool for detecting a target protein or polypeptide, and the target protein or polypeptide includes a protein or polypeptide selected from Table 1.
- the use of this embodiment involves an assessment method that includes:
- the target protein or polypeptide includes the protein or polypeptide selected from Table 1.
- the proteomic data is determined by high performance liquid chromatography and Obtained by mass spectrometry, the proteomic data includes parent and daughter ion pairs, retention time, collision voltage (CE) and peak area;
- step c) analyzing the proteomic data obtained in step c), the analysis comprising inputting the proteomic data into an AI model;
- the retention time involved in the present invention refers to the time for the peak of the peptide to emerge after passing through the chromatographic column.
- the collision voltage involved in the present invention refers to the voltage when parent ions are fragmented in the mass spectrometry collision chamber.
- the analysis of step d) of the evaluation of the use of this embodiment comprises building an AI model, said building the AI model comprising dividing the retrospective data set into a training set, a validation set and an independent test set, wherein for For each unit that provides samples, if the unit's sample batch M ⁇ 2, one batch of data will be randomly selected from the M batches of data and divided into an independent test set, and the remaining M-1 batches of data will be randomly selected and divided into independent test sets. The data is divided into training set and validation set;
- establishing the AI model further includes dividing the data divided into the training set and the validation set into 70% of the training set and 30% of the validation set according to the time sequence of mass spectrometry generation;
- establishing the AI model further includes using a prospective data set as a second independent test set, and the sample batch and mass spectrometry time of the prospective data set are strictly independent of the retrospective data set.
- the establishment of the AI model in the use of this embodiment also includes calculating the proportion of the single protein peak area of the three noise proteins HBB, THYG and H4 in the sample to the sum of the total protein peak areas and the three protein peaks The ratio of the area sum to the sum of the total protein peak areas.
- the ratio of a single protein peak area is >70% or the ratio of the sum of the three protein peak areas >95%, the sample is determined to be an unqualified sample;
- establishing the AI model further includes removing samples containing extremely high abundance target proteins or polypeptides, wherein the extremely high abundance target proteins or polypeptides include VNVDEVGGEALGR, EFTPPVQAAYQK, LALQFTTNPK, LAAQSTLSFYQR, LEDIPVASLPDLHDIER, FLQGDHFGTSPR, QVDQFLGVPYAAPPLAERR , GGADVASIHLLTAR, RISGLIYEETR, ISGLIYEETR and VFLENVIR.
- evaluation of the use of this embodiment involves a mass spectrometry method that involves data acquisition of proteins or polypeptides eluted from a chromatography column on a triple quadrupole mass spectrometer using Scheduled MRM TM mode in positive ion mode.
- the Schedule window is 2.5 minutes.
- the present invention also provides a system for evaluating the malignancy degree or malignancy probability of a subject's thyroid nodules based on targeted detection of proteins or polypeptides and machine learning, the system comprising:
- a sample pre-processing device which uses pressure cycle technology (PCT technology) to pre-process the FNA sample;
- PCT technology pressure cycle technology
- a detection device that detects proteomic data of a target protein or polypeptide in the obtained sample, wherein the target protein or polypeptide includes a protein or polypeptide selected from Table 1, and the proteomic data is passed through high-performance liquid phase Obtained by chromatography methods and mass spectrometry methods, the proteomic data includes parent and daughter ion pairs, retention time, collision voltage (CE) and peak area;
- an analysis device that analyzes the obtained proteomic data, the analysis comprising inputting the proteomic data into an AI model;
- the analysis of iv) of this embodiment includes building an AI model, and building the AI model includes dividing the retrospective data set into a training set, a validation set, and an independent test set, wherein for each unit providing a sample , if the unit’s sample delivery batch M ⁇ 2, then randomly select one batch of data from the M batches of data and divide it into an independent test set, and the remaining M-1 batches of data will be divided into the training set and validation set;
- establishing the AI model further includes dividing the data divided into the training set and the validation set into 70% of the training set and 30% of the validation set according to the time sequence of mass spectrometry generation;
- establishing the AI model further includes using a prospective data set as a second independent test set, and the sample batch and mass spectrometry time of the prospective data set are strictly independent of the retrospective data set.
- the establishment of the AI model involved in this embodiment also includes calculating the ratio of the single protein peak area of the three noise proteins HBB, THYG and H4 in the sample to the sum of the total protein peak areas and the sum of the three protein peak areas.
- the proportion of a single protein in the sum of the peak areas of the total proteins is determined to be unqualified;
- establishing the AI model further includes removing samples containing extremely high-abundance target proteins or polypeptides, including VNVDEVGGEALGR, EFTPPVQAAYQK, LALQFTTNPK, LAAQSTLSFYQR, LEDIPVASLPDLHDIER, FLQGDHFGTSPR, QVDQFLGVPYAAPPLAERR, GGADVASIHLLTAR, RISGLIYEETR, ISGLIYEETR and VFLENVIR.
- VNVDEVGGEALGR extremely high-abundance target proteins or polypeptides
- the mass spectrometry method involved in this embodiment includes collecting data from the protein or polypeptide eluted from the chromatography column on a triple quadrupole mass spectrometer using Scheduled MRM TM mode in positive ion mode.
- the Schedule window is 2.5 minutes.
- the present invention also provides an evaluation model for evaluating the malignant degree or malignant probability of a subject's thyroid nodules, wherein fine needle aspiration tissue biopsy of subjects with different malignant degrees of thyroid nodules is performed.
- the proteomic data of the target protein or polypeptide of the sample (FNA sample) is used as training data to train the machine learning model to obtain the evaluation model.
- the target protein or polypeptide includes the protein or polypeptide selected from Table 1.
- the proteomic data includes parent and daughter ion transitions, retention time, collision voltage (CE), and peak area.
- the evaluation involved in evaluating the model of this embodiment includes building an AI model, and building the AI model includes dividing the retrospective data set into a training set, a validation set, and an independent test set, wherein a sample is provided for each unit, if the unit’s sample delivery batch M ⁇ 2, then randomly select one batch of data from the M batches of data and divide it into an independent test set, and the remaining M-1 The data of each batch is divided into a training set and a validation set;
- establishing the AI model further includes dividing the data divided into the training set and the validation set into 70% of the training set and 30% of the validation set according to the time sequence of mass spectrometry generation;
- establishing the AI model further includes using a prospective data set as a second independent test set, and the sample batch and mass spectrometry time of the prospective data set are strictly independent of the retrospective data set.
- the establishment of the AI model involved in the evaluation of the evaluation model of this embodiment also includes calculating the proportion of the individual protein peak areas of the three noise proteins HBB, THYG and H4 in the sample to the sum of the total protein peak areas and The proportion of the sum of the peak areas of these three proteins to the sum of the total protein peak areas.
- the proportion of a single protein is >70% or the sum of the peak areas of these three proteins is >95%, the sample is determined to be an unqualified sample;
- establishing the AI model further includes removing samples containing extremely high-abundance target proteins or polypeptides, including VNVDEVGGEALGR, EFTPPVQAAYQK, LALQFTTNPK, LAAQSTLSFYQR, LEDIPVASLPDLHDIER, FLQGDHFGTSPR, QVDQFLGVPYAAPPLAERR, GGADVASIHLLTAR, RISGLIYEETR, ISGLIYEETR and VFLENVIR.
- VNVDEVGGEALGR extremely high-abundance target proteins or polypeptides
- the present invention uses high-performance liquid chromatography and mass spectrometry to detect the proteomics data of the target protein or polypeptide in the sample. After processing the peptide peak area information of the mass spectrometry data and AI modeling, the final evaluation result is obtained ( Malignant probability) can provide clinical reference for the malignancy of thyroid nodules. Among them, for thyroid nodules that cannot be identified clinically, a second evaluation result (malignant probability) can also be provided for doctors' reference.
- Figure 1 shows the AI flow chart of the present invention
- Figure 2 shows a schematic diagram of the present invention establishing a training data set and a test set
- Figure 3 shows the results of the first comparative experiment in one embodiment of the present invention
- Figure 4 shows the results of a second comparative experiment in an embodiment of the present invention
- Figure 5 shows the ROC chart predicted by the model of the present invention.
- reagents, instruments, devices, etc. used in the present invention are all commercially available products.
- Example 1 Establishment of a clinical multicenter prospective cohort.
- Example 2 Pressure circulation system assisted FNA sample processing method.
- FNA puncture samples were obtained by ultrasound guidance or intraoperative repeated aspiration puncture using a 19-27g syringe needle.
- the puncture sample was first lysed in 0.5 mL of red blood cell lysis solution at low temperature at 4°C. After 5 minutes of reaction, it was put into a centrifuge and centrifuged at 300g for 10 minutes. After centrifugation, discard the solution and retain the cells remaining after centrifugation.
- PCT is an emerging semi-automated sample preparation technology for tissue lysis and protein and peptide extraction. It promotes the dissolution of tissues and cells through ultrahigh pressure (up to 45kpsi) and standard atmospheric pressure circulation in a small volume (150 ⁇ l) container. , accelerate protein hydrolysis and enzymatic hydrolysis.
- the main feature of PCT is the semi-automatic processing of micro-volume samples (about 0.1mg tissue/more than a thousand cells), ensuring sample preparation The stability and reproducibility of the process have been widely used in many biological fields.
- the PCT sample preparation system is a complete workflow based on pressure cycle technology, consisting of Barocycler2320EXT equipment (can process 16 samples at the same time) and consumables such as MicroTube, MicroPestle, MicroCaps, etc., if applied to proteomics, it can be Peptides ready for mass spectrometry analysis are extracted from tissue within 4-5 hours.
- the thyroid puncture sample after removing red blood cells is added with lysis solution (6M urea, 2M thiourea), reducing agent (tris(2-carboxyethyl)phosphine, TCEP), and alkylating reagent iodoacetamide. IAA) is reacted in PCT tubes.
- the instrument parameters are set during the reaction: 90 cycles. Each cycle includes 45,000 psi, 30 seconds, and 10 seconds off-time. After the reaction is completed, add 0.75 to 1.5 ⁇ g of LysC and 2.5 to 5 ⁇ g of Trypsin enzymes to accelerate the reaction in PCT.
- the reaction conditions are: 120 cycles, each cycle includes 20,000 psi, 50 s, and 10 s off-time. After digestion, the peptides are desalted through a C18 column. Finally, the peptides are clean and dried for subsequent analysis.
- Example 3 Candidate protein selection.
- candidate peptides and corresponding parent and child ions that are beneficial for determining benign and malignant thyroid nodules are screened out.
- the initial candidate pool covered a total of 212 proteins.
- Example 4 Detection of target proteins (peptides) using targeted proteomics methods.
- This embodiment involves targeted proteome detection of polypeptides, which is divided into liquid phase method optimization and mass spectrometry parameter optimization. Through optimization, rapid detection can be completed within 10-25 minutes.
- Liquid phase method optimization high performance liquid phase: column type (C18, polar end-capped, length 100mm; particle size 1.9 ⁇ m), using mobile phase A (aqueous solution containing 0.1% (v/v) formic acid) and mobile phase B (acetonitrile solution containing 0.1% (v/v) formic acid) for gradient elution, flow rate 0.2ml/min: 0-1 minutes: 3% B, 1-20 minutes: 3% B ⁇ 40% B; 20-20.1 Minutes: 40% B ⁇ 80% B; 20.1 ⁇ 22 minutes: 80% B; 22.1 ⁇ 25 minutes: 3% B.
- the column oven temperature is 50°C.
- Mass spectrometry parameter optimization Eluting peptides will be data acquired on a triple quadrupole mass spectrometer using MRM mode in positive ion mode to determine retention times. After determining the retention time, use the ramp method to optimize the collision energy CE of each MRM ion pair, and finally integrate the retention time and optimized CE to generate the Scheduled MRMTM acquisition method (Schedule window is 2.5 minutes). Data collection includes mother and daughter ion transitions, retention time and optimized collision voltage (CE). The results are shown in Table 1.
- the inventor also synthesized peptides containing stable isotope labels, mixed them and incorporated them into the samples to perform MRM acquisition.
- the purpose of introducing isotope-labeled peptides in the present invention is to confirm the target peptide and eliminate false positive signals.
- Example 5 Mass spectrometry data processing and AI modeling.
- the final evaluation result (malignancy probability) is obtained by processing the peptide peak area information of the mass spectrometry data and AI modeling, which can provide clinical reference for the malignancy of thyroid nodules.
- AI modeling can provide clinical reference for the malignancy of thyroid nodules.
- existing clinical methods cannot The identified thyroid nodules can also provide a second assessment result (probability of malignancy) for doctors' reference.
- the AI algorithm of the present invention can provide the above two results.
- this embodiment divides the retrospective data set into three parts: 1. Training set, 2. Verification set, and 3. Different batches of independent test sets , the specific process is shown in Figure 2.
- an independent test set of different batches is divided according to different hospital information and sample delivery information: for each hospital, if the sample delivery batch M ⁇ 2, a batch of samples will be randomly selected.
- the data belongs to the independent test set (to prove that the AI model of the present invention can overcome the batch effect and has high performance for different sample batches), and the remaining M-1 batches of data belong to the training set and verification set.
- the remaining data were divided into approximately 70% training set and 30% validation set according to the time sequence of mass spectrum generation to train a model that is not sensitive to mass spectrum time.
- the difference between different test sets can be seen from Figure 2.
- the blue data will be further divided into a training set and an internal test set according to time (5.5.i is used to determine five sets of models and parameters during training, and then the internal test set and two Independent test set for testing), T0, T1, and T2 are the times when mass spectrometry is performed, and the modeling time is the T1 time point. Therefore, the data before T1 are retrospective data, and the data after that are prospective data.
- VNVDEVGGEALGR remove extremely high abundance target proteins and corresponding peptides during classification (these peptides are not suitable for adding to the model and are used for quality control.
- VNVDEVGGEALGR EFTPPVQAAYQK, LALQFTTNPK, LAAQSTLSFYQR, LEDIPVASLPDLHDIER, FLQGDHFGTSPR, QVDQFLGVPYAAPPLAERR, GGADVASIHLLTAR, A total of 11 peptides including RISGLIYEETR, ISGLIYEETR, and VFLENVIR need to be removed.
- Normalizing the data (dividing by the median) or normalizing the peptides (z-score) can achieve the effect.
- Normalizing the peptides includes each peptide (feature)
- perform z-score on each peptide of the new data When testing new data, perform z-score on each peptide of the new data.
- the benefits of this operation are: 1) The training set is slightly different but the validation 1 is completely different, so different parameters and models can be obtained, and the training effect is better; 2) Due to the above differences, the five models directly have a certain degree of independence; 3) Due to Independence, the five models have certain complementarity when fused, so that very good results can be achieved after fusion. It should be noted that the model of the present invention can be extended to other models, including but not limited to logistic regression, decision tree, random forest, SVM, neural network and other models.
- ii Use grid search or genetic algorithm to search parameters for each model separately: for the parameters of each grid point, first model the training set in i according to this parameter and rank the importance, and then use this parameter as the basis , join the model for modeling according to the importance of features from large to small.
- the evaluation function is the sum of the AUC value on validation 1 and the AUC value on validation 2, and both AUCs should be no less than 0.9.
- the value of a single model The total number of features should not exceed 10 to facilitate the product application of the final kit.
- the parameters and corresponding features when the evaluation function obtains the highest value are the parameters and features that are finally determined.
- This embodiment can obtain the model with the best performance on the training set and has certain generalization properties (both AUCs are greater than 0.9).
- Model fusion (optional): Since the robustness and stability of a single model are limited, this invention fuses the results of five XGBoost models.
- five models are trained, and any combination of the model and its peptide segments can be packaged into a test kit, or a combination of five models and their peptide segments can be packaged into a test kit.
- the results of vi.iii or iv are predicted into two categories through the threshold. If the value is greater than the threshold, the prediction is 1 (malignant), and if the value is less than the threshold, the prediction is 0 (benign).
- the threshold is defined as (P1/S1+P2/S2)/2, where P1, P2 is the number of positive samples in the 70% and 30% data sets respectively, and S1 and S2 are the number of samples in the 70% and 30% data sets respectively.
- the ROC chart predicted by the model of the present invention is shown in Figure 5, and the model AUC is 0.90.
- the inventors also combine the method of the present invention with two references (Patel et al., Performance of a Genomic Sequencing Classifier for the Preoperative Diagnosis of Cytologically Indeterminate Thyroid Nodules, JAMA Surg. 2018; 153(9):817- 824 and the methods in Livhits et al., Effectiveness of Molecular Testing Techniques for Diagnosis of Indeterminate Thyroid Nodules: A Randomized Clinical Trial, JAMA Oncol. 2021Jan1; 7(1):70-77) were conducted on sensitivity, specificity, etc. Comparison, the comparison results are shown in Table 7.
- a GSC is the abbreviation of Genomic Sequencing Classifier
- the inventors In order to reduce clinical misdiagnosis of the malignant degree or malignant probability of thyroid nodules, that is, to reduce false positive judgments, the inventors therefore limitedly selected a model with higher specificity.
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Immunology (AREA)
- Molecular Biology (AREA)
- Biochemistry (AREA)
- Theoretical Computer Science (AREA)
- Pathology (AREA)
- Medical Informatics (AREA)
- Analytical Chemistry (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biotechnology (AREA)
- Software Systems (AREA)
- Hematology (AREA)
- Evolutionary Biology (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Urology & Nephrology (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Cell Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Food Science & Technology (AREA)
- Medicinal Chemistry (AREA)
- Physiology (AREA)
- Microbiology (AREA)
- Computing Systems (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
Abstract
一种评估甲状腺结节恶性程度或概率的系统、模型及试剂盒。评估方法采用压力循环技术对细针穿刺组织活检样本进行处理,通过高效液相色谱方法和质谱方法检测所得到的样本中目标蛋白或多肽的蛋白质组学数据,通过对质谱数据的肽段峰面积信息进行处理和AI建模后,得到最终的评估结果,即恶性概率。评估结果能够为临床提供甲状腺结节恶性程度的参考,其中,对于现有临床无法鉴定的甲状腺结节,亦能够同时提供第二个评估结果,即恶性概率供医生参考。
Description
本申请要求于2022年03月22日提交中国专利局、申请号为202210281265.8、发明名称为“评估甲状腺结节恶性程度或概率的系统、模型及试剂盒”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本发明涉及医学检测领域,具体涉及评估甲状腺结节恶性程度或概率的系统、模型及试剂盒。
甲状腺结节(Thyroid Nodules)是一种常见的临床病症,根据尸检报告显示,甲状腺结节在普通人群中发病率约50%至60%,多发于女性人群。绝大多数甲状腺结节患者没有临床症状,常常是通过体检或自身触摸发现。在通过病理检查发现的甲状腺结节中,只有5%至15%的结节为恶性结节,即甲状腺癌。
目前临床指南推荐的甲状腺结节的评估主要基于以下三点:一是高分辨率的超声探查,二是血生化指标,三是细针穿刺组织活检(Fine needle aspiration biopsy,FNAB或FNA)。在以上三种检查中,FNA被认为是对可疑甲状腺结节患者的临床管理中最敏感、最经济的可靠检测方法。然而,FNA结果符合率的高与低通常需要取决于穿刺操作者、细胞病理医生的技术和经验,并且仍有15%至30%的甲状腺结节不能通过FNA和细胞病理学得到清楚地评估。针对不确定性甲状腺结节的处理方式,主流观点是进行甲状腺全切或近半切。但大多数术后病理证实为良性结节,这显然会导致过度诊断和过度治疗。
因此目前临床上的诊断标准和治疗方案对无症状的甲状腺结节患者并无益处。患者将支付高昂的手术费用以及甲状腺切除后需终身服用替代激素,甚至承担手术可能会带来的甲状腺危象及术后复发等风险。
近年来,随着分子技术的发展,为了提高对不确定性甲状腺结节的精准诊断,基于甲状腺组织DNA与RNA的分子诊断方法应运而生。在美国,目前已有两个基于基因检测的用于划分此类结节的类别检查推向临床,一个是Afirma,另一个是ThyroSeq。虽然二者具有很高NPV(Negative Predictive Value),但其PPV(Positive Predictive Value)很低。换言之,这两种方法仅对部分的良性结节有很好的分类,对于是否有恶性嫌疑则无法精准确定,因此这两种方法对可能的过度治疗并无明显改善。
蛋白质是生命活动的执行者,是生命表型的最终体现。定量蛋白质组学研究可从蛋白质组层面阐释某种生物现象的发生发展原因与规律,对生命科学以及人类自身疾病诊疗有重大意义。对于肿瘤组织和非肿瘤组织的定量蛋白质组研究,可能发现某种或某些肿瘤特异的蛋白质作为疾病的标志物,可用于肿瘤的早期诊断、确诊与分型。
发明内容
本发明涉及一种新型检测方法——基于靶向检测蛋白(多肽)与机器学习对甲状腺结节恶性程度评估的方法。
在一个方面,本发明提供一种基于靶向检测蛋白或多肽与机器学习对受试者的甲状腺结节恶性程度或恶性概率进行评估的非诊断方法,包括:
a)提供受试者的细针穿刺组织活检样本,记为FNA样本;
b)采用压力循环技术(Pressure Cycling Technology,PCT)对所述FNA样本进行前处理;
c)检测步骤b)得到的FNA样本中目标蛋白或多肽的蛋白质组学数据,其中所述目标蛋白或多肽包括选自表1的蛋白或多肽,所述蛋白质组学数据通过高效液相色谱方法和质谱方法获得;所述蛋白质组学数据包括MRM离子对和峰面积;
d)分析所得到的蛋白质组学数据,所述分析包括将所得到的蛋白质组学数据输入AI模型;以及
e)输出结果,对于临床上不确定或者难以评判的甲状腺结节,提供恶性概率结果。
本发明所涉及的所述质谱方法是采用质谱多反应监测(Multiple Reaction Monitoring,MRM)技术完成,其中质谱多反应监测技术即质谱MRM技术,是一种基于已知信息或假定信息有针对性地获取数据从而进行质谱信号采集的技术。对于MRM技术而言关键在于首先要能够检测到具有特异性的母离子,然后只将选定的特异性母离子进行碰撞诱导(collision-induced),最后去除其他子离子的干
扰,只对选定的特异子离子进行质谱信号的采集。由于三重四级杆质谱(triple quadrupole system,TQS)是进行单一质荷比扫描最灵敏的质谱系统,因此是最适合MRM分析的质谱仪器。
MRM技术能够在三重四级杆第一极(Q1)和第三极(Q3)中分别选择检测特定母离子和子离子,在母离子和子离子两个水平排除干扰,增强检测特异性。因此,本发明还涉及目标蛋白或多肽的母子离子对。
本发明所涉及的峰面积是指母子离子对色谱峰面积。
在一个实施方案中,该实施方案的评估方法中的d)步骤的分析包括建立AI模型,所述建立AI模型包括将回顾性数据集分为训练集、验证集和独立测试集,其中对于每个提供样本的单位,如果所述单位的送样批次M≥2,则从所述M批数据中随机选出一个批次的数据划分至独立测试集,而剩余M-1个批次的数据划分至训练集和验证集;
任选地,其中建立AI模型还包括将划分至训练集和验证集的数据根据质谱产生的时间顺序划分为70%的训练集和30%的验证集;
进一步任选地,其中建立AI模型还包括将前瞻性数据集作为第二独立测试集,所述前瞻性数据集的样本批次和质谱时间均严格独立于回顾性数据集。
其中,所述回顾性数据集是回顾性研究中回顾临床病例得到的低质量数据形成的集合。所述前瞻性数据集是前瞻性研究中收集的临床病例的高质量数据形成的集合。
在另一个实施方案中,该实施方案的评估方法中的建立AI模型还包括计算样本中三种噪声蛋白HBB、THYG和H4的单个蛋白峰面积占总蛋白峰面积和的比例以及这三种蛋白峰面积和占总蛋白峰面积和的比例,其中当单个蛋白峰面积的比例>70%或这三种蛋白峰面积和的比例>95%时,确定该样本为不合格样本;
任选地,其中建立的AI模型,还包括使用了去除包含极高丰度的目标蛋白或多肽的样本,所述极高丰度的目标蛋白或多肽包括VNVDEVGGEALGR、EFTPPVQAAYQK、LALQFTTNPK、LAAQSTLSFYQR、LEDIPVASLPDLHDIER、FLQGDHFGTSPR、QVDQFLGVPYAAPPLAERR、GGADVASIHLLTAR、RISGLIYEETR、ISGLIYEETR和VFLENVIR。
在又一个实施方案中,该实施方案的评估方法中的质谱方法包括将从色谱柱流出的蛋白或多肽在三重四级杆质谱仪上使用正离子模式下的Scheduled MRMTM模式进行数据采集。任选地,Schedule窗口为2.5分钟。
在另一个方面,本发明提供目标蛋白或多肽作为检测靶标在制备试剂盒中的用途,所述试剂盒用于基于靶向检测蛋白或多肽与机器学习对受试者的甲状腺结节恶性程度或恶性概率进行评估,其中所述试剂盒包含检测目标蛋白或多肽的工具,所述目标蛋白或多肽包括选自表1的蛋白或多肽。
在一个实施方案中,该实施方案的用途所涉及的评估方法包括:
a)提供受试者的细针穿刺组织活检样本,记为FNA样本;
b)采用压力循环技术(PCT技术)对所述FNA样本进行前处理;
c)检测步骤b)得到的FNA样本中目标蛋白或多肽的蛋白质组学数据,所述目标蛋白或多肽包括选自表1的蛋白或多肽,所述蛋白质组学数据通过高效液相色谱方法和质谱方法获得,所述蛋白质组学数据包括母子离子对、保留时间、碰撞电压(CE)和峰面积;
d)分析步骤c)得到的蛋白质组学数据,所述分析包括将所述蛋白质组学数据输入AI模型;以及
e)输出结果,对于临床上不确定或者难以评判的甲状腺结节提供恶性概率结果。
本发明所涉及的保留时间是指肽段通过色谱柱后出峰的时间。
本发明所涉及的碰撞电压是指母离子在质谱碰撞室中发生碎裂时的电压。
在另一个实施方案中,该实施方案的用途的评估的d)步骤的分析包括建立AI模型,所述建立AI模型包括将回顾性数据集分为训练集、验证集和独立测试集,其中对于每个提供样本的单位,如果该单位的送样批次M≥2,则从所述M批数据中随机选出一个批次的数据划分至独立测试集,而剩余M-1个批次的数据划分至训练集和验证集;
任选地,其中建立AI模型还包括将划分至训练集和验证集的数据根据质谱产生的时间顺序划分为70%的训练集和30%的验证集;
进一步任选地,其中建立AI模型还包括将前瞻性数据集作为第二独立测试集,所述前瞻性数据集的样本批次和质谱时间均严格独立于回顾性数据集。
在又一个实施方案中,该实施方案的用途中的建立AI模型还包括计算样本中三种噪声蛋白HBB、THYG和H4的单个蛋白峰面积占总蛋白峰面积和的比例以及这三种蛋白峰面积和占总蛋白峰面积和的比例,其中当单个蛋白峰面积的比例>70%或这三种蛋白峰面积和的比例>95%时,确定该样本为不合格样本;
任选地,其中建立AI模型还包括去除包含极高丰度的目标蛋白或多肽的样本,其中所述极高丰度的目标蛋白或多肽包括VNVDEVGGEALGR、EFTPPVQAAYQK、LALQFTTNPK、LAAQSTLSFYQR、LEDIPVASLPDLHDIER、FLQGDHFGTSPR、QVDQFLGVPYAAPPLAERR、GGADVASIHLLTAR、RISGLIYEETR、ISGLIYEETR和VFLENVIR。
在另一个实施方案中,该实施方案的用途的评估所涉及的质谱方法包括将从色谱柱流出的蛋白或多肽在三重四级杆质谱仪上使用正离子模式下的Scheduled MRMTM模式进行数据采集。任选地,Schedule窗口为2.5分钟。
在又一个方面,本发明还提供一种系统,其用于基于靶向检测蛋白或多肽与机器学习对受试者的甲状腺结节恶性程度或恶性概率进行评估,所述系统包含:
i)采集装置,其采集受试者的细针穿刺组织活检样本,记为FNA样本;
ii)样本前处理装置,其采用压力循环技术(PCT技术)对所述FNA样本进行前处理;
iii)检测装置,其检测所得到的样本中目标蛋白或多肽的蛋白质组学数据,其中所述目标蛋白或多肽包括选自表1的蛋白或多肽,并且所述蛋白质组学数据通过高效液相色谱方法和质谱方法获得,所述蛋白质组学数据包括母子离子对、保留时间、碰撞电压(CE)和峰面积;
iv)分析装置,其分析所得到的蛋白质组学数据,所述分析包括将所述蛋白质组学数据输入AI模型;以及
v)输出装置,其输出结果,其中对于临床上不确定或者难以评判的甲状腺结节,提供恶性概率结果。
在一个实施方案中,该实施方案的iv)的分析包括建立AI模型,所述建立AI模型包括将回顾性数据集分为训练集、验证集和独立测试集,其中对于每个提供样本的单位,如果该单位的送样批次M≥2,则从所述M批数据中随机选出一个批次的数据划分至独立测试集,而剩余M-1个批次的数据划分至训练集和验证集;
任选地,其中建立AI模型还包括将划分至训练集和验证集的数据根据质谱产生的时间顺序划分为70%的训练集和30%的验证集;
进一步任选地,其中建立AI模型还包括将前瞻性数据集作为第二独立测试集,所述前瞻性数据集的样本批次和质谱时间均严格独立于回顾性数据集。
在又一个实施方案中,该实施方案涉及的建立AI模型还包括计算样本中三种噪声蛋白HBB、THYG和H4的单个蛋白峰面积占总蛋白峰面积和的比例以及这三种蛋白峰面积和占总蛋白峰面积和的比例,其中当单个蛋白单个蛋白的比例>70%或这三种蛋白峰面积和的比例>95%时,确定该样本为不合格样本;
任选地,其中建立AI模型还包括去除包含极高丰度的目标蛋白或多肽的样本,所述极高丰度的目标蛋白或多肽包括VNVDEVGGEALGR、EFTPPVQAAYQK、LALQFTTNPK、LAAQSTLSFYQR、LEDIPVASLPDLHDIER、FLQGDHFGTSPR、QVDQFLGVPYAAPPLAERR、GGADVASIHLLTAR、RISGLIYEETR、ISGLIYEETR和VFLENVIR。
在另一个实施方案中,该实施方案所涉及的质谱方法包括将从色谱柱流出的蛋白或多肽在三重四极杆质谱仪上使用正离子模式下的Scheduled MRMTM模式进行数据采集。任选地,Schedule窗口为2.5分钟。
在另一个方面,本发明还提供一种对受试者的甲状腺结节恶性程度或恶性概率进行评估的评估模型,其中通过将具有甲状腺结节不同恶性程度的受试者的细针穿刺组织活检样本(FNA样本)的目标蛋白或多肽的蛋白质组学数据作为训练数据训练机器学习模型而得到所述评估模型,所述目标蛋白或多肽包括选自表1的蛋白或多肽,对于临床上不确定或者难以评判的甲状腺结节提供恶性概率结果,其中所述蛋白质组学数据包括母子离子对、保留时间、碰撞电压(CE)和峰面积。
在一个实施方案中,该实施方案的评估模型所涉及的评估包括建立AI模型,所述建立AI模型包括将回顾性数据集分为训练集、验证集和独立测试集,其中对于每个提供样本的单位,如果所述单位的送样批次M≥2,则从所述M批数据中随机选出一个批次的数据划分至独立测试集,而剩余M-1
个批次的数据划分至训练集和验证集;
任选地,其中建立AI模型还包括将划分至训练集和验证集的数据根据质谱产生的时间顺序划分为70%的训练集和30%的验证集;
进一步任选地,其中建立AI模型还包括将前瞻性数据集作为第二独立测试集,所述前瞻性数据集的样本批次和质谱时间均严格独立于回顾性数据集。
在另一个实施方案中,该实施方案的评估模型中的评估所涉及的建立AI模型还包括计算样本中三种噪声蛋白HBB、THYG和H4的单个蛋白峰面积占总蛋白峰面积和的比例以及这三种蛋白峰面积和占总蛋白峰面积和的比例,其中当单个蛋白的比例>70%或这三种蛋白峰面积和的比例>95%时,确定该样本为不合格样本;
任选地,其中建立AI模型还包括去除包含极高丰度的目标蛋白或多肽的样本,所述极高丰度的目标蛋白或多肽包括VNVDEVGGEALGR、EFTPPVQAAYQK、LALQFTTNPK、LAAQSTLSFYQR、LEDIPVASLPDLHDIER、FLQGDHFGTSPR、QVDQFLGVPYAAPPLAERR、GGADVASIHLLTAR、RISGLIYEETR、ISGLIYEETR和VFLENVIR。
本发明通过高效液相色谱方法和质谱方法检测所得到的样本中目标蛋白或多肽的蛋白质组学数据,通过质谱数据的肽段峰面积信息进行处理和AI建模后,得到最终的评估结果(恶性概率),能够为临床提供甲状腺结节恶性程度的参考,其中,对于现有临床无法鉴定的甲状腺结节,亦能够同时提供第二个评估结果(恶性概率)供医生参考。
图1显示的是本发明的AI流程图;
图2显示的是本发明建立训练数据集和测试集的示意图;
图3显示的是本发明的一个实施例中的第一比较实验的结果;
图4显示的是本发明的一个实施例中的第二比较实验的结果;
图5显示的是本发明的模型预测的ROC图。
以下通过实施例来示例性展示本发明的具体实施方式,但是,应当理解的是,本发明并不局限于此。
除非明确指明,否则本发明所用到的试剂、仪器、装置等均为市售可获得的产品。
实施例
实施例1——建立临床多中心前瞻性队列。
首先建立一个全国多中心临床试验进行样本采集。
纳入标准:
(1)年龄≥18岁,≤70岁;
(2)未经药物治疗的甲状腺结节初治患者;
(3)甲状腺结节≥5mm,甲状腺细针穿刺,Bethesda III/IV;
(4)行甲状腺全/部分切除术,并有对应细胞病理穿刺结节的组织学报告;
(5)患者知情同意后自愿参与研究。
排除标准:
(1)未经手术患者;
(2)样本量不足;
本研究共计采集3120例样本,排除不符合标准的样本后剩余2450样本进行样本前处理与数据采集。
实施例2——压力循环系统辅助的FNA样本处理方法。
FNA穿刺样本通过超声引导或术中使用19-27g注射器针头进行反复抽吸穿刺获取。穿刺样本首先通过0.5mL红细胞裂解液进行低温4℃进行裂解,反应5min后放入离心机,300g离心10min。离心后,弃除溶液,保留离心后剩余的细胞。
随后,采用PCT技术对样本进行前处理。
PCT是一项新兴的半自动化组织裂解和蛋白质、多肽提取的样本制备技术,在小体积(150微升)的容器内,通过超高压(最高达45kpsi)和标准大气压循环促进组织和细胞的溶解,加速蛋白质水解和酶解。PCT的主要特点是半自动化处理微量样本(约0.1mg组织/千余个细胞),保证了样本制备
过程的稳定性和可重复性,在众多生物学领域得到广泛应用。
作为一个实例,PCT样本制备系统是一套基于压力循环技术的完整工作流程,由Barocycler2320EXT设备(可同时处理16个样本)和MicroTube、MicroPestle、MicroCaps等耗材组成,如应用于蛋白质组学,可在4-5小时内从组织中提取出可用于质谱分析的肽段。
在本实施例中,去除红细胞后的甲状腺穿刺样本,加入裂解液(6M尿素,2M硫脲)、还原剂(tris(2-carboxyethyl)phosphine,TCEP)、烷基化试剂碘乙酰胺(iodoacetamide,IAA)在PCT管中进行反应,反应时仪器参数设置:90cycles,每个cycle包括45,000psi,30s,以及10s off-time。反应结束后,加入0.75~1.5μg LysC和2.5~5μg Trypsin两种酶在PCT中加速反应,反应条件为:120cycles,每个cycle包括20,000psi,50s,以及10s off-time。消化结束后,多肽通过C18柱进行脱盐。最后,洁净干燥的多肽进行后续分析。
实施例3——候选蛋白挑选。
本实施例筛选出有利于甲状腺结节良性恶性判定的候选肽段及相应的母子离子。
i)前期研究中发现的14个蛋白组合和20个蛋白组合;
ii)前期研究中的模型挑选出诊断滤泡癌与滤泡腺瘤的49个蛋白;
iii)前期研究中的数据里获取的滤泡癌与滤泡腺瘤的差异蛋白;
iv)临床中免疫组织化学染色的47个蛋白;
v)文献中报道的与甲状腺癌相关的76个蛋白;
本发明人将以上渠道获取的蛋白进行合并后,初始侯选池共涵盖212个蛋白。
接下来,本发明人筛选出121个蛋白及537个母子离子对作为后续构建模型的母子离子对数据库(表1)。
表1候选蛋白及相应母子离子对(第1-3栏):
续表1(第1栏和第4-6栏)
续表1(第1栏和第7-8栏)
实施例4——靶向蛋白质组学方法检测目标蛋白(肽段)。
本实施例涉及多肽的靶向蛋白质组检测,分为液相方法优化及质谱参数优化。通过优化,可以在10-25分钟内完成快速检测。
液相方法优化:高效液相:色谱柱类型(C18,极性封端,长度100mm;粒径1.9μm),使用流动相A(含0.1%(v/v)甲酸的水溶液)及流动相B(含0.1%(v/v)甲酸的乙腈溶液)进行梯度洗脱,流速0.2ml/min:0-1分钟:3%B,1-20分钟:3%B~40%B;20-20.1分钟:40%B~80%B;20.1~22分钟:80%B;22.1~25分钟:3%B。柱温箱温度50℃。
质谱参数优化:流出肽段将在三重四级杆质谱仪上使用正离子模式下的MRM模式进行数据采集以确定保留时间。确定保留时间之后,使用ramp的方法优化每一个MRM离子对的碰撞能量CE,最终整合保留时间和优化的CE,生成Scheduled MRMTM采集方法(Schedule窗口为2.5分钟)。数据采集母子离子对、保留时间及优化后碰撞电压(CE),结果见表1。
本发明人还通过合成含有稳定同位素标记的肽段,混合后掺入到样本中,进行MRM采集。本发明引入同位素标记的肽段的目的为目标肽段的确证,排除假阳性信号。
实施例5——质谱数据的处理和AI建模。
5.1原理概要
本实施例通过对质谱数据的肽段峰面积信息进行处理和AI建模后,得到最终的评估结果(恶性概率),能够为临床提供甲状腺结节恶性程度的参考,其中,对于现有临床无法鉴定的甲状腺结节,亦能够同时提供第二个评估结果(恶性概率)供医生参考。对于本实施例中建议的不同肽段组合,本发明的AI算法都能够提供上述两个结果。
本实施例的AI流程图如图1所示。
5.2建立训练数据集和测试集
为了验证本发明的AI模型的有效性、稳定性和泛化性,本实施例将回顾性数据集分为三份:1.训练集,2.验证集,和3.不同批次独立测试集,具体流程如图2所示。
首先从现有样本中,根据不同的医院信息和送样信息划分出一个不同批次的独立测试集:对于每个医院,若送样批次M≥2,则从中随机选出一个批次的数据归属于该独立测试集(以证明本发明的AI模型能够克服批次效应,对于不同样本批次都有高表现),剩余M-1个批次的数据归属于训练集和验证集。
将剩余数据根据质谱产生的时间顺序划分为约70%的训练集和30%的验证集,以训练出对质谱时间不敏感的模型。
为了进一步验证本发明的AI模型的泛化性,将收集的一批前瞻性的样本作为独立测试集,该样本的样本批次和质谱时间均为严格独立,从而证明本发明的AI模型对于前瞻性数据集也有高表现。
由图2可见不同测试集的区别,蓝色数据会根据时间进一步划分为训练集和内部测试集(训练时用5.5.i的方式确定五组模型和参数,然后分别在内部测试集和两个独立测试集进行测试),T0、T1、T2为进行质谱的时间,建模时间为T1时间点,因此T1之前的数据为回顾性数据,之后的数据为前瞻性数据。
5.3数据清洗
i.计算样本中三种噪声蛋白:HBB、THYG、H4的单个蛋白峰面积占总蛋白峰面积和的比例,当单个蛋白的比例>70%或这三种蛋白峰面积和的比例>95%时,确定该样本为不合格样本,从而大大提升模型分类效果(有ROC图的对比,在质控阈值为70%、80%、90%、100%的比较,此处进行了两个比较实验,第一比较实验如图3所示,第二比较实验如图4所示。第一比较实验,应用了相同训练和测试数据,不同的是一组对所有数据进行70%质控且去除高丰度蛋白,一组不进行处理,测试结果明显是质控比较好;还进行了第二比较实验,即固定70%质控训练出的模型,在另一组数据70%、80%、90%和100%质控数据上测试,结果分别为0.91,0.9,0.87和0.82);
ii.分类时去除极高丰度的目标蛋白及相应肽段(这些肽段不适合加入模型,是用于质控的肽段。VNVDEVGGEALGR,EFTPPVQAAYQK,LALQFTTNPK,LAAQSTLSFYQR,LEDIPVASLPDLHDIER,FLQGDHFGTSPR,QVDQFLGVPYAAPPLAERR,GGADVASIHLLTAR,RISGLIYEETR,ISGLIYEETR,VFLENVIR共11条肽段需要去除。若不去除这些肽段,则模型结果受到这些肽段的影响,尤其是在不同批次样本中,这些肽段的浓度影响不同,使得内部测试集AUC从1下降到0.99,而不同批次独立测试集AUC从0.923下降到0.845)。
5.4数据预处理
对数据归一化(除以中位数)或对肽段归一化(z-score)均能达到效果,其中对肽段归一化(z-score)包括对每个肽段(特征)的定量在训练集上进行z-score归一化处理,并记录每个肽段的平均值和标准差,在测试新数据的时候对新数据的每个肽段执行z-score。
5.5模型训练
i.将训练集中的两类数据按比例分为五份,每份含有20%的正样本和20%的负样本,每次将其中4份组合为训练数据,利用XG Boos t模型进行AI建模,在剩余1份上(validation 1)和之前提到的内部测试集上(validation 2)进行验证,这样可以有5个不同的训练集得到五个XGBoost模型,以增加模型的多样性,为后面的模型融合做准备。该操作的有益之处在于:1)训练集少量不同而validation 1完全不同能够得到不同的参数和模型,训练效果比较好;2)由于上述不同,五个模型直接具有一定独立性;3)由于独立性,五个模型融合时具有一定互补性,这样融合后能达到非常好的效果。需要说明的是,本发明的模型可以扩展为其他模型,包括但不限于逻辑回归,决策树,随机森林,SVM,神经网络等模型。
ii.用网格搜索或遗传算法对于每个模型分别进行搜参:对于每个格点的参数,首先对i中的训练集按该参数建模并进行重要性排序,然后以该参数为基础,按特征的重要性从大到小加入模型进行建模,评估函数为在validation 1上的AUC值和validation 2上的AUC值的和,且两个AUC应均不低于0.9,单个模型的总特征不超过10,以便于最终试剂盒产品化应用。当评估函数取得最高值时的参数和对应的特征为最终决定的参数和特征。本实施例能够取得训练集表现最优的模型,且有一定的泛化性(两个AUC均大于0.9)。
iii.对训练集进行不同划分产生更多的模型和特征组合(可选)。
5.6模型测试/预测
使用如上训练好的模型对新数据进行测试/预测。
i.将从穿刺样本获得的新的质谱数据如前所述进行数据处理。
ii.使用MRMTransitionGroupPicker或MRMMapper(OpenMS)算法在谱图中挑选目标母离子的全部峰,使用mProphet算法对数据进行质量控制(错误发现率估计),得到精准的定性定量分析,或者使用Skyline软件进行定性定量,该步实现从质谱数据到肽段定量数据。
iii.对数据进行归一化处理,包括获得之前记录的平均值和方差后进行z-score变换。
iv.在由5个训练集产生的五个XGBoost模型上进行测试,获得预测值(0~1的一个概率值)。
v.模型融合(可选):由于单模型的鲁棒性和稳定性受限,因此本发明对于五个XGBoost模型的结果进行融合,融合方式为pred=(pred1+pred2+pred3+pred4+pred5)/5。本实施例训练出五个模型,可以采取任意一个模型及其肽段组合包装进试剂盒,也可以把五个模型及其肽段组合包装进试剂盒。
vi.iii或iv的结果通过阈值进行二分类预测,大于阈值预测为1(恶性),小于阈值预测为0(良性),阈值定义为(P1/S1+P2/S2)/2,其中P1、P2分别为70%、30%数据集的正样本数量,S1、S2分别为70%、30%数据集的样本数量。
vii.总体结果展示(敏感度、特异度和AUC);在临床中难以分辨的III/IV类结果展示;泛化性展示1(两个验证集和两个独立测试集);泛化性展示2(两个独立测试集上的多中心数据展示)。结果示于表2和表3中:
表2总体结果展示(第1-3栏)
续表2(第4-6栏)
续表2(第7-8栏)
表3多中心结果展示
注:第二个独立测试集由于良性测试样本量偏少,结果不太稳定,其他测试都能达到预期效果。
实施例6——本发明与现有技术的比较。
6.1与临床细胞病理学医生评估结果比较
Bethesda III/IV类临床细胞病理学医生无法评估的甲状腺结节,也无法确切知晓是良性还是恶性(III类不代表良性,而IV类也不代表恶性)。甲状腺TBSRTC各诊断分级的恶性风险及临床管理方法见表4。
表4:甲状腺TBSRTC各诊断分级的恶性风险及临床管
本研究方法对于III/IV类的评价准确率为77%,模型AUC为0.90。
由于III/IV类数据较少,本发明人将内部测试集和两个独立测试集的数据合并预测展示结果(5个良性,21个恶性,分别见表5和表6):
表5医生预测
表6模型预测
本发明的模型预测的ROC图见图5,模型AUC为0.90。
6.2与现有技术中的方法的比较
本发明人在此还将本发明的方法与两篇参考文献(Patel et al.,Performance of a Genomic Sequencing Classifier for the Preoperative Diagnosis of Cytologically Indeterminate Thyroid Nodules,JAMA Surg.2018;153(9):817-824和Livhits et al.,Effectiveness of Molecular Testing Techniques for Diagnosis of Indeterminate Thyroid Nodules:A Randomized Clinical Trial,JAMA Oncol.2021Jan1;7(1):70-77)中的方法就敏感性、特异性等进行了比较,比较结果见表7。
表7:本发明的方法与两篇参考文献中的方法的比较(第1-3栏)
aGSC是Genomic Sequencing Classifier的缩写;
b仅纳入了有明确手术结果的样本;
续表7(第4-6栏)
续表7(第7-8栏)
结果显示,本发明的方法的特异性和阳性预测值均显著高于参考文献中所用的方法。
为了降低临床上对甲状腺结节恶性程度或恶性概率的误诊,即减少假阳性判断,因此本发明人有限选择特异性更高的模型。
以上实施例的说明只是用于帮助理解本发明的方法及其核心思想。应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以对本发明进行若干改进和修饰,这些改进和修饰也落入本发明权利要求的保护范围内。对这些实施例的多种修改对本领域的专业技术人员来说是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。
Claims (20)
- 一种系统,其特征在于,用于基于靶向检测蛋白或多肽与机器学习对受试者的甲状腺结节恶性程度或恶性概率进行评估,所述系统包含:i)采集装置,用于采集受试者的细针穿刺组织活检样本,记为FNA样本;ii)样本前处理装置,用于采用压力循环技术对所述FNA样本进行前处理;iii)检测装置,用于检测所述FNA样本中目标蛋白或多肽的蛋白质组学数据,所述目标蛋白或多肽的氨基酸序列如SEQ ID NO:1~SEQ ID NO:179所示,所述蛋白质组学数据通过高效液相色谱方法和质谱方法获得;iv)分析装置,用于分析所述蛋白质组学数据,所述分析的方法包括将所述蛋白质组学数据输入AI模型,所述蛋白质组学数据包括母子离子对、保留时间、碰撞电压和峰面积;以及v)输出装置,用于输出结果,对于临床上不确定或者难以评判的甲状腺结节提供恶性概率结果。
- 根据权利要求1所述的系统,其特征在于,iv)所述分析包括建立AI模型,所述建立AI模型包括将回顾性数据集分为训练集、验证集和独立测试集,其中对于每个提供样本的单位,如果该单位的送样批次M≥2,则从所述M批数据中随机选出一个批次的数据划分至独立测试集,而剩余M-1个批次的数据划分至训练集和验证集。
- 根据权利要求2所述的系统,其特征在于,所述建立AI模型还包括将划分至训练集和验证集的数据根据质谱产生的时间顺序划分为70%的训练集和30%的验证集。
- 根据权利要求3所述的系统,其特征在于,所述建立AI模型还包括将前瞻性数据集作为第二独立测试集,所述前瞻性数据集的样本批次和质谱时间均严格独立于回顾性数据集。
- 根据权利要求2~4任意一项所述的系统,其特征在于,建立AI模型还包括计算FNA样本中三种噪声蛋白HBB、THYG和H4的单个蛋白峰面积占总蛋白峰面积和的比例以及这三种蛋白峰面积和占总蛋白峰面积和的比例,其中当单个蛋白峰面积的比例>70%或这三种蛋白峰面积和的比例>95%时,确定该样本为不合格样本。
- 根据权利要求2所述的系统,其特征在于,建立AI模型还包括去除包含极高丰度的目标蛋白或多肽的样本,所述极高丰度的目标蛋白或多肽包括VNVDEVGGEALGR、EFTPPVQAAYQK、LALQFTTNPK、LAAQSTLSFYQR、 LEDIPVASLPDLHDIER、FLQGDHFGTSPR、QVDQFLGVPYAAPPLAERR、GGADVASIHLLTAR、RISGLIYEETR、ISGLIYEETR和VFLENVIR。
- 根据权利要求1或2所述的系统,其特征在于,所述质谱方法包括将从色谱柱流出的蛋白或多肽在三重四极杆质谱仪上使用正离子模式下的Scheduled MRM TM模式进行数据采集。
- 根据权利要求7所述的系统,其特征在于,Schedule窗口为2.5分钟。
- 一种对受试者的甲状腺结节恶性程度或恶性概率进行评估的评估模型,其特征在于,通过将具有甲状腺结节不同恶性程度的受试者的细针穿刺组织活检的FNA样本的目标蛋白或多肽的蛋白质组学数据作为训练数据训练机器学习模型而得到该评估模型,所述目标蛋白或多肽的氨基酸序列如SEQ ID NO:1~SEQ ID NO:179所示,并且其中对于临床上不确定或者难以评判的甲状腺结节,提供恶性概率结果。
- 根据权利要求9所述的评估模型,其特征在于,所述评估的方法包括建立AI模型,所述建立AI模型的方法包括将回顾性数据集分为训练集、验证集和独立测试集,其中对于每个提供样本的单位,如果所述单位的送样批次M≥2,则从该M批数据中随机选出一个批次的数据划分至独立测试集,而剩余M-1个批次的数据划分至训练集和验证集。
- 根据权利要求9或10所述的评估模型,其特征在于,建立AI模型还包括计算样本中三种噪声蛋白HBB、THYG和H4的单个蛋白峰面积占总蛋白峰面积和的比例以及这三种蛋白峰面积和占总蛋白峰面积和的比例,其中当单个蛋白峰面积的比例>70%或这三种蛋白峰面积和的比例>95%时,确定该样本为不合格样本。
- 权利要求1~7中任一项所述的系统或权利要求8~11中任一项所述的评估模型在制备基于靶向检测蛋白或多肽与机器学习对受试者的甲状腺结节恶性程度或恶性概率进行评估的装置中的用途。
- 目标蛋白或多肽作为检测靶标在制备试剂盒中的用途,其特征在于,所述试剂盒用于基于靶向检测蛋白或多肽与机器学习对受试者的甲状腺结节恶性程度或恶性概率进行评估,其中所述试剂盒包含检测目标蛋白或多肽的工具,并且其中所述目标蛋白或多肽的氨基酸序列如SEQ ID NO:1~SEQ ID NO:179所示。
- 根据权利要求13所述的用途,其特征在于,所述评估的方法包括:a)提供受试者的细针穿刺组织活检样本,记为FNA样本;b)采用压力循环技术对所述FNA样本进行前处理;c)检测所述FNA样本中目标蛋白或多肽的蛋白质组学数据,所述目标蛋白或多肽的氨基酸序列如SEQ ID NO:1~SEQ ID NO:179所示,并且所述蛋白质组学数据通过高效液相色谱方法和质谱方法获得;d)分析所述蛋白质组学数据,其中所述分析包括将所述蛋白质组学数据输入AI模型;以及e)输出结果,其中对于临床上不确定或者难以评判的甲状腺结节,提供恶性概率结果。
- 根据权利要求14所述的用途,其特征在于,d)步骤的分析包括建立AI模型,所述建立AI模型包括将回顾性数据集分为训练集、验证集和独立测试集,其中对于每个提供样本的单位,如果该单位的送样批次M≥2,则从该M批数据中随机选出一个批次的数据划分至独立测试集,而剩余M-1个批次的数据划分至训练集和验证集。
- 根据权利要求15所述的用途,其特征在于,所述建立AI模型还包括将划分至训练集和验证集的数据根据质谱产生的时间顺序划分为70%的训练集和30%的验证集。
- 根据权利要求15所述的用途,其特征在于,所述建立AI模型还包括去除包含极高丰度的目标蛋白或多肽的样本,所述极高丰度的目标蛋白或多肽包括VNVDEVGGEALGR、EFTPPVQAAYQK、LALQFTTNPK、LAAQSTLSFYQR、LEDIPVASLPDLHDIER、FLQGDHFGTSPR、QVDQFLGVPYAAPPLAERR、GGADVASIHLLTAR、RISGLIYEETR、ISGLIYEETR和VFLENVIR。
- 根据权利要求15所述的用途,其特征在于,所述建立AI模型还包括计算样本中三种噪声蛋白HBB、THYG和H4的单个蛋白峰面积占总蛋白峰面积和的比例以及这三种蛋白峰面积和占总蛋白峰面积和的比例,其中当单个蛋白峰面积的比例>70%或这三种蛋白峰面积和的比例>95%时,确定该样本为不合格样本。
- 根据权利要求14所述的用途,其特征在于,所述质谱方法包括将从色谱柱流出的蛋白或多肽在三重四级杆质谱仪上使用正离子模式下的Scheduled MRM TM模式进行数据采集。
- 根据权利要求19所述的用途,其特征在于,Schedule窗口为2.5分钟。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210281265.8 | 2022-03-22 | ||
CN202210281265.8A CN114414704B (zh) | 2022-03-22 | 2022-03-22 | 评估甲状腺结节恶性程度或概率的系统、模型及试剂盒 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023179263A1 true WO2023179263A1 (zh) | 2023-09-28 |
Family
ID=81263218
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/076918 WO2023179263A1 (zh) | 2022-03-22 | 2023-02-17 | 评估甲状腺结节恶性程度或概率的系统、模型及试剂盒 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN114414704B (zh) |
WO (1) | WO2023179263A1 (zh) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114414704B (zh) * | 2022-03-22 | 2022-08-12 | 西湖欧米(杭州)生物科技有限公司 | 评估甲状腺结节恶性程度或概率的系统、模型及试剂盒 |
CN115128285B (zh) * | 2022-08-30 | 2023-01-06 | 西湖大学 | 一种蛋白质组合对甲状腺滤泡性肿瘤鉴别评估的试剂盒、系统 |
CN115436640B (zh) * | 2022-11-07 | 2023-04-18 | 西湖欧米(杭州)生物科技有限公司 | 适于可评估甲状腺结节恶性程度或概率的多肽的替代基质 |
CN116609451A (zh) * | 2023-04-19 | 2023-08-18 | 西湖欧米(杭州)生物科技有限公司 | 一种用于甲状腺结节质谱法检测过程质量控制的质控品 |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009111881A1 (en) * | 2008-03-13 | 2009-09-17 | British Columbia Cancer Agency Branch | Biomarkers for diagnosis of differentiated thyroid cancer |
US20120142030A1 (en) * | 2007-04-14 | 2012-06-07 | The Regents of the University of Colorado, Body Co rporate | Biomarkers for Follicular Thyroid Carcinoma and Methods of Use |
WO2016201555A1 (en) * | 2015-06-13 | 2016-12-22 | Walfish Paul G | Methods and compositions for the diagnosis of a thyroid condition |
CN108896682A (zh) * | 2018-07-18 | 2018-11-27 | 杭州汇健科技有限公司 | 一种肽指纹图谱的快速质谱分析与谱图判别方法 |
CN111243042A (zh) * | 2020-02-28 | 2020-06-05 | 浙江德尚韵兴医疗科技有限公司 | 基于深度学习的超声甲状腺结节良恶性特征可视化的方法 |
CN111292801A (zh) * | 2020-01-21 | 2020-06-16 | 西湖大学 | 蛋白质质谱结合深度学习评估甲状腺结节的方法 |
CN112684048A (zh) * | 2020-12-22 | 2021-04-20 | 中山大学附属第一医院 | 一种术前鉴别甲状腺良恶性结节的生物标志物、试剂盒及其应用 |
CN113514530A (zh) * | 2020-12-23 | 2021-10-19 | 岛津企业管理(中国)有限公司 | 一种基于敞开式离子源的甲状腺恶性肿瘤诊断系统 |
CN114414704A (zh) * | 2022-03-22 | 2022-04-29 | 西湖欧米(杭州)生物科技有限公司 | 评估甲状腺结节恶性程度或概率的系统、模型及试剂盒 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4742029B2 (ja) * | 2003-04-01 | 2011-08-10 | アクティブクス バイオサイエンシズ、インコーポレイテッド | アシル−リン酸及びホスホン酸プローブ並びにその合成方法並びにプロテオーム分析における使用 |
WO2010021822A2 (en) * | 2008-07-30 | 2010-02-25 | The Regents Of The University Of California | Discovery of candidate biomarkers of in vivo apoptosis by global profiling of caspase cleavage sites |
US20170121055A1 (en) * | 2015-10-28 | 2017-05-04 | Snyder Industries, Inc. | Pallet for supporting and stacking rolls of material |
-
2022
- 2022-03-22 CN CN202210281265.8A patent/CN114414704B/zh active Active
-
2023
- 2023-02-17 WO PCT/CN2023/076918 patent/WO2023179263A1/zh unknown
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120142030A1 (en) * | 2007-04-14 | 2012-06-07 | The Regents of the University of Colorado, Body Co rporate | Biomarkers for Follicular Thyroid Carcinoma and Methods of Use |
WO2009111881A1 (en) * | 2008-03-13 | 2009-09-17 | British Columbia Cancer Agency Branch | Biomarkers for diagnosis of differentiated thyroid cancer |
WO2016201555A1 (en) * | 2015-06-13 | 2016-12-22 | Walfish Paul G | Methods and compositions for the diagnosis of a thyroid condition |
CN108896682A (zh) * | 2018-07-18 | 2018-11-27 | 杭州汇健科技有限公司 | 一种肽指纹图谱的快速质谱分析与谱图判别方法 |
CN111292801A (zh) * | 2020-01-21 | 2020-06-16 | 西湖大学 | 蛋白质质谱结合深度学习评估甲状腺结节的方法 |
CN111243042A (zh) * | 2020-02-28 | 2020-06-05 | 浙江德尚韵兴医疗科技有限公司 | 基于深度学习的超声甲状腺结节良恶性特征可视化的方法 |
CN112684048A (zh) * | 2020-12-22 | 2021-04-20 | 中山大学附属第一医院 | 一种术前鉴别甲状腺良恶性结节的生物标志物、试剂盒及其应用 |
CN113514530A (zh) * | 2020-12-23 | 2021-10-19 | 岛津企业管理(中国)有限公司 | 一种基于敞开式离子源的甲状腺恶性肿瘤诊断系统 |
CN114414704A (zh) * | 2022-03-22 | 2022-04-29 | 西湖欧米(杭州)生物科技有限公司 | 评估甲状腺结节恶性程度或概率的系统、模型及试剂盒 |
Also Published As
Publication number | Publication date |
---|---|
CN114414704B (zh) | 2022-08-12 |
CN114414704A (zh) | 2022-04-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2023179263A1 (zh) | 评估甲状腺结节恶性程度或概率的系统、模型及试剂盒 | |
Zhang et al. | Tree analysis of mass spectral urine profiles discriminates transitional cell carcinoma of the bladder from noncancer patient | |
CN109658980A (zh) | 一种粪便基因标志物的筛选及应用 | |
US20190056402A1 (en) | Organ specific diagnostic panels and methods for identification of organ specific panel proteins | |
Widlak et al. | Serum mass profile signature as a biomarker of early lung cancer | |
CN111833963A (zh) | 一种cfDNA分类方法、装置和用途 | |
CN116798520B (zh) | 鳞状细胞癌组织起源位点蛋白标志物预测模型的构建方法 | |
CN107273717A (zh) | 一种肺癌血清基因的检测模型及其构建方法和应用 | |
CN117686712A (zh) | 一种基于舌苔微生物蛋白筛查胃癌的方法 | |
CN114577972B (zh) | 一种用于体液鉴定的蛋白质标志物筛选方法 | |
CN111748624B (zh) | 用于预测肝癌是否复发的生物标志物 | |
CN116735889B (zh) | 一种用于结直肠癌早期筛查的蛋白质标志物、试剂盒及应用 | |
CN112382341A (zh) | 一种用于鉴定食管鳞癌预后相关的生物标志物的方法 | |
CN115128285B (zh) | 一种蛋白质组合对甲状腺滤泡性肿瘤鉴别评估的试剂盒、系统 | |
CN116148482A (zh) | 用于乳腺癌患者鉴定的设备及其制备用途 | |
CN118150830B (zh) | 蛋白标志物组合在制备结直肠癌早期诊断产品中的应用 | |
CN115792247B (zh) | 蛋白组合在制备甲状腺乳头状癌风险辅助分层系统中的应用 | |
CN117089621B (zh) | 生物标志物组合及其在预测结直肠癌疗效中的应用 | |
CN113960130B (zh) | 一种采用开放式离子源诊断甲状腺癌的机器学习方法 | |
Zoppis et al. | Analysis of Correlation Structures in Renal Cell Carcinoma Patient Data. | |
Mischak et al. | Urinary proteome analysis using capillary electrophoresis coupled to mass spectrometry: a powerful tool in clinical diagnosis, prognosis and therapy evaluation | |
Venancio et al. | Potential of molecular diagnostic methods in the early identification of multiple myeloma: an integrative review | |
CN118858647A (zh) | 高级别浆液性卵巢癌血浆细胞外囊泡蛋白诊断标志物及应用 | |
Xu et al. | Serum and Urine Metabolic Fingerprints Characterize Renal Cell Carcinoma for Classification, Early Diagnosis, and Prognosis | |
US20190078167A1 (en) | Genetic markers used for identifying benign and malignant pulmonary micro-nodules and the application thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23773514 Country of ref document: EP Kind code of ref document: A1 |