WO2024072802A1 - Methods and systems for classification of a condition using mass spectrometry data - Google Patents
Methods and systems for classification of a condition using mass spectrometry data Download PDFInfo
- Publication number
- WO2024072802A1 WO2024072802A1 PCT/US2023/033724 US2023033724W WO2024072802A1 WO 2024072802 A1 WO2024072802 A1 WO 2024072802A1 US 2023033724 W US2023033724 W US 2023033724W WO 2024072802 A1 WO2024072802 A1 WO 2024072802A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- machine learning
- learning model
- mass spectra
- condition
- transformer
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 160
- 238000004949 mass spectrometry Methods 0.000 title claims abstract description 55
- 238000004458 analytical method Methods 0.000 claims abstract description 15
- 238000010801 machine learning Methods 0.000 claims description 87
- 238000001819 mass spectrum Methods 0.000 claims description 78
- 238000001228 spectrum Methods 0.000 claims description 71
- 238000012549 training Methods 0.000 claims description 64
- 201000010099 disease Diseases 0.000 claims description 54
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 54
- 238000013528 artificial neural network Methods 0.000 claims description 30
- 238000002955 isolation Methods 0.000 claims description 30
- 230000015654 memory Effects 0.000 claims description 26
- 206010028980 Neoplasm Diseases 0.000 claims description 25
- 201000011510 cancer Diseases 0.000 claims description 24
- 238000004885 tandem mass spectrometry Methods 0.000 claims description 24
- 150000002500 ions Chemical class 0.000 claims description 23
- 238000003860 storage Methods 0.000 claims description 21
- 239000013598 vector Substances 0.000 claims description 17
- 238000004422 calculation algorithm Methods 0.000 claims description 16
- 238000012706 support-vector machine Methods 0.000 claims description 16
- 238000004811 liquid chromatography Methods 0.000 claims description 13
- 239000000090 biomarker Substances 0.000 claims description 12
- 238000007637 random forest analysis Methods 0.000 claims description 12
- 238000003066 decision tree Methods 0.000 claims description 9
- 230000001537 neural effect Effects 0.000 claims description 9
- 230000004304 visual acuity Effects 0.000 claims description 9
- 238000012512 characterization method Methods 0.000 claims description 8
- 238000002705 metabolomic analysis Methods 0.000 claims description 7
- 230000001431 metabolomic effect Effects 0.000 claims description 7
- 230000004481 post-translational protein modification Effects 0.000 claims description 7
- 230000004083 survival effect Effects 0.000 claims description 6
- 206010061902 Pancreatic neoplasm Diseases 0.000 claims description 4
- 230000000875 corresponding effect Effects 0.000 claims description 4
- 208000015486 malignant pancreatic neoplasm Diseases 0.000 claims description 4
- 201000002528 pancreatic cancer Diseases 0.000 claims description 4
- 208000008443 pancreatic carcinoma Diseases 0.000 claims description 4
- 230000001225 therapeutic effect Effects 0.000 claims description 4
- 206010033128 Ovarian cancer Diseases 0.000 claims description 3
- 206010061535 Ovarian neoplasm Diseases 0.000 claims description 3
- 239000003550 marker Substances 0.000 claims description 3
- 206010006187 Breast cancer Diseases 0.000 claims description 2
- 208000026310 Breast neoplasm Diseases 0.000 claims description 2
- 208000022072 Gallbladder Neoplasms Diseases 0.000 claims description 2
- 206010058467 Lung neoplasm malignant Diseases 0.000 claims description 2
- 206010060862 Prostate cancer Diseases 0.000 claims description 2
- 208000000236 Prostatic Neoplasms Diseases 0.000 claims description 2
- 230000021736 acetylation Effects 0.000 claims description 2
- 238000006640 acetylation reaction Methods 0.000 claims description 2
- 230000002596 correlated effect Effects 0.000 claims description 2
- 238000000766 differential mobility spectroscopy Methods 0.000 claims description 2
- 238000002125 drift tube ion mobility spectroscopy Methods 0.000 claims description 2
- 238000007667 floating Methods 0.000 claims description 2
- 201000010175 gallbladder cancer Diseases 0.000 claims description 2
- 230000013595 glycosylation Effects 0.000 claims description 2
- 238000006206 glycosylation reaction Methods 0.000 claims description 2
- 238000009169 immunotherapy Methods 0.000 claims description 2
- 201000005202 lung cancer Diseases 0.000 claims description 2
- 208000020816 lung neoplasm Diseases 0.000 claims description 2
- 230000000873 masking effect Effects 0.000 claims description 2
- 230000026731 phosphorylation Effects 0.000 claims description 2
- 238000006366 phosphorylation reaction Methods 0.000 claims description 2
- 230000004043 responsiveness Effects 0.000 claims description 2
- 238000002560 therapeutic procedure Methods 0.000 claims description 2
- 230000034512 ubiquitination Effects 0.000 claims description 2
- 238000010798 ubiquitination Methods 0.000 claims description 2
- 239000012472 biological sample Substances 0.000 abstract description 3
- 239000000523 sample Substances 0.000 description 41
- 230000014759 maintenance of location Effects 0.000 description 16
- 238000012360 testing method Methods 0.000 description 13
- 108090000623 proteins and genes Proteins 0.000 description 12
- 230000004913 activation Effects 0.000 description 11
- 238000012545 processing Methods 0.000 description 11
- 102000004169 proteins and genes Human genes 0.000 description 11
- 230000006870 function Effects 0.000 description 10
- 238000011176 pooling Methods 0.000 description 10
- 238000004891 communication Methods 0.000 description 9
- 238000013527 convolutional neural network Methods 0.000 description 9
- 238000013135 deep learning Methods 0.000 description 9
- 239000011159 matrix material Substances 0.000 description 8
- 230000000306 recurrent effect Effects 0.000 description 8
- 238000004587 chromatography analysis Methods 0.000 description 7
- 238000005259 measurement Methods 0.000 description 6
- 239000000203 mixture Substances 0.000 description 6
- 239000012491 analyte Substances 0.000 description 5
- 238000004895 liquid chromatography mass spectrometry Methods 0.000 description 5
- 238000003058 natural language processing Methods 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 5
- 238000005457 optimization Methods 0.000 description 5
- 210000002381 plasma Anatomy 0.000 description 5
- 230000009467 reduction Effects 0.000 description 5
- 238000000926 separation method Methods 0.000 description 5
- 210000002966 serum Anatomy 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 108090000765 processed proteins & peptides Proteins 0.000 description 4
- 238000010200 validation analysis Methods 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 3
- 210000004027 cell Anatomy 0.000 description 3
- 230000001413 cellular effect Effects 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 3
- 238000004140 cleaning Methods 0.000 description 3
- 238000013500 data storage Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 238000002347 injection Methods 0.000 description 3
- 239000007924 injection Substances 0.000 description 3
- 238000007477 logistic regression Methods 0.000 description 3
- 235000021110 pickles Nutrition 0.000 description 3
- 102000008946 Fibrinogen Human genes 0.000 description 2
- 108010049003 Fibrinogen Proteins 0.000 description 2
- 230000005526 G1 to G0 transition Effects 0.000 description 2
- 210000004369 blood Anatomy 0.000 description 2
- 239000008280 blood Substances 0.000 description 2
- 238000013501 data transformation Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 229940074200 diamode Drugs 0.000 description 2
- 230000037213 diet Effects 0.000 description 2
- 235000005911 diet Nutrition 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 230000004064 dysfunction Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000005040 ion trap Methods 0.000 description 2
- 230000000155 isotopic effect Effects 0.000 description 2
- 230000000670 limiting effect Effects 0.000 description 2
- PGYPOBZJRVSMDS-UHFFFAOYSA-N loperamide hydrochloride Chemical compound Cl.C=1C=CC=CC=1C(C=1C=CC=CC=1)(C(=O)N(C)C)CCN(CC1)CCC1(O)C1=CC=C(Cl)C=C1 PGYPOBZJRVSMDS-UHFFFAOYSA-N 0.000 description 2
- 239000002207 metabolite Substances 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000002552 multiple reaction monitoring Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 230000002829 reductive effect Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 210000002700 urine Anatomy 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- QTBSBXVTEAMEQO-UHFFFAOYSA-M Acetate Chemical compound CC([O-])=O QTBSBXVTEAMEQO-UHFFFAOYSA-M 0.000 description 1
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- QGZKDVFQNNGYKY-UHFFFAOYSA-O Ammonium Chemical compound [NH4+] QGZKDVFQNNGYKY-UHFFFAOYSA-O 0.000 description 1
- OYPRJOBELJOOCE-UHFFFAOYSA-N Calcium Chemical compound [Ca] OYPRJOBELJOOCE-UHFFFAOYSA-N 0.000 description 1
- KRKNYBCHXYNGOX-UHFFFAOYSA-K Citrate Chemical compound [O-]C(=O)CC(O)(CC([O-])=O)C([O-])=O KRKNYBCHXYNGOX-UHFFFAOYSA-K 0.000 description 1
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- 206010061818 Disease progression Diseases 0.000 description 1
- 241000282326 Felis catus Species 0.000 description 1
- 102000020897 Formins Human genes 0.000 description 1
- 108091022623 Formins Proteins 0.000 description 1
- 206010019663 Hepatic failure Diseases 0.000 description 1
- DGAQECJNVWCQMB-PUAWFVPOSA-M Ilexoside XXIX Chemical compound C[C@@H]1CC[C@@]2(CC[C@@]3(C(=CC[C@H]4[C@]3(CC[C@@H]5[C@@]4(CC[C@@H](C5(C)C)OS(=O)(=O)[O-])C)C)[C@@H]2[C@]1(C)O)C)C(=O)O[C@H]6[C@@H]([C@H]([C@@H]([C@H](O6)CO)O)O)O.[Na+] DGAQECJNVWCQMB-PUAWFVPOSA-M 0.000 description 1
- FYYHWMGAXLPEAU-UHFFFAOYSA-N Magnesium Chemical compound [Mg] FYYHWMGAXLPEAU-UHFFFAOYSA-N 0.000 description 1
- 208000002720 Malnutrition Diseases 0.000 description 1
- 208000001145 Metabolic Syndrome Diseases 0.000 description 1
- 241001465754 Metazoa Species 0.000 description 1
- 229910002651 NO3 Inorganic materials 0.000 description 1
- NHNBFGGVMKEFGY-UHFFFAOYSA-N Nitrate Chemical compound [O-][N+]([O-])=O NHNBFGGVMKEFGY-UHFFFAOYSA-N 0.000 description 1
- 229910019142 PO4 Inorganic materials 0.000 description 1
- ZLMJMSJWJFRBEC-UHFFFAOYSA-N Potassium Chemical compound [K] ZLMJMSJWJFRBEC-UHFFFAOYSA-N 0.000 description 1
- 208000001647 Renal Insufficiency Diseases 0.000 description 1
- QAOWNCQODCNURD-UHFFFAOYSA-L Sulfate Chemical compound [O-]S([O-])(=O)=O QAOWNCQODCNURD-UHFFFAOYSA-L 0.000 description 1
- 201000000690 abdominal obesity-metabolic syndrome Diseases 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000011575 calcium Substances 0.000 description 1
- 229910052791 calcium Inorganic materials 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000011088 calibration curve Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 210000001175 cerebrospinal fluid Anatomy 0.000 description 1
- 150000001793 charged compounds Chemical class 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 210000002726 cyst fluid Anatomy 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 206010012601 diabetes mellitus Diseases 0.000 description 1
- 230000005750 disease progression Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011067 equilibration Methods 0.000 description 1
- 230000004438 eyesight Effects 0.000 description 1
- 210000003608 fece Anatomy 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 229940012952 fibrinogen Drugs 0.000 description 1
- 150000004675 formic acid derivatives Chemical class 0.000 description 1
- 239000012634 fragment Chemical class 0.000 description 1
- 238000013467 fragmentation Methods 0.000 description 1
- 238000006062 fragmentation reaction Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000003862 health status Effects 0.000 description 1
- 230000006303 immediate early viral mRNA transcription Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010348 incorporation Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000003064 k means clustering Methods 0.000 description 1
- 201000006370 kidney failure Diseases 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000011005 laboratory method Methods 0.000 description 1
- 230000014725 late viral mRNA transcription Effects 0.000 description 1
- 150000002632 lipids Chemical class 0.000 description 1
- 239000007788 liquid Substances 0.000 description 1
- 208000007903 liver failure Diseases 0.000 description 1
- 231100000835 liver failure Toxicity 0.000 description 1
- 238000011068 loading method Methods 0.000 description 1
- -1 m/z) Chemical class 0.000 description 1
- 239000011777 magnesium Substances 0.000 description 1
- 229910052749 magnesium Inorganic materials 0.000 description 1
- 230000001071 malnutrition Effects 0.000 description 1
- 235000000824 malnutrition Nutrition 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 210000003097 mucus Anatomy 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 208000015380 nutritional deficiency disease Diseases 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 239000013610 patient sample Substances 0.000 description 1
- NBIIXXVUZAFLBC-UHFFFAOYSA-K phosphate Chemical compound [O-]P([O-])([O-])=O NBIIXXVUZAFLBC-UHFFFAOYSA-K 0.000 description 1
- 239000010452 phosphate Substances 0.000 description 1
- 238000000053 physical method Methods 0.000 description 1
- 239000011591 potassium Substances 0.000 description 1
- 229910052700 potassium Inorganic materials 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 102000004196 processed proteins & peptides Human genes 0.000 description 1
- 238000011002 quantification Methods 0.000 description 1
- 210000003296 saliva Anatomy 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 239000010454 slate Substances 0.000 description 1
- 239000011734 sodium Substances 0.000 description 1
- 229910052708 sodium Inorganic materials 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 230000003319 supportive effect Effects 0.000 description 1
- 210000001519 tissue Anatomy 0.000 description 1
- 238000009966 trimming Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/10—Signal processing, e.g. from mass spectrometry [MS] or from PCR
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N30/00—Investigating or analysing materials by separation into components using adsorption, absorption or similar phenomena or using ion-exchange, e.g. chromatography or field flow fractionation
- G01N30/02—Column chromatography
- G01N30/62—Detectors specially adapted therefor
- G01N30/72—Mass spectrometers
- G01N30/7233—Mass spectrometers interfaced to liquid or supercritical fluid chromatograph
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N30/00—Investigating or analysing materials by separation into components using adsorption, absorption or similar phenomena or using ion-exchange, e.g. chromatography or field flow fractionation
- G01N30/02—Column chromatography
- G01N30/86—Signal analysis
- G01N30/8624—Detection of slopes or peaks; baseline correction
- G01N30/8631—Peaks
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N30/00—Investigating or analysing materials by separation into components using adsorption, absorption or similar phenomena or using ion-exchange, e.g. chromatography or field flow fractionation
- G01N30/02—Column chromatography
- G01N30/86—Signal analysis
- G01N30/8624—Detection of slopes or peaks; baseline correction
- G01N30/8644—Data segmentation, e.g. time windows
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N30/00—Investigating or analysing materials by separation into components using adsorption, absorption or similar phenomena or using ion-exchange, e.g. chromatography or field flow fractionation
- G01N30/02—Column chromatography
- G01N30/88—Integrated analysis systems specially adapted therefor, not covered by a single one of the groups G01N30/04 - G01N30/86
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N33/00—Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
- G01N33/48—Biological material, e.g. blood, urine; Haemocytometers
- G01N33/50—Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
- G01N33/68—Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N33/00—Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
- G01N33/48—Biological material, e.g. blood, urine; Haemocytometers
- G01N33/50—Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
- G01N33/68—Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
- G01N33/6803—General methods of protein analysis not limited to specific proteins or families of proteins
- G01N33/6848—Methods of protein analysis involving mass spectrometry
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/10—Machine learning using kernel methods, e.g. support vector machines [SVM]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N30/00—Investigating or analysing materials by separation into components using adsorption, absorption or similar phenomena or using ion-exchange, e.g. chromatography or field flow fractionation
- G01N30/02—Column chromatography
- G01N30/88—Integrated analysis systems specially adapted therefor, not covered by a single one of the groups G01N30/04 - G01N30/86
- G01N2030/8809—Integrated analysis systems specially adapted therefor, not covered by a single one of the groups G01N30/04 - G01N30/86 analysis specially adapted for the sample
- G01N2030/8813—Integrated analysis systems specially adapted therefor, not covered by a single one of the groups G01N30/04 - G01N30/86 analysis specially adapted for the sample biological materials
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N27/00—Investigating or analysing materials by the use of electric, electrochemical, or magnetic means
- G01N27/62—Investigating or analysing materials by the use of electric, electrochemical, or magnetic means by investigating the ionisation of gases, e.g. aerosols; by investigating electric discharges, e.g. emission of cathode
- G01N27/622—Ion mobility spectrometry
- G01N27/623—Ion mobility spectrometry combined with mass spectrometry
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N2800/00—Detection or diagnosis of diseases
- G01N2800/60—Complex ways of combining multiple protein biomarkers for diagnosis
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N2800/00—Detection or diagnosis of diseases
- G01N2800/70—Mechanisms involved in disease identification
- G01N2800/7023—(Hyper)proliferation
- G01N2800/7028—Cancer
-
- H—ELECTRICITY
- H01—ELECTRIC ELEMENTS
- H01J—ELECTRIC DISCHARGE TUBES OR DISCHARGE LAMPS
- H01J49/00—Particle spectrometers or separator tubes
- H01J49/0027—Methods for using particle spectrometers
- H01J49/0036—Step by step routines describing the handling of the data generated during a measurement
-
- H—ELECTRICITY
- H01—ELECTRIC ELEMENTS
- H01J—ELECTRIC DISCHARGE TUBES OR DISCHARGE LAMPS
- H01J49/00—Particle spectrometers or separator tubes
- H01J49/004—Combinations of spectrometers, tandem spectrometers, e.g. MS/MS, MSn
Definitions
- Mass spectrometry is an analytical technique that measures the mass-to-charge ratio (m/z) of molecules in a sample, providing accurate and specific measurements of molecules even at trace levels. Mass spectrometry is often coupled with liquid chromatography (LC) in biological and clinical studies which provides additional information on molecules based on retention time and can improve signal-to-noise ratios and reduce matrix effects observed by the mass spectrometer. Improvements in mass spectrometers such as high- resolution instruments and faster and more efficient chromatographic methods that have greatly expanded the wealth of information that can be gained through mass spectrometry.
- LC liquid chromatography
- the method comprises obtaining one or more raw mass spectra of a sample from the subject collected using a mass spectrometer, wherein the raw mass spectra comprises ion m/z values and intensities, wherein an experimental m/Am resolving power of the mass spectrometer is about 500-2,000,000 at m/z 200.
- the method comprises providing a machine learning model comprising one or more transformers that are trained on a raw mass spectra training dataset for characterization of the condition of the subject.
- raw mass spectra are converted to preprocessed mass spectra by an automated algorithm.
- the automated algorithm comprises a de-isotoping, a de-charging, or a de-adducting algorithm.
- the method comprises using the machine learning model to analyze the one or more raw mass spectra of the sample and output information indicative of at least one condition marker or condition state in the subject. [0004] In some embodiments, the method comprises providing the information to a user via a graphical user interface.
- the experimental m/Am resolving power is about 500-1,000,000 at m/z 200. In some embodiments, the experimental m/Am resolving power is about 500-30,000 at m/z 200. In some embodiments, the experimental m/Am resolving power is about 500-5,000 at m/z 200.
- the condition comprises a disease. In some embodiments, the condition comprises an age state of the subject. In some embodiments, the condition comprises a progression-free survival of the subject.
- the machine learning model comprises a plurality of transformers.
- the plurality of transformers are arranged in a hierarchy comprising a first and second transformer arranged in a hierarchy such that an output of the first transformer is used as an input of the second transformer.
- the one or more raw mass spectra are tokenized prior to submission to the one or more transformers.
- the one or more transformers are arranged in a hierarchy with a linear classifier and a random forest aggregator.
- the machine learning model further comprises a linear classifier. In some embodiments, the machine learning model further comprises a neural radiance field. In some embodiments, the machine learning model further comprises a multi-layer neural network. In some embodiments, the machine learning model further comprises a decision tree. In some embodiments, the machine learning model further comprises a support vector machine.
- the one or more raw mass spectra comprise MS/MS spectra. In some embodiments, the one or more raw mass spectra comprise MS n spectra. In some embodiments, the MS/MS or MS n spectra are acquired in a data independent manner.
- the machine learning model is trained with at least 10,000 individual mass spectra per day. In some embodiments, the machine learning model is trained with at least 50,000 individual mass spectra per day. In some embodiments, the machine learning model is trained with at least 100,000 individual mass spectra per day.
- the method comprises obtaining one or more raw mass spectra of a sample from the subject collected using a mass spectrometer.
- the method comprises providing a machine learning model comprising a plurality of transformers that are arranged in a hierarchy and trained on a raw mass spectra training dataset for characterization of the condition.
- the method comprises using the machine learning model to analyze the one or more raw mass spectra of the sample and output information indicative of at least one condition or condition state in the subject.
- the hierarchy comprises a first and second transformer in a hierarchy such that an output of the first transformer is used as an input of the second transformer.
- the hierarchy further comprises a linear classifier, the linear model being arranged in the hierarchy such that an output of the second transformer is used as an input of the linear classifier.
- the hierarchy further comprises a neural radiance field.
- the neural radiance field is arranged in the hierarchy such that an output of the second transformer is used as an input of the neural radiance field.
- a neural radiance field replaces one or more of the transformers described herein.
- the hierarchy further comprises a multi-layer neural network.
- the multi-layer neural network is arranged in the hierarchy such that an output of the second transformer is used as an input of the multi-layer neural network.
- the multi-layer neural network replaces one or more of the transformers described herein.
- the hierarchy further comprises a decision tree, the decision tree being arranged in the hierarchy such that an output of the second transformer is used as an input of the decision tree.
- the hierarchy further comprises a support vector machine, the support vector machine being arranged in the hierarchy such that an output of the second transformer is used as an input of the support vector machine.
- the first transformer classifies tokenized data based on an MS/MS isolation window. In some embodiments, the classification performed by the first transformer is a summarization of tokenized data from the same MS/MS isolation window.
- the second transformer classifies a vector output of the first transformer based upon a sample identity.
- the classification performed by the second transformer is a summarization of data comprising samples obtained from the same subject.
- the sample identity comprises an identity of the subject from which the sample was obtained.
- the linear classifier classifies the disease or disease state based on the vector output from the second transformer.
- the raw mass spectra comprise MS/MS spectra that are acquired in a data independent manner.
- the machine learning model is trained with at least 10,000 individual mass spectra per day. In some embodiments, the machine learning model is trained with at least 50,000 individual mass spectra per day. In some embodiments, the machine learning model is trained with at least 100,000 individual mass spectra per day.
- the condition comprises a disease. In some embodiments, the condition comprises an age state of the subject. In some embodiments, the condition comprises a progression-free survival of the subject.
- the method comprises obtaining one or more raw mass spectra of a sample from the subject collected using a mass spectrometer.
- the method comprises providing the machine learning model that is trained on a raw mass spectra training dataset for characterization of the condition, wherein the machine learning model is trained at a rate of at least 10,000 individual raw mass spectra from the training dataset per day.
- the method comprises using the machine learning model to analyze the one or more raw mass spectra of the sample and output information indicative of at least one condition in the subject.
- the rate is at least 50,000 individual raw mass spectra from the training set per day. In some embodiments, the rate is at least 100,000 individual raw mass spectra from the training set per day.
- the machine learning model further comprises a linear classifier.
- the one or more raw mass spectra comprise MS/MS spectra.
- the machine learning model comprises a plurality of transformers.
- the plurality of transformers are arranged in a hierarchy comprising a first and second transformer in a hierarchy arranged such that an output of the first transformer is used as an input of the second transformer.
- the one or more raw mass spectra are tokenized prior to submission to the one or more transformers.
- the condition comprises a disease.
- the condition comprises an age state of the subject.
- the condition comprises a progression-free survival of the subject.
- the one or more raw mass spectra are tokenized by an MS/MS isolation window and a plurality of m/z values corresponding to detected ions of each of the one or more raw mass spectra.
- the one or more raw mass spectra are tokenized such that m/z values with the same unit mass are binned together.
- tokenized data comprises multiple entries for the same unit mass.
- the multiple entries correspond to separate peaks having the same nominal mass.
- the one or more raw mass spectra are tokenized using large bins (e.g. bins spanning about 1, 0.7, 0.5, or 0.3 mass units).
- the one or more raw mass spectra are tokenized using small bins (e.g. bins spanning about 0.1, 0.01, 0.001 or less mass units. In some embodiments, the one or more raw mass spectra are tokenized using uniform bins. In some embodiments, the one or more raw mass spectra are tokenized using non- uniform bins.
- the machine learning model is trained using self-supervised learning.
- measuring the sample by mass spectrometry comprises separating components of the sample using liquid chromatography coupled to a mass spectrometer.
- a gradient method of the liquid chromatography runs over a period of at least 15 minutes (e.g. about 15, 30, 60, 90, or 180 minutes).
- a gradient method of the liquid chromatography runs over a period of about 5 to 10 minutes (e.g. about 5, 7, or 10 minutes).
- the information includes presence or absence of the at least one disease or disease state in the subject.
- the at least one disease or disease state comprises cancer.
- the cancer comprises pancreatic cancer or ovarian cancer.
- the cancer comprises breast cancer.
- the cancer comprises prostate cancer.
- the cancer comprises lung cancer.
- the cancer comprises gallbladder cancer.
- the condition comprises a plurality of disease states.
- the condition is a disease state, and the disease state comprises a responsiveness of a disease to a therapeutic intervention.
- the therapeutic intervention is an immunotherapy (e.g. a CAR-T therapy).
- the information comprises a probability or likelihood of the subject having the at least one disease or disease state. In some embodiments, the information comprises an indication of disease state or disease severity. In some embodiments, the information comprises an indication of disease classification. In some embodiments, the at least one disease or disease state is a cancer and the indication of the disease classification comprises an identification of a cell line genotype or cell line phenotype of the cancer.
- the information is associated with at least one of a proteomic, a lipidomic, or a metabolomic profile of the sample obtained from the subject.
- the machine learning model outputs the information without requiring prior domain knowledge relating to at least one of the proteomic, lipidomic, or metabolomic profile.
- an accuracy of the information is at least 70%.
- an accuracy of the information is at least 80%.
- an accuracy of the information is at least 90%.
- an accuracy of the information is at least 95%.
- an accuracy of the information is at least 99%.
- training the machine learning model to determine a presence or absence of the one or more disease conditions requires no more than about 500 experimental data points. In some embodiments, no more than about 200 experimental data points are required to train the machine learning model. In some embodiments, no more than about 100 experimental data points are required to train the machine learning model.
- an accuracy of the determination is at least about 70%.
- the proteomic profile comprises one or more post-translational modifications (PTMs).
- the post-translational modifications comprise one or more phosphorylation, acetylation, ubiquitination, glycosylation, or combination of two or more thereof.
- training the machine learning model comprises randomly masking about 1-25% (e.g. 1%, 5%, 10%, 15%, 20%, or 25%) of the training set and adding about 1-10% (e.g. about 1%, 2%, 3%, 4%, 5%, or 10%) noise as a means of self-supervised learning.
- measuring the sample by mass spectrometry comprises separating ions by ion mobility (e.g. by High Field Asymmetric Waveform Ion Mobility Spectrometry (FAIMS) or Drift-tube Ion Mobility Spectrometry) prior to or during acquisition mass spectra.
- FIMS High Field Asymmetric Waveform Ion Mobility Spectrometry
- Drift-tube Ion Mobility Spectrometry e.g. by Drift-tube Ion Mobility Spectrometry
- a mean average percent error of the information is less than about 30% (e.g. less than 30%, 20%, 15%, 10%, 5%, 3%, 2%, or 1%).
- adjacent m/z values are not treated as continuous values during the analysis.
- the information comprises identification of one or more signal which is determinative of the presence or absence of a particular condition. In some embodiments, the information comprises identification of one or more signal which is indicative or correlated with a particular state of a particular condition. In some embodiments, the information is used for biomarker discovery.
- the method is capable of being trained at a rate of at least 10 training samples per day (e.g at least 10, 15, 50, 100, 300, 500, or 700 samples per day) when trained using a single GPU or CPU which is no faster in terms of maximum single precision floating point operations per second than an NVidia RTX A6000 GPU equipped with 48GB of RAM.
- a rate of at least 10 training samples per day e.g at least 10, 15, 50, 100, 300, 500, or 700 samples per day
- non-transitory computer-readable storage media comprising instructions that, when executed by a processor, causes the processor to perform methods described herein.
- systems configured for characterizing a condition of a subject, the systems comprising: a computer comprising a memory operably coupled to at least one processor; and a module executing in the memory of the computer, the module comprising program code enabled upon execution by the at least one processor of the computer to perform methods described herein.
- Another aspect of the present disclosure provides a system comprising one or more computer processors and computer memory coupled thereto.
- the computer memory comprises machine executable code that, upon execution by the one or more computer processors, implements any of the methods above or elsewhere herein.
- FIG. 1A illustrates an exemplary machine learning architecture for classification of one or more conditions of a subject.
- Input data from a mass spectrometer is provided to a machine learning model, which comprises a hierarchical transformer arrangement.
- Raw input data is processed by Transformer LI, which provides its output as input to Transformer L2.
- L2 output can be further processed by additional steps (shown as optional hierarchy layers) which provide their output as input to a final classifier, or L2 can directly output to the input of the final classifier.
- the classifier then outputs classification information about the one or more conditions of the subject.
- FIG. IB illustrates an example tokenization of a spectrum.
- a spectrum having 3 peaks with m/z values 103.009 and 231.068 and 378.136 are converted to a sequence of tokens [231, 103, 378],
- FIG. 2 illustrates an example of a machine learning model utilizing hierarchical transformers for classification of the condition of a subject (in this example, identification of disease).
- FIG. 3 illustrates self-supervised training of the level 1 transformer used in the example machine learning model shown in Fig. 2.
- FIG. 4 illustrates self-supervised training of the level 2 transformer used in the example machine learning model shown in Fig. 2.
- FIG. 5. illustrates classification of a condition of a subject from the L2 output.
- FIG. 6 illustrates the level 1 encoder peak prediction accuracy and loss progression as training continues.
- FIG. 7 illustrates the level 1 encoder adjacent spectrum prediction loss and accuracy progression as training continues.
- FIG. 8 illustrates the level 2 encoder spectrum prediction accuracy and loss progression as training continues.
- FIG. 9 illustrates the level 2 encoder inter/intra person prediction loss and accuracy progression as training continues, for test and validation sets.
- FIG. 10 illustrates that the example training and validation sets produced similar accuracy.
- FIG. 11 illustrates example output from an example implementation of the hierarchical transformer scheme shown in FIG. 2.
- FIG. 12 illustrates an example of inspecting weights of the top-level linear model of the hierarchical transformer scheme of the example of FIG. 2. Absolute values of the weights indicate the importance of the input feature, I.e., level 2 output. Per isolation window breakdown reveals which isolation window is more important, providing identification of specific condition markers (e.g. biomarkers).
- condition markers e.g. biomarkers
- FIG. 13 illustrates inspecting score, i.e., product of level 2 output and weight of the hierarchical transformer scheme of the example of FIG. 2. Score is summed together to get the final classification verdict. By breaking down by window, the window which contributed most to the final score can be identified.
- FIG. 14 illustrates inspection of the attention of level 2 transformer of the hierarchical transformer scheme of the example of FIG. 2. Given a specific window, inspecting the attention score of level 2 transformer and check where the model determines to be more important in terms of retention time. The X axis is indicative of time.
- FIG. 15 shows a computer system that is programmed or otherwise configured to implement methods provided herein.
- FIG. 16 illustrates an alternate exemplary machine learning architecture for classification of one or more conditions of a subject.
- Input data from a mass spectrometer is provided to a machine learning model, which comprises a hierarchical transformer arrangement.
- Raw input data is processed by Transformer LI, which provides its output as input to a linear classifier (or to additional process steps (shown as optional hierarchy layers) which input to the linear classifier), which is aggregated by a random forest model which outputs classification information about the one or more conditions of the subject.
- FIG. 17 illustrates a more detailed implementation of the example machine learning model depicted in FIG. 16.
- FIG. 18 illustrates conversion of a spectrum into a sequence useful for training example models described herein.
- FIG 19 illustrates a conceptual analogy between sentence and spectrum pre-training of models described herein.
- FIG. 20A illustrates example results from a test case of an exemplary machine learning model described herein for Protein P01861.
- FIG. 20B illustrates example results from a test case of an exemplary machine learning model described herein for Protein P08519.
- FIG. 21 illustrates example results from a test case of an exemplary machine learning model described herein for Protein P01861.
- FIG. 22 illustrates the accuracy of an exemplary machine learning model described herein in various test cases.
- FIG. 23 illustrates the accuracy of an exemplary machine learning model described herein in alternate test cases described herein.
- ranges include the range endpoints. Additionally, every sub range and value within the range is present as if explicitly written out.
- the term “about” or “approximately” may mean within an acceptable error range for the particular value, which will depend in part on how the value is measured or determined, e.g., the limitations of the measurement system. For example, “about” may mean within 1 or more than 1 standard deviation, per the practice in the art. Alternatively, “about” may mean a range of up to 20%, up to 10%, up to 5%, or up to 1% of a given value. Where particular values are described in the application and claims, unless otherwise stated the term “about” meaning within an acceptable error range for the particular value may be assumed.
- MS mass to charge ratio
- Metabolomics, lipidomics, and/or proteomics can provide key insights into the health and functionality of a biological system. These tools can provide information useful for assessing the health status of human or animal subjects, as select metabolites, lipids, and proteins serve as biomarkers for various states of disease, malnutrition, or cellular dysfunction. For example, conditions such as diabetes mellitus, metabolic syndrome, renal failure, and hepatic failure present with biomarkers recognizable in blood or urine. Other cellular dysfunctions, such as various cancers, provide biomarker signatures that enable early detection of disease or monitoring of disease progression. Thus, analysis of biomarkers is of key utility for the fields of medical and veterinary science.
- One aspect of the present disclosure provides a method comprising: applying mass spectrometry (MS) to a sample and using a trained machine learning model to determine information about one or more condition of a sample obtained from a subject.
- MS mass spectrometry
- non-transitory computer-readable storage medium comprising a set of instructions for executing a method described herein.
- the machine learning model is selected from logistic regression, ada boost classifier, extra trees classifier, extreme gradient boosting, gaussian process classifier, gradient boosting classifier, K- nearest neighbor, light gradient boosting, linear discriminant analysis, multi-level perceptron, naive Bayes, quadratic discriminant analysis, random forest classifier, ridge classifier, SVM (linear and radial kernels), fully connected neural network, or a deep neural network.
- One aspect of the present disclosure provides a system for classification of a condition of a subject based on a sample obtained from the subject comprising: a computing unit operably coupled to a mass spec (MS) machine.
- MS mass spec
- a sample obtained from a subject can be a cell, a tissue, a urine, a fecal matter, a blood, a blood plasma, a mucus, a saliva, a blood serum, a cerebrospinal fluid, or a cyst fluid.
- Chromatography generally comprises a laboratory technique for the separation of a mixture into its components.
- a mixture can be dissolved into a mobile phase, which can be carried through a system, such as a column, comprising a fixed stationary phase.
- the components within the mobile phase may have different affinities to the stationary phase, resulting in different retention times depending on these affinities. As a result, separation of components in the mixture is achieved.
- the separated components from chromatography may be analyzed using a mass spectrometer (MS).
- MS mass spectrometer
- the LC output may be passed to an MS either directly or indirectly.
- Mass spectrometric analysis generally refers to measuring the mass-to-charge ratio of ions (e.g., m/z), resulting in a mass spectrum.
- the mass spectrum comprises a plot of intensity as a function of mass-to-charge ratio.
- the mass spectrum may be used to determine elemental or isotopic signatures in a sample, as well as the masses of the components (e.g., particles or molecules) in the mixture. This may be used to determine a chemical identity or structure of the components in the mixture.
- one or more acquisition parameters is programmed in the MS.
- the one or more acquisition parameters comprises, for example, the one or more mass acquisition windows, one or more acquisition times for the one or more mass acquisition windows, one or more resolutions for the one or more mass acquisition windows, one or more gain settings for the one or more acquisition windows, one or more ionization polarity settings for the one or more mass acquisition windows, one or more mass resolutions for the one or more mass acquisition windows, or any combination thereof.
- the MS is a high- resolution mass spectrometer. In some cases, the MS is a low-resolution mass spectrometer.
- the high-resolution mass spectrometer has a mass accuracy is less than or equal to 75 ppm, less than or equal to 30 ppm, less than or equal to 15 ppm, less than or equal to 10 ppm, or less than or equal to 5 ppm.
- the output signal from the MS can comprise an intensity value, a mass-to-charge ratio, or a combination thereof.
- the output signal from the MS comprises raw, unprocessed MS data.
- the output signal comprises a first signal indicating an intensity value or a mass-to-charge ratio of one or more analytes.
- the output signal comprises a second signal indicating an intensity value or a mass-to-charge ratio of one or more calibrators.
- the output signal comprises the first signal and the second signal.
- the output signal comprises the peak signal intensity obtained for an exact isotopic mass for each of the one or more analytes or one or more calibrators of known molecular weight.
- the output signal comprises combined signals corresponding to one or more mass adducts for the one or more analytes.
- the output signal for the one or more analytes is obtained by calculating the sum of the adduct signals for 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10 analyte adducts.
- the analyte adducts correspond to the proton, sodium, potassium, calcium, magnesium, ammonium, nitrate, sulfate, phosphate, acetate, citrate, or formate adducts.
- the MS is a tandem MS (MS/MS).
- MS/MS mode a tandem MS can be operated such that ions passing through a first mass spec are activated and the m/z of the activated ions after a fixed amount of time.
- the second MS produces a mass spectrum comprising the activated ions and any fragments thereof produced during or after the ion activation.
- Isolation windows can be selected to determine which ions are subjected to activation and subsequent analysis.
- the isolation windows are fixed by the operator.
- the isolation windows can be adjusted during the course of data acquisition, for example to activate the most or least abundant ions in a spectrum for subsequent analysis of fragmentation.
- the LC-MS method provided herein is optimized for performance on a subset of cellular analytes. In some cases, the LC-MS methods provided herein ionizes in both positive and negative modes. In some cases, the LC-MS method provided herein ionize analytes as molecular ions. In some cases, an ion mobility separation is performed prior to, or during mass spectrometry analysis.
- the output signal from the MS may be processed by a signal processing module.
- the input to the signal processing module can comprise an input signal comprising an intensity value, a mass-to-charge ratio, timing information, or a combination thereof from the MS.
- the input to the signal processing module comprises raw or unprocessed MS data.
- the input is an MZML file comprising the raw, unprocessed MS data.
- the input comprises preprocessed MS data.
- Preprocessing MS data may comprise data cleaning, data transformation, data reduction, or any combination thereof.
- data cleaning comprises cleaning missing data (e.g., fill in or ignore missing values), noisy data (e.g., binning, regression, clustering, etc.), or a combination thereof.
- data transformation comprises standardization, normalization, attribute selection, discretization, hierarchy generation, or any combination thereof.
- data reduction comprises data aggregation, attribute subset selection, numerosity reduction, dimensionality reduction, or any combination thereof.
- the MS data is preprocessed prior to the signal processing module. In some cases, the MS data is preprocessed in the signal processing module.
- the signal processing module can comprise a machine learning model.
- the machine learning model can be trained on MS data.
- the machine learning model may be a trained machine learning algorithm.
- the trained machine learning model may be used to determine information about a condition of a sample obtained from a subject.
- a machine learning model can comprise a supervised, semi-supervised, unsupervised, or self-supervised machine learning model.
- the one or more ML approaches perform classification or clustering of the MS data.
- the machine learning approach comprises a classical machine learning method, such as, but not limited to, support vector machine (SVM) (e.g., one-class SVM, linear or radial kernels, etc.), K-nearest neighbor (KNN), isolation forest, random forest, logistic regression, AdaBoost classifier, extra trees classifier, extreme gradient boosting, gaussian process classifier, gradient boosting classifier, light gradient boosting, linear discriminant analysis, naive Bayes, quadratic discriminant analysis, ridge classifier, or any combination thereof.
- SVM support vector machine
- KNN K-nearest neighbor
- AdaBoost classifier extra trees classifier, extreme gradient boosting, gaussian process classifier, gradient boosting classifier, light gradient boosting, linear discriminant analysis, naive Bayes, quadratic discriminant analysis,
- the machine learning approach comprises a deep leaning method (e.g., deep neural network (DNN)), such as, but not limited to a fully-connected network, convolutional neural network (CNN) (e.g., one-class CNN), recurrent neural network (RNN), transformer, graph neural network (GNN), convolutional graph neural network (CGNN), multi-level perceptron (MLP), or any combination thereof.
- DNN deep neural network
- CNN convolutional neural network
- RNN recurrent neural network
- GNN graph neural network
- CGNN convolutional graph neural network
- MLP multi-level perceptron
- a classical ML method comprises one or more algorithms that learns from existing observations (i.e., known features) to predict outputs.
- the one or more algorithms perform clustering of data.
- the classical ML algorithms for clustering comprise K-means clustering, mean-shift clustering, density-based spatial clustering of applications with noise (DBSCAN), expectationmaximization (EM) clustering (e.g., using Gaussian mixture models (GMM)), agglomerative hierarchical clustering, or any combination thereof.
- the one or more algorithms perform classification of data.
- the classical ML algorithms for classification comprise logistic regression, naive Bayes, KNN, random forest, isolation forest, decision trees, gradient boosting, support vector machine (SVM), or any combination thereof.
- the SVM comprises a one-class SMV or a multi-class SVM.
- the deep learning method comprises one or more algorithms that learns by extracting new features to predict outputs.
- the deep learning method comprises one or more layers.
- the deep learning method comprises a neural network (e.g., DNN comprising more than one layer).
- the output from a given node is passed on as input to another node.
- the nodes in the network generally comprise input units in an input layer, hidden units in one or more hidden layers, output units in an output layer, or a combination thereof.
- an input node is connected to one or more hidden units.
- one or more hidden units is connected to an output unit.
- the nodes can generally take in input through the input units and generate an output from the output units using an activation function.
- the input or output comprises a tensor, a matrix, a vector, an array, or a scalar.
- the activation function is a Rectified Linear Unit (ReLU) activation function, Gaussian Error Linear Unit (GeLU), a sigmoid activation function, a hyperbolic tangent activation function, or a Softmax activation function.
- the connections between nodes further comprise weights for adjusting input data to a given node (i.e., to activate input data or deactivate input data).
- the weights are learned by the neural network.
- the neural network is trained to learn weights using gradient-based optimizations.
- the gradient-based optimization comprises one or more loss functions.
- the gradient-based optimization is gradient descent, conjugate gradient descent, stochastic gradient descent, or any variation thereof (e.g., adaptive moment estimation (Adam)).
- the gradient in the gradient-based optimization is computed using backpropagation.
- the nodes are organized into graphs to generate a network (e.g., graph neural networks).
- the nodes are organized into one or more layers to generate a network (e.g., feed forward neural networks, convolutional neural networks (CNNs), recurrent neural networks (RNNs), etc.).
- the CNN comprises a one-class CNN or a multi-class CNN.
- the neural network comprises one or more recurrent layers.
- the one or more recurrent layers are one or more long short-term memory (LSTM) layers or gated recurrent units (GRUs).
- the one or more recurrent layers perform sequential data classification and clustering in which the data ordering is considered (e.g., time series data).
- future predictions are made by the one or more recurrent layers according to the sequence of past events.
- the recurrent layer retains important information, while selectively removing what is not essential to the classification.
- the neural network comprise one or more convolutional layers.
- the input and the output are a tensor representing variables or attributes in a data set (e.g., features), which may be referred to as a feature map (or activation map).
- the one or more convolutional layers are referred to as a feature extraction phase.
- the convolutions are one dimensional (ID) convolutions, two dimensional (2D) convolutions, three dimensional (3D) convolutions, or any combination thereof.
- the convolutions are ID transpose convolutions, 2D transpose convolutions, 3D transpose convolutions, or any combination thereof.
- the layers in a neural network can further comprise one or more pooling layers before or after a convolutional layer.
- the one or more pooling layers reduces the dimensionality of a feature map using filters that summarize regions of a matrix. In some embodiments, this down samples the number of outputs, and thus reduces the parameters and computational resources needed for the neural network.
- the one or more pooling layers comprises max pooling, min pooling, average pooling, global pooling, norm pooling, or a combination thereof.
- max pooling reduces the dimensionality of the data by taking only the maximums values in the region of the matrix. In some embodiments, this helps capture the most significant one or more features.
- the one or more pooling layers is one dimensional (ID), two dimensional (2D), three dimensional (3D), or any combination thereof.
- the neural network can further comprise of one or more flattening layers, which can flatten the input to be passed on to the next layer.
- a input e.g., feature map
- the flattened inputs can be used to output a classification of an object.
- the classification comprises a binary classification or multi-class classification of visual data (e.g., images, videos, etc.) or non-visual data (e.g., measurements, audio, text, etc.).
- the classification comprises binary classification of an image (e.g., cat or dog).
- the classification comprises multi-class classification of a text (e.g., identifying hand-written digits)). In some embodiments, the classification comprises binary classification of a measurement. In some examples, the binary classification of a measurement comprises a classification of a system’s performance using the physical measurements described herein (e.g., normal or abnormal, normal or anormal).
- the neural networks can further comprise of one or more dropout layers.
- the dropout layers are used during training of the neural network (e.g., to perform binary or multi-class classifications).
- the one or more dropout layers randomly set some weights as 0 (e.g., about 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80% of weights).
- the setting some weights as 0 also sets the corresponding elements in the feature map as 0.
- the one or more dropout layers can be used to avoid the neural network from overfitting.
- the neural network can further comprise one or more dense layers, which comprises a fully connected network.
- information is passed through a fully connected network to generate a predicted classification of an object.
- the error associated with the predicted classification of the object is also calculated.
- the error is backpropagated to improve the prediction.
- the one or more dense layers comprises a Softmax activation function.
- the Softmax activation function converts a vector of numbers to a vector of probabilities. In some embodiments, these probabilities are subsequently used in classifications, such as classifications of a type or class of a molecule (e.g., calibrator or analyte) as described herein.
- the model comprises multi-modality models.
- multimodality models can be extremely powerful. Different modalities provide supportive, complementary or even completely orthogonal signals to the model.
- Multi-modality models allow the model to be use for a variety of downstream tasks that might benefit from some or all of the input modalities.
- Intermediate features and terminal embeddings from each model are fused.
- the fused representation is then used to train subsequent models for various tasks including regression, classification, generation and dimensionality reduction.
- the entire network and sub-models can be fine-tuned for specific tasks or the sub-models can be frozen and only the heads trained and/or finetuned.
- the modularity offers the flexibility of interchanging a sub- model by higher performing models as they become available or designed.
- Sub-models can take any form, such as, but not limited to, CNN, Transformer, MLP, etc. Each module can then be used to generate embeddings for new unseen data that can then be used for downstream tasks. [0096]
- the training data may be designed based on one or more considerations.
- Considerations may comprise, by way of non-limiting example, effective LC separation of the broadest range of analytes, instrumental conditions for collective sensitivity of all analytes (ionization mode, RT, extracted ion chromatogram for each analyte), inherent range (high and low) of instrument detection (for each analyte), resolving power of the mass spectrometer, length of time between injections (acquisition and column equilibration), stability and reproducibility over long acquisition times, MS/MS parameters (e.g. isolation windows for data independent analysis (DIA)), and/or use of spiked-in non-endogenous QC analytes to demarcate between sample issues and instrument issues.
- effective LC separation of the broadest range of analytes instrumental conditions for collective sensitivity of all analytes (ionization mode, RT, extracted ion chromatogram for each analyte), inherent range (high and low) of instrument detection (for each analyte), resolving power of the mass spectrometer
- training data may comprise raw spectra comprising data on a plurality of samples collected from populations of subjects with one or more known conditions.
- the instruments can comprise two or more different mass spectrometer types (e.g. ion trap, orbitrap, FT-ICR, time-of-flight (ToF), or QQQ-time-of-flight (QTOF) mass spectrometers).
- the instruments can comprise two or more different mass spectrometers of the same type. Inclusion of the one or more design considerations in building the training set can produce a model which is capable of accurately classifying a sample obtained from a subject having an unknown condition based on analysis of MS data obtained from the sample.
- a run list of samples is provided by a user interface, for example to facilitate construction of the training set using an MS or LC-MS equipped with an autosampler.
- the user interface comprises information such as sample plate positions, blank positions, number of drawers, number of slots per drawer, columns to run, blank plate number of wells, number of injections, plates between calibration curves, maximum blank well reuse, injection volume, blank frequency, etc.
- the mass accuracy is less than or equal to 75 ppm, less than or equal to 30 ppm, less than or equal to 15 ppm, less than or equal to 10 ppm, or less than or equal to 5 ppm.
- methods described herein do not require exact mass (e.g. data from a low resolution mass spectrometer such as a conventional ion trap may be used) in order to provide classification of a condition of a subject based on analysis of a sample obtained from the subject.
- exact mass e.g. data from a low resolution mass spectrometer such as a conventional ion trap may be used
- training data can comprise multi-modal foundation models.
- the foundation models can be trained using meta data inputs comprising MSI and/or MS2 spectra.
- the underlying architecture is modality agnostic.
- a modalityagnostic foundational model may be trained to understand mass spectra indifferently to whether the spectra are acquired in MSI, MS2, Multiple Reaction Monitoring (MRM), Data-independent Acquisition (DIA), Data Dependent Acquisition (DDA), MS n mode, or combinations thereof.
- MRM Multiple Reaction Monitoring
- DIA Data-independent Acquisition
- DDA Data Dependent Acquisition
- MS n mode MS n mode, or combinations thereof.
- a multi-modal modal or mode-agnostic model described herein can translate from one modality to another based at least in part on data describing a joint space between two or more modalities.
- training a model described herein using inputs from a plurality of different modalities reduces or eliminates the need for labeling of training and/or sample data.
- training using a combination of MSI and MS2 data can reduce or eliminate the need for labeled datasets for a particular downstream application (such as biomarker discovery and/or disease classification).
- use of a multi-modal training regime can significantly reduce a number of empirical data points needed to make a disease classification or discover a biomarker. Such embodiments are particularly advantageous for classification of rare or complex conditions where the availability of controlled empirical data is limited and/or nonexistent.
- multi-modal models allow utilization of mass spectrometry measurements of less than 150 clinical samples (e.g. as few as 10 to 20 samples) to provide accurate characterization of a disease or condition.
- foundational multi-modal models can be fine-tuned using a small number of data points, for example, to train for specific characterizations such as identification of gene labels and/or metabolites.
- multi-modal models are trained using m/z peaks with raw intensity values from a plurality of mass spectrometer operating modes to form a vocabulary of the model.
- continuous values can be converted to discreet inputs, representing intensity, m/z and/or mode of acquisition.
- chromatographic information can be included to further refine the models, or intentionally excluded to produce a model which is LC agnostic.
- training data comprises millions of paired m/z, intensity data points.
- the precision of data points is compressed by breaking discreet values into the first three and last three digits to reduce the dimensionality of the training set (e.g. from millions of data points to about 1000 different vectors).
- FIG. 15 shows a computer system 1501 that is programmed or otherwise configured to characterize a condition of a subject using mass spectrometry data obtained by analyzing a sample collected from the subject.
- the computer system 1501 can regulate various aspects of the machine-learning based methods of the present disclosure, such as, for example, providing a model which is capable of providing output information indicative of at least one condition marker or condition state in the subject.
- the computer system 1501 can be an electronic device of a user or a computer system that is remotely located with respect to the electronic device.
- the electronic device can be a mobile electronic device.
- the computer system 1501 includes a central processing unit (CPU, also “processor” and “computer processor” herein) 1505, which can be a single core or multi core processor, or a plurality of processors for parallel processing.
- the computer system 1501 also includes memory or memory location 1510 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 1515 (e.g., hard disk), communication interface 1520 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 1525, such as cache, other memory, data storage and/or electronic display adapters.
- the memory 1510, storage unit 1515, interface 1520 and peripheral devices 1525 are in communication with the CPU 1505 through a communication bus (solid lines), such as a motherboard.
- the storage unit 1515 can be a data storage unit (or data repository) for storing data.
- the computer system 1501 can be operatively coupled to a computer network (“network”) 1530 with the aid of the communication interface 1520.
- the network 1530 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet.
- the network 1530 in some cases is a telecommunication and/or data network.
- the network 1530 can include one or more computer servers, which can enable distributed computing, such as cloud computing.
- the network 1530, in some cases with the aid of the computer system 1501, can implement a peer-to-peer network, which may enable devices coupled to the computer system 1501 to behave as a client or a server.
- the CPU 1505 can execute a sequence of machine-readable instructions, which can be embodied in a program or software.
- the instructions may be stored in a memory location, such as the memory 1510.
- the instructions can be directed to the CPU 1505, which can subsequently program or otherwise configure the CPU 1505 to implement methods of the present disclosure. Examples of operations performed by the CPU 1505 can include fetch, decode, execute, and writeback.
- the CPU 1505 can be part of a circuit, such as an integrated circuit. One or more other components of the system 1501 can be included in the circuit. In some cases, the circuit is an application specific integrated circuit (ASIC).
- ASIC application specific integrated circuit
- the storage unit 1515 can store files, such as drivers, libraries and saved programs.
- the storage unit 1515 can store user data, e.g., user preferences and user programs.
- the computer system 1501 in some cases can include one or more additional data storage units that are external to the computer system 1501, such as located on a remote server that is in communication with the computer system 1501 through an intranet or the Internet.
- the computer system 1501 can communicate with one or more remote computer systems through the network 1530.
- the computer system 1501 can communicate with a remote computer system of a user.
- remote computer systems include personal computers (e.g., portable PC), slate or tablet PC’s (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device, Blackberry®), or personal digital assistants.
- the user can access the computer system 1501 via the network 1530.
- Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 1501, such as, for example, on the memory 1510 or electronic storage unit 1515.
- the machine executable or machine readable code can be provided in the form of software. During use, the code can be executed by the processor 1505. In some cases, the code can be retrieved from the storage unit 1515 and stored on the memory 1510 for ready access by the processor 1505. In some situations, the electronic storage unit 1515 can be precluded, and machine-executable instructions are stored on memory 1510.
- the code can be pre-compiled and configured for use with a machine having a processer adapted to execute the code, or can be compiled during runtime.
- the code can be supplied in a programming language that can be selected to enable the code to execute in a pre-compiled or as-compiled fashion.
- aspects of the systems and methods provided herein can be embodied in programming.
- Various aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium.
- Machine-executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk.
- “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server.
- another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links.
- a machine readable medium such as computer-executable code
- a tangible storage medium such as computer-executable code
- Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings.
- Volatile storage media include dynamic memory, such as main memory of such a computer platform.
- Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system.
- Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications.
- RF radio frequency
- IR infrared
- Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data.
- Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
- the computer system 1501 can include or be in communication with an electronic display 1535 that comprises a user interface (UI) 1540 for providing, for example, information concerning a condition of a sample obtained from a subject.
- UI user interface
- Examples of UI’s include, without limitation, a graphical user interface (GUI) and web-based user interface.
- Methods and systems of the present disclosure can be implemented by way of one or more algorithms.
- An algorithm can be implemented by way of software upon execution by the central processing unit 1505.
- the algorithm can, for example, be configured to perform any of the methods described herein.
- Example 1 Direct Disease Classification from Raw Mass-Spectrometry data using Self- Supervised Deep Learning
- Deep learning has made great strides in many areas, but in proteomics, the adaptation has been limited to a small number of applications, such as prediction of chromatographic retention time and product ion intensities for specific ions.
- An untapped potential of deep learning was demonstrated by building a model that classifies patients into groups of cancer patients and groups of normal subjects by directly analyzing data-independent acquisition (DIA) data, without needing any prior proteomics knowledge.
- DIA data-independent acquisition
- Transformer encoders were used for encoding DIA data. To facilitate the processes of a large data set, the encoders were laid out in a hierarchy according to the arrangements described in FIGs. 1A, and 2. The level- 1 transformer encoded each MS/MS spectrum, and the level -2 transformer encoded a sequence of level-1 outputs. Both encoders were trained in a selfsupervised fashion - learning the distribution itself without externally added labels. In the selfsupervised training, novel optimization objectives were added on top of the typical objective of predicting hidden input. After training each level in sequence, the top-level classifier was finetuned along with the level-2 transformer. The labels used in the final fine-tune step are the only external information injected to the model.
- a machine-learning model termed Spectrum is All you Need (SAN) was designed which aims to analyze MS data using deep learning, with minimal domain knowledge.
- a transformer which is widely used in natural language processing (NLP), computer vision, speech processing, as well as in bioinformatics, was used as the main engine of the architecture. The use of transformer resulted in several interesting design decisions:
- Tokenization each spectrum was converted to a sequence of tokens, similar to a sentence being converted to tokens in NLP.
- Self-supervised training Transformers ordinarily require a lot of examples to train properly. Due to the wealth of data available by mass spectrometric analysis of samples, particularly when paired with additional separations (e.g. chromatography, ion mobility, etc.), even a single sample comprises a very large data set. Accordingly, self-supervised training was used to decrease the number of experimental data points required. Several self-supervised objectives were devised and used to reduce the total number of training points required to build an accurate model.
- Tokenization A unique tokenization procedure was utilized. Mass spectra generally include a set of peaks where each peak has a mass-to-charge ratio (m/z) and an intensity. Each tandem mass (MS/MS) spectrum was converted to a sequence of tokens by, first sorting all the peaks in the decreasing order of their intensities (e.g. most intense peaks first), and then converting peaks to token ids by rounding up the m/z value to the closest integer after multiplying a scaling constant (0.9995 by default). Using this particular tokenization scheme the peaks whose m/z values were close together were assigned the same token id, indicating that high-resolution isn’t necessarily needed to provide accurate classification. Each token then became a categorical variable which does not carry any explicit information about the m/z value. All relationships among tokens were then discovered by the transformer from scratch by looking at the data.
- Transformer s memory usage increases with O(n 2 ) where n is the length of input sequence.
- n is the length of input sequence.
- a transformer model is trainable with a local GPU, but beyond that is not practical or feasible.
- the SAN implementation structures input data into three levels of a hierarchy (see FIG. 2).
- level 1 a transformer learned to encode each individual MS/MS spectrum.
- level 2 another transformer learned to encode a sequence of spectra, i.e. a sequence of LI outputs.
- level 2.5 a simple linear classifier learned to classify disease status (as cancerous or disease free) from a sequence of L2 outputs.
- a Pancreatic cancer data-set was used containing 118 raw mass spec files collected from the same number of samples.
- the gradient length of the LC-MS used to collect the files was 180 mins resulting in approximately 23 IK spectra across 70 isolation windows.
- Raw files were converted to mzml using msconvert. For the conversion, cwt peak peaking is selected. Mz and intensity were written as single precision floats. Also, zero samples (zero intensity peaks) were removed.
- the full conversion was performed using the following msconvert settings “—64 — mz32 — inten32 —filter “peakPicking cwt” —filter “zeroSamples removeExtra””
- the dataset was stratified-split into 80% train, 10% validation, and 10% test set using healthy vs. cancer label by skleam library.
- a Tokenization process converted mzml files to python pickle files.
- the pickle file contained a list, whose element is a diet.
- the diet had one key token and value contained a list of numpy arrays.
- the numpy array had the token id of each spectrum.
- the pickle object was referenced by data[window_idx][‘token’][spectrum_idx][peak_idx]. The following steps were then performed:
- a transformer encoder was trained with two tasks - masked token prediction and adjacency prediction. Two cross-entropy losses were added with equal weights. Adjacency was labeled into three classes - unrelated, adjacent in horizontal(time) axis, and adjacent in vertical (isolation window) axis.
- Target adjacency label was sampled with the equal weights. Using the target label, two spectra were sampled from dataset. One summary token, and two spectra are concatenated. 10% of tokens from each spectrum is masked. Token type id is set to 0 for the summary token and the first spectrum. The second spectrum has token type id of 1. Token embedding is followed by layer norm and dropout. The transformer block was configured to have 6 layers, 512 hidden dimensions, 3*512 intermediate dimensions, 8 attention heads, absolute position encoding.
- Token prediction head was dot product with token embedding table, followed by a bias layer. Linear head was used for adjacency prediction.
- Level 1 After level 1 training was finished, level 1 transformer was fixed and dataset was encoded using the transformer. As level 1 is trained using two spectra, two spectra are fed to the encoder and output of adjacency prediction token is taken. This reduced the sequence length of isolation window 2.2K to 1. IK. After pre-encoding, two spectra were represented by one 512 dimensional vector.
- Level 2 was constructed similar to level 1. The core difference is that the input element was already in a high dimensional vector, thus token embedding was not needed. Masked token prediction in level 1 was replaced with masked vector prediction. Adjacency prediction was replaced with inter-intra person prediction.
- Masked vector prediction The masked input vector was replaced by a learnable vector. Output of the encoder was dot-producted with the original contents of the vector. Cross-entropy loss was used. In level 1, dot-product was taken across token-space, i.e. total -2000 categories. In level 2, dot-product was taken across masked inputs including the ones in batch. As the batch size gets larger, the difficulty of the task increases, as does the loss of quality.
- Inter-intra person prediction Given two sequences of vectors from two isolation windows, the model was asked to guess whether two sequences stem from one person or two persons.
- Target inter-intra person prediction label was sampled randomly with equal weights.
- Model parameters were extracted from a model trained to classify whether or not a subject has cancer. Results are shown in FIGs. 12-14. Extraction of the model parameters allow for identification of biomarkers which are indicative of either a heathy subject or a subject with cancer.
- the number of examples in a typical clinical study of a condition of a subject is on the order of patient samples times isolation windows, which is less than 10K sample points for a typical dataset that can be used to train the second level of the hierarchy. Additionally, the level- 1 model is frozen after pre-training, is not updated during fine-tuning stage, and does not receive or utilize any label information.
- the transformer encoder learns spectrum level information, allowing for direct finetuning, which means that label information is injected into the model.
- the use of random forest as information aggregator facilitates back-tracing of results. For example, inspecting feature importance of the model can reveal which spectrum is important and contributes to the final output the most.
- a random forest model treats each score as a unique feature. If one sample has retention time offset relative to other samples, spectra and their scores will propagate as an offset in the feature space of the random forest. A simple retention time alignment step was added to cancel out retention drift as much as possible, as illustrated in the dataset section below.
- Example 4 Some implementations of the Example scheme included a transformer model that takes multiple spectra as input, which can absorb some of the remaining offset in the input, as described in the model section below.
- Dataset preparation Raw data files were centroided and deisotoped.
- a trim step was extended to compensate retention time offset by calculating an offset for each sample, which is added to the trimming range.
- a simple data-driven method was used as follows: Each DIA run was converted to 3D matrix whose axis was isolation window index, retention time index, and binned mz index. The value was log of peak intensity or 0 if there is no peak in the given index.
- Example 3 The basic principle of converting spectrum into sequence was performed substantially as in Example 3. Each peak was sorted by intensity rank order and mapped to a fixed sized m/z bin. An example of this is illustrated in FIG. 18.
- classification token is prepended. The token encourages the model to summarize the information of the whole spectrum. Its encoded output is fed to a linear classifier.
- the Example model was capable of handling more than one spectrum at a time, which can extract more information from time-adjacent, multiple spectra. Multiple sequences from multiple spectra are concatenated to form one sequence, with END token inserted between.
- the number of spectra to combine is a hyper-parameter. Different values have been tried and evaluated, ranging from 1 to 6.
- pre-training step 15% of tokens are randomly hidden, similar to masked language model training.
- the diagram illustrated in FIG. 19 shows conceptual analogy between sentence and spectrum pre-training. Putting the spectrum example above In sequence form: [ CLASS, C, E, MASK, E, END], the model is trained to predict MASK token with expected golden answer to be B.
- the Example model learns the general distribution of input data during pre-training step. In fine-tuning step, the model learns how label is related to the input and is trained to predict the label given the input. To do that, the output of CLASS token is fed to a linear classifier that predicts the label.
- the example model is trained with more diverse spectra. However, only small portion of spectra that comes from specific isolation window and retention time have mutual information with label. Other spectra can be low quality examples and feeding them might have negative impact.
- the example model is fine-tuned with spectra from a specific isolation window and retention time. This prevents high fidelity information examples from being mixed with low fidelity information examples. However, the number of training examples used can be significantly smaller.
- Plasma vs. Serum Compared to plasma blood sample, serum goes through extra processing. Fibrinogen protein is known to be filtered out. A SAN model as described above was trained with a dataset of mixed plasma and serum samples. A label was assigned to indicate whether the sample is plasma or serum.
- the accuracy and AUC for the test split was around 0.99.
- the FIG. 20A shows feature importance of random forest classifier.
- the random forest model treats each isolation window index and spectrum(retention time) index as a separate input feature. Among thousands of such features, the model finds important ones that are helpful to predict the label. Important features were compared against a list of peptides in fibrinogen alpha and gamma protein. The list is produced by DIA-NN tool and has isolation window and retention time information.
- the FIG. 20A shows that SAN feature list overlaps with the peptide list from DIA-NN tool.
- FIG. 20B shows that a few top features overlap with DIA-NN peptide list.
- FIG. 21 shows that a few top features overlap with DIA-NN peptide list.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Medical Informatics (AREA)
- Chemical & Material Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Immunology (AREA)
- Biophysics (AREA)
- Pathology (AREA)
- Computing Systems (AREA)
- Biochemistry (AREA)
- Analytical Chemistry (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Urology & Nephrology (AREA)
- Biotechnology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Public Health (AREA)
- Hematology (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Signal Processing (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioethics (AREA)
- Cell Biology (AREA)
- Microbiology (AREA)
- Food Science & Technology (AREA)
Abstract
Described herein are methods and systems for characterizing one or more conditions of a subject based on analysis of biological samples obtained from the subject by mass spectrometry.
Description
METHODS AND SYSTEMS FOR CLASSIFICATION OF A CONDITION USING
MASS SPECTROMETRY DATA
CROSS-REFERENCE
[0001] This application claims the benefit of U.S. Provisional Application No. 63/410,054, filed September 26, 2022, and U.S. Provisional Application No. 63/531,910, filed August 10, 2023, each of which is incorporated herein by reference in its entirety.
BACKGROUND
[0002] Mass spectrometry (MS) is an analytical technique that measures the mass-to-charge ratio (m/z) of molecules in a sample, providing accurate and specific measurements of molecules even at trace levels. Mass spectrometry is often coupled with liquid chromatography (LC) in biological and clinical studies which provides additional information on molecules based on retention time and can improve signal-to-noise ratios and reduce matrix effects observed by the mass spectrometer. Improvements in mass spectrometers such as high- resolution instruments and faster and more efficient chromatographic methods that have greatly expanded the wealth of information that can be gained through mass spectrometry. Despite the advances made over the past few decades, however, much of the information that can be gained goes unutilized due to the challenging complexity of interpreting mass spectra, particularly when the mass spectrometer is utilizing chromatography. Accordingly, improved methods of analyzing the wealth of data available from mass spectrometry of biological samples are needed.
SUMMARY
[0003] In one aspect, described herein are methods of characterizing a condition of a subject using mass spectrometry data. In some embodiments, the method comprises obtaining one or more raw mass spectra of a sample from the subject collected using a mass spectrometer, wherein the raw mass spectra comprises ion m/z values and intensities, wherein an experimental m/Am resolving power of the mass spectrometer is about 500-2,000,000 at m/z 200. In some embodiments, the method comprises providing a machine learning model comprising one or more transformers that are trained on a raw mass spectra training dataset for characterization of the condition of the subject. In some embodiments, raw mass spectra are converted to preprocessed mass spectra by an automated algorithm. In some embodiments, the automated algorithm comprises a de-isotoping, a de-charging, or a de-adducting algorithm. In some embodiments, the method comprises using the machine learning model to analyze the one or more raw mass spectra of the sample and output information indicative of at least one condition marker or condition state in the subject.
[0004] In some embodiments, the method comprises providing the information to a user via a graphical user interface. In some embodiments, the experimental m/Am resolving power is about 500-1,000,000 at m/z 200. In some embodiments, the experimental m/Am resolving power is about 500-30,000 at m/z 200. In some embodiments, the experimental m/Am resolving power is about 500-5,000 at m/z 200.
[0005] In some embodiments, the condition comprises a disease. In some embodiments, the condition comprises an age state of the subject. In some embodiments, the condition comprises a progression-free survival of the subject.
[0006] In some embodiments, the machine learning model comprises a plurality of transformers. In some embodiments, the plurality of transformers are arranged in a hierarchy comprising a first and second transformer arranged in a hierarchy such that an output of the first transformer is used as an input of the second transformer. In some embodiments, the one or more raw mass spectra are tokenized prior to submission to the one or more transformers. In some embodiments, the one or more transformers are arranged in a hierarchy with a linear classifier and a random forest aggregator.
[0007] In some embodiments, the machine learning model further comprises a linear classifier. In some embodiments, the machine learning model further comprises a neural radiance field. In some embodiments, the machine learning model further comprises a multi-layer neural network. In some embodiments, the machine learning model further comprises a decision tree. In some embodiments, the machine learning model further comprises a support vector machine.
[0008] In some embodiments, the one or more raw mass spectra comprise MS/MS spectra. In some embodiments, the one or more raw mass spectra comprise MSn spectra. In some embodiments, the MS/MS or MSn spectra are acquired in a data independent manner.
[0009] In some embodiments, the machine learning model is trained with at least 10,000 individual mass spectra per day. In some embodiments, the machine learning model is trained with at least 50,000 individual mass spectra per day. In some embodiments, the machine learning model is trained with at least 100,000 individual mass spectra per day.
[0010] In another aspect, described herein are machine-learning based methods of characterizing a condition of a subject. In some embodiments, the method comprises obtaining one or more raw mass spectra of a sample from the subject collected using a mass spectrometer. In some embodiments, the method comprises providing a machine learning model comprising a plurality of transformers that are arranged in a hierarchy and trained on a raw mass spectra training dataset for characterization of the condition. In some embodiments, the method comprises using the machine learning model to analyze the one or more raw mass spectra of the sample and output information indicative of at least one condition or condition state in the subject.
[0011] In some embodiments, the hierarchy comprises a first and second transformer in a hierarchy such that an output of the first transformer is used as an input of the second transformer. In some embodiments, the hierarchy further comprises a linear classifier, the linear model being arranged in the hierarchy such that an output of the second transformer is used as an input of the linear classifier.
[0012] In some embodiments, the hierarchy further comprises a neural radiance field. In some embodiments, the neural radiance field is arranged in the hierarchy such that an output of the second transformer is used as an input of the neural radiance field. In some embodiments, a neural radiance field replaces one or more of the transformers described herein.
[0013] In some embodiments, the hierarchy further comprises a multi-layer neural network. In some embodiments, the multi-layer neural network is arranged in the hierarchy such that an output of the second transformer is used as an input of the multi-layer neural network. In some embodiments, the multi-layer neural network replaces one or more of the transformers described herein.
[0014] In some embodiments, the hierarchy further comprises a decision tree, the decision tree being arranged in the hierarchy such that an output of the second transformer is used as an input of the decision tree. In some embodiments, the hierarchy further comprises a support vector machine, the support vector machine being arranged in the hierarchy such that an output of the second transformer is used as an input of the support vector machine. In some embodiments, the first transformer classifies tokenized data based on an MS/MS isolation window. In some embodiments, the classification performed by the first transformer is a summarization of tokenized data from the same MS/MS isolation window.
[0015] In some embodiments, the second transformer classifies a vector output of the first transformer based upon a sample identity. In some embodiments, the classification performed by the second transformer is a summarization of data comprising samples obtained from the same subject. In some embodiments, the sample identity comprises an identity of the subject from which the sample was obtained.
[0016] In some embodiments, the linear classifier classifies the disease or disease state based on the vector output from the second transformer.
[0017] In some embodiments, the raw mass spectra comprise MS/MS spectra that are acquired in a data independent manner. In some embodiments, the machine learning model is trained with at least 10,000 individual mass spectra per day. In some embodiments, the machine learning model is trained with at least 50,000 individual mass spectra per day. In some embodiments, the machine learning model is trained with at least 100,000 individual mass spectra per day.
[0018] In some embodiments, the condition comprises a disease. In some embodiments, the condition comprises an age state of the subject. In some embodiments, the condition comprises a progression-free survival of the subject.
[0019] In another aspect, described herein are methods of characterizing a condition of a subject using a high throughput trained machine learning model. In some embodiments, the method comprises obtaining one or more raw mass spectra of a sample from the subject collected using a mass spectrometer. In some embodiments, the method comprises providing the machine learning model that is trained on a raw mass spectra training dataset for characterization of the condition, wherein the machine learning model is trained at a rate of at least 10,000 individual raw mass spectra from the training dataset per day. In some embodiments, the method comprises using the machine learning model to analyze the one or more raw mass spectra of the sample and output information indicative of at least one condition in the subject.
[0020] In some embodiments, the rate is at least 50,000 individual raw mass spectra from the training set per day. In some embodiments, the rate is at least 100,000 individual raw mass spectra from the training set per day.
[0021] In some embodiments, the machine learning model further comprises a linear classifier. In some embodiments, the one or more raw mass spectra comprise MS/MS spectra. In some embodiments, the machine learning model comprises a plurality of transformers. In some embodiments, the plurality of transformers are arranged in a hierarchy comprising a first and second transformer in a hierarchy arranged such that an output of the first transformer is used as an input of the second transformer. In some embodiments, the one or more raw mass spectra are tokenized prior to submission to the one or more transformers. In some embodiments, the condition comprises a disease. In some embodiments, the condition comprises an age state of the subject. In some embodiments, the condition comprises a progression-free survival of the subject.
[0022] In some embodiments, the one or more raw mass spectra are tokenized by an MS/MS isolation window and a plurality of m/z values corresponding to detected ions of each of the one or more raw mass spectra. In some embodiments, the one or more raw mass spectra are tokenized such that m/z values with the same unit mass are binned together. In some embodiments, tokenized data comprises multiple entries for the same unit mass. In some embodiments, the multiple entries correspond to separate peaks having the same nominal mass. [0023] In some embodiments, the one or more raw mass spectra are tokenized using large bins (e.g. bins spanning about 1, 0.7, 0.5, or 0.3 mass units). In some embodiments, the one or more raw mass spectra are tokenized using small bins (e.g. bins spanning about 0.1, 0.01, 0.001 or less mass units. In some embodiments, the one or more raw mass spectra are tokenized using
uniform bins. In some embodiments, the one or more raw mass spectra are tokenized using non- uniform bins.
[0024] In some embodiments, the machine learning model is trained using self-supervised learning. In some embodiments, measuring the sample by mass spectrometry comprises separating components of the sample using liquid chromatography coupled to a mass spectrometer. In some embodiments, a gradient method of the liquid chromatography runs over a period of at least 15 minutes (e.g. about 15, 30, 60, 90, or 180 minutes). In some embodiments, a gradient method of the liquid chromatography runs over a period of about 5 to 10 minutes (e.g. about 5, 7, or 10 minutes).
[0025] In some embodiments, the information includes presence or absence of the at least one disease or disease state in the subject. In some embodiments, the at least one disease or disease state comprises cancer. In some embodiments, the cancer comprises pancreatic cancer or ovarian cancer. In some embodiments, the cancer comprises breast cancer. In some embodiments, the cancer comprises prostate cancer. In some embodiments, the cancer comprises lung cancer. In some embodiments, the cancer comprises gallbladder cancer. In some embodiments, the condition comprises a plurality of disease states. In some embodiments, the condition is a disease state, and the disease state comprises a responsiveness of a disease to a therapeutic intervention. In some embodiments, the therapeutic intervention is an immunotherapy (e.g. a CAR-T therapy).
[0026] In some embodiments, the information comprises a probability or likelihood of the subject having the at least one disease or disease state. In some embodiments, the information comprises an indication of disease state or disease severity. In some embodiments, the information comprises an indication of disease classification. In some embodiments, the at least one disease or disease state is a cancer and the indication of the disease classification comprises an identification of a cell line genotype or cell line phenotype of the cancer.
[0027] In some embodiments, the information is associated with at least one of a proteomic, a lipidomic, or a metabolomic profile of the sample obtained from the subject. In some embodiments, the machine learning model outputs the information without requiring prior domain knowledge relating to at least one of the proteomic, lipidomic, or metabolomic profile. In some embodiments, an accuracy of the information is at least 70%. In some embodiments, an accuracy of the information is at least 80%. In some embodiments, an accuracy of the information is at least 90%. In some embodiments, an accuracy of the information is at least 95%. In some embodiments, an accuracy of the information is at least 99%.
[0028] In some embodiments, training the machine learning model to determine a presence or absence of the one or more disease conditions requires no more than about 500 experimental
data points. In some embodiments, no more than about 200 experimental data points are required to train the machine learning model. In some embodiments, no more than about 100 experimental data points are required to train the machine learning model.
[0029] In some embodiments, an accuracy of the determination is at least about 70%. In some embodiments, the proteomic profile comprises one or more post-translational modifications (PTMs). In some embodiments, the post-translational modifications comprise one or more phosphorylation, acetylation, ubiquitination, glycosylation, or combination of two or more thereof.
[0030] In some embodiments, training the machine learning model comprises randomly masking about 1-25% (e.g. 1%, 5%, 10%, 15%, 20%, or 25%) of the training set and adding about 1-10% (e.g. about 1%, 2%, 3%, 4%, 5%, or 10%) noise as a means of self-supervised learning.
[0031] In some embodiments, measuring the sample by mass spectrometry comprises separating ions by ion mobility (e.g. by High Field Asymmetric Waveform Ion Mobility Spectrometry (FAIMS) or Drift-tube Ion Mobility Spectrometry) prior to or during acquisition mass spectra. In some embodiments, a mean average percent error of the information is less than about 30% (e.g. less than 30%, 20%, 15%, 10%, 5%, 3%, 2%, or 1%).
[0032] In some embodiments, adjacent m/z values are not treated as continuous values during the analysis. In some embodiments, the information comprises identification of one or more signal which is determinative of the presence or absence of a particular condition. In some embodiments, the information comprises identification of one or more signal which is indicative or correlated with a particular state of a particular condition. In some embodiments, the information is used for biomarker discovery.
[0033] In some embodiments, the method is capable of being trained at a rate of at least 10 training samples per day (e.g at least 10, 15, 50, 100, 300, 500, or 700 samples per day) when trained using a single GPU or CPU which is no faster in terms of maximum single precision floating point operations per second than an NVidia RTX A6000 GPU equipped with 48GB of RAM.
[0034] In another aspect, described herein are non-transitory computer-readable storage media comprising instructions that, when executed by a processor, causes the processor to perform methods described herein.
[0035] In another aspect, described herein are systems configured for characterizing a condition of a subject, the systems comprising: a computer comprising a memory operably coupled to at least one processor; and a module executing in the memory of the computer, the module
comprising program code enabled upon execution by the at least one processor of the computer to perform methods described herein.
[0036] Another aspect of the present disclosure provides a system comprising one or more computer processors and computer memory coupled thereto. The computer memory comprises machine executable code that, upon execution by the one or more computer processors, implements any of the methods above or elsewhere herein.
[0037] Additional aspects and advantages of the present disclosure will become readily apparent to those skilled in this art from the following detailed description, wherein only illustrative embodiments of the present disclosure are shown and described. As will be realized, the present disclosure is capable of other and different embodiments, and its several details are capable of modifications in various obvious respects, all without departing from the disclosure. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.
INCORPORATION BY REFERENCE
[0038] All publications, patents, and patent applications mentioned in this specification are herein incorporated by reference to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference. To the extent publications and patents or patent applications incorporated by reference contradict the disclosure contained in the specification, the specification is intended to supersede and/or take precedence over any such contradictory material.
BRIEF DESCRIPTION OF THE DRAWINGS
[0039] The novel features of the present disclosure are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present disclosure will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the present disclosure are utilized, and the accompanying drawings (also “Figure” and “FIG.” herein), of which:
[0040] FIG. 1A illustrates an exemplary machine learning architecture for classification of one or more conditions of a subject. Input data from a mass spectrometer is provided to a machine learning model, which comprises a hierarchical transformer arrangement. Raw input data is processed by Transformer LI, which provides its output as input to Transformer L2. L2 output can be further processed by additional steps (shown as optional hierarchy layers) which provide their output as input to a final classifier, or L2 can directly output to the input of the final classifier. The classifier then outputs classification information about the one or more conditions of the subject.
[0041] FIG. IB illustrates an example tokenization of a spectrum. A spectrum having 3 peaks with m/z values 103.009 and 231.068 and 378.136 are converted to a sequence of tokens [231, 103, 378],
[0042] FIG. 2 illustrates an example of a machine learning model utilizing hierarchical transformers for classification of the condition of a subject (in this example, identification of disease).
[0043] FIG. 3 illustrates self-supervised training of the level 1 transformer used in the example machine learning model shown in Fig. 2.
[0044] FIG. 4 illustrates self-supervised training of the level 2 transformer used in the example machine learning model shown in Fig. 2.
[0045] FIG. 5. illustrates classification of a condition of a subject from the L2 output.
[0046] FIG. 6 illustrates the level 1 encoder peak prediction accuracy and loss progression as training continues.
[0047] FIG. 7 illustrates the level 1 encoder adjacent spectrum prediction loss and accuracy progression as training continues.
[0048] FIG. 8 illustrates the level 2 encoder spectrum prediction accuracy and loss progression as training continues.
[0049] FIG. 9 illustrates the level 2 encoder inter/intra person prediction loss and accuracy progression as training continues, for test and validation sets.
[0050] FIG. 10 illustrates that the example training and validation sets produced similar accuracy.
[0051] FIG. 11 illustrates example output from an example implementation of the hierarchical transformer scheme shown in FIG. 2.
[0052] FIG. 12 illustrates an example of inspecting weights of the top-level linear model of the hierarchical transformer scheme of the example of FIG. 2. Absolute values of the weights indicate the importance of the input feature, I.e., level 2 output. Per isolation window breakdown reveals which isolation window is more important, providing identification of specific condition markers (e.g. biomarkers).
[0053] FIG. 13 illustrates inspecting score, i.e., product of level 2 output and weight of the hierarchical transformer scheme of the example of FIG. 2. Score is summed together to get the final classification verdict. By breaking down by window, the window which contributed most to the final score can be identified.
[0054] FIG. 14 illustrates inspection of the attention of level 2 transformer of the hierarchical transformer scheme of the example of FIG. 2. Given a specific window, inspecting the attention
score of level 2 transformer and check where the model determines to be more important in terms of retention time. The X axis is indicative of time.
[0055] FIG. 15 shows a computer system that is programmed or otherwise configured to implement methods provided herein.
[0056] FIG. 16 illustrates an alternate exemplary machine learning architecture for classification of one or more conditions of a subject. Input data from a mass spectrometer is provided to a machine learning model, which comprises a hierarchical transformer arrangement. Raw input data is processed by Transformer LI, which provides its output as input to a linear classifier (or to additional process steps (shown as optional hierarchy layers) which input to the linear classifier), which is aggregated by a random forest model which outputs classification information about the one or more conditions of the subject.
[0057] FIG. 17 illustrates a more detailed implementation of the example machine learning model depicted in FIG. 16.
[0058] FIG. 18 illustrates conversion of a spectrum into a sequence useful for training example models described herein.
[0059] FIG 19 illustrates a conceptual analogy between sentence and spectrum pre-training of models described herein.
[0060] FIG. 20A illustrates example results from a test case of an exemplary machine learning model described herein for Protein P01861.
[0061] FIG. 20B illustrates example results from a test case of an exemplary machine learning model described herein for Protein P08519.
[0062] FIG. 21 illustrates example results from a test case of an exemplary machine learning model described herein for Protein P01861.
[0063] FIG. 22 illustrates the accuracy of an exemplary machine learning model described herein in various test cases.
[0064] FIG. 23 illustrates the accuracy of an exemplary machine learning model described herein in alternate test cases described herein.
DETAILED DESCRIPTION
[0065] While various embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions may occur to those skilled in the art without departing from the invention. It should be understood that various alternatives to the embodiments of the invention described herein may be employed.
[0066] Whenever the term “at least,” “greater than,” or “greater than or equal to” precedes the first numerical value in a series of two or more numerical values, the term “at least,” “greater than” or “greater than or equal to” applies to each of the numerical values in that series of numerical values. For example, greater than or equal to 1, 2, or 3 is equivalent to greater than or equal to 1, greater than or equal to 2, or greater than or equal to 3.
[0067] Whenever the term “no more than,” “less than,” or “less than or equal to” precedes the first numerical value in a series of two or more numerical values, the term “no more than,” “less than,” or “less than or equal to” applies to each of the numerical values in that series of numerical values. For example, less than or equal to 3, 2, or 1 is equivalent to less than or equal to 3, less than or equal to 2, or less than or equal to 1.
[0068] Certain inventive embodiments herein contemplate numerical ranges. When ranges are present, the ranges include the range endpoints. Additionally, every sub range and value within the range is present as if explicitly written out. The term “about” or “approximately” may mean within an acceptable error range for the particular value, which will depend in part on how the value is measured or determined, e.g., the limitations of the measurement system. For example, “about” may mean within 1 or more than 1 standard deviation, per the practice in the art. Alternatively, “about” may mean a range of up to 20%, up to 10%, up to 5%, or up to 1% of a given value. Where particular values are described in the application and claims, unless otherwise stated the term “about” meaning within an acceptable error range for the particular value may be assumed.
[0069] As used herein, the terms “MS”, “mass spec” and “mass spectrometer” are used interchangeably to refer to a device which separates ions in time, space, or both based on a mass to charge ratio (m/z) of the ions.
[0070] Recognized herein is the need for systems and methods for characterizing a condition of a subject using machine learning techniques coupled with mass spectrometry data of a sample obtained from the subject.
[0071] Metabolomics, lipidomics, and/or proteomics can provide key insights into the health and functionality of a biological system. These tools can provide information useful for assessing the health status of human or animal subjects, as select metabolites, lipids, and proteins serve as biomarkers for various states of disease, malnutrition, or cellular dysfunction. For example, conditions such as diabetes mellitus, metabolic syndrome, renal failure, and hepatic failure present with biomarkers recognizable in blood or urine. Other cellular dysfunctions, such as various cancers, provide biomarker signatures that enable early detection of disease or monitoring of disease progression. Thus, analysis of biomarkers is of key utility for the fields of medical and veterinary science.
[0072] Analysis of biological samples by mass spectrometry provides access to the wealth of information provided by metabolomics, lipidomics, proteomics. Experiments in biological mass spectrometry can start with a neutral liquid sample and ends with the detection of a charged gas phase ion.
[0073] One aspect of the present disclosure provides a method comprising: applying mass spectrometry (MS) to a sample and using a trained machine learning model to determine information about one or more condition of a sample obtained from a subject.
[0074] In another aspect, described herein are non-transitory computer-readable storage medium comprising a set of instructions for executing a method described herein. In some embodiments, the machine learning model is selected from logistic regression, ada boost classifier, extra trees classifier, extreme gradient boosting, gaussian process classifier, gradient boosting classifier, K- nearest neighbor, light gradient boosting, linear discriminant analysis, multi-level perceptron, naive Bayes, quadratic discriminant analysis, random forest classifier, ridge classifier, SVM (linear and radial kernels), fully connected neural network, or a deep neural network.
[0075] One aspect of the present disclosure provides a system for classification of a condition of a subject based on a sample obtained from the subject comprising: a computing unit operably coupled to a mass spec (MS) machine.
[0076] In some embodiments a sample obtained from a subject can be a cell, a tissue, a urine, a fecal matter, a blood, a blood plasma, a mucus, a saliva, a blood serum, a cerebrospinal fluid, or a cyst fluid.
[0077] Samples may be analyzed using chromatography. In some cases, the chromatography comprises liquid chromatography (LC). Chromatography generally comprises a laboratory technique for the separation of a mixture into its components. A mixture can be dissolved into a mobile phase, which can be carried through a system, such as a column, comprising a fixed stationary phase. The components within the mobile phase may have different affinities to the stationary phase, resulting in different retention times depending on these affinities. As a result, separation of components in the mixture is achieved.
[0078] The separated components from chromatography may be analyzed using a mass spectrometer (MS). The LC output may be passed to an MS either directly or indirectly. Mass spectrometric analysis generally refers to measuring the mass-to-charge ratio of ions (e.g., m/z), resulting in a mass spectrum. The mass spectrum comprises a plot of intensity as a function of mass-to-charge ratio. The mass spectrum may be used to determine elemental or isotopic signatures in a sample, as well as the masses of the components (e.g., particles or molecules) in the mixture. This may be used to determine a chemical identity or structure of the components in the mixture.
[0079] In some cases, one or more acquisition parameters is programmed in the MS. In some instances, the one or more acquisition parameters comprises, for example, the one or more mass acquisition windows, one or more acquisition times for the one or more mass acquisition windows, one or more resolutions for the one or more mass acquisition windows, one or more gain settings for the one or more acquisition windows, one or more ionization polarity settings for the one or more mass acquisition windows, one or more mass resolutions for the one or more mass acquisition windows, or any combination thereof. In some cases, the MS is a high- resolution mass spectrometer. In some cases, the MS is a low-resolution mass spectrometer. In some instances, the high-resolution mass spectrometer has a mass accuracy is less than or equal to 75 ppm, less than or equal to 30 ppm, less than or equal to 15 ppm, less than or equal to 10 ppm, or less than or equal to 5 ppm.
[0080] The output signal from the MS can comprise an intensity value, a mass-to-charge ratio, or a combination thereof. In some cases, the output signal from the MS comprises raw, unprocessed MS data. In some cases, the output signal comprises a first signal indicating an intensity value or a mass-to-charge ratio of one or more analytes. In some cases, the output signal comprises a second signal indicating an intensity value or a mass-to-charge ratio of one or more calibrators. In some cases, the output signal comprises the first signal and the second signal. In some instances, the output signal comprises the peak signal intensity obtained for an exact isotopic mass for each of the one or more analytes or one or more calibrators of known molecular weight. In some instances, the output signal comprises combined signals corresponding to one or more mass adducts for the one or more analytes. In some examples, the output signal for the one or more analytes is obtained by calculating the sum of the adduct signals for 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10 analyte adducts. In some cases, the analyte adducts correspond to the proton, sodium, potassium, calcium, magnesium, ammonium, nitrate, sulfate, phosphate, acetate, citrate, or formate adducts.
[0081] In some embodiments, the MS is a tandem MS (MS/MS). In MS/MS mode, a tandem MS can be operated such that ions passing through a first mass spec are activated and the m/z of the activated ions after a fixed amount of time. The second MS produces a mass spectrum comprising the activated ions and any fragments thereof produced during or after the ion activation. Isolation windows can be selected to determine which ions are subjected to activation and subsequent analysis. In a data independent acquisition mode, the isolation windows are fixed by the operator. In a data dependent acquisition mode, the isolation windows can be adjusted during the course of data acquisition, for example to activate the most or least abundant ions in a spectrum for subsequent analysis of fragmentation.
[0082] In some cases, the LC-MS method provided herein is optimized for performance on a subset of cellular analytes. In some cases, the LC-MS methods provided herein ionizes in both positive and negative modes. In some cases, the LC-MS method provided herein ionize analytes as molecular ions. In some cases, an ion mobility separation is performed prior to, or during mass spectrometry analysis.
[0083] The output signal from the MS (e.g., mass spectrum comprising intensity value, mass-to- charge ratio, and/or timing information; or tandem mass spectra) may be processed by a signal processing module. The input to the signal processing module can comprise an input signal comprising an intensity value, a mass-to-charge ratio, timing information, or a combination thereof from the MS.
[0084] In some cases, the input to the signal processing module comprises raw or unprocessed MS data. In some cases the input is an MZML file comprising the raw, unprocessed MS data. In some cases, the input comprises preprocessed MS data. Preprocessing MS data may comprise data cleaning, data transformation, data reduction, or any combination thereof. In some cases, data cleaning comprises cleaning missing data (e.g., fill in or ignore missing values), noisy data (e.g., binning, regression, clustering, etc.), or a combination thereof. In some cases, data transformation comprises standardization, normalization, attribute selection, discretization, hierarchy generation, or any combination thereof. In some cases, data reduction comprises data aggregation, attribute subset selection, numerosity reduction, dimensionality reduction, or any combination thereof. In some cases, the MS data is preprocessed prior to the signal processing module. In some cases, the MS data is preprocessed in the signal processing module. The signal processing module can comprise a machine learning model. The machine learning model can be trained on MS data. The machine learning model may be a trained machine learning algorithm. The trained machine learning model may be used to determine information about a condition of a sample obtained from a subject.
[0085] A machine learning model can comprise a supervised, semi-supervised, unsupervised, or self-supervised machine learning model. In some cases, the one or more ML approaches perform classification or clustering of the MS data. In some examples, the machine learning approach comprises a classical machine learning method, such as, but not limited to, support vector machine (SVM) (e.g., one-class SVM, linear or radial kernels, etc.), K-nearest neighbor (KNN), isolation forest, random forest, logistic regression, AdaBoost classifier, extra trees classifier, extreme gradient boosting, gaussian process classifier, gradient boosting classifier, light gradient boosting, linear discriminant analysis, naive Bayes, quadratic discriminant analysis, ridge classifier, or any combination thereof. In some examples, the machine learning approach comprises a deep leaning method (e.g., deep neural network (DNN)), such as, but not limited to
a fully-connected network, convolutional neural network (CNN) (e.g., one-class CNN), recurrent neural network (RNN), transformer, graph neural network (GNN), convolutional graph neural network (CGNN), multi-level perceptron (MLP), or any combination thereof.
[0086] In some embodiments, a classical ML method comprises one or more algorithms that learns from existing observations (i.e., known features) to predict outputs. In some embodiments, the one or more algorithms perform clustering of data. In some examples, the classical ML algorithms for clustering comprise K-means clustering, mean-shift clustering, density-based spatial clustering of applications with noise (DBSCAN), expectationmaximization (EM) clustering (e.g., using Gaussian mixture models (GMM)), agglomerative hierarchical clustering, or any combination thereof. In some embodiments, the one or more algorithms perform classification of data. In some examples, the classical ML algorithms for classification comprise logistic regression, naive Bayes, KNN, random forest, isolation forest, decision trees, gradient boosting, support vector machine (SVM), or any combination thereof. In some examples, the SVM comprises a one-class SMV or a multi-class SVM.
[0087] In some embodiments, the deep learning method comprises one or more algorithms that learns by extracting new features to predict outputs. In some embodiments, the deep learning method comprises one or more layers. In some embodiments, the deep learning method comprises a neural network (e.g., DNN comprising more than one layer). In some embodiments, the output from a given node is passed on as input to another node. The nodes in the network generally comprise input units in an input layer, hidden units in one or more hidden layers, output units in an output layer, or a combination thereof. In some embodiments, an input node is connected to one or more hidden units. In some embodiments, one or more hidden units is connected to an output unit. The nodes can generally take in input through the input units and generate an output from the output units using an activation function. In some embodiments, the input or output comprises a tensor, a matrix, a vector, an array, or a scalar. In some embodiments, the activation function is a Rectified Linear Unit (ReLU) activation function, Gaussian Error Linear Unit (GeLU), a sigmoid activation function, a hyperbolic tangent activation function, or a Softmax activation function.
[0088] The connections between nodes further comprise weights for adjusting input data to a given node (i.e., to activate input data or deactivate input data). In some embodiments, the weights are learned by the neural network. In some embodiments, the neural network is trained to learn weights using gradient-based optimizations. In some embodiments, the gradient-based optimization comprises one or more loss functions. In some embodiments, the gradient-based optimization is gradient descent, conjugate gradient descent, stochastic gradient descent, or any variation thereof (e.g., adaptive moment estimation (Adam)). In some further embodiments, the
gradient in the gradient-based optimization is computed using backpropagation. In some embodiments, the nodes are organized into graphs to generate a network (e.g., graph neural networks). In some embodiments, the nodes are organized into one or more layers to generate a network (e.g., feed forward neural networks, convolutional neural networks (CNNs), recurrent neural networks (RNNs), etc.). In some embodiments, the CNN comprises a one-class CNN or a multi-class CNN.
[0089] In some embodiments, the neural network comprises one or more recurrent layers. In some embodiments, the one or more recurrent layers are one or more long short-term memory (LSTM) layers or gated recurrent units (GRUs). In some embodiments, the one or more recurrent layers perform sequential data classification and clustering in which the data ordering is considered (e.g., time series data). In such embodiments, future predictions are made by the one or more recurrent layers according to the sequence of past events. In some embodiments, the recurrent layer retains important information, while selectively removing what is not essential to the classification.
[0090] In some embodiments, the neural network comprise one or more convolutional layers. In some embodiments, the input and the output are a tensor representing variables or attributes in a data set (e.g., features), which may be referred to as a feature map (or activation map). In such embodiments, the one or more convolutional layers are referred to as a feature extraction phase. In some embodiments, the convolutions are one dimensional (ID) convolutions, two dimensional (2D) convolutions, three dimensional (3D) convolutions, or any combination thereof. In further embodiments, the convolutions are ID transpose convolutions, 2D transpose convolutions, 3D transpose convolutions, or any combination thereof.
[0091] The layers in a neural network can further comprise one or more pooling layers before or after a convolutional layer. In some embodiments, the one or more pooling layers reduces the dimensionality of a feature map using filters that summarize regions of a matrix. In some embodiments, this down samples the number of outputs, and thus reduces the parameters and computational resources needed for the neural network. In some embodiments, the one or more pooling layers comprises max pooling, min pooling, average pooling, global pooling, norm pooling, or a combination thereof. In some embodiments, max pooling reduces the dimensionality of the data by taking only the maximums values in the region of the matrix. In some embodiments, this helps capture the most significant one or more features. In some embodiments, the one or more pooling layers is one dimensional (ID), two dimensional (2D), three dimensional (3D), or any combination thereof.
[0092] The neural network can further comprise of one or more flattening layers, which can flatten the input to be passed on to the next layer. In some embodiments, a input (e.g., feature
map) is flattened by reducing the input to a one-dimensional array. In some embodiments, the flattened inputs can be used to output a classification of an object. In some embodiments, the classification comprises a binary classification or multi-class classification of visual data (e.g., images, videos, etc.) or non-visual data (e.g., measurements, audio, text, etc.). In some embodiments, the classification comprises binary classification of an image (e.g., cat or dog). In some embodiments, the classification comprises multi-class classification of a text (e.g., identifying hand-written digits)). In some embodiments, the classification comprises binary classification of a measurement. In some examples, the binary classification of a measurement comprises a classification of a system’s performance using the physical measurements described herein (e.g., normal or abnormal, normal or anormal).
[0093] The neural networks can further comprise of one or more dropout layers. In some embodiments, the dropout layers are used during training of the neural network (e.g., to perform binary or multi-class classifications). In some embodiments, the one or more dropout layers randomly set some weights as 0 (e.g., about 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80% of weights). In some embodiments, the setting some weights as 0 also sets the corresponding elements in the feature map as 0. In some embodiments, the one or more dropout layers can be used to avoid the neural network from overfitting.
[0094] The neural network can further comprise one or more dense layers, which comprises a fully connected network. In some embodiments, information is passed through a fully connected network to generate a predicted classification of an object. In some embodiments, the error associated with the predicted classification of the object is also calculated. In some embodiments, the error is backpropagated to improve the prediction. In some embodiments, the one or more dense layers comprises a Softmax activation function. In some embodiments, the Softmax activation function converts a vector of numbers to a vector of probabilities. In some embodiments, these probabilities are subsequently used in classifications, such as classifications of a type or class of a molecule (e.g., calibrator or analyte) as described herein.
[0095] In some cases, the model comprises multi-modality models. In some cases, multimodality models can be extremely powerful. Different modalities provide supportive, complementary or even completely orthogonal signals to the model. Multi-modality models allow the model to be use for a variety of downstream tasks that might benefit from some or all of the input modalities. Intermediate features and terminal embeddings from each model are fused. The fused representation is then used to train subsequent models for various tasks including regression, classification, generation and dimensionality reduction. The entire network and sub-models can be fine-tuned for specific tasks or the sub-models can be frozen and only the heads trained and/or finetuned. The modularity offers the flexibility of interchanging a sub-
model by higher performing models as they become available or designed. Sub-models can take any form, such as, but not limited to, CNN, Transformer, MLP, etc. Each module can then be used to generate embeddings for new unseen data that can then be used for downstream tasks. [0096] The training data may be designed based on one or more considerations. Considerations may comprise, by way of non-limiting example, effective LC separation of the broadest range of analytes, instrumental conditions for collective sensitivity of all analytes (ionization mode, RT, extracted ion chromatogram for each analyte), inherent range (high and low) of instrument detection (for each analyte), resolving power of the mass spectrometer, length of time between injections (acquisition and column equilibration), stability and reproducibility over long acquisition times, MS/MS parameters (e.g. isolation windows for data independent analysis (DIA)), and/or use of spiked-in non-endogenous QC analytes to demarcate between sample issues and instrument issues.
[0097] For example, training data may comprise raw spectra comprising data on a plurality of samples collected from populations of subjects with one or more known conditions. The instruments can comprise two or more different mass spectrometer types (e.g. ion trap, orbitrap, FT-ICR, time-of-flight (ToF), or QQQ-time-of-flight (QTOF) mass spectrometers). The instruments can comprise two or more different mass spectrometers of the same type. Inclusion of the one or more design considerations in building the training set can produce a model which is capable of accurately classifying a sample obtained from a subject having an unknown condition based on analysis of MS data obtained from the sample.
[0098] In some cases, a run list of samples is provided by a user interface, for example to facilitate construction of the training set using an MS or LC-MS equipped with an autosampler. In some instances, the user interface comprises information such as sample plate positions, blank positions, number of drawers, number of slots per drawer, columns to run, blank plate number of wells, number of injections, plates between calibration curves, maximum blank well reuse, injection volume, blank frequency, etc.
[0099] In some embodiments, the mass accuracy is less than or equal to 75 ppm, less than or equal to 30 ppm, less than or equal to 15 ppm, less than or equal to 10 ppm, or less than or equal to 5 ppm.
[0100] In some embodiments, methods described herein do not require exact mass (e.g. data from a low resolution mass spectrometer such as a conventional ion trap may be used) in order to provide classification of a condition of a subject based on analysis of a sample obtained from the subject.
[0101] In some cases, training data can comprise multi-modal foundation models. The foundation models can be trained using meta data inputs comprising MSI and/or MS2 spectra.
In some instances, the underlying architecture is modality agnostic. For example, a modalityagnostic foundational model may be trained to understand mass spectra indifferently to whether the spectra are acquired in MSI, MS2, Multiple Reaction Monitoring (MRM), Data-independent Acquisition (DIA), Data Dependent Acquisition (DDA), MSn mode, or combinations thereof. [0102] In some embodiments, a multi-modal modal or mode-agnostic model described herein can translate from one modality to another based at least in part on data describing a joint space between two or more modalities.
[0103] In some embodiments, training a model described herein using inputs from a plurality of different modalities reduces or eliminates the need for labeling of training and/or sample data. For example, training using a combination of MSI and MS2 data can reduce or eliminate the need for labeled datasets for a particular downstream application (such as biomarker discovery and/or disease classification). In some embodiments, use of a multi-modal training regime can significantly reduce a number of empirical data points needed to make a disease classification or discover a biomarker. Such embodiments are particularly advantageous for classification of rare or complex conditions where the availability of controlled empirical data is limited and/or nonexistent.
[0104] In some embodiments, multi-modal models allow utilization of mass spectrometry measurements of less than 150 clinical samples (e.g. as few as 10 to 20 samples) to provide accurate characterization of a disease or condition.
[0105] In some embodiments, foundational multi-modal models can be fine-tuned using a small number of data points, for example, to train for specific characterizations such as identification of gene labels and/or metabolites.
[0106] In some embodiments, multi-modal models are trained using m/z peaks with raw intensity values from a plurality of mass spectrometer operating modes to form a vocabulary of the model. In some cases, continuous values can be converted to discreet inputs, representing intensity, m/z and/or mode of acquisition. In some cases, chromatographic information can be included to further refine the models, or intentionally excluded to produce a model which is LC agnostic.
[0107] In some embodiments, training data comprises millions of paired m/z, intensity data points. In certain embodiments, the precision of data points is compressed by breaking discreet values into the first three and last three digits to reduce the dimensionality of the training set (e.g. from millions of data points to about 1000 different vectors).
Computer systems
[0108] The present disclosure provides computer systems that are programmed to implement methods of the disclosure. FIG. 15 shows a computer system 1501 that is programmed or otherwise configured to characterize a condition of a subject using mass spectrometry data obtained by analyzing a sample collected from the subject. The computer system 1501 can regulate various aspects of the machine-learning based methods of the present disclosure, such as, for example, providing a model which is capable of providing output information indicative of at least one condition marker or condition state in the subject. The computer system 1501 can be an electronic device of a user or a computer system that is remotely located with respect to the electronic device. The electronic device can be a mobile electronic device.
[0109] The computer system 1501 includes a central processing unit (CPU, also “processor” and “computer processor” herein) 1505, which can be a single core or multi core processor, or a plurality of processors for parallel processing. The computer system 1501 also includes memory or memory location 1510 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 1515 (e.g., hard disk), communication interface 1520 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 1525, such as cache, other memory, data storage and/or electronic display adapters. The memory 1510, storage unit 1515, interface 1520 and peripheral devices 1525 are in communication with the CPU 1505 through a communication bus (solid lines), such as a motherboard. The storage unit 1515 can be a data storage unit (or data repository) for storing data. The computer system 1501 can be operatively coupled to a computer network (“network”) 1530 with the aid of the communication interface 1520. The network 1530 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet. The network 1530 in some cases is a telecommunication and/or data network. The network 1530 can include one or more computer servers, which can enable distributed computing, such as cloud computing. The network 1530, in some cases with the aid of the computer system 1501, can implement a peer-to-peer network, which may enable devices coupled to the computer system 1501 to behave as a client or a server.
[0110] The CPU 1505 can execute a sequence of machine-readable instructions, which can be embodied in a program or software. The instructions may be stored in a memory location, such as the memory 1510. The instructions can be directed to the CPU 1505, which can subsequently program or otherwise configure the CPU 1505 to implement methods of the present disclosure. Examples of operations performed by the CPU 1505 can include fetch, decode, execute, and writeback.
[OHl] The CPU 1505 can be part of a circuit, such as an integrated circuit. One or more other components of the system 1501 can be included in the circuit. In some cases, the circuit is an application specific integrated circuit (ASIC).
[0112] The storage unit 1515 can store files, such as drivers, libraries and saved programs. The storage unit 1515 can store user data, e.g., user preferences and user programs. The computer system 1501 in some cases can include one or more additional data storage units that are external to the computer system 1501, such as located on a remote server that is in communication with the computer system 1501 through an intranet or the Internet.
[0113] The computer system 1501 can communicate with one or more remote computer systems through the network 1530. For instance, the computer system 1501 can communicate with a remote computer system of a user. Examples of remote computer systems include personal computers (e.g., portable PC), slate or tablet PC’s (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device, Blackberry®), or personal digital assistants. The user can access the computer system 1501 via the network 1530. [0114] Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 1501, such as, for example, on the memory 1510 or electronic storage unit 1515. The machine executable or machine readable code can be provided in the form of software. During use, the code can be executed by the processor 1505. In some cases, the code can be retrieved from the storage unit 1515 and stored on the memory 1510 for ready access by the processor 1505. In some situations, the electronic storage unit 1515 can be precluded, and machine-executable instructions are stored on memory 1510.
[0115] The code can be pre-compiled and configured for use with a machine having a processer adapted to execute the code, or can be compiled during runtime. The code can be supplied in a programming language that can be selected to enable the code to execute in a pre-compiled or as-compiled fashion.
[0116] Aspects of the systems and methods provided herein, such as the computer system 1501, can be embodied in programming. Various aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Machine-executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk. “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software
programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server. Thus, another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software. As used herein, unless restricted to non-transitory, tangible “storage” media, terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.
[0117] Hence, a machine readable medium, such as computer-executable code, may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings. Volatile storage media include dynamic memory, such as main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
[0118] The computer system 1501 can include or be in communication with an electronic display 1535 that comprises a user interface (UI) 1540 for providing, for example, information concerning a condition of a sample obtained from a subject. Examples of UI’s include, without limitation, a graphical user interface (GUI) and web-based user interface.
[0119] Methods and systems of the present disclosure can be implemented by way of one or more algorithms. An algorithm can be implemented by way of software upon execution by the
central processing unit 1505. The algorithm can, for example, be configured to perform any of the methods described herein.
Examples
Example 1: Direct Disease Classification from Raw Mass-Spectrometry data using Self- Supervised Deep Learning
[0120] Deep learning has made great strides in many areas, but in proteomics, the adaptation has been limited to a small number of applications, such as prediction of chromatographic retention time and product ion intensities for specific ions. An untapped potential of deep learning was demonstrated by building a model that classifies patients into groups of cancer patients and groups of normal subjects by directly analyzing data-independent acquisition (DIA) data, without needing any prior proteomics knowledge.
[0121] Transformer encoders were used for encoding DIA data. To facilitate the processes of a large data set, the encoders were laid out in a hierarchy according to the arrangements described in FIGs. 1A, and 2. The level- 1 transformer encoded each MS/MS spectrum, and the level -2 transformer encoded a sequence of level-1 outputs. Both encoders were trained in a selfsupervised fashion - learning the distribution itself without externally added labels. In the selfsupervised training, novel optimization objectives were added on top of the typical objective of predicting hidden input. After training each level in sequence, the top-level classifier was finetuned along with the level-2 transformer. The labels used in the final fine-tune step are the only external information injected to the model.
[0122] This method was tested on two DIA datasets - ovarian cancer with 157 samples and pancreatic cancer with 118 samples, each of which contained -50% of healthy samples as control. The datasets were split into 80% training, 10% evaluation and 10% test sets. For the split, both models achieved an area under curve of 1.0 in classifying the cancer and normal samples.
[0123] The results demonstrate that deep learning architectures and training regimes are beneficial to mass spectrometry-based proteomics and/or for classification of conditions of a subject (such as cancer). Minimal domain knowledge was used to develop the model which is capable of accurately distinguishing samples obtained from subjects with cancer from cancer- free subjects, demonstrating that the method can be extended to other mass spectrometry -based omics technologies (e.g., metabolomics and lipidomics) and their integrations. This work paves a way toward making sense of underutilized information, such as post-translational modifications and discovery of new biomarkers.
Example 2 - Spectrum is All you Need (SAN ill)
[0124] A machine-learning model, termed Spectrum is All you Need (SAN), was designed which aims to analyze MS data using deep learning, with minimal domain knowledge. A transformer, which is widely used in natural language processing (NLP), computer vision, speech processing, as well as in bioinformatics, was used as the main engine of the architecture. The use of transformer resulted in several interesting design decisions:
[0125] Tokenization: each spectrum was converted to a sequence of tokens, similar to a sentence being converted to tokens in NLP.
[0126] Hierarchy: Transformers conventionally struggle with a long data sequences. Accordingly, input data was split and fed to a hierarchy of models to reduce the size of the data sequence each transformer was required to handle (see FIG. 1A).
[0127] Self-supervised training: Transformers ordinarily require a lot of examples to train properly. Due to the wealth of data available by mass spectrometric analysis of samples, particularly when paired with additional separations (e.g. chromatography, ion mobility, etc.), even a single sample comprises a very large data set. Accordingly, self-supervised training was used to decrease the number of experimental data points required. Several self-supervised objectives were devised and used to reduce the total number of training points required to build an accurate model.
[0128] Tokenization: A unique tokenization procedure was utilized. Mass spectra generally include a set of peaks where each peak has a mass-to-charge ratio (m/z) and an intensity. Each tandem mass (MS/MS) spectrum was converted to a sequence of tokens by, first sorting all the peaks in the decreasing order of their intensities (e.g. most intense peaks first), and then converting peaks to token ids by rounding up the m/z value to the closest integer after multiplying a scaling constant (0.9995 by default). Using this particular tokenization scheme the peaks whose m/z values were close together were assigned the same token id, indicating that high-resolution isn’t necessarily needed to provide accurate classification. Each token then became a categorical variable which does not carry any explicit information about the m/z value. All relationships among tokens were then discovered by the transformer from scratch by looking at the data.
[0129] Transformer’s memory usage increases with O(n2) where n is the length of input sequence. When n is smaller than a few hundreds, a transformer model is trainable with a local GPU, but beyond that is not practical or feasible. To circumvent this issue while exploiting the capabilities of the transformer, the SAN implementation structures input data into three levels of a hierarchy (see FIG. 2). In level 1 (LI), a transformer learned to encode each individual MS/MS spectrum. In level 2, another transformer learned to encode a sequence of spectra, i.e. a
sequence of LI outputs. In level 2.5, a simple linear classifier learned to classify disease status (as cancerous or disease free) from a sequence of L2 outputs.
Self-supervised Training
[0130] To make up for a low example count, self-supervised training was extensively used. In LI, model was asked to guess masked token, similar to masked language modeling in NLP. In the level 2 (L2), model was asked to predict masked inputs.
[0131] The algorithm components described above focused to learn and encode input sequence position-by-position and thus the output at a specific position is specific to the very position. This is a limitation to understand a broad context about the input. To encourage the model to learn higher level features, a secondary objective was used. Similar to the next-sentence- prediction task in NLP, where a model is asked to predict whether two sentences are related or not, the LI model was asked to predict whether two spectra are adjacent. In L2, the model was asked to predict whether two sequences of spectra are from the same sample.
Implementation details
[0132] A Pancreatic cancer data-set was used containing 118 raw mass spec files collected from the same number of samples. The gradient length of the LC-MS used to collect the files was 180 mins resulting in approximately 23 IK spectra across 70 isolation windows. Raw files were converted to mzml using msconvert. For the conversion, cwt peak peaking is selected. Mz and intensity were written as single precision floats. Also, zero samples (zero intensity peaks) were removed. The full conversion was performed using the following msconvert settings “—64 — mz32 — inten32 —filter "peakPicking cwt" —filter "zeroSamples removeExtra"” The dataset was stratified-split into 80% train, 10% validation, and 10% test set using healthy vs. cancer label by skleam library.
Tokenization
[0133] A Tokenization process converted mzml files to python pickle files. The pickle file contained a list, whose element is a diet. The diet had one key token and value contained a list of numpy arrays. The numpy array had the token id of each spectrum. The pickle object was referenced by data[window_idx][‘token’][spectrum_idx][peak_idx]. The following steps were then performed:
Search for MS level 2 spectrum.
Sort peaks in intensity descending order. Take top 150 peaks and discard the rest. Multiply mz value by 0.9995. clip mz value to [100, 1800], Round mz value to the nearest integer.
Add offset -90= 10 for reserved tokens - 100 for min mz value.
Trim to Remove head and tail of experiment 1/6 each and 1/3 total, Reducing the number of spectrum per isolation window from 3.3K to 2.2K.
Encoder Level 1
[0134] A transformer encoder was trained with two tasks - masked token prediction and adjacency prediction. Two cross-entropy losses were added with equal weights. Adjacency was labeled into three classes - unrelated, adjacent in horizontal(time) axis, and adjacent in vertical (isolation window) axis.
[0135] Target adjacency label was sampled with the equal weights. Using the target label, two spectra were sampled from dataset. One summary token, and two spectra are concatenated. 10% of tokens from each spectrum is masked. Token type id is set to 0 for the summary token and the first spectrum. The second spectrum has token type id of 1. Token embedding is followed by layer norm and dropout. The transformer block was configured to have 6 layers, 512 hidden dimensions, 3*512 intermediate dimensions, 8 attention heads, absolute position encoding.
Token prediction head was dot product with token embedding table, followed by a bias layer. Linear head was used for adjacency prediction.
[0136] Training was performed using AdamW with learning rate of 2e-4 and weight decay of 4e-5. Betal=0.9, Beta2=0.95. Embedding, bias, and layer norm parameters were excluded from weight decay. Learning rate was cosine decayed with alpha=0.1; Batch = 512. g5x instance with 4 GPUs; Epoch steps = (106 * 70 * 2200 // BATCH L1 // 10); Target epochs = 100. Training of the encoder took 1-2 days.
[0137] Pre-encoding of Level 1 : After level 1 training was finished, level 1 transformer was fixed and dataset was encoded using the transformer. As level 1 is trained using two spectra, two spectra are fed to the encoder and output of adjacency prediction token is taken. This reduced the sequence length of isolation window 2.2K to 1. IK. After pre-encoding, two spectra were represented by one 512 dimensional vector.
Encoder Level 2
[0138] Level 2 was constructed similar to level 1. The core difference is that the input element was already in a high dimensional vector, thus token embedding was not needed. Masked token prediction in level 1 was replaced with masked vector prediction. Adjacency prediction was replaced with inter-intra person prediction.
[0139] Masked vector prediction: The masked input vector was replaced by a learnable vector. Output of the encoder was dot-producted with the original contents of the vector. Cross-entropy loss was used. In level 1, dot-product was taken across token-space, i.e. total -2000 categories.
In level 2, dot-product was taken across masked inputs including the ones in batch. As the batch size gets larger, the difficulty of the task increases, as does the loss of quality.
[0140] Inter-intra person prediction: Given two sequences of vectors from two isolation windows, the model was asked to guess whether two sequences stem from one person or two persons.
[0141] Target inter-intra person prediction label was sampled randomly with equal weights.
Two isolation windows were selected and random jitter amount (~3% of sequence length) was added as offset to read from. 10% of the input was masked.
[0142] Every three input elements were grouped together and go through fully connected layer to produce one 512-dimensional vector. This reduced the sequence length of one isolation window from 1.1K to -350. The Transformer block was similar to level 1, except the number of layer is increased to 9 from 6.
[0143] Training required about 2 days on a local machine using the following settings: Batch = 24; Epoch steps = (10 * 70 * 116) // BATCH_L2); Target epochs = 100;
Classifier (Level 2.5)
[0144] The level 2 outputs, from all 70 isolation windows, were fed to linear classifier. The Level 2 encoder and linear classifier were trained together.
[0145] Dataset: Only 2 examples can be fit in 48G memory at one time. Random sampling one positive and one negative example to bundle two as one mini batch was used.
[0146] Training setting for the linear classifier were: LR = 2e-4, WD = le-3; Batch = 2; Epoch steps = 1 * 94 // BATCH FINE TUNE AGGREGATE; Target epoch = 10
[0147] Evaluation on validation split and test split were performed.
Example 3 Discovery of new condition markers
[0148] Model parameters were extracted from a model trained to classify whether or not a subject has cancer. Results are shown in FIGs. 12-14. Extraction of the model parameters allow for identification of biomarkers which are indicative of either a heathy subject or a subject with cancer.
Example 4 Protein Discovery and Disease Classification using Alternate Hierarchical Transformers
[0149] An alternate hierarchy implementations of the tokenization, hierarchy, and selfsupervised training methods described herein was developed as follows for DIA-mode experiments:
[0150] Since one typical DIA mode experiment contains more than tens of thousands of spectra including such data in a transformer model is challenging. The architecture of Examples 2 and 3 address this concern by breaking down the data into two levels - spectrum level (1st) and
isolation window level (2nd) - so that the data size becomes manageable by the transformer model.
[0151] The number of examples in a typical clinical study of a condition of a subject is on the order of patient samples times isolation windows, which is less than 10K sample points for a typical dataset that can be used to train the second level of the hierarchy. Additionally, the level- 1 model is frozen after pre-training, is not updated during fine-tuning stage, and does not receive or utilize any label information.
[0152] To improve the number of examples which are available in certain applications to train such a two level model, and to increase the fidelity of both levels of the model, an alternate hierarchy was developed. In the alternate hierarchy, the Level-2 model and subsequent linear classifier are substituted for a direct linear classifier, which produces a score directly. The score indicates how likely the given spectrum belongs to a sample of label 0 or 1. The hierarchy is trained with 10K+ spectra and their scores given one sample. A random forest model aggregates the scores from the linear classified and outputs the final score. A high-level overview of this workflow is presented in FIG. 16, and a more detailed diagram of the Example scheme is presented in FIG. 17.
[0153] The transformer encoder learns spectrum level information, allowing for direct finetuning, which means that label information is injected into the model. The use of random forest as information aggregator facilitates back-tracing of results. For example, inspecting feature importance of the model can reveal which spectrum is important and contributes to the final output the most.
[0154] Also given an important spectrum, it is possible to inspect an attention matrix of the transformer model to see which peak contributed to the score more.
[0155] Among 10K+ spectra in a sample, only small fraction of spectra are important or have mutual information with label. Beforehand, we don’t know which are important and train with all spectra. Examples with low quality or signal-to-noise ratio can have a negative impact on training. Two different fine-tuning schemes implemented in this example are designed to alleviate this problem, as described below in the fine-tuning section.
[0156] A random forest model treats each score as a unique feature. If one sample has retention time offset relative to other samples, spectra and their scores will propagate as an offset in the feature space of the random forest. A simple retention time alignment step was added to cancel out retention drift as much as possible, as illustrated in the dataset section below.
[0157] Some implementations of the Example scheme included a transformer model that takes multiple spectra as input, which can absorb some of the remaining offset in the input, as described in the model section below.
Example 4 Implementation Details:
[0158] Dataset preparation: Raw data files were centroided and deisotoped.
[0159] Tokenization was performed substantially as described in Example 2.
[0160] A trim step was extended to compensate retention time offset by calculating an offset for each sample, which is added to the trimming range.
Offset Calculation:
[0161] A simple data-driven method was used as follows: Each DIA run was converted to 3D matrix whose axis was isolation window index, retention time index, and binned mz index. The value was log of peak intensity or 0 if there is no peak in the given index.
[0162] For each matrix, retention time was adjusted with an offset to maximize sum of similarity to other matrix. This was performed in a loop until the offset values converged. Model:
[0163] The basic principle of converting spectrum into sequence was performed substantially as in Example 3. Each peak was sorted by intensity rank order and mapped to a fixed sized m/z bin. An example of this is illustrated in FIG. 18. In addition to the converted sequence, classification token is prepended. The token encourages the model to summarize the information of the whole spectrum. Its encoded output is fed to a linear classifier.
[0164] The Example model was capable of handling more than one spectrum at a time, which can extract more information from time-adjacent, multiple spectra. Multiple sequences from multiple spectra are concatenated to form one sequence, with END token inserted between.
[0165] For example, given two spectra that are identical to diagram in FIG. 18, the converted sequence will be: [ CLASS, C, E, B, E, END, C, E, B, E, END],
[0166] The number of spectra to combine is a hyper-parameter. Different values have been tried and evaluated, ranging from 1 to 6.
Pre-training:
[0167] In pre-training step, 15% of tokens are randomly hidden, similar to masked language model training. The diagram illustrated in FIG. 19 shows conceptual analogy between sentence and spectrum pre-training. Putting the spectrum example above In sequence form: [ CLASS, C, E, MASK, E, END], the model is trained to predict MASK token with expected golden answer to be B.
Fine-tuning:
[0168] The Example model learns the general distribution of input data during pre-training step. In fine-tuning step, the model learns how label is related to the input and is trained to predict the label given the input. To do that, the output of CLASS token is fed to a linear classifier that predicts the label.
Global scheme:
[0169] Straightforward application of fine-tuning is:
Initialize model from pre-trained model.
Sample train example spectrum evenly from dataset. Train the model with the example and label.
Hyper-local scheme:
[0170] Instead of having one model trained with all spectra in the dataset, a hyper-local scheme builds and fine-tune model separately for each isolation and retention time index:
Loop for each isolation and retention index.
Initialize model from pre-trained model.
Pick example spectrum from the specific index.
Train the model with the example and label.
Global vs. Hyper-local:
[0171] In the global scheme, the example model is trained with more diverse spectra. However, only small portion of spectra that comes from specific isolation window and retention time have mutual information with label. Other spectra can be low quality examples and feeding them might have negative impact.
[0172] In hyper-local scheme, the example model is fine-tuned with spectra from a specific isolation window and retention time. This prevents high fidelity information examples from being mixed with low fidelity information examples. However, the number of training examples used can be significantly smaller.
[0173] Both schemes were evaluated in models as described in Example 4. Hyper-local scheme performed slightly better and is discussed in detail for the results section below.
Results:
[0174] Due to the nature of deep learning, it hard to interpret and understand output of model. Even if evaluation metrics are good, it is not always obvious why a model made a particular decision or based on what features. To thoroughly evaluate the example models in a controlled fashion, a problem with known biomarker proteins was selected as a test case to allow the important spectra that are found by model to be compared against known proteins and their spectra. The results of these test cases are described below:
[0175] Plasma vs. Serum: Compared to plasma blood sample, serum goes through extra processing. Fibrinogen protein is known to be filtered out. A SAN model as described above was trained with a dataset of mixed plasma and serum samples. A label was assigned to indicate whether the sample is plasma or serum.
[0176] The accuracy and AUC for the test split was around 0.99.
[0177] The FIG. 20A shows feature importance of random forest classifier. The random forest model treats each isolation window index and spectrum(retention time) index as a separate input feature. Among thousands of such features, the model finds important ones that are helpful to predict the label. Important features were compared against a list of peptides in fibrinogen alpha and gamma protein. The list is produced by DIA-NN tool and has isolation window and retention time information. The FIG. 20A shows that SAN feature list overlaps with the peptide list from DIA-NN tool.
[0178] A more challenging test case was created using quantification results from DIA-NN. A single protein was handpicked and a label was assigned based on the quantity of the protein - label 1 for high quantity and 0 for low quantity.
[0179] Different variant of models achieved around 0.95 AUC and 0.9 accuracy, as illustrated in FIG. 22. FIG. 20B shows that a few top features overlap with DIA-NN peptide list.
[0180] Another protein, P08519, was chosen and repeated the same procedure as above.
[0181] Different variant of models achieved around 0.95 AUC and 0.9 accuracy. The variance among models was higher than protein P01861, as shown in FIG. 23
[0182] FIG. 21 shows that a few top features overlap with DIA-NN peptide list.
[0183] While preferred embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. It is not intended that the disclosure be limited by the specific examples provided within the specification. While the disclosure has been described with reference to the aforementioned specification, the descriptions and illustrations of the embodiments herein are not meant to be construed in a limiting sense. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the present disclosure.
Furthermore, it shall be understood that all aspects of the present disclosure are not limited to the specific depictions, configurations or relative proportions set forth herein which depend upon a variety of conditions and variables. It should be understood that various alternatives to the embodiments of the present disclosure described herein may be employed in practicing the present disclosure. It is therefore contemplated that the present disclosure shall also cover any such alternatives, modifications, variations, or equivalents. It is intended that the following claims define the scope of the present disclosure and that methods and structures within the scope of these claims and their equivalents be covered thereby.
Claims
1. A method of characterizing a condition of a subject using mass spectrometry data, the method comprising: obtaining one or more raw mass spectra of a sample from the subject collected using a mass spectrometer, wherein the raw mass spectra comprises ion m/z values and intensities, wherein an experimental m/Am resolving power of the mass spectrometer is about 500- 2,000,000 at m/z 200; providing a machine learning model comprising one or more transformers that are trained on a raw mass spectra training dataset for characterization of the condition of the subject; and using the machine learning model to analyze the one or more raw mass spectra of the sample and output information indicative of at least one condition marker or condition state in the subject.
2. The method of claim 1, further comprising providing the information to a user via a graphical user interface.
3. The method of claim 1, wherein the experimental m/Am resolving power is about 500- 1,000,000 at m/z 200.
4. The method of any of claims 1-3, wherein the experimental m/Am resolving power is about 500-30,000 at m/z 200.
5. The method of any of claims 1-4, wherein the experimental m/Am resolving power is about 500-5,000 at m/z 200.
6. The method of any of the preceding claims, wherein the condition comprises a disease.
7. The method of any of the preceding claims, wherein the condition comprises an age state of the subject.
8. The method of any of the preceding claims, wherein the condition comprises a progression-free survival of the subject.
9. The method of any one of the preceding claims, wherein the machine learning model further comprises a linear classifier.
10. The method of any one of the preceding claims, wherein the machine learning model further comprises a neural radiance field.
11. The method of any one of the preceding claims, wherein the machine learning model further comprises a multi-layer neural network.
12. The method of any one of the preceding claims, wherein the machine learning model further comprises a decision tree.
13. The method of any one of the preceding claims, wherein the machine learning model further comprises a support vector machine.
14. The method of any one of the preceding claims, wherein the one or more raw mass spectra comprise MS/MS spectra.
15. The method of claim 14, wherein the MS/MS spectra are acquired in a data independent manner.
16. The method of claim any of the preceding claims, wherein the machine learning model comprises a plurality of transformers.
17. The method of claim 16, wherein the plurality of transformers are arranged in a hierarchy comprising a first and second transformer arranged in a hierarchy such that an output of the first transformer is used as an input of the second transformer.
18. The method of any one of the preceding claims, wherein the one or more raw mass spectra are tokenized prior to submission to the one or more transformers.
19. The method of any one of the preceding claims, wherein the machine learning model is trained with at least 10,000 individual mass spectra per day.
20. The method of any of the preceding claims, wherein the machine learning model is trained with at least 50,000 individual mass spectra per day.
21. The method of any of the preceding claims, wherein the machine learning model is trained with at least 100,000 individual mass spectra per day.
22. A machine-learning based method of characterizing a condition of a subject, the method comprising: obtaining one or more raw mass spectra of a sample from the subject collected using a mass spectrometer; providing a machine learning model comprising a plurality of transformers that are arranged in a hierarchy and trained on a raw mass spectra training dataset for characterization of the condition; and using the machine learning model to analyze the one or more raw mass spectra of the sample and output information indicative of at least one condition or condition state in the subject.
23. The method of claim 22, wherein the hierarchy comprises a first and second transformer in a hierarchy such that an output of the first transformer is used as an input of the second transformer.
24. The method of claim 23, further wherein the hierarchy further comprises a linear classifier, the linear model being arranged in the hierarchy such that an output of the second transformer is used as an input of the linear classifier.
25. The method of claim 23, further wherein the hierarchy further comprises a neural radiance field, the neural radiance field being arranged in the hierarchy such that an output of the second transformer is used as an input of the neural radiance field.
26. The method of claim 23, further wherein the hierarchy further comprises a multi-layer neural network, the multi-layer neural network being arranged in the hierarchy such that an output of the second transformer is used as an input of the multi-layer neural network.
27. The method of claim 23, further wherein the hierarchy further comprises a decision tree, the decision tree being arranged in the hierarchy such that an output of the second transformer is used as an input of the decision tree.
28. The method of claim 23, further wherein the hierarchy further comprises a support vector machine, the support vector machine being arranged in the hierarchy such that an output of the second transformer is used as an input of the support vector machine.
29. The method of any one of claims 24-28, wherein the first transformer classifies tokenized data based on an MS/MS isolation window.
30. The method of claim 29, wherein the second transformer classifies a vector output of the first transformer based upon a sample identity.
31. The method of claim 30, wherein the linear classifier classifies the disease or disease state based on the vector output from the second transformer.
32. The method of any one of the preceding claims, wherein the raw mass spectra comprise MS/MS spectra that are acquired in a data independent manner.
33. The method of any one of the preceding claims , wherein the machine learning model is trained with at least 10,000 individual mass spectra per day.
34. The method of any of the preceding claims, wherein the machine learning model is trained with at least 50,000 individual mass spectra per day.
35. The method of any of the preceding claims, wherein the machine learning model is trained with at least 100,000 individual mass spectra per day.
36. The method of any of the preceding claims, wherein the condition comprises a disease.
37. The method of any of the preceding claims, wherein the condition comprises an age state of the subject.
38. The method of any of the preceding claims, wherein the condition comprises a progression-free survival of the subject.
39. A method of characterizing a condition of a subject using a high throughput trained machine learning model, the method comprising: obtaining one or more raw mass spectra of a sample from the subject collected using a mass spectrometer;
providing the machine learning model that is trained on a raw mass spectra training dataset for characterization of the condition, wherein the machine learning model is trained at a rate of at least 10,000 individual raw mass spectra from the training dataset per day; and using the machine learning model to analyze the one or more raw mass spectra of the sample and output information indicative of at least one condition in the subject.
40. The method of claim 39, wherein the rate is at least 50,000 individual raw mass spectra from the training set per day.
41. The method of claim 39, wherein the rate is at least 100,000 individual raw mass spectra from the training set per day.
42. The method of any one of the preceding claims, wherein the machine learning model further comprises a linear classifier.
43. The method of any one of the preceding claims, wherein the one or more raw mass spectra comprise MS/MS spectra.
44. The method of any of the preceding claims, wherein the machine learning model comprises a plurality of transformers.
45. The method of claim 44, wherein the plurality of transformers are arranged in a hierarchy comprising a first and second transformer in a hierarchy arranged such that an output of the first transformer is used as an input of the second transformer.
46. The method of any one of the preceding claims, wherein the one or more raw mass spectra are tokenized prior to submission to the one or more transformers.
47. The method of any of the preceding claims, wherein the condition comprises a disease.
48. The method of any of the preceding claims, wherein the condition comprises an age state of the subject.
49. The method of any of the preceding claims, wherein the condition comprises a progression-free survival of the subject.
50. The method of any one of the preceding claims, wherein the one or more raw mass spectra are tokenized by an MS/MS isolation window and a plurality of m/z values corresponding to detected ions of each of the one or more raw mass spectra.
51. The method of claim 50, wherein the one or more raw mass spectra are tokenized such that m/z values with the same unit mass are binned together.
52. The method of claim 50, wherein the one or more raw mass spectra are tokenized using large bins (e.g. bins spanning about 1, 0.7, 0.5, or 0.3 mass units).
53. The method of claim 50, wherein the one or more raw mass spectra are tokenized using small bins (e.g. bins spanning about 0.1, 0.01, 0.001 or less mass units).
The method of claim 50, wherein the one or more raw mass spectra are tokenized using uniform bins. The method of claim 50, wherein the one or more raw mass spectra are tokenized using non-uniform bins. The method of any of the preceding claims, wherein the machine learning model is trained using self-supervised learning. The method of any one of the preceding claims, wherein measuring the sample by mass spectrometry comprises separating components of the sample using liquid chromatography coupled to a mass spectrometer. The method of claim 57, wherein a gradient method of the liquid chromatography runs over a period of at least 15 minutes (e.g. about 15, 30, 60, 90, or 180 minutes). The method of claim 57, wherein a gradient method of the liquid chromatography runs over a period of about 5 to 10 minutes (e.g. about 5, 7, or 10 minutes). The method of any of the preceding claims, wherein the information includes presence or absence of the at least one disease or disease state in the subject. The method of any of the preceding claims, wherein the at least one disease or disease state comprises cancer. The method of claim 61, wherein the cancer comprises pancreatic cancer or ovarian cancer. The method of claim 61, wherein the cancer comprises breast cancer. The method of claim 61, wherein the cancer comprises prostate cancer. The method of claim 61, wherein the cancer comprises lung cancer. The method of claim 61, wherein the cancer comprises gallbladder cancer. The method of any of the preceding claims, wherein the condition comprises a plurality of disease states. The method of any of the preceding claims, wherein the condition is a disease state, and the disease state comprises a responsiveness of a disease to a therapeutic intervention. The method of claim 68, wherein the therapeutic intervention is an immunotherapy (e.g. a CAR-T therapy). The method of any of the preceding claims, wherein the information comprises a probability or likelihood of the subject having the at least one disease or disease state. The method of any of the preceding claims, wherein the information comprises an indication of disease state or disease severity. The method of any of the preceding claims, wherein the information comprises an indication of disease classification.
The method of claim 72, wherein the at least one disease or disease state is a cancer and the indication of the disease classification comprises an identification of a cell line genotype or cell line phenotype of the cancer. The method of any one of the preceding claims, wherein the information is associated with at least one of a proteomic, a lipidomic, or a metabolomic profile of the sample obtained from the subject. The method of claim 74, wherein the machine learning model outputs the information without requiring prior domain knowledge relating to at least one of the proteomic, lipidomic, or metabolomic profile. The method of any one of the preceding claims, wherein an accuracy of the information is at least 70%. The method of any one of the preceding claims, wherein an accuracy of the information is at least 80%. The method of any one of the preceding claims, wherein an accuracy of the information is at least 90%. The method of any one of the preceding claims, wherein an accuracy of the information is at least 95%. The method of any one of the preceding claims, wherein an accuracy of the information is at least 99%. The method of any one of the preceding claims, wherein training the machine learning model to determine a presence or absence of the one or more disease conditions requires no more than about 500 experimental data points. The method of claim 81, wherein no more than about 200 experimental data points are required to train the machine learning model. The method of claim 81, wherein no more than about 100 experimental data points are required to train the machine learning model. The method of claim 81, wherein an accuracy of the determination is at least about 70%. The method of claim 74, wherein the proteomic profile comprises one or more post- translational modifications (PTMs). The method of claim 85, wherein the post-translational modifications comprise one or more phosphorylation, acetylation, ubiquitination, glycosylation, or combination of two or more thereof. The method of any one of the preceding claims, wherein the training the machine learning model comprises randomly masking about 1-25% (e.g. 1%, 5%, 10%, 15%,
20%, or 25%) of the training set and adding about 1-10% (e.g. about 1%, 2%, 3%, 4%, 5%, or 10%) noise as a means of self-supervised learning. The method of any one of the preceding claims, wherein measuring the sample by mass spectrometry comprises separating ions by ion mobility (e.g. by High Field Asymmetric Waveform Ion Mobility Spectrometry (FAIMS) or Drift-tube Ion Mobility Spectrometry) prior to or during acquisition mass spectra. The method of any one of the preceding claims, wherein a mean average percent error of the information is less than about 30% (e.g. less than 30%, 20%, 15%, 10%, 5%, 3%, 2%, or 1%). The method of any of the preceding claims, wherein adjacent m/z values are not treated as continuous values during the analysis. The method of any of the preceding claims, wherein the information comprises identification of one or more signal which is determinative of the presence or absence of a particular condition. The method of any of the preceding claims, wherein the information comprises identification of one or more signal which is indicative or correlated with a particular state of a particular condition. The method of any of the preceding claims, wherein the information is used for biomarker discovery. The method of any of the preceding claims, further comprising raw mass spectra are converted to preprocessed mass spectra by an automated algorithm comprising a deisotoping, a de-charging, or a de-adducting algorithm. The method of any of the preceding claims, wherein the method is capable of being trained at a rate of at least 10 training samples per day (e.g at least 10, 15, 50, 100, 300, 500, or 700 samples per day) when trained using a single GPU or CPU which is no faster in terms of maximum single precision floating point operations per second than an NVidia RTX A6000 GPU equipped with 48GB of RAM. The method of any of the preceding claims, wherein the machine learning model comprises a transformer arranged in a hierarchy with a linear classifier and a random forest aggregator. A non-transitory computer-readable storage medium comprising instructions that, when executed by a processor, causes the processor to perform the method of any of the preceding claims. A system configured for characterizing a condition of a subject, the system comprising: a computer comprising a memory operably coupled to at least one processor; and
a module executing in the memory of the computer, the module comprising program code enabled upon execution by the at least one processor of the computer to perform: the method of any of the preceding claims.
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202263410054P | 2022-09-26 | 2022-09-26 | |
US63/410,054 | 2022-09-26 | ||
US202363531910P | 2023-08-10 | 2023-08-10 | |
US63/531,910 | 2023-08-10 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024072802A1 true WO2024072802A1 (en) | 2024-04-04 |
Family
ID=90478980
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2023/033724 WO2024072802A1 (en) | 2022-09-26 | 2023-09-26 | Methods and systems for classification of a condition using mass spectrometry data |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2024072802A1 (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220122690A1 (en) * | 2020-07-17 | 2022-04-21 | Genentech, Inc. | Attention-based neural network to predict peptide binding, presentation, and immunogenicity |
-
2023
- 2023-09-26 WO PCT/US2023/033724 patent/WO2024072802A1/en unknown
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220122690A1 (en) * | 2020-07-17 | 2022-04-21 | Genentech, Inc. | Attention-based neural network to predict peptide binding, presentation, and immunogenicity |
Non-Patent Citations (2)
Title |
---|
ANONYMOUS: " MS 2 -TRANSFORMER: AN END-TO-END MODEL FOR MS/MS-ASSISTED MOLECULE IDENTIFICATION", UNDER REVIEW AS A CONFERENCE PAPER AT ICLR 2022, 1 January 2022 (2022-01-01), XP093158000, Retrieved from the Internet <URL:https://openreview.net/pdf?id=XK4GN6UCTfH> * |
MAUREEN FEUCHEROLLES: "Combination of MALDI-TOF Mass Spectrometry and Machine Learning for Rapid Antimicrobial Resistance Screening: The Case of Campylobacter spp.", FRONTIERS IN MICROBIOLOGY, FRONTIERS MEDIA, LAUSANNE, vol. 12, 18 February 2022 (2022-02-18), Lausanne , XP093158002, ISSN: 1664-302X, DOI: 10.3389/fmicb.2021.804484 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Winter et al. | Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations | |
Tian et al. | Clustering single-cell RNA-seq data with a model-based deep learning approach | |
US11587646B2 (en) | Method for simultaneous characterization and expansion of reference libraries for small molecule identification | |
Dührkop et al. | Searching molecular structure databases with tandem mass spectra using CSI: FingerID | |
US20190147983A1 (en) | Systems and methods for de novo peptide sequencing from data-independent acquisition using deep learning | |
WO2020014767A1 (en) | Systems and methods for de novo peptide sequencing from data-independent acquisition using deep learning | |
JP2022525427A (en) | Automatic boundary detection in mass spectrometry data | |
Ahmed et al. | Enhanced feature selection for biomarker discovery in LC-MS data using GP | |
Alqudah | Ovarian cancer classification using serum proteomic profiling and wavelet features a comparison of machine learning and features selection algorithms | |
Brendel et al. | Application of deep learning on single-cell RNA sequencing data analysis: a review | |
Yang et al. | Image-based classification of protein subcellular location patterns in human reproductive tissue by ensemble learning global and local features | |
Liu et al. | Current and future deep learning algorithms for tandem mass spectrometry (MS/MS)‐based small molecule structure elucidation | |
Cadow et al. | On the feasibility of deep learning applications using raw mass spectrometry data | |
Mao et al. | Mitigating the missing-fragmentation problem in de novo peptide sequencing with a two-stage graph-based deep learning model | |
Goldman et al. | Prefix-tree decoding for predicting mass spectra from molecules | |
Butler et al. | MS2Mol: A transformer model for illuminating dark chemical space from mass spectra | |
Litsa et al. | Spec2Mol: An end-to-end deep learning framework for translating MS/MS Spectra to de-novo molecules | |
CN113380337A (en) | Organic fluorescent small molecule optical property prediction method based on deep neural network | |
CN111508565B (en) | Mass spectrometry for determining the presence or absence of a chemical element in an analyte | |
Fan et al. | Intelligence algorithms for protein classification by mass spectrometry | |
Huber et al. | MS2DeepScore-a novel deep learning similarity measure for mass fragmentation spectrum comparisons | |
WO2024072802A1 (en) | Methods and systems for classification of a condition using mass spectrometry data | |
WO2023164518A2 (en) | Predicting chemical structure and properties based on mass spectra | |
Datta | Feature selection and machine learning with mass spectrometry data | |
Webel et al. | Mass spectrometry-based proteomics imputation using self supervised deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23873529 Country of ref document: EP Kind code of ref document: A1 |