WO2023147472A1 - Methods and systems for risk stratification of colorectal cancer - Google Patents

Methods and systems for risk stratification of colorectal cancer Download PDF

Info

Publication number
WO2023147472A1
WO2023147472A1 PCT/US2023/061453 US2023061453W WO2023147472A1 WO 2023147472 A1 WO2023147472 A1 WO 2023147472A1 US 2023061453 W US2023061453 W US 2023061453W WO 2023147472 A1 WO2023147472 A1 WO 2023147472A1
Authority
WO
WIPO (PCT)
Prior art keywords
variables
classifier
demographic
features
physiological
Prior art date
Application number
PCT/US2023/061453
Other languages
French (fr)
Inventor
Richard BOURGON
Drew Reid CLAUSEN
Robertino MERA-GILER
Macdonald Morris
Amit Kapish PASUPATHY
Moses Alexander Rubin WINTNER
Original Assignee
Freenome Holdings, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Freenome Holdings, Inc. filed Critical Freenome Holdings, Inc.
Publication of WO2023147472A1 publication Critical patent/WO2023147472A1/en

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/53Immunoassay; Biospecific binding assay; Materials therefor
    • G01N33/574Immunoassay; Biospecific binding assay; Materials therefor for cancer
    • G01N33/57407Specifically defined cancers
    • G01N33/57419Specifically defined cancers of colon
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/08Detecting, measuring or recording devices for evaluating the respiratory organs
    • A61B5/082Evaluation by breath analysis, e.g. determination of the chemical composition of exhaled breath
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/145Measuring characteristics of blood in vivo, e.g. gas concentration, pH value; Measuring characteristics of body fluids or tissues, e.g. interstitial fluid, cerebral tissue
    • A61B5/14532Measuring characteristics of blood in vivo, e.g. gas concentration, pH value; Measuring characteristics of body fluids or tissues, e.g. interstitial fluid, cerebral tissue for measuring glucose, e.g. by tissue impedance measurement
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/145Measuring characteristics of blood in vivo, e.g. gas concentration, pH value; Measuring characteristics of body fluids or tissues, e.g. interstitial fluid, cerebral tissue
    • A61B5/14546Measuring characteristics of blood in vivo, e.g. gas concentration, pH value; Measuring characteristics of body fluids or tissues, e.g. interstitial fluid, cerebral tissue for measuring analytes not otherwise provided for, e.g. ions, cytochromes
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/42Detecting, measuring or recording for evaluating the gastrointestinal, the endocrine or the exocrine systems
    • A61B5/4211Diagnosing or evaluating reflux
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/42Detecting, measuring or recording for evaluating the gastrointestinal, the endocrine or the exocrine systems
    • A61B5/4216Diagnosing or evaluating gastrointestinal ulcers
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/42Detecting, measuring or recording for evaluating the gastrointestinal, the endocrine or the exocrine systems
    • A61B5/4222Evaluating particular parts, e.g. particular organs
    • A61B5/4255Intestines, colon or appendix
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/72Signal processing specially adapted for physiological signals or for diagnostic purposes
    • A61B5/7235Details of waveform analysis
    • A61B5/7264Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems
    • A61B5/7267Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems involving training the classification device
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/72Signal processing specially adapted for physiological signals or for diagnostic purposes
    • A61B5/7271Specific aspects of physiological measurement analysis
    • A61B5/7275Determining trends in physiological measurement data; Predicting development of a medical condition based on physiological measurements, e.g. determining a risk factor
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/72Signal processing specially adapted for physiological signals or for diagnostic purposes
    • A61B5/7271Specific aspects of physiological measurement analysis
    • A61B5/7282Event detection, e.g. detecting unique waveforms indicative of a medical condition
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • CRC screening modalities There are a variety of colorectal cancer screening modalities available, including high- sensitivity guaiac fecal occult blood test (HSgFOBT), fecal immunochemical test (FIT), FIT- DNA, computed tomography coIonography, flexible sigmoidoscopy, and colonoscopy. While uncommon, harms resulting from CRC screening are primarily due to complications from colonoscopies (i.e., screening, surveillance, or follow-up after positive non-invasive test) and include serious gastrointestinal bleeding, perforations, and cardiopulmonary events (USPSTF 2021). Clinical and coverage guidelines in the United States generally recommend CRC screening for asymptomatic adults aged 45 to 85. In addition, the Centers for Medicare and Medicaid Services (CMS) has included adherence to CRC screening as a component of its Star Rating program since 2008.
  • CMS Centers for Medicare and Medicaid Services
  • the present disclosure relates to cancer screening and detection and, more particularly, but not exclusively, to methods and systems of evaluating a risk of cancer by stratifying populations of individuals based on non-invasive features and computational methods.
  • CRC colorectal cancer
  • microsimulations of a program that only screens individuals with a CRC risk above a certain threshold found similar reductions in CRC incidence and mortality when compared to screening the entire “average risk” population (Buskermolen 2019, Helsingen 2019).
  • Such targeted screening efforts require accurate CRC risk prediction models to improve patient compliance with early enough lead times to enable early intervention and treatment for improved clinical outcomes.
  • the present disclosure provides a classifier for evaluation of colorectal cancer risk of a target individual, wherein the classifier is trained on at least one training data set that comprises a plurality of features based on demographic, physiological, and clinical variables from the target individual, wherein the classifier is generated based at least in part on an analysis of a plurality of respective demographic, physiological, and clinical features from a plurality of sampled individuals, wherein at least one of the plurality of features is derived from at least data of the demographic, physiological, and clinical variables obtained at 2 or more time points.
  • the demographic, physiological, and clinical features comprise at least two features obtained from demographic, symptomatic, lifestyle, diagnosis, or biochemical variables.
  • the demographic variables are selected from age, gender, weight, height, BMI, race, country, and geographically determined data. In some embodiments, the demographic variables comprise local air quality, limiting long-term illness, or Townsend deprivation index. In some embodiments, the demographic variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the symptomatic variables are selected from heartbum, sweating, hemorrhoids, chronic sinusitis, atrial fibrillation, rectal bleeding, abdominal pain, bowel movement change, reflux, constipation, indigestion, hypertension, hematuria, diarrhea, bloating belching, abnormal weight loss, and obesity.
  • the symptomatic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the lifestyle variables are selected from smoking, alcohol use, red meat consumption, medications, progestogen, estrogen and other hormonal modulators, hormonal contraceptives, non-steroidal anti-inflammatory and antirheumatic drugs (NSAID), antithrombotic agents, blood glucose lowering drugs excluding insulin, insulin and analogues, immunosuppressants, sulfonamides and trimethoprim antibacterials, and drugs for peptic ulcer and gastroesophageal reflux disease (GERD).
  • the lifestyle variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the diagnosis variables are selected from diverticular disease, multiple sclerosis, diabetes 1, diabetes 2, hyperthyroidism, psoriasis, anemia, peripheral arterial disease, AIDS, mental health disorder, inflammatory bowel disease, myocardial infarction, lupus, dementia, heart failure, hepatitis, alcoholic fatty liver disease, cirrhosis, rheumatoid arthritis, acromegaly, hemiplegia, COPD, irritable bowel syndrome, learning disabilities, depression, stroke, and chronic kidney disease.
  • the diagnosis variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the biochemical variables are selected from alanine aminotransferase (ALT), albumin, alkaline phosphatase (ALP), aspartate aminotransferase (AST), calcium, carbon dioxide, chloride, cholesterol, creatinine, ferritin, globulin, glucose, high density lipoprotein (HDL), low density lipoprotein (LDL), phosphate, potassium, sodium, triglycerides, urea, and uric acid.
  • the biochemical variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the classifier comprises features based on recency, trends, and sequence features.
  • the demographic, physiological, and clinical variables reflect an entire medical history of the target individual.
  • the demographic, physiological, and clinical variables are not generated using a CBC analysis of the target individual.
  • a training data set used to train the classifier includes comprises two or more variables per sampled individual across two or more time points.
  • the feature data for the plurality of features is collected longitudinally to provide a time series of feature values.
  • the time series of feature values is weighted according to recency or elapsed time, thereby providing a higher significance to recent measurements of the time series of feature values.
  • At least about 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 variables are used in the classification model.
  • the present disclosure provides a classifier for evaluation colorectal cancer risk of a target individual, wherein the classifier is trained on at least one training data set that comprises (i) a plurality of features based on two or more demographic, physiological, and clinical variables from the target individual, and (ii) 9 or less blood test features based on a plurality of current blood test results of the target individual, wherein each one of the 9 or less different blood test features is based on a blood test value of one of the plurality of current blood test results of the target individual, wherein at least one of the plurality of features is based on data of the two or more demographic, physiological, and clinical variables obtained at 2 or more time points.
  • the plurality of blood test results comprises (i) 9 or less of the following blood tests: red blood cells (RBC), hemoglobin (HGB), and hematocrit (HCT) and (ii) at least one result of the following blood tests hemoglobin (MCH) and mean corpuscular hemoglobin concentration (MCHC).
  • RBC red blood cells
  • HGB hemoglobin
  • HCT hematocrit
  • MHC mean corpuscular hemoglobin concentration
  • the blood test results comprises 9 or less results of the following blood tests: white blood cell count (WBC); mean platelet volume (MPV); mean cell; platelet count (CBC); eosinophils count; neutrophils percentage; monocytes percentage; eosinophils percentage; basophils percentage; lymphocytes percentage; and neutrophils count; monocytes count, lymphocytes count; neutrophil-lymphocyte ratio (NLR).
  • WBC white blood cell count
  • MPV mean platelet volume
  • CBC platelet count
  • eosinophils count neutrophils percentage
  • monocytes percentage eosinophils percentage
  • basophils percentage lymphocytes percentage
  • neutrophils count monocytes count, lymphocytes count
  • neutrophil-lymphocyte ratio NLR
  • the classifier is trained equally across all prior demographic, physiological, and clinical features. In some embodiments, the equal training reduces or prevents bias towards later detection. In some embodiments, the classifier is configured to provide the evaluation of the colorectal cancer risk of the target individual with advance notice or lead time before a diagnosis that is sufficient to permit treatment intervention at a time point that is clinically actionable.
  • the advance notice or lead time is sufficient to permit a treatment that improves clinical outcomes including treatment efficacy and mortality
  • the advance notice or lead time provided by the classification model permits a treatment that improves clinical outcomes including treatment efficacy and mortality.
  • the advance notice or lead time has a median that is between 100 and 300 days before colorectal cancer diagnosis.
  • the advance notice or lead time has a median that is at least 300 days, at least 400 days, at least 500 days, at least 600 days, at least 700 days, at least 800 days, at least 900 days, at least 1000 days, at least 1100 days, at least 1200 days, at least 1300 days, at least 1400 days, or at least 1500 days before colorectal cancer diagnosis.
  • the advance notice or lead time is at least 600 days. [0025] In some embodiments, the advance notice or lead time is at least 1000 days. [0026] In some embodiments, the plurality of features comprises an age of the target individual; wherein the classifier is generated according to at least an analysis of the age of each of another plurality of sampled individuals.
  • the plurality of features comprises a sex of the target individual; wherein the classifier is generated according to at least an analysis of the sex of each of another plurality of sampled individuals.
  • the classifier is configured as a machine learning classifier selected from the group consisting of a deep learning classifier, a neural network classifier, a linear discriminant analysis (LDA) classifier, a quadratic discriminant analysis (QDA) classifier, a support vector machine (SVM) classifier, a random forest (RF) classifier, K nearest neighbor (KNN), a linear kernel support vector machine classifier, a first or second order polynomial kernel support vector machine classifier, a ridge regression classifier, an elastic net algorithm classifier, a sequential minimal optimization algorithm classifier, a naive Bayes algorithm classifier, and principal component analysis classifier.
  • a machine learning classifier selected from the group consisting of a deep learning classifier, a neural network classifier, a linear discriminant analysis (LDA) classifier, a quadratic discriminant analysis (QDA) classifier, a support vector machine (SVM) classifier, a random forest (RF) classifier, K nearest neighbor (KNN), a linear kernel support vector machine
  • the present disclosure provides a method of generating a colorectal cancer risk classifier, comprising: (a) providing a plurality of features based on demographic, physiological and clinical variables from a target individual; (b) generating a dataset having a plurality of sets of features, each set of features generated according to a respective plurality of features based on demographic, physiological and clinical features from a plurality of sampled individuals; and (c) generating at least one classifier based at least in part on an analysis of the dataset, and outputting the at least one classifier, wherein at least one of the plurality of features is based on data of the demographic, physiological, and clinical variables obtained at 2 or more time points.
  • the demographic, physiological, and clinical features comprise at least two features selected from demographic, symptomatic, lifestyle, diagnosis, and biochemical variables.
  • the demographic variables are selected from age, gender, weight, height, BMI, race, country, and geographically determined data.
  • the demographic variables comprise local air quality, limiting long-term illness, or Townsend deprivation index.
  • the demographic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the symptomatic variables are selected from heartburn, sweating, hemorrhoids, chronic sinusitis, atrial fibrillation, rectal bleeding, abdominal pain, bowel movement change, reflux, constipation, indigestion, hypertension, hematuria, diarrhea, bloating belching, abnormal weight loss, and obesity.
  • the symptomatic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the lifestyle variables are selected from smoking, alcohol use, red meat consumption, medications, progestogen, estrogen and other hormonal modulators, hormonal contraceptives, non-steroidal anti-inflammatory and antirheumatic drugs (NSAID), antithrombotic agents, blood glucose lowering drugs excluding insulin, insulin and analogues, immunosuppressants, sulfonamides and trimethoprim antibacterials, and drugs for peptic ulcer and gastroesophageal reflux disease (GERD).
  • the lifestyle variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the diagnosis variables are selected from diverticular disease, multiple sclerosis, diabetes 1, diabetes 2, hyperthyroidism, psoriasis, anemia, peripheral arterial disease, AIDS, mental health disorder, inflammatory bowel disease, myocardial infarction, lupus, dementia, heart failure, hepatitis, alcoholic fatty liver disease, cirrhosis, rheumatoid arthritis, acromegaly, hemiplegia, COPD, irritable bowel syndrome, learning disabilities, depression, stroke, and chronic kidney disease.
  • the diagnosis variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the biochemical variables are selected from alanine aminotransferase (ALT), albumin, alkaline phosphatase (ALP), aspartate aminotransferase (AST), calcium, carbon dioxide, chloride, cholesterol, creatinine, ferritin, globulin, glucose, high density lipoprotein (HDL), low density lipoprotein (LDL), phosphate, potassium, sodium, triglycerides, urea, and uric acid.
  • the biochemical variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • At least about 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190 or 200 variables are used in generating the colorectal cancer risk classifier. In some embodiments, between 20 and 200, 30 and 190, 40 and 180, 50 and 170, 60 and 160, 70 and 150, 80 and 140, 90 and 130, or 100 and 120 variables are used in generating the colorectal cancer risk classifier.
  • the generating of (c) comprises weighting each of the plurality of demographic, physiological and clinical features according to a date of the respective plurality of demographic, physiological and clinical features. In some embodiments, the generating of (c) comprises filtering the plurality of demographic, physiological and clinical features to remove outliers according to a standard deviation maximum threshold. In some embodiments, the plurality of features are weighted according to a date of the respective plurality of demographic, physiological and clinical features.
  • the present disclosure provides a method of generating a risk profile for colorectal cancer in a target individual comprising: a) obtaining a plurality of features based on demographic, physiological, and clinical variables from a target individual, b) providing at least one classifier generated based at least in part on an analysis of a plurality of respective demographic, physiological, and clinical features from a plurality of sampled individuals; and c) evaluating, using a processor, a colorectal cancer risk of the target individual at least in part by classifying the plurality of features using the at least one classifier to provide a risk profile of the target individual, wherein at least one of the plurality of features is based on data of the demographic, physiological, and clinical variables obtained at 2 or more time points.
  • the demographic, physiological, and clinical features comprise at least two features selected from demographic, symptomatic, lifestyle, diagnosis, and biochemical variables.
  • the demographic variables are selected from age, gender, weight, height, BMI, race, country, and geographically determined data.
  • the demographic variables comprise local air quality, limiting long-term illness, or Townsend deprivation index.
  • the demographic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the symptomatic variables are selected from heartbum, sweating, hemorrhoids, chronic sinusitis, atrial fibrillation, rectal bleeding, abdominal pain, bowel movement change, reflux, constipation, indigestion, hypertension, hematuria, diarrhea, bloating belching, abnormal weight loss, and obesity.
  • the symptomatic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the lifestyle variables are selected from smoking, alcohol use, red meat consumption, medications, progestogen, estrogen and other hormonal modulators, hormonal contraceptives, non-steroidal anti-inflammatory and antirheumatic drugs (NSAID), antithrombotic agents, blood glucose lowering drugs excluding insulin, insulin and analogues, immunosuppressants, sulfonamides and trimethoprim antibacterials, and drugs for peptic ulcer and gastroesophageal reflux disease (GERD).
  • the lifestyle variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the diagnosis variables are selected from diverticular disease, multiple sclerosis, diabetes 1, diabetes 2, hyperthyroidism, psoriasis, anemia, peripheral arterial disease, AIDS, mental health disorder, inflammatory bowel disease, myocardial infarction, lupus, dementia, heart failure, hepatitis, alcoholic fatty liver disease, cirrhosis, rheumatoid arthritis, acromegaly, hemiplegia, COPD, irritable bowel syndrome, learning disabilities, depression, stroke, and chronic kidney disease.
  • the diagnosis variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the biochemical variables are selected from alanine aminotransferase (ALT), albumin, alkaline phosphatase (ALP), aspartate aminotransferase (AST), calcium, carbon dioxide, chloride, cholesterol, creatinine, ferritin, globulin, glucose, high density lipoprotein (HDL), low density lipoprotein (LDL), phosphate, potassium, sodium, triglycerides, urea, and uric acid.
  • the biochemical variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • At least about 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190 or 200 variables are used in generating the risk profile for colorectal cancer. In some embodiments, between 20 and 200, 30 and 190, 40 and 180, 50 and 170, 60 and 160, 70 and 150, 80 and 140, 90 and 130, or 100 and 120 variables are used in generating the risk profile for colorectal cancer.
  • the present disclosure provides a method for evaluating colorectal cancer risk of a target individual, comprising: obtaining a plurality of features based on demographic, physiological, and clinical variables from a target individual; providing at least one classifier generated based at least in part on an analysis of a plurality of respective demographic, physiological, and clinical features of each of a plurality of sampled individuals; and evaluating, using a processor, a colorectal cancer risk of the target individual at least in part by classifying the plurality of features using the at least one classifier, wherein at least one of the features is based on data of the demographic, physiological, and clinical variables obtained at 2 or more time points.
  • the demographic, physiological and clinical features comprise at least two features selected from demographic, symptomatic, lifestyle, diagnosis, and biochemical variables.
  • the demographic variables are selected from age, gender, weight, height, BMI, race, country, and geographically determined data.
  • the demographic variables comprise local air quality, limiting long-term illness, or Townsend deprivation index.
  • the demographic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the symptomatic variables are selected from heartburn, sweating, hemorrhoids, chronic sinusitis, atrial fibrillation, rectal bleeding, abdominal pain, bowel movement change, reflux, constipation, indigestion, hypertension, hematuria, diarrhea, bloating belching, abnormal weight loss, and obesity.
  • the symptomatic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the lifestyle variables are selected from smoking, alcohol use, red meat consumption, medications, progestogen, estrogen and other hormonal modulators, hormonal contraceptives, non-steroidal anti-inflammatory and antirheumatic drugs (NSAID), antithrombotic agents, blood glucose lowering drugs excluding insulin, insulin and analogues, immunosuppressants, sulfonamides and trimethoprim antibacterials, and drugs for peptic ulcer and gastroesophageal reflux disease (GERD).
  • the lifestyle variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the diagnosis variables are selected from diverticular disease, multiple sclerosis, diabetes 1, diabetes 2, hyperthyroidism, psoriasis, anemia, peripheral arterial disease, AIDS, mental health disorder, inflammatory bowel disease, myocardial infarction, lupus, dementia, heart failure, hepatitis, alcoholic fatty liver disease, cirrhosis, rheumatoid arthritis, acromegaly, hemiplegia, COPD, irritable bowel syndrome, learning disabilities, depression, stroke, and chronic kidney disease.
  • the diagnosis variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the biochemical variables are selected from alanine aminotransferase (ALT), albumin, alkaline phosphatase (ALP), aspartate aminotransferase (AST), calcium, carbon dioxide, chloride, cholesterol, creatinine, ferritin, globulin, glucose, high density lipoprotein (HDL), low density lipoprotein (LDL), phosphate, potassium, sodium, triglycerides, urea, and uric acid.
  • the biochemical variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • At least about 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190 or 200 variables are used in classifying the plurality of features using the at least one classifier. In some embodiments, between 20 and 200, 30 and 190, 40 and 180, 50 and 170, 60 and 160, 70 and 150, 80 and 140, 90 and 130, or 100 and 120 variables are used in classifying the plurality of features using the at least one classifier.
  • the present disclosure provides a method for evaluation of colorectal cancer risk in a target individual, comprising: a) receiving by a computing system associated with a database storing a plurality of classifiers and from a client terminal and via a network, an indication of values of a plurality of features based on demographic, physiological, and clinical variables from a target individual, wherein the clinical variables comprise current blood test results calculated based at least in part on an analysis of a blood sample obtained from the target individual, and wherein at least one of the demographic, physiological, and clinical features is based on data of the demographic, physiological, and clinical variables obtained at 2 or more time points; b) generating, by said computing system, a combination of features based on two or more of the demographic, physiological and clinical variables, and 9 or less blood test features based on said plurality of current blood test results, each one of said 9 or less different blood test features is based on a blood test value of one of said plurality of current blood test results, wherein at least one of the
  • the demographic, physiological and clinical features comprise at least two features selected from demographic, symptomatic, lifestyle, diagnosis, and biochemical variables.
  • the demographic variables are selected from age, gender, weight, height, BMI, race, country, and geographically determined data.
  • the demographic variables comprise local air quality, limiting long-term illness, or Townsend deprivation index.
  • the demographic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the symptomatic variables are selected from heartburn, sweating, hemorrhoids, chronic sinusitis, atrial fibrillation, rectal bleeding, abdominal pain, bowel movement change, reflux, constipation, indigestion, hypertension, hematuria, diarrhea, bloating belching, abnormal weight loss, and obesity.
  • the symptomatic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the lifestyle variables are selected from smoking, alcohol use, red meat consumption, medications, progestogen, estrogen and other hormonal modulators, hormonal contraceptives, non-steroidal anti-inflammatory and antirheumatic drugs (NSAID), antithrombotic agents, blood glucose lowering drugs excluding insulin, insulin and analogues, immunosuppressants, sulfonamides and trimethoprim antibacterials, and drugs for peptic ulcer and gastroesophageal reflux disease (GERD).
  • the lifestyle variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the diagnosis variables are selected from diverticular disease, multiple sclerosis, diabetes 1, diabetes 2, hyperthyroidism, psoriasis, anemia, peripheral arterial disease, AIDS, mental health disorder, inflammatory bowel disease, myocardial infarction, lupus, dementia, heart failure, hepatitis, alcoholic fatty liver disease, cirrhosis, rheumatoid arthritis, acromegaly, hemiplegia, COPD, irritable bowel syndrome, learning disabilities, depression, stroke, and chronic kidney disease.
  • the diagnosis variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the biochemical variables are selected from alanine aminotransferase (ALT), albumin, alkaline phosphatase (ALP), aspartate aminotransferase (AST), calcium, carbon dioxide, chloride, cholesterol, creatinine, ferritin, globulin, glucose, high density lipoprotein (HDL), low density lipoprotein (LDL), phosphate, potassium, sodium, triglycerides, urea, and uric acid.
  • the biochemical variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • At least about 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190 or 200 variables are used in classifying the demographic, physiological and clinical variables and combination of 9 or less blood test features using the at least one classifier.
  • between 20 and 200, 30 and 190, 40 and 180, 50 and 170, 60 and 160, 70 and 150, 80 and 140, 90 and 130, or 100 and 120 variables are used in classifying the demographic, physiological and clinical variables and combination of 9 or less blood test features using the at least one classifier.
  • the present disclosure provides a system for generating a colorectal cancer risk profile comprising:(i) a processor, (ii) a memory unit which stores at least one classifier generated based at least in part on an analysis of a plurality of demographic, physiological, and clinical features of individuals of a plurality of sampled individuals, and an input unit which receives a plurality of demographic, physiological and clinical variables of a target individual, and (iii) a colorectal cancer evaluating module which evaluates, using the processor, a colorectal cancer risk of the target individual at least in part by classifying, using the at least one classifier, a plurality of features based on the plurality of demographic, physiological, and clinical variables, wherein at least one of the plurality of features is based on data of the plurality of demographic, physiological, and clinical variables obtained at 2 or more time points.
  • the demographic, physiological, and clinical features comprise at least two features selected from demographic, symptomatic, lifestyle, diagnosis, and biochemical variables.
  • the demographic variables are selected from age, gender, weight, height, BMI, race, country, and geographically determined data.
  • the demographic variables comprise local air quality, limiting long-term illness, or Townsend deprivation index.
  • the demographic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the symptomatic variables are selected from heartburn, sweating, hemorrhoids, chronic sinusitis, atrial fibrillation, rectal bleeding, abdominal pain, bowel movement change, reflux, constipation, indigestion, hypertension, hematuria, diarrhea, bloating belching, abnormal weight loss, and obesity.
  • the symptomatic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the lifestyle variables are selected from smoking, alcohol use, red meat consumption, medications, progestogen, estrogen and other hormonal modulators, hormonal contraceptives, non-steroidal anti-inflammatory and antirheumatic drugs (NSAID), antithrombotic agents, blood glucose lowering drugs excluding insulin, insulin and analogues, immunosuppressants, sulfonamides and trimethoprim antibacterials, and drugs for peptic ulcer and gastroesophageal reflux disease (GERD).
  • the lifestyle variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the diagnosis variables are selected from diverticular disease, multiple sclerosis, diabetes 1, diabetes 2, hyperthyroidism, psoriasis, anemia, peripheral arterial disease, AIDS, mental health disorder, inflammatory bowel disease, myocardial infarction, lupus, dementia, heart failure, hepatitis, alcoholic fatty liver disease, cirrhosis, rheumatoid arthritis, acromegaly, hemiplegia, COPD, irritable bowel syndrome, learning disabilities, depression, stroke, and chronic kidney disease.
  • the diagnosis variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the biochemical variables are selected from alanine aminotransferase (ALT), albumin, alkaline phosphatase (ALP), aspartate aminotransferase (AST), calcium, carbon dioxide, chloride, cholesterol, creatinine, ferritin, globulin, glucose, high density lipoprotein (HDL), low density lipoprotein (LDL), phosphate, potassium, sodium, triglycerides, urea, and uric acid.
  • the biochemical variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • At least about 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190 or 200 variables are used in classifying the plurality of features using the at least one classifier. In some embodiments, between 20 and 200, 30 and 190, 40 and 180, 50 and 170, 60 and 160, 70 and 150, 80 and 140, 90 and 130, or 100 and 120 variables are used in classifying the plurality of features using the at least one classifier.
  • the colorectal cancer risk profile identifies individuals at risk of colorectal cancer. In some embodiments, the colorectal cancer risk profile stratifies a population of individuals for cancer risk. In some embodiments, the colorectal cancer risk profile is used to provide treatment recommendations for the individual based on the colorectal cancer risk profile.
  • the present disclosure provides a system for classifying classifying a target individual for colorectal cancer risk comprising:(i) a processor, (ii) a memory unit which stores at least one classifier generated based at least in part on an analysis of a plurality of demographic, physiological, and clinical features of a plurality of sampled individuals, and an input unit which receives a plurality of demographic, physiological, and clinical variables of a target individual, and (iii) a colorectal cancer evaluating module which evaluates, using the processor, a colorectal cancer risk of the target individual at least in part by classifying, using the at least one classifier, a plurality of features based on the plurality of demographic, physiological, and clinical variables, wherein at least one of the plurality of features is based on data of the plurality of demographic, physiological, and clinical variables obtained at 2 or more time points.
  • the demographic, physiological, and clinical features comprise at least two features selected from demographic, symptomatic, lifestyle, diagnosis, and biochemical variables.
  • the demographic variables are selected from age, gender, weight, height, BMI, race, country, and geographically determined data.
  • the demographic variables comprise local air quality, limiting long-term illness, or Townsend deprivation index.
  • the demographic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the symptomatic variables are selected from heartburn, sweating, hemorrhoids, chronic sinusitis, atrial fibrillation, rectal bleeding, abdominal pain, bowel movement change, reflux, constipation, indigestion, hypertension, hematuria, diarrhea, bloating belching, abnormal weight loss, and obesity.
  • the symptomatic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the lifestyle variables are selected from smoking, alcohol use, red meat consumption, medications, progestogen, estrogen and other hormonal modulators, hormonal contraceptives, non-steroidal anti-inflammatory and antirheumatic drugs (NSAID), antithrombotic agents, blood glucose lowering drugs excluding insulin, insulin and analogues, immunosuppressants, sulfonamides and trimethoprim antibacterials, and drugs for peptic ulcer and gastroesophageal reflux disease (GERD).
  • the lifestyle variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the diagnosis variables are selected from diverticular disease, multiple sclerosis, diabetes 1, diabetes 2, hyperthyroidism, psoriasis, anemia, peripheral arterial disease, AIDS, mental health disorder, inflammatory bowel disease, myocardial infarction, lupus, dementia, heart failure, hepatitis, alcoholic fatty liver disease, cirrhosis, rheumatoid arthritis, acromegaly, hemiplegia, COPD, irritable bowel syndrome, learning disabilities, depression, stroke, and chronic kidney disease.
  • the diagnosis variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • the biochemical variables are selected from alanine aminotransferase (ALT), albumin, alkaline phosphatase (ALP), aspartate aminotransferase (AST), calcium, carbon dioxide, chloride, cholesterol, creatinine, ferritin, globulin, glucose, high density lipoprotein (HDL), low density lipoprotein (LDL), phosphate, potassium, sodium, triglycerides, urea, and uric acid.
  • the biochemical variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
  • At least about 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190 or 200 variables are used in classifying the plurality of features using the at least one classifier.
  • between 20 and 200, 30 and 190, 40 and 180, 50 and 170, 60 and 160, 70 and 150, 80 and 140, 90 and 130, or 100 and 120 variables are used in classifying the plurality of features using the at least one classifier.
  • the demographic, physiological, and clinical features comprise a plurality of historical and current blood test results comprising results of 9 or less of the following of plurality of blood tests: red blood cells (RBC), hemoglobin (HGB), and hematocrit (HCT) and at least one result of the following blood tests hemoglobin (MCH) and mean corpuscular hemoglobin concentration (MCHC).
  • RBC red blood cells
  • HGB hemoglobin
  • HCT hematocrit
  • MCH mean corpuscular hemoglobin concentration
  • Implementation of the method and/or system of embodiments of the disclosure can involve performing or completing selected tasks manually, automatically, or a combination thereof. Moreover, according to actual instrumentation and equipment of embodiments of the method and/or system of the disclosure, several selected tasks could be implemented by hardware, by software or by firmware or by a combination thereof using an operating system. [0052] In some embodiments, hardware for performing selected tasks according to embodiments of the disclosure could be implemented as a chip or a circuit. As software, selected tasks according to embodiments of the disclosure could be implemented as a plurality of software instructions being executed by a computer using any suitable operating system.
  • a data processor such as a computing platform for executing a plurality of instructions.
  • the data processor includes a volatile memory for storing instructions and/or data and/or a nonvolatile storage, for example, a magnetic hard-disk and/or removable media, for storing instructions and/or data.
  • a network connection is provided as well.
  • a display and/or a user input device such as a keyboard or mouse are optionally provided as well.
  • FIG. 1 provides a schematic of a computer system that is programmed or otherwise configured with the machine learning models and classifiers in order to implement methods provided herein.
  • the ROC curve was calculated on the Northern Ireland Holdout set with a CRC prediction horizon of 24 months.
  • FIG. 4 shows SHAP values for Model A trained on the Dataset 1 cohort.
  • SHAP values measure the contribution of each feature to the prediction. Positive SHAP values indicate that the feature increases a patient’s risk score and negative values indicate that a feature reduces a patient’s risk score.
  • each point represents a single observation.
  • the SHAP values for a given feature in every observation are plotted horizontally in each row of the figure.
  • the vertical thickness of each row captures the relative frequency a feature’s contribution level across all observations.
  • the color of each point indicates the value of the feature with blue representing low feature values and red representing high feature values. For example, the thick red cloud of points on the right side of the top row indicates that for many observations, higher values of demographic feature 1 contribute to higher CRC risk.
  • seven of the features included in the model are obtained from variables measured at multiple time points: lab feature 3, lab feature 4, lab feature 7, lab feature 8, diagnosis feature 1, BMI feature 2, and lab feature 9.
  • the present disclosure in some embodiments thereof, relates to cancer diagnosis and, more particularly, but not exclusively, to methods and systems of evaluating a risk of cancer.
  • features obtained from demographic, physiological and clinical variables are used as input datasets into trained algorithms (e.g., machine learning models or classifiers) to find correlations between features and patient groups.
  • patient groups include presence of diseases or conditions, stages, subtypes, treatment responders vs. non-responders, and progressors vs. non-progressors.
  • feature matrices are generated to compare samples obtained from individuals with known conditions or characteristics.
  • samples are obtained from healthy individuals, or individuals who do not have any of the known indications and samples from patients known to have increased risk of developing cancer, and in particular colorectal cancer.
  • feature generally refers to an individual measurable property or characteristic of a phenomenon being observed.
  • the concept of “feature” is related to that of explanatory variable used in statistical techniques such as for example, but not limited to, linear regression and logistic regression.
  • Features are usually numeric, but structural features such as strings and graphs are used in syntactic pattern recognition.
  • features may be obtained from demographic, physiological and clinical information or variable.
  • the action of converting this information into features useful for computational method is referred to herein as “featurization”.
  • input features generally refers to variables that are converted to a form, often a numeric form, used by the trained algorithm, (e.g., model or classifier) to predict an output classification (label) of a sample, e.g., a condition, cancer risk category.
  • Feature values of the variables may be determined for an individual and used to determine a classification.
  • features obtained from demographic, physiological and clinical variables are obtained from historical medical records of an individual.
  • the present methods incorporated a historical view including aspects such as recency, trends, and sequence features which are believed to improve the predictive value of resulting.
  • Each observation included the entire patient history to that date.
  • the training data set used in the present methods includes multiple observations per patient across multiple time points.
  • feature data is collected longitudinally to provide time series of feature values.
  • the time series of feature values is weighted according to elapsed time to provide higher significance of recent measurements.
  • a colorectal cancer risk classifier is provided that is trained on data that comprises a plurality of features based on demographic, physiological and clinical variables from a target individual, wherein the classifier is generated according to an analysis of a plurality of respective demographic, physiological and clinical features of a plurality of sampled individuals, wherein at least one of the features is based on data variables obtained at 2 or more time points.
  • the model trains equally across all prior demographic and clinical features. While not to be bound by any mechanism, it is believed that such equal training prevent bias towards later detection.
  • Each of the plurality of demographic, physiological and clinical features comprises at least two features obtained from demographic, symptomatic, lifestyle, diagnosis or biochemical variables.
  • the demographic variables are selected from age, gender, weight, height, BMI, race, country, geographically determined data such as local air quality, limiting long-term illness, or Townsend deprivation index, and the like.
  • the variable may then be featurized for use in the computerized method or for use as an input to train a computational classification model.
  • the set of features comprises an age of the target individual; wherein the at least one classifier is generated according to an analysis of the age of each of another plurality of sampled individuals.
  • the set of features comprises sex of the target individual; wherein the at least one classifier is generated according to an analysis of the sex of each of another plurality of sampled individuals.
  • the symptomatic variables are selected from heartburn, sweating, hemorrhoids, chronic sinusitis, atrial fibrillation, rectal bleeding, abdominal pain, bowel movement change, reflux, constipation, indigestion, hypertension, hematuria, diarrhea, bloating belching, abnormal weight loss, and obesity.
  • the variable may then be featurized for use in the computerized method or for use as an input to train a computational classification model.
  • the lifestyle variables are selected from smoking and alcohol use, red meat consumption, and medications including progestogen, estrogen and other hormonal modulators, hormonal contraceptives, non-steroidal anti-inflammatory and antirheumatic drugs (NS AID), antithrombotic agents, blood glucose lowering drugs excluding insulin, insulin and analogues, immunosuppressants, sulfonamides and trimethoprim antibacterials, and drugs for peptic ulcer and gastroesophageal reflux disease (GERD).
  • the variable may then be featurized for use in the computerized method or for use as an input to train a computational classification model.
  • the diagnosis variables are selected from diverticular disease, multiple sclerosis, diabetes 1, diabetes 2, hyperthyroidism, psoriasis, anemia, peripheral arterial disease, AIDS, mental health disorder, inflammatory bowel disease, myocardial infarction, lupus, dementia, heart failure, hepatitis, alcoholic fatty liver disease, cirrhosis, rheumatoid arthritis, acromegaly, hemiplegia, COPD, irritable bowel syndrome, learning disabilities, depression, stroke, and chronic kidney disease.
  • the variable may then be featurized for use in the computerized method or for use as an input to train a computational classification model.
  • the biochemical variables are selected from alanine aminotransferase (ALT), albumin, alkaline phosphatase (ALP), aspartate aminotransferase (AST), calcium, carbon dioxide, chloride, cholesterol, creatinine, ferritin, globulin, glucose, high density lipoprotein (HDL), low density lipoprotein (LDL), phosphate, potassium, sodium, triglycerides, urea, and/or uric acid.
  • the variable may then be featurized for use in the computerized method or for use as an input to train a computational classification model.
  • the present classifiers incorporate a historical view including aspects such as recency, trends, and sequence features which may improve the predictive value of models using the classifiers.
  • each variable includes the entire patient history to that date.
  • the training data set used in the present methods includes multiple variables per patient across multiple time points, in contrast to other methods which trained a model on one variables per patient at one time point.
  • feature data is collected longitudinally to provide time series of feature values.
  • the time series of feature values is weighted according to elapsed time to provide higher significance of recent measurements.
  • regularization of machine learning methods is performed to reduce the complexity of the model and prevent overfitting of data.
  • At least about 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190 or 200 variables are used in the classification model.
  • a colorectal cancer risk classifier is provided trained on data that comprises a plurality of features based on demographic, physiological and clinical variables from a target individual, wherein the classifier is generated according to an analysis of a plurality of respective demographic, physiological and clinical features of a plurality of sampled individuals, wherein at least one of the features is based on data variables obtained at 2 or more time points.
  • a colorectal cancer risk classifier trained on data that comprises a plurality of features based on two or more demographic, physiological and clinical variables, and 9 or less blood test features based on said plurality of current blood test results, each one of said 9 or less different blood test features is based on a blood test value of one of said plurality of current blood test results wherein at least one of the features is based on data variables obtained at 2 or more time points.
  • the plurality of blood test results of 9 or less of the following blood tests red blood cells (RBC), hemoglobin (HGB), and hematocrit (HCT) and at least one result of the following blood tests hemoglobin (MCH) and mean corpuscular hemoglobin concentration (MCHC).
  • RBC red blood cells
  • HGB hemoglobin
  • HCT hematocrit
  • MCH mean corpuscular hemoglobin concentration
  • each of the plurality of historical and current blood test results comprises 9 or less of the following blood tests: white blood cell count— WBC (CBC); mean platelet volume (MPV); mean cell; platelet count (CBC); eosinophils count; neutrophils percentage; monocytes percentage; eosinophils percentage; basophils percentage; lymphocytes percentage; and neutrophils count; monocytes count, lymphocytes count; neutrophil-lymphocyte ratio (NLR).
  • CBC white blood cell count— WBC
  • MPV mean platelet volume
  • CBC mean cell
  • CBC platelet count
  • eosinophils count neutrophils percentage
  • monocytes percentage eosinophils percentage
  • basophils percentage eosinophils percentage
  • lymphocytes percentage neutrophil-lymphocyte ratio
  • NLR neutrophil-lymphocyte ratio
  • the model does not require a CBC analysis of the individual.
  • the present classifiers incorporate a historical view including aspects such as recency, trends, and sequence features which are believed to improve the predictive value of models using the classifiers.
  • a classifier useful in a computerized method of evaluating colorectal cancer risk is generated by training a classification model on a data set comprising a plurality of features obtained from demographic, physiological and clinical variables from of a target individual, to provide at least one classifier generated according to an analysis of a plurality of respective demographic, physiological and clinical features of each of another of a plurality of sampled individuals; and evaluating, using a processor, a colorectal cancer risk of the target individual by classifying the set of features using the at least one classifier wherein at least one of the features is based on data variables obtained at 2 or more time points.
  • Each of the plurality of demographic, physiological and clinical features comprises of at least two the features obtained from demographic, symptomatic, lifestyle, diagnosis or biochemical variables.
  • the present methods and systems identify feature sets to input into a trained algorithm (e.g., machine learning model or classifier).
  • a trained algorithm e.g., machine learning model or classifier
  • the system input features of an individual and forms a feature vector from the measured values of the features.
  • the system inputs the feature vector into the machine learning model and obtains an output classification of whether the individual has increased risk of cancer.
  • the machine learning model outputs a classifier capable of distinguishing between two or more groups or classes of individuals or features in a population of individuals or features of the population.
  • the classifier is a trained machine learning classifier.
  • the informative features obtained from the demographic, physiological and clinical variables are assayed using the methods and systems described to form a risk classification profile.
  • Receiver-operating characteristic (ROC) curves may be generated by plotting the performance of a particular feature in distinguishing between at least two populations (e.g., individuals having high or low risk of colorectal cancer, or more than two levels of colorectal cancer risk).
  • the feature data across the entire population e.g., the cases and controls
  • features included in the model may be obtained from variables measured at multiple time points.
  • variables measured at least 2 time points provide trend data that may be featurized and input to train a classification model.
  • demographic, symptomatic, lifestyle, diagnosis or biochemical variables are measured at different time points such as about 6 months, about 9 months, about 12 months, about 18 months or about 24 months to provide trend data that may be featurized and input to train a classification model.
  • the demographic variables measured at least 2 time points provide trend data are selected from age, gender, weight, height, BMI, race, country, geographically determined data such as local air quality, limiting long-term illness, or Townsend deprivation index, and the like.
  • the variable may then be featurized for use in the computerized method or for use as an input to train a computational classification model.
  • the symptomatic variables measured at least 2 time points provide trend data are selected from heartbum, sweating, hemorrhoids, chronic sinusitis, atrial fibrillation, rectal bleeding, abdominal pain, bowel movement change, reflux, constipation, indigestion, hypertension, hematuria, diarrhea, bloating belching, abnormal weight loss, and obesity.
  • the variable may then be featurized for use in the computerized method or for use as an input to train a computational classification model.
  • the lifestyle variables measured at 2 or more time points to provide trend data are selected from (i) smoking and alcohol use, (ii) red meat consumption, and (iii) medications including progestogen, estrogen and other hormonal modulators, hormonal contraceptives, non-steroidal anti-inflammatory and antirheumatic drugs (NS AID), antithrombotic agents, blood glucose lowering drugs excluding insulin, insulin and analogues, immunosuppressants, sulfonamides and trimethoprim antibacterials, and drugs for peptic ulcer and gastroesophageal reflux disease (GERD).
  • the variable may then be featurized for use in the computerized method or for use as an input to train a computational classification model.
  • the diagnosis variables measured at 2 or more time points to provide trend data are selected from diverticular disease, multiple sclerosis, diabetes 1, diabetes 2, hyperthyroidism, psoriasis, anemia, peripheral arterial disease, AIDS, mental health disorder, inflammatory bowel disease, myocardial infarction, lupus, dementia, heart failure, hepatitis, alcoholic fatty liver disease, cirrhosis, rheumatoid arthritis, acromegaly, hemiplegia, COPD, irritable bowel syndrome, learning disabilities, depression, stroke, and chronic kidney disease.
  • the variable may then be featurized for use in the computerized method or for use as an input to train a computational classification model.
  • the biochemical variables measured at 2 or more time points to provide trend data are selected from alanine aminotransferase (ALT), albumin, alkaline phosphatase (ALP), aspartate aminotransferase (AST), calcium, carbon dioxide, chloride, cholesterol, creatinine, ferritin, globulin, glucose, high density lipoprotein (HDL), low density lipoprotein (LDL), phosphate, potassium, sodium, triglycerides, urea, and/or uric acid.
  • ALT alanine aminotransferase
  • albumin albumin
  • AST aspartate aminotransferase
  • AST aspartate aminotransferase
  • a blood test result variable is measured at least 2 time points to provide trend data and is selected from: red blood cells (RBC), hemoglobin (HGB), and hematocrit (HCT) and at least one result of the following blood tests hemoglobin (MCH) and mean corpuscular hemoglobin concentration (MCHC).
  • a blood test result variable is measured at least 2 time points to provide trend data and is selected from: white blood cell count— WBC (CBC); mean platelet volume (MPV); mean cell; platelet count (CBC); eosinophils count; neutrophils percentage; monocytes percentage; eosinophils percentage; basophils percentage; lymphocytes percentage; and neutrophils count; monocytes count, lymphocytes count; neutrophil-lymphocyte ratio (NLR).
  • a feature of the present classification models may be an increase in lead time or advance notice before colorectal cancer diagnosis to permit treatment intervention at a time point that is still actionable and may improve clinical outcomes including treatment efficacy and mortality.
  • Lead time may correspond to the difference in time from when the cancer would have been detected by symptoms, in the absence of a risk model, to when the detection occurred in the presence of a screening program.
  • This information may be obtained from medical record databases, featurized, and used to train a classification model.
  • Previous models may train on fewer features and on a fixed time point too close to a colorectal cancer diagnosis and therefore do not provide measure of colorectal cancer risk early enough to appropriately intervene.
  • the process of obtaining screening, scheduling physician appointments and confirmatory colonoscopies, and other treatment logistics often require more than 6 months of lead time to accomplish.
  • Individuals identified by risk models as having higher level risk scores subsequently may need to navigate the treatment process and so risk profiles with higher sensitivity at earlier lead times are desirable.
  • the present methods and classifiers may be trained at arbitrarily set time points, such as 1 year, 2 year, 3 years, or 4 years in advance to train across different time horizons.
  • the advance notice or lead time provided by the classification model before a diagnosis can permit treatment intervention at a time point that is still actionable and may improve clinical outcomes including treatment efficacy and mortality.
  • the median advance notice or lead time is between 100 and 300 days before colorectal cancer diagnosis and/or is longer than other models.
  • the median advance notice or lead time provided by the classification model is at least 300 days, at least 400 days, at least 500 days, at least 600 days, at least 700 days, at least 800 days, at least 900 days, at least 1000 days, at least 1100 days, at least 1200 days, at least 1300 days, at least 1400 days, or at least 1500 days before colorectal cancer diagnosis.
  • the advance notice provided by the classification model is at least 600 days.
  • the advance notice provided by the classification model is at least 1000 days.
  • the system comprises a classification circuit that is configured as a machine learning classifier selected from the group consisting of a deep learning classifier, a neural network classifier, a linear discriminant analysis (LDA) classifier, a quadratic discriminant analysis (QDA) classifier, a support vector machine (SVM) classifier, a random forest (RF) classifier, K nearest neighbor (KNN), a linear kernel support vector machine classifier, a first or second order polynomial kernel support vector machine classifier, a ridge regression classifier, an elastic net algorithm classifier, a sequential minimal optimization algorithm classifier, a naive Bayes algorithm classifier, and principal component analysis classifier.
  • a machine learning classifier selected from the group consisting of a deep learning classifier, a neural network classifier, a linear discriminant analysis (LDA) classifier, a quadratic discriminant analysis (QDA) classifier, a support vector machine (SVM) classifier, a random forest (RF) classifier, K nearest neighbor (KNN), a
  • the classification model is indicative of an elevated risk of colorectal cancer risk at a sensitivity of at least about 25%. In some embodiments, the classification model profile is indicative of an elevated risk of colorectal cancer at a sensitivity of at least about 30%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a sensitivity of at least about 35%. In some embodiments, the classification mode profile is indicative of an elevated risk of colorectal cancer at a sensitivity of at least about 40%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a sensitivity of at least about 50%.
  • the classification model is indicative of an elevated risk of colorectal cancer at a sensitivity of at least about 60%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a sensitivity of at least about 70%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a sensitivity of at least about 80%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a sensitivity of at least about 90%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a sensitivity of at least about 95%.
  • the classification model is indicative of an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 5%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 10%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 15%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 20%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 25%.
  • PPV positive predictive value
  • the classification model is indicative an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 30%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 40%. In some embodiments, the classification model is indicative an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 50%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 60%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 70%.
  • PPV positive predictive value
  • the classification model is indicative of an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 80%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 90%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 95%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 99%. [0118] In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a negative predictive value (NPV) of at least about 40%.
  • NPV negative predictive value
  • the classification model is indicative of an elevated risk of colorectal cancer at a negative predictive value (NPV) of at least about 50%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a negative predictive value (NPV) of at least about 60%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a negative predictive value (NPV) of at least about 70%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a negative predictive value (NPV) of at least about 80%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a negative predictive value (NPV) of at least about 90%.
  • NPV negative predictive value
  • the classification model is indicative of an elevated risk of colorectal cancer at a negative predictive value (NPV) of at least about 95%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a negative predictive value (NPV) of at least about 99%.
  • the trained algorithm determines an elevated risk of the colorectal cancer of the subject with an Area Under Curve (AUC) of at least about 0.50. In some embodiments, the trained algorithm determines an elevated risk of the colorectal cancer of the subject with an Area Under Curve (AUC) of at least about 0.60. In some embodiments, the trained algorithm determines an elevated risk of the colorectal cancer of the subject with an Area Under Curve (AUC) of at least about 0.70. In some embodiments, the trained algorithm determines an elevated risk of the colorectal cancer of the subject with an Area Under Curve (AUC) of at least about 0.80.
  • AUC Area Under Curve
  • the trained algorithm determines an elevated risk of the colorectal cancer of the subject with an Area Under Curve (AUC) of at least about 0.90. In some embodiments, the trained algorithm determines an elevated risk of the colorectal cancer of the subject with an Area Under Curve (AUC) of at least about 0.95. In some embodiments, the trained algorithm determines an elevated risk of the colorectal cancer of the subject with an Area Under Curve (AUC) of at least about 0.99.
  • AUC Area Under Curve
  • the present disclosure provides a system, method, or kit having data analysis realized in software application, computing hardware, or both.
  • the analysis application or system comprises at least a data receiving module, a data pre-processing module, a data analysis module (which can operate on one or more types of featurized data), a data interpretation module, or a data visualization module.
  • the data preprocessing module can comprise hardware systems or computer software that performs operations on the data in preparation for analysis. Examples of operations that may be applied to the data in the pre-processing module include affine transformations, denoising operations, data cleaning, reformatting, or subsampling.
  • a data analysis module can perform probabilistic and statistical analysis to identify abnormal patterns related to a disease, pathology, state, risk, condition, or phenotype.
  • a data interpretation module can use analysis methods, for example, drawn from statistics, mathematics, or biology, to support understanding of the relation between the identified abnormal patterns and health conditions, functional states, prognoses, or risks.
  • a data visualization module can use methods of mathematical modeling, computer graphics, or rendering to create visual representations of data that can facilitate the understanding or interpretation of results.
  • machine learning methods are applied to distinguish samples in a population of samples. In some embodiments, machine learning methods are applied to distinguish samples between individuals at different levels of risk for colorectal cancer.
  • the one or more machine learning operations used to train the prediction engine include one or more of: a generalized linear model, a generalized additive model, a non-parametric regression operation, a random forest classifier, a spatial regression operation, a Bayesian regression model, a time series analysis, a Bayesian network, a Gaussian network, a decision tree learning operation, an artificial neural network, a recurrent neural network, a convolutional neural network, a reinforcement learning operation, linear or nonlinear regression operations, a support vector machine, a clustering operation, and a genetic algorithm operation.
  • computer processing methods are selected from logistic regression, multiple linear regression (MLR), dimension reduction, partial least squares (PLS) regression, principal component regression, autoencoders, variational autoencoders, singular value decomposition, Fourier bases, wavelets, discriminant analysis, support vector machine, decision tree, classification and regression trees (CART), tree-based methods, random forest, gradient boost tree, logistic regression, matrix factorization, multidimensional scaling (MDS), dimensionality reduction methods, t-distributed stochastic neighbor embedding (t-SNE), multilayer perceptron (MLP), network clustering, neuro-fuzzy, and artificial neural networks.
  • MLR multiple linear regression
  • PLS partial least squares
  • principal component regression autoencoders
  • variational autoencoders singular value decomposition
  • Fourier bases discriminant analysis
  • support vector machine decision tree
  • classification and regression trees CART
  • tree-based methods random forest, gradient boost tree, logistic regression, matrix factorization, multidimensional scaling (MDS), dimensionality reduction methods, t-d
  • the disclosed systems and methods provide a classifier generated based on feature information derived from demographic, physiological and clinical variables.
  • the classifier forms part of a predictive engine for distinguishing groups in a population based on risk features identified in demographic, physiological and clinical variables.
  • a classifier is created by normalizing the features by formatting features into a unified format and a unified scale; storing the normalized feature information in a columnar database; training a prediction engine by applying one or more one machine learning operations to the stored normalized feature information, the prediction engine mapping, for a particular population, a combination of one or more features; applying the prediction engine to the accessed field information to identify an individual associated with a risk group; and classifying the individual into a risk group.
  • a hierarchy is created by normalizing the feature information by formatting similar portions of the feature information into a unified format and a unified scale; storing the normalized feature information in a columnar database; training a prediction engine by applying one or more one machine learning operations to the stored normalized feature information, the prediction engine mapping, for a particular population, a combination of one or more features; applying the prediction engine to the accessed field information to identify an individual associated with a risk group; and classifying the individual into a risk group.
  • Specificity generally refers to “the probability of assigning a positive risk score among those who are at a low risk of colorectal cancer”. It may be calculated by the number of low-risk persons who tested negative divided by the total number of low-risk individuals.
  • the model, classifier, or predictive test has a specificity of at least 40%, at least 45%, at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, or at least 99%.
  • Sensitivity generally refers to “the probability of a positive test among high-risk persons”. It may be calculated by the number of high-risk individuals who tested positive divided by the total number of high-risk individuals. [0129] In various examples, the model, classifier, or predictive test has a sensitivity of at least 40%, at least 45%, at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least
  • the subject matter described herein can include a digital processing device or use of the same.
  • the digital processing device can include one or more hardware central processing units (CPU), graphics processing units (GPU), or tensor processing units (TPU) that carry out the device’s functions.
  • the digital processing device can include an operating system configured to perform executable instructions.
  • the digital processing device can optionally be connected a computer network. In some examples, the digital processing device may be optionally connected to the Internet. In some examples, the digital processing device may be optionally connected to a cloud computing infrastructure. In some examples, the digital processing device may be optionally connected to an intranet. In some examples, the digital processing device may be optionally connected to a data storage device.
  • Non-limiting examples of suitable digital processing devices include server computers, desktop computers, laptop computers, notebook computers, sub-notebook computers, netbook computers, netpad computers, set-top computers, handheld computers, Internet appliances, mobile smartphones, and tablet computers.
  • Suitable tablet computers can include, for example, those with booklet, slate, and convertible configurations.
  • the digital processing device can include an operating system configured to perform executable instructions.
  • the operating system can include software, including programs and data, which manages the device’s hardware and provides services for execution of applications.
  • Non-limiting examples of operating systems include Ubuntu, FreeBSD, OpenBSD, NetBSD®, Linux, Apple® Mac OS X Server®, Oracle® Solaris®, Windows Server®, and Novell® NetWare®.
  • Non-limiting examples of suitable personal computer operating systems include Microsoft® Windows®, Apple® Mac OS X®, UNIX®, and UNIX-like operating systems such as GNU/Linux®.
  • the operating system may be provided by cloud computing, and cloud computing resources may be provided by one or more service providers.
  • the device can include a storage and/or memory device.
  • the storage and/or memory device may be one or more physical apparatuses used to store data or programs on a temporary or permanent basis.
  • the device may be volatile memory and require power to maintain stored information.
  • the device may be non-volatile memory and retain stored information when the digital processing device is not powered.
  • the non-volatile memory can include flash memory.
  • the nonvolatile memory can include dynamic random-access memory (DRAM).
  • the non-volatile memory can include ferroelectric random access memory (FRAM).
  • the non-volatile memory can include phase-change random access memory (PRAM).
  • the device may be a storage device including, for example, CD- ROMs, DVDs, flash memory devices, magnetic disk drives, magnetic tapes drives, optical disk drives, and cloud computing-based storage.
  • the storage and/or memory device may be a combination of devices such as those disclosed herein.
  • the digital processing device can include a display to send visual information to a user.
  • the display may be a cathode ray tube (CRT).
  • the display may be a liquid crystal display (LCD).
  • the display may be a thin film transistor liquid crystal display (TFT-LCD).
  • the display may be an organic light emitting diode (OLED) display.
  • OLED organic light emitting diode
  • on OLED display may be a passive- matrix OLED (PMOLED) or active-matrix OLED (AMOLED) display.
  • the display may be a plasma display.
  • the display may be a video projector.
  • the display may be a combination of devices such as those disclosed herein.
  • the subject matter disclosed herein can include one or more non- transitory computer-readable storage media encoded with a program including instructions executable by the operating system of an optionally networked digital processing device.
  • a computer-readable storage medium may be a tangible component of a digital processing device.
  • a computer-readable storage medium may be optionally removable from a digital processing device.
  • a computer-readable storage medium can include, for example, CD-ROMs, DVDs, flash memory devices, solid state memory, magnetic disk drives, magnetic tape drives, optical disk drives, cloud computing systems and services, and the like.
  • the program and instructions may be permanently, substantially permanently, semi- permanently, or non-transitorily encoded on the media. E. Computer systems
  • FIG. 1 shows a computer system 101 that is programmed or otherwise configured to store, process, identify, or interpret demographic, physiological and clinical variables.
  • the computer system 101 can process various aspects of patient demographic, physiological and clinical variables of the present disclosure (FIG. 1).
  • the computer system 101 may be an electronic device of a user or a computer system that is remotely located with respect to the electronic device.
  • the electronic device may be a mobile electronic device.
  • the computer system 101 comprises a central processing unit (CPU, also “processor” and “computer processor” herein) 105, which may be a single core or multi core processor, or a plurality of processors for parallel processing.
  • the computer system 101 also comprises memory or memory location 110 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 115 (e.g., hard disk), communication interface 120 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 125, such as cache, other memory, data storage and/or electronic display adapters.
  • the memory 110, storage unit 115, interface 120 and peripheral devices 125 are in communication with the CPU 105 through a communication bus (solid lines), such as a motherboard.
  • the storage unit 115 may be a data storage unit (or data repository) for storing data.
  • the computer system 101 may be operatively coupled to a computer network (“network”) 130 with the aid of the communication interface 120.
  • the network 130 may be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet.
  • the network 130 in some examples is a telecommunication and/or data network.
  • the network 130 can include one or more computer servers, which can enable distributed computing, such as cloud computing.
  • the network 130 in some examples with the aid of the computer system 101, can implement a peer-to-peer network, which may enable devices coupled to the computer system 101 to behave as a client or a server.
  • the CPU 105 can execute a sequence of machine-readable instructions, which may be embodied in a program or software.
  • the instructions may be stored in a memory location, such as the memory 110.
  • the instructions may be directed to the CPU 105, which can subsequently program or otherwise configure the CPU 105 to implement methods of the present disclosure. Examples of operations performed by the CPU 105 can include fetch, decode, execute, and writeback.
  • the CPU 105 may be part of a circuit, such as an integrated circuit. One or more other components of the system 101 may be included in the circuit. In some examples, the circuit is an application specific integrated circuit (ASIC).
  • ASIC application specific integrated circuit
  • the storage unit 115 can store files, such as drivers, libraries and saved programs.
  • the storage unit 115 can store user data, e.g., user preferences and user programs.
  • the computer system 101 in some examples can include one or more additional data storage units that are external to the computer system 101, such as located on a remote server that is in communication with the computer system 101 through an intranet or the Internet.
  • the computer system 101 can communicate with one or more remote computer systems through the network 130.
  • the computer system 101 can communicate with a remote computer system of a user.
  • remote computer systems include personal computers (e.g., portable PC), slate or tablet PC’s (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device, Blackberry®), or personal digital assistants.
  • the user can access the computer system 101 via the network 130.
  • Methods as described herein may be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 101, such as, for example, on the memory 110 or electronic storage unit 115.
  • the machine-executable or machine-readable code may be provided in the form of software. During use, the code may be executed by the processor 105. In some examples, the code may be retrieved from the storage unit 115 and stored on the memory 110 for ready access by the processor 105. In some examples, the electronic storage unit 115 may be precluded, and machine-executable instructions are stored on memory 110.
  • the code may be pre-compiled and configured for use with a machine having a processer adapted to execute the code or may be interpreted or compiled during runtime.
  • the code may be supplied in a programming language that may be selected to enable the code to execute in a precompiled, interpreted, or as-compiled fashion.
  • aspects of the systems and methods provided herein may be embodied in programming.
  • Various aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium.
  • Machine- executable code may be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk.
  • “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non- transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server.
  • another type of media that may bear the software elements comprises optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links.
  • a machine readable medium such as computer-executable code
  • a tangible storage medium such as computer-executable code
  • Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings.
  • Volatile storage media include dynamic memory, such as main memory of such a computer platform.
  • Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system.
  • Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications.
  • RF radio frequency
  • IR infrared
  • Common forms of computer- readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data.
  • Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
  • the computer system 101 can include or be in communication with an electronic display 135 that comprises a user interface (LT) 140 for providing, for example, a demographic, physiological and clinical variables or features.
  • a user interface for providing, for example, a demographic, physiological and clinical variables or features.
  • UI user interface
  • Examples of UI’s include, without limitation, a graphical user interface (GUI) and web-based user interface.
  • Methods and systems of the present disclosure may be implemented by way of one or more algorithms.
  • An algorithm may be implemented by way of software upon execution by the central processing unit 105.
  • the algorithm can, for example, store, process, identify, or interpret patient demographic, physiological and clinical variables.
  • the subject matter disclosed herein can include at least one computer program or use of the same.
  • a computer program can a sequence of instructions, executable in the digital processing device’s CPU, GPU, or TPU, written to perform a specified task.
  • Computer-readable instructions may be implemented as program modules, such as functions, objects, Application Programming Interfaces (APIs), data structures, and the like, that perform particular tasks or implement particular abstract data types.
  • APIs Application Programming Interfaces
  • a computer program may be written in various versions of various languages. The functionality of the computer-readable instructions may be combined or distributed as desired in various environments.
  • a computer program can include one sequence of instructions.
  • a computer program can include a plurality of sequences of instructions. In some examples, a computer program may be provided from one location. In some examples, a computer program may be provided from a plurality of locations. In some examples, a computer program can include one or more software modules. In some examples, a computer program can include, in part or in whole, one or more web applications, one or more mobile applications, one or more standalone applications, one or more web browser plug-ins, extensions, add- ins, or addons, or combinations thereof. In some examples, the computer processing may be a method of statistics, mathematics, biology, or any combination thereof.
  • the computer processing method comprises a dimension reduction method including, for example, logistic regression, dimension reduction, principal component analysis, autoencoders, singular value decomposition, Fourier bases, singular value decomposition, wavelets, discriminant analysis, support vector machine, tree-based methods, random forest, gradient boost tree, logistic regression, matrix factorization, network clustering, and neural network such as convolutional neural networks.
  • the computer processing method is a supervised machine learning method including, for example, a regression, support vector machine, tree-based method, and network.
  • the computer processing method is an unsupervised machine learning method including, for example, clustering, network, principal component analysis, and matrix factorization.
  • the subject matter disclosed herein can include one or more databases, or use of the same to store patient data, demographic, physiological and clinical variables.
  • suitable databases include electronic medical record (EMR) or electronic health record (EHR) databases.
  • EMR electronic medical record
  • EHR electronic health record
  • suitable databases can include, for example, relational databases, non-relational databases, object-oriented databases, object databases, entity-relationship model databases, associative databases, and XML databases.
  • a database may be internet-based.
  • a database may be web-based.
  • a database may be cloud computing-based.
  • a database may be based on one or more local computer storage devices.
  • the database is one or more medical record database(s) and/or connected to a medical record database interface.
  • the database(s) may include a plurality of individual records, also referred to as a plurality of individual samples, which describe, for each of another of a plurality of sampled individuals, one or more sets of a plurality of historical test results each set of another individual, and optionally one or more demographic, physiological, or clinical variables.
  • the set of a plurality of variables may be stored in a common sample record and/or gathered from a number of independent and/or connected databases.
  • the present disclosure provides a non-transitory computer-readable medium comprising instructions that direct a processor to carry out a method disclosed herein.
  • the present disclosure provides a computing device comprising the computer-readable medium.
  • the disclosed methods are directed to classification models used to assess risk of colorectal cancer and stratify patient populations based on the classification. Such classification and stratification are useful to prioritize individuals in a population for targeted colorectal cancer screening and earlier diagnosis and intervention.
  • the present methods may incorporate a historical view including aspects such as recency, trends, and sequence features which may improve the predictive value of resulting models.
  • Each observation may include the entire patient history to that date.
  • the training data set used in the present methods may include multiple observations per patient across multiple time points.
  • feature data is collected longitudinally to provide time series of feature values.
  • the time series of feature values is weighted according to elapsed time to provide higher significance of recent measurements.
  • training data are collected at many points in a patient’s health history and not limited to single data points collected shortly before a patient’s CRC diagnosis. This approach may enable the present methods and machine learning models to learn and identify features indicative of CRC risk at earlier times.
  • observations and data are excluded if they occurred at a time point after a patient was referred to palliative or hospice care, after a colectomy; before the patient was 45, or after the patient turned 86, because screening is not recommended in those circumstances (thus the risk is not clinically relevant).
  • a method of generating a risk profile for cancer comprising: providing a classifier capable of generating a risk profile for cancer in an individual wherein the a colorectal cancer risk classifier is trained on data that comprises a plurality of features based on demographic, physiological and clinical variables from a target individual, wherein the classifier is generated according to an analysis of a plurality of respective demographic, physiological and clinical features of a plurality of sampled individuals, wherein at least one of the features is based on data variables obtained at 2 or more time points.
  • a method for classification of cancer risk in an individual comprising: providing a colorectal cancer risk classifier trained on data that comprises a plurality of features based on two or more demographic, physiological and clinical variables, and 9 or less blood test features based on said plurality of current blood test results, each one of said 9 or less different blood test features is based on a blood test value of one of said plurality of current blood test results wherein at least one of the features is based on data variables obtained at 2 or more time points.
  • the present disclosure provides a system for performing classifications of individuals based on cancer risk comprising: a) a receiver to receive a plurality of training individuals, each of the plurality of training individuals having a plurality of demographic, physiological and clinical variables wherein each of the plurality of training individuals comprises one or more known labels b) a feature module to identify a set of features based on the variables corresponding to an individual that are operable to be input to the machine learning model for each of the plurality of individuals, wherein the set of features corresponds to variables of the plurality of training individuals, wherein for each of the plurality of training individuals, the system is operable to subject a plurality of variables of the individual to a plurality of different assays to obtain sets of measured values, wherein each set of measured values is from one variable in the individual, wherein a plurality of sets of measured values are obtained for the plurality of individuals, c) an analysis module to analyze the sets of measured values to obtain a training vector for the individual, wherein the training vector comprises feature values
  • a system for generating a risk profile comprising: a) a receiver to receive a plurality of training individuals, each of the plurality of training individuals having a plurality of demographic, physiological and clinical variables wherein each of the plurality of training individuals comprises one or more known labels b) a feature module to identify a set of features based on the variables corresponding to an individual that are operable to be input to the machine learning model for each of the plurality of individuals, wherein the set of features corresponds to variables of the plurality of training individuals, wherein for each of the plurality of training individuals, the system is operable to subject a plurality of variables of the individual to a plurality of different assays to obtain sets of measured values, wherein each set of measured values is from one variable in the individual, wherein a plurality of sets of measured values are obtained for the plurality of individuals, c) an analysis module to analyze the sets of measured values to obtain a training vector for the individual, wherein the training vector comprises feature values of the N set of features
  • the risk profile identifies individuals at risk of colorectal cancer.
  • the risk profile stratifies a population of individuals for cancer risk.
  • the risk profile is used to provide treatment recommendations for the individual based on the risk profile.
  • the set of current blood test results includes 9 or less of the following blood tests hemoglobin (HGB), hematocrit (HCT), and red blood cells (RBC) and at least one result of the following blood tests: mean cell hemoglobin (MCH) and mean corpuscular hemoglobin concentration (MCHC) and the age of the target individual.
  • HGB hemoglobin
  • HCT hematocrit
  • RBC red blood cells
  • MCH mean cell hemoglobin
  • MCHC mean corpuscular hemoglobin concentration
  • the set of current blood test results further includes nine or less of the following blood tests: white blood cell count— WBC (CBC); mean platelet volume (MPV); mean cell volume (MCV); red cell distribution width (RDW); platelet count (CBC); eosinophils count; neutrophils percentage; monocytes percentage; eosinophils percentage; basophils percentage; neutrophils count; monocytes count; and platelets hematocrit (PCT).
  • CBC white blood cell count— WBC
  • MPV mean platelet volume
  • MCV mean cell volume
  • RDW red cell distribution width
  • PCT platelets hematocrit
  • the colorectal cancer risk is evaluated by classifying biochemical blood test results of the target individual.
  • the classifiers are generated according to an analysis of historical biochemical blood test results of the plurality of individuals.
  • the biochemical blood test results may include results of any of the following blood tests: Albumin, Calcium, Chloride, Cholesterol, Creatinine, high density lipoprotein (HDL), low density lipoprotein (LDL), Potassium, Sodium, Triglycerides, Urea, and/or Uric Acid.
  • the colorectal cancer risk is evaluated by classifying demographic characteristics of the target individual.
  • the classifiers are generated according to an analysis of demographic characteristics of the plurality of individuals.
  • both the demographic, physiological and clinical variables of the target individual and the demographic, physiological and clinical variables of sampled individuals are used for generating expended sets of features which include manipulated and/or weighted values.
  • each expended set of features is based on the demographic characteristics of a respective individual, for example as described below.
  • the one or more classifiers are adapted to one or more demographic characteristics of the target individual.
  • the classifiers are selected to match one or more demographic characteristics of the target individual.
  • different classifiers may be used for women and men.
  • methods and systems of generating one or more classifiers for colorectal cancer risk evaluation are based on analysis of a plurality of demographic, physiological or clinical variables of each of another of a plurality of sampled individuals and generating accordingly a dataset having a plurality of sets of features each generated according to respective demographic, physiological or clinical variables.
  • the dataset is then used to generate and output one or more classifiers, such as a deep learning classifier, a neural network classifier, a linear discriminant analysis (LDA) classifier, a quadratic discriminant analysis (QDA) classifier, a support vector machine (SVM) classifier, a random forest (RF) classifier, K nearest neighbor (KNN), a linear kernel support vector machine classifier, a first or second order polynomial kernel support vector machine classifier, a ridge regression classifier, an elastic net algorithm classifier, a sequential minimal optimization algorithm classifier, a naive Bayes algorithm classifier, and principal component analysis classifier.
  • the classifiers may be provided as modules for execution on client terminals or used as an online service for evaluating colorectal cancer risk of target individuals based on their current blood test results.
  • each sample record includes one or more sets of a plurality of historical test results of an individual, each includes a combination blood test results, for example a combination of 9 or less and/or any intermediate number of blood test results.
  • each extracted set of unprocessed features includes at 9 or less of the following blood test results: red blood cells (RBC); white blood cell count— WBC (CBC); mean platelet volume (MPV); hemoglobin (HGB); hematocrit (HCT); mean cell volume (MCV); mean cell hemoglobin (MCH); mean corpuscular hemoglobin concentration (MCHC); red cell distribution width (RDW); platelet count (CBC); eosinophils count; neutrophils percentage; monocytes percentage; eosinophils percentage; basophils percentage; neutrophils count; monocytes count; and Platelets hematocrit (PCT).
  • each extracted set of unprocessed features includes at least result of the following blood tests HGB, HCT, and RBC, at least
  • this extracted set of unprocessed features further includes one or more of the following blood tests RDW, Platelets, and MCV. Additionally, this extracted set of unprocessed features may further includes one or more of the following blood tests WBC, eosinophils count, neutrophils percentage and/or count, basophils percentage and/or count, and monocytes percentage and/or count.
  • Example 1 Modeling Prognostic Colorectal Cancer Risk Using Dataset 1 Data and initial preparation
  • Dataset 1 a de-identified compilation of several decades of records from over twenty million patients in the UK, collected at over 800 primary care practices across England, Scotland, Wales and Northern Ireland.
  • Dataset 1 incorporates data which was used for the validation of the ColonFlag Model published in Kinar et. al. (2016).
  • the dataset was cleaned, curated, and transformed into a format amenable for ML methods combining custom techniques with those described in certain publications: Boursi et al. (2016), Lewis et al. (2015), Khan (2010), Barnett (2012), Cassell (2016), Payne (2020).
  • Model A was designed to help health systems and health plans improve patient outcomes in the context of existing CRC screening programs. To that end, a 24 month predictive window (is used in order to identify cancers early enough to improve outcomes), an inclusive population between the ages of 45 and 85 (consistent with current clinical and coverage guidelines), and a broad set of features (in order to improve performance). Further, the data used included data collected at many points in a patient’s health history and did not limit training set to a single data point collected shortly before a patient’s CRC diagnosis, as Kinar et al. (2016) did, thereby enabling the ML model to learn features indicative of CRC risk at earlier times.
  • the model was trained on a set of observations corresponding to the dates in a patients’ record on which CRC risk can be calculated, is relevant, to clinical care, and can be updated based on new information. Dates before January 2011 were excluded from the observation set because several UK initiatives impacting data quality were not complete until that date. Observations were also excluded if they occurred after a patient was referred to palliative or hospice care; after a colectomy; before the patient was 45; or after the patient turned 86, because screening is not recommended in those circumstances (thus the risk is not clinically relevant). Each observation included the entire patient history to that date. By contrast to ColonFlag (Medial Research Ltd), which was trained on one observation per patient, the present training set often included multiple observations per patient.
  • the present machine learning and imputation methods were developed to be as inclusive of potentially informative health data as possible, subject to our strict regularization criteria. While traditional prognostic health models use a static view of patient state at time of prediction (e.g., most recent laboratory values, and presence or absence of historical diagnoses), these methods incorporate temporal features (e.g., recency, trends, and orderings) which are shown to be helpful in increasing model performance.
  • temporal features e.g., recency, trends, and orderings
  • the ColonFlag Model (Kinar et. al. 2016) was chosen as a benchmark. This is a natural choice because it has been developed into the commercial product ColonFlag by ColonFlag and is used by some health systems in the United States. As noted above, the present Model A is not suitable for a direct comparison to ColonFlag because it was trained on a larger and more recent dataset using methods that differed in multiple ways. In order to isolate the machine learning methodology, two new models were trained in as close to the same manner as the ColonFlag Model as possible. Specifically, both were trained on the ColonFlag Comparison Cohort, which was constructed with the same study design as the dataset used to train the ColonFlag model.
  • the ColonFlag Model was trained with a dataset from Maccabi Healthcare Services, which is not available. Both models were trained with the same methodological approach (cohort, two year horizon, most recent CBC observation for cases) as ColonFlag.
  • the first model aims to closely replicate the ColonFlag Model and is limited to age, sex, and CBC values. (Referred to herein as the ColonFlag Replicate).
  • the second model uses additional data about a patient’s health history in the same manner as the present Model A. (Referred to herein as the ColonFlag Replicate Model). Information about the models developed for this study, as well as Model A are summarized in Table 1.
  • the validation data (referred to as ColonFlag Comparison Case- Control Holdout set) was constructed to be as similar as possible to the validation set used in Kinar et al. (2016).
  • the models’ ability was evaluated to identify patients in a casecontrol dataset between the ages of 50 and 75 who, at the time of a complete blood count result, would be diagnosed with CRC within 3-6 months.
  • the same set of performance metrics as Kinar et al. (2016) was used to evaluate models on this dataset.
  • Model A In addition to improving the model’s performance, using a broader feature set in combination with data imputation allows Model A to statistically assess risk for any patient at risk of CRC.
  • the ColonFlag Model requires that a patient has had a recent CBC in order to make a prediction.
  • the ColonFlag Model In Study Cohort 1, the ColonFlag Model would have made predictions on only 68% of the observations, requiring attending doctors to order CBCs for the remainder in order to assess their CRC risk.
  • a risk model such as that described herein is trained and validated for use at any time, not just at a single point in time as is the case in most statistical model validations.
  • Model A uses all prior observations equally.
  • Table 1 Not only does Model A find more of the cases before the actual standard-of-care diagnosis (80.8% vs 76.7%,) but it does so with a median lead time that is 231 days longer than the ColonFlag Comparison Model.
  • Example 2 Modeling Prognostic Colorectal Cancer Risk using EMR and Claims data from the US.
  • Model B The model whose performance is depicted here (Model B) was trained and tested using a de-identified dataset from partner B, a compilation of electronic medical records and health insurance claims. It was developed on a dataset comprising all available medical records and insurance history from a subset of the patients in partner B’s data warehouse: all patients with any prior cancer diagnosis and a random 20% sample of all patients without a history of cancer.
  • the combined dataset (Dataset B) was cleaned, curated, and transformed into a format amenable for ML methods using custom techniques similar to those used in developing the Model A described in Example 1.
  • Model B was designed to help health systems and health plans improve patient outcomes in the context of existing CRC screening programs. Like Model A, a 24 month predictive window, an inclusive population between the ages of 45 and 75, and a broad set of features collected at many points in a patient’s health history were used in training and validation. Further, as with the model A, we use a validation set drawn from a geographically distinct region to test the generalizability of the model. Model B was trained on data collected in seven US census geographic divisions and validated on data from an eighth division.
  • Figure 5 represents the performance of the Model B in a facsimile of a population health management real-world use case.
  • a population health team evaluated eligible patients’ two-year CRC risk biyearly, on January 1 and July 1, each year 2012-2019.
  • Eligible patients on each date comprised patients who, on this date, were continuously enrolled for at least one year prior and two years subsequent (or who died in the subsequent two years).
  • the prior enrollment period requirement was implemented to ensure sufficient data were available to the risk prediction model.
  • the follow-up period requirement ensured two year follow-up CRC incidence was captured in the data.
  • Model B On each January 1 and July 1 2012-2019, the eligible patients’ risk scores were assessed by both Model B and a similar model which used only the patient’s age and biological sex as features.
  • the area under the ROC curve (AUC) for Model B is 0.719 (95% CI 0.701,0.736) and the AUC for age-sex baseline is 0.651 (95% CI 0.632, 0.669).

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Pathology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Veterinary Medicine (AREA)
  • Animal Behavior & Ethology (AREA)
  • Surgery (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Physiology (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Primary Health Care (AREA)
  • Psychiatry (AREA)
  • Gastroenterology & Hepatology (AREA)
  • Endocrinology (AREA)
  • Immunology (AREA)
  • Signal Processing (AREA)
  • Chemical & Material Sciences (AREA)
  • Optics & Photonics (AREA)
  • Evolutionary Computation (AREA)
  • Hematology (AREA)
  • Biotechnology (AREA)
  • Urology & Nephrology (AREA)
  • Microbiology (AREA)
  • Cell Biology (AREA)
  • Medicinal Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Biology (AREA)

Abstract

The present disclosure provides classification models, methods and systems for cancer screening and detection, including for evaluating a risk of cancer by stratifying populations of individuals based on non-invasive features and computational methods. classifier is provided that is trained on data that comprises a plurality of features based on demographic, physiological and clinical variables from a target individual, wherein the classifier is generated according to an analysis of a plurality of respective demographic, physiological and clinical features of a plurality of sampled individuals, wherein at least one of the features is based on data variables obtained at 2 or more time points. The models, methods and systems may provide increased lead time before a diagnosis to permit treatment intervention at a time point that is still actionable and may improve clinical outcomes including treatment efficacy and mortality.

Description

METHODS AND SYSTEMS FOR RISK STRATIFICATION OF COLORECTAL CANCER
CROSS-REFERENCE
[0001] This application claims the benefit of U.S. Provisional Patent Application No. 63/304,101, filed January 28, 2022, which is incorporated by reference herein in its entirety.
BACKGROUND
[0002] There are a variety of colorectal cancer screening modalities available, including high- sensitivity guaiac fecal occult blood test (HSgFOBT), fecal immunochemical test (FIT), FIT- DNA, computed tomography coIonography, flexible sigmoidoscopy, and colonoscopy. While uncommon, harms resulting from CRC screening are primarily due to complications from colonoscopies (i.e., screening, surveillance, or follow-up after positive non-invasive test) and include serious gastrointestinal bleeding, perforations, and cardiopulmonary events (USPSTF 2021). Clinical and coverage guidelines in the United States generally recommend CRC screening for asymptomatic adults aged 45 to 85. In addition, the Centers for Medicare and Medicaid Services (CMS) has included adherence to CRC screening as a component of its Star Rating program since 2008.
[0003] Prior models have several shortcomings that may limit their ability to support targeted CRC screening programs. For example some models Kinar et al. (2016) use a small set of preselected variables, which excludes potentially valuable information from other sources, the models are designed for use in a narrowly defined population (i.e. the patients with the information required by the short variable list), and do not consider a long enough prediction horizon to enable early diagnosis.
SUMMARY
[0004] The present disclosure relates to cancer screening and detection and, more particularly, but not exclusively, to methods and systems of evaluating a risk of cancer by stratifying populations of individuals based on non-invasive features and computational methods. Some current evidence shows that the benefits of routine colorectal cancer (CRC) screening may outweigh the harms for individuals between 45 and 75, however the use of better risk stratification tools to prioritize outreach to high-risk individuals for screening could further improve benefits, lower risks, boost screening adherence among high-risk individuals who become aware of their risk, and optimize the utilization of limited screening resources (Ladabaum 2020; Robertson 2019). For example, microsimulations of a program that only screens individuals with a CRC risk above a certain threshold found similar reductions in CRC incidence and mortality when compared to screening the entire “average risk” population (Buskermolen 2019, Helsingen 2019). Such targeted screening efforts require accurate CRC risk prediction models to improve patient compliance with early enough lead times to enable early intervention and treatment for improved clinical outcomes.
[0005] In various aspects of the present disclosure, computerized classifiers, methods and systems of evaluating colorectal cancer risk are provided.
[0006] In an aspect, the present disclosure provides a classifier for evaluation of colorectal cancer risk of a target individual, wherein the classifier is trained on at least one training data set that comprises a plurality of features based on demographic, physiological, and clinical variables from the target individual, wherein the classifier is generated based at least in part on an analysis of a plurality of respective demographic, physiological, and clinical features from a plurality of sampled individuals, wherein at least one of the plurality of features is derived from at least data of the demographic, physiological, and clinical variables obtained at 2 or more time points. [0007] In some embodiments, the demographic, physiological, and clinical features comprise at least two features obtained from demographic, symptomatic, lifestyle, diagnosis, or biochemical variables. In some embodiments, the demographic variables are selected from age, gender, weight, height, BMI, race, country, and geographically determined data. In some embodiments, the demographic variables comprise local air quality, limiting long-term illness, or Townsend deprivation index. In some embodiments, the demographic variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the symptomatic variables are selected from heartbum, sweating, hemorrhoids, chronic sinusitis, atrial fibrillation, rectal bleeding, abdominal pain, bowel movement change, reflux, constipation, indigestion, hypertension, hematuria, diarrhea, bloating belching, abnormal weight loss, and obesity. In some embodiments, the symptomatic variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the lifestyle variables are selected from smoking, alcohol use, red meat consumption, medications, progestogen, estrogen and other hormonal modulators, hormonal contraceptives, non-steroidal anti-inflammatory and antirheumatic drugs (NSAID), antithrombotic agents, blood glucose lowering drugs excluding insulin, insulin and analogues, immunosuppressants, sulfonamides and trimethoprim antibacterials, and drugs for peptic ulcer and gastroesophageal reflux disease (GERD). In some embodiments, the lifestyle variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the diagnosis variables are selected from diverticular disease, multiple sclerosis, diabetes 1, diabetes 2, hyperthyroidism, psoriasis, anemia, peripheral arterial disease, AIDS, mental health disorder, inflammatory bowel disease, myocardial infarction, lupus, dementia, heart failure, hepatitis, alcoholic fatty liver disease, cirrhosis, rheumatoid arthritis, acromegaly, hemiplegia, COPD, irritable bowel syndrome, learning disabilities, depression, stroke, and chronic kidney disease. In some embodiments, the diagnosis variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the biochemical variables are selected from alanine aminotransferase (ALT), albumin, alkaline phosphatase (ALP), aspartate aminotransferase (AST), calcium, carbon dioxide, chloride, cholesterol, creatinine, ferritin, globulin, glucose, high density lipoprotein (HDL), low density lipoprotein (LDL), phosphate, potassium, sodium, triglycerides, urea, and uric acid. In some embodiments, the biochemical variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
[0008] In some embodiments, the classifier comprises features based on recency, trends, and sequence features.
[0009] In some embodiments, the demographic, physiological, and clinical variables reflect an entire medical history of the target individual.
[0010] In some embodiments, the demographic, physiological, and clinical variables are not generated using a CBC analysis of the target individual.
[0011] In some embodiments, a training data set used to train the classifier includes comprises two or more variables per sampled individual across two or more time points.
[0012] In some embodiments, the feature data for the plurality of features is collected longitudinally to provide a time series of feature values.
[0013] In some embodiments, the time series of feature values is weighted according to recency or elapsed time, thereby providing a higher significance to recent measurements of the time series of feature values.
[0014] In some embodiments, wherein complexity and overfitting of the classifier are reduced at least in part by performing a machine learning regularization method.
[0015] In some embodiments, at least about 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 variables are used in the classification model.
[0016] In some embodiments, between 20 and 200, 30 and 190, 40 and 180, 50 and 170, 60 and 160, 70 and 150, 80 and 140, 90 and 130, or 100 and 120 variables are used in the classification model.
[0017] In another aspect, the present disclosure provides a classifier for evaluation colorectal cancer risk of a target individual, wherein the classifier is trained on at least one training data set that comprises (i) a plurality of features based on two or more demographic, physiological, and clinical variables from the target individual, and (ii) 9 or less blood test features based on a plurality of current blood test results of the target individual, wherein each one of the 9 or less different blood test features is based on a blood test value of one of the plurality of current blood test results of the target individual, wherein at least one of the plurality of features is based on data of the two or more demographic, physiological, and clinical variables obtained at 2 or more time points.
[0018] In some embodiments, the plurality of blood test results comprises (i) 9 or less of the following blood tests: red blood cells (RBC), hemoglobin (HGB), and hematocrit (HCT) and (ii) at least one result of the following blood tests hemoglobin (MCH) and mean corpuscular hemoglobin concentration (MCHC).
[0019] In some embodiments, the blood test results comprises 9 or less results of the following blood tests: white blood cell count (WBC); mean platelet volume (MPV); mean cell; platelet count (CBC); eosinophils count; neutrophils percentage; monocytes percentage; eosinophils percentage; basophils percentage; lymphocytes percentage; and neutrophils count; monocytes count, lymphocytes count; neutrophil-lymphocyte ratio (NLR).
[0020] In some embodiments, the classifier is trained equally across all prior demographic, physiological, and clinical features. In some embodiments, the equal training reduces or prevents bias towards later detection. In some embodiments, the classifier is configured to provide the evaluation of the colorectal cancer risk of the target individual with advance notice or lead time before a diagnosis that is sufficient to permit treatment intervention at a time point that is clinically actionable.
[0021] In some embodiments, the advance notice or lead time is sufficient to permit a treatment that improves clinical outcomes including treatment efficacy and mortality In some embodiments, the advance notice or lead time provided by the classification model permits a treatment that improves clinical outcomes including treatment efficacy and mortality.
[0022] In some embodiments, the advance notice or lead time has a median that is between 100 and 300 days before colorectal cancer diagnosis.
[0023] In some embodiments, the advance notice or lead time has a median that is at least 300 days, at least 400 days, at least 500 days, at least 600 days, at least 700 days, at least 800 days, at least 900 days, at least 1000 days, at least 1100 days, at least 1200 days, at least 1300 days, at least 1400 days, or at least 1500 days before colorectal cancer diagnosis.
[0024] In some embodiments, the advance notice or lead time is at least 600 days. [0025] In some embodiments, the advance notice or lead time is at least 1000 days. [0026] In some embodiments, the plurality of features comprises an age of the target individual; wherein the classifier is generated according to at least an analysis of the age of each of another plurality of sampled individuals.
[0027] In some embodiments, the plurality of features comprises a sex of the target individual; wherein the classifier is generated according to at least an analysis of the sex of each of another plurality of sampled individuals.
[0028] In some embodiments, the classifier is configured as a machine learning classifier selected from the group consisting of a deep learning classifier, a neural network classifier, a linear discriminant analysis (LDA) classifier, a quadratic discriminant analysis (QDA) classifier, a support vector machine (SVM) classifier, a random forest (RF) classifier, K nearest neighbor (KNN), a linear kernel support vector machine classifier, a first or second order polynomial kernel support vector machine classifier, a ridge regression classifier, an elastic net algorithm classifier, a sequential minimal optimization algorithm classifier, a naive Bayes algorithm classifier, and principal component analysis classifier.
[0029] In another aspect, the present disclosure provides a method of generating a colorectal cancer risk classifier, comprising: (a) providing a plurality of features based on demographic, physiological and clinical variables from a target individual; (b) generating a dataset having a plurality of sets of features, each set of features generated according to a respective plurality of features based on demographic, physiological and clinical features from a plurality of sampled individuals; and (c) generating at least one classifier based at least in part on an analysis of the dataset, and outputting the at least one classifier, wherein at least one of the plurality of features is based on data of the demographic, physiological, and clinical variables obtained at 2 or more time points.
[0030] In some embodiments, the demographic, physiological, and clinical features comprise at least two features selected from demographic, symptomatic, lifestyle, diagnosis, and biochemical variables. In some embodiments, the demographic variables are selected from age, gender, weight, height, BMI, race, country, and geographically determined data. In some embodiments, the demographic variables comprise local air quality, limiting long-term illness, or Townsend deprivation index. In some embodiments, the demographic variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the symptomatic variables are selected from heartburn, sweating, hemorrhoids, chronic sinusitis, atrial fibrillation, rectal bleeding, abdominal pain, bowel movement change, reflux, constipation, indigestion, hypertension, hematuria, diarrhea, bloating belching, abnormal weight loss, and obesity. In some embodiments, the symptomatic variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the lifestyle variables are selected from smoking, alcohol use, red meat consumption, medications, progestogen, estrogen and other hormonal modulators, hormonal contraceptives, non-steroidal anti-inflammatory and antirheumatic drugs (NSAID), antithrombotic agents, blood glucose lowering drugs excluding insulin, insulin and analogues, immunosuppressants, sulfonamides and trimethoprim antibacterials, and drugs for peptic ulcer and gastroesophageal reflux disease (GERD). In some embodiments, the lifestyle variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the diagnosis variables are selected from diverticular disease, multiple sclerosis, diabetes 1, diabetes 2, hyperthyroidism, psoriasis, anemia, peripheral arterial disease, AIDS, mental health disorder, inflammatory bowel disease, myocardial infarction, lupus, dementia, heart failure, hepatitis, alcoholic fatty liver disease, cirrhosis, rheumatoid arthritis, acromegaly, hemiplegia, COPD, irritable bowel syndrome, learning disabilities, depression, stroke, and chronic kidney disease. In some embodiments, the diagnosis variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the biochemical variables are selected from alanine aminotransferase (ALT), albumin, alkaline phosphatase (ALP), aspartate aminotransferase (AST), calcium, carbon dioxide, chloride, cholesterol, creatinine, ferritin, globulin, glucose, high density lipoprotein (HDL), low density lipoprotein (LDL), phosphate, potassium, sodium, triglycerides, urea, and uric acid. In some embodiments, the biochemical variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
[0031] In some embodiments, at least about 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190 or 200 variables are used in generating the colorectal cancer risk classifier. In some embodiments, between 20 and 200, 30 and 190, 40 and 180, 50 and 170, 60 and 160, 70 and 150, 80 and 140, 90 and 130, or 100 and 120 variables are used in generating the colorectal cancer risk classifier.
[0032] In some embodiments, the generating of (c) comprises weighting each of the plurality of demographic, physiological and clinical features according to a date of the respective plurality of demographic, physiological and clinical features. In some embodiments, the generating of (c) comprises filtering the plurality of demographic, physiological and clinical features to remove outliers according to a standard deviation maximum threshold. In some embodiments, the plurality of features are weighted according to a date of the respective plurality of demographic, physiological and clinical features.
[0033] In another aspect, the present disclosure provides a method of generating a risk profile for colorectal cancer in a target individual comprising: a) obtaining a plurality of features based on demographic, physiological, and clinical variables from a target individual, b) providing at least one classifier generated based at least in part on an analysis of a plurality of respective demographic, physiological, and clinical features from a plurality of sampled individuals; and c) evaluating, using a processor, a colorectal cancer risk of the target individual at least in part by classifying the plurality of features using the at least one classifier to provide a risk profile of the target individual, wherein at least one of the plurality of features is based on data of the demographic, physiological, and clinical variables obtained at 2 or more time points.
[0034] In some embodiments, the demographic, physiological, and clinical features comprise at least two features selected from demographic, symptomatic, lifestyle, diagnosis, and biochemical variables.
[0035] In some embodiments, the demographic variables are selected from age, gender, weight, height, BMI, race, country, and geographically determined data. In some embodiments, the demographic variables comprise local air quality, limiting long-term illness, or Townsend deprivation index. In some embodiments, the demographic variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the symptomatic variables are selected from heartbum, sweating, hemorrhoids, chronic sinusitis, atrial fibrillation, rectal bleeding, abdominal pain, bowel movement change, reflux, constipation, indigestion, hypertension, hematuria, diarrhea, bloating belching, abnormal weight loss, and obesity. In some embodiments, the symptomatic variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the lifestyle variables are selected from smoking, alcohol use, red meat consumption, medications, progestogen, estrogen and other hormonal modulators, hormonal contraceptives, non-steroidal anti-inflammatory and antirheumatic drugs (NSAID), antithrombotic agents, blood glucose lowering drugs excluding insulin, insulin and analogues, immunosuppressants, sulfonamides and trimethoprim antibacterials, and drugs for peptic ulcer and gastroesophageal reflux disease (GERD). In some embodiments, the lifestyle variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the diagnosis variables are selected from diverticular disease, multiple sclerosis, diabetes 1, diabetes 2, hyperthyroidism, psoriasis, anemia, peripheral arterial disease, AIDS, mental health disorder, inflammatory bowel disease, myocardial infarction, lupus, dementia, heart failure, hepatitis, alcoholic fatty liver disease, cirrhosis, rheumatoid arthritis, acromegaly, hemiplegia, COPD, irritable bowel syndrome, learning disabilities, depression, stroke, and chronic kidney disease. In some embodiments, the diagnosis variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the biochemical variables are selected from alanine aminotransferase (ALT), albumin, alkaline phosphatase (ALP), aspartate aminotransferase (AST), calcium, carbon dioxide, chloride, cholesterol, creatinine, ferritin, globulin, glucose, high density lipoprotein (HDL), low density lipoprotein (LDL), phosphate, potassium, sodium, triglycerides, urea, and uric acid. In some embodiments, the biochemical variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
[0036] In some embodiments, at least about 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190 or 200 variables are used in generating the risk profile for colorectal cancer. In some embodiments, between 20 and 200, 30 and 190, 40 and 180, 50 and 170, 60 and 160, 70 and 150, 80 and 140, 90 and 130, or 100 and 120 variables are used in generating the risk profile for colorectal cancer.
[0037] In another aspect, the present disclosure provides a method for evaluating colorectal cancer risk of a target individual, comprising: obtaining a plurality of features based on demographic, physiological, and clinical variables from a target individual; providing at least one classifier generated based at least in part on an analysis of a plurality of respective demographic, physiological, and clinical features of each of a plurality of sampled individuals; and evaluating, using a processor, a colorectal cancer risk of the target individual at least in part by classifying the plurality of features using the at least one classifier, wherein at least one of the features is based on data of the demographic, physiological, and clinical variables obtained at 2 or more time points.
[0038] In some embodiments, the demographic, physiological and clinical features comprise at least two features selected from demographic, symptomatic, lifestyle, diagnosis, and biochemical variables. In some embodiments, the demographic variables are selected from age, gender, weight, height, BMI, race, country, and geographically determined data. In some embodiments, the demographic variables comprise local air quality, limiting long-term illness, or Townsend deprivation index. In some embodiments, the demographic variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the symptomatic variables are selected from heartburn, sweating, hemorrhoids, chronic sinusitis, atrial fibrillation, rectal bleeding, abdominal pain, bowel movement change, reflux, constipation, indigestion, hypertension, hematuria, diarrhea, bloating belching, abnormal weight loss, and obesity. In some embodiments, the symptomatic variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the lifestyle variables are selected from smoking, alcohol use, red meat consumption, medications, progestogen, estrogen and other hormonal modulators, hormonal contraceptives, non-steroidal anti-inflammatory and antirheumatic drugs (NSAID), antithrombotic agents, blood glucose lowering drugs excluding insulin, insulin and analogues, immunosuppressants, sulfonamides and trimethoprim antibacterials, and drugs for peptic ulcer and gastroesophageal reflux disease (GERD). In some embodiments, the lifestyle variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the diagnosis variables are selected from diverticular disease, multiple sclerosis, diabetes 1, diabetes 2, hyperthyroidism, psoriasis, anemia, peripheral arterial disease, AIDS, mental health disorder, inflammatory bowel disease, myocardial infarction, lupus, dementia, heart failure, hepatitis, alcoholic fatty liver disease, cirrhosis, rheumatoid arthritis, acromegaly, hemiplegia, COPD, irritable bowel syndrome, learning disabilities, depression, stroke, and chronic kidney disease. In some embodiments, the diagnosis variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the biochemical variables are selected from alanine aminotransferase (ALT), albumin, alkaline phosphatase (ALP), aspartate aminotransferase (AST), calcium, carbon dioxide, chloride, cholesterol, creatinine, ferritin, globulin, glucose, high density lipoprotein (HDL), low density lipoprotein (LDL), phosphate, potassium, sodium, triglycerides, urea, and uric acid. In some embodiments, the biochemical variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
[0039] In some embodiments, at least about 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190 or 200 variables are used in classifying the plurality of features using the at least one classifier. In some embodiments, between 20 and 200, 30 and 190, 40 and 180, 50 and 170, 60 and 160, 70 and 150, 80 and 140, 90 and 130, or 100 and 120 variables are used in classifying the plurality of features using the at least one classifier.
[0040] In another aspect, the present disclosure provides a method for evaluation of colorectal cancer risk in a target individual is provided, comprising: a) receiving by a computing system associated with a database storing a plurality of classifiers and from a client terminal and via a network, an indication of values of a plurality of features based on demographic, physiological, and clinical variables from a target individual, wherein the clinical variables comprise current blood test results calculated based at least in part on an analysis of a blood sample obtained from the target individual, and wherein at least one of the demographic, physiological, and clinical features is based on data of the demographic, physiological, and clinical variables obtained at 2 or more time points; b) generating, by said computing system, a combination of features based on two or more of the demographic, physiological and clinical variables, and 9 or less blood test features based on said plurality of current blood test results, each one of said 9 or less different blood test features is based on a blood test value of one of said plurality of current blood test results, wherein at least one of the features is based on data of the demographic, physiological, and clinical variables obtained at 2 or more time points; c) selecting at least one classifier from the plurality of classifiers according to at least one demographic characteristic of the target individual, wherein each of the plurality of classifiers is generated according to a plurality of respective demographic, physiological and clinical variables and historical blood test results of a plurality of sampled individuals having at least one different demographic characteristic, wherein the at least one classifier is generated based at least in part on an analysis of the plurality of respective demographic, physiological, and clinical variables and historical blood test results of each of another of the plurality of sampled individuals; and d) evaluating, using a computer processor of the computing system, a colorectal cancer risk of the target individual at least in part by classifying the demographic, physiological and clinical variables and combination of 9 or less blood test features using the at least one classifier; and e) outputting the colorectal cancer risk for presentation by the client terminal. [0041] In some embodiments, the the demographic, physiological and clinical features comprise at least two features selected from demographic, symptomatic, lifestyle, diagnosis, and biochemical variables. In some embodiments, the demographic variables are selected from age, gender, weight, height, BMI, race, country, and geographically determined data. In some embodiments, the demographic variables comprise local air quality, limiting long-term illness, or Townsend deprivation index. In some embodiments, the demographic variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the symptomatic variables are selected from heartburn, sweating, hemorrhoids, chronic sinusitis, atrial fibrillation, rectal bleeding, abdominal pain, bowel movement change, reflux, constipation, indigestion, hypertension, hematuria, diarrhea, bloating belching, abnormal weight loss, and obesity. In some embodiments, the symptomatic variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the lifestyle variables are selected from smoking, alcohol use, red meat consumption, medications, progestogen, estrogen and other hormonal modulators, hormonal contraceptives, non-steroidal anti-inflammatory and antirheumatic drugs (NSAID), antithrombotic agents, blood glucose lowering drugs excluding insulin, insulin and analogues, immunosuppressants, sulfonamides and trimethoprim antibacterials, and drugs for peptic ulcer and gastroesophageal reflux disease (GERD). In some embodiments, the lifestyle variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the diagnosis variables are selected from diverticular disease, multiple sclerosis, diabetes 1, diabetes 2, hyperthyroidism, psoriasis, anemia, peripheral arterial disease, AIDS, mental health disorder, inflammatory bowel disease, myocardial infarction, lupus, dementia, heart failure, hepatitis, alcoholic fatty liver disease, cirrhosis, rheumatoid arthritis, acromegaly, hemiplegia, COPD, irritable bowel syndrome, learning disabilities, depression, stroke, and chronic kidney disease. In some embodiments, the diagnosis variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the biochemical variables are selected from alanine aminotransferase (ALT), albumin, alkaline phosphatase (ALP), aspartate aminotransferase (AST), calcium, carbon dioxide, chloride, cholesterol, creatinine, ferritin, globulin, glucose, high density lipoprotein (HDL), low density lipoprotein (LDL), phosphate, potassium, sodium, triglycerides, urea, and uric acid. In some embodiments, the biochemical variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
[0042] In some embodiments, at least about 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190 or 200 variables are used in classifying the demographic, physiological and clinical variables and combination of 9 or less blood test features using the at least one classifier. In some embodiments, between 20 and 200, 30 and 190, 40 and 180, 50 and 170, 60 and 160, 70 and 150, 80 and 140, 90 and 130, or 100 and 120 variables are used in classifying the demographic, physiological and clinical variables and combination of 9 or less blood test features using the at least one classifier.
[0043] In another aspect, the present disclosure provides a system for generating a colorectal cancer risk profile comprising:(i) a processor, (ii) a memory unit which stores at least one classifier generated based at least in part on an analysis of a plurality of demographic, physiological, and clinical features of individuals of a plurality of sampled individuals, and an input unit which receives a plurality of demographic, physiological and clinical variables of a target individual, and (iii) a colorectal cancer evaluating module which evaluates, using the processor, a colorectal cancer risk of the target individual at least in part by classifying, using the at least one classifier, a plurality of features based on the plurality of demographic, physiological, and clinical variables, wherein at least one of the plurality of features is based on data of the plurality of demographic, physiological, and clinical variables obtained at 2 or more time points.
[0044] In some embodiments, the demographic, physiological, and clinical features comprise at least two features selected from demographic, symptomatic, lifestyle, diagnosis, and biochemical variables. In some embodiments, the demographic variables are selected from age, gender, weight, height, BMI, race, country, and geographically determined data. In some embodiments, the demographic variables comprise local air quality, limiting long-term illness, or Townsend deprivation index. In some embodiments, the demographic variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the symptomatic variables are selected from heartburn, sweating, hemorrhoids, chronic sinusitis, atrial fibrillation, rectal bleeding, abdominal pain, bowel movement change, reflux, constipation, indigestion, hypertension, hematuria, diarrhea, bloating belching, abnormal weight loss, and obesity. In some embodiments, the symptomatic variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the lifestyle variables are selected from smoking, alcohol use, red meat consumption, medications, progestogen, estrogen and other hormonal modulators, hormonal contraceptives, non-steroidal anti-inflammatory and antirheumatic drugs (NSAID), antithrombotic agents, blood glucose lowering drugs excluding insulin, insulin and analogues, immunosuppressants, sulfonamides and trimethoprim antibacterials, and drugs for peptic ulcer and gastroesophageal reflux disease (GERD). In some embodiments, the lifestyle variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the diagnosis variables are selected from diverticular disease, multiple sclerosis, diabetes 1, diabetes 2, hyperthyroidism, psoriasis, anemia, peripheral arterial disease, AIDS, mental health disorder, inflammatory bowel disease, myocardial infarction, lupus, dementia, heart failure, hepatitis, alcoholic fatty liver disease, cirrhosis, rheumatoid arthritis, acromegaly, hemiplegia, COPD, irritable bowel syndrome, learning disabilities, depression, stroke, and chronic kidney disease. In some embodiments, the diagnosis variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the biochemical variables are selected from alanine aminotransferase (ALT), albumin, alkaline phosphatase (ALP), aspartate aminotransferase (AST), calcium, carbon dioxide, chloride, cholesterol, creatinine, ferritin, globulin, glucose, high density lipoprotein (HDL), low density lipoprotein (LDL), phosphate, potassium, sodium, triglycerides, urea, and uric acid. In some embodiments, the biochemical variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
[0045] In some embodiments, at least about 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190 or 200 variables are used in classifying the plurality of features using the at least one classifier. In some embodiments, between 20 and 200, 30 and 190, 40 and 180, 50 and 170, 60 and 160, 70 and 150, 80 and 140, 90 and 130, or 100 and 120 variables are used in classifying the plurality of features using the at least one classifier. In some embodiments, the colorectal cancer risk profile identifies individuals at risk of colorectal cancer. In some embodiments, the colorectal cancer risk profile stratifies a population of individuals for cancer risk. In some embodiments, the colorectal cancer risk profile is used to provide treatment recommendations for the individual based on the colorectal cancer risk profile.
[0046] In another aspect, the present disclosure provides a system for classifying classifying a target individual for colorectal cancer risk comprising:(i) a processor, (ii) a memory unit which stores at least one classifier generated based at least in part on an analysis of a plurality of demographic, physiological, and clinical features of a plurality of sampled individuals, and an input unit which receives a plurality of demographic, physiological, and clinical variables of a target individual, and (iii) a colorectal cancer evaluating module which evaluates, using the processor, a colorectal cancer risk of the target individual at least in part by classifying, using the at least one classifier, a plurality of features based on the plurality of demographic, physiological, and clinical variables, wherein at least one of the plurality of features is based on data of the plurality of demographic, physiological, and clinical variables obtained at 2 or more time points.
[0047] In some embodiments, the demographic, physiological, and clinical features comprise at least two features selected from demographic, symptomatic, lifestyle, diagnosis, and biochemical variables. In some embodiments, the demographic variables are selected from age, gender, weight, height, BMI, race, country, and geographically determined data. In some embodiments, the demographic variables comprise local air quality, limiting long-term illness, or Townsend deprivation index. In some embodiments, the demographic variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the symptomatic variables are selected from heartburn, sweating, hemorrhoids, chronic sinusitis, atrial fibrillation, rectal bleeding, abdominal pain, bowel movement change, reflux, constipation, indigestion, hypertension, hematuria, diarrhea, bloating belching, abnormal weight loss, and obesity. In some embodiments, the symptomatic variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the lifestyle variables are selected from smoking, alcohol use, red meat consumption, medications, progestogen, estrogen and other hormonal modulators, hormonal contraceptives, non-steroidal anti-inflammatory and antirheumatic drugs (NSAID), antithrombotic agents, blood glucose lowering drugs excluding insulin, insulin and analogues, immunosuppressants, sulfonamides and trimethoprim antibacterials, and drugs for peptic ulcer and gastroesophageal reflux disease (GERD). In some embodiments, the lifestyle variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the diagnosis variables are selected from diverticular disease, multiple sclerosis, diabetes 1, diabetes 2, hyperthyroidism, psoriasis, anemia, peripheral arterial disease, AIDS, mental health disorder, inflammatory bowel disease, myocardial infarction, lupus, dementia, heart failure, hepatitis, alcoholic fatty liver disease, cirrhosis, rheumatoid arthritis, acromegaly, hemiplegia, COPD, irritable bowel syndrome, learning disabilities, depression, stroke, and chronic kidney disease. In some embodiments, the diagnosis variables are featurized for use in a computerized method or for use as an input to train a computational classification model. In some embodiments, the biochemical variables are selected from alanine aminotransferase (ALT), albumin, alkaline phosphatase (ALP), aspartate aminotransferase (AST), calcium, carbon dioxide, chloride, cholesterol, creatinine, ferritin, globulin, glucose, high density lipoprotein (HDL), low density lipoprotein (LDL), phosphate, potassium, sodium, triglycerides, urea, and uric acid. In some embodiments, the biochemical variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
[0048] In some embodiments, at least about 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190 or 200 variables are used in classifying the plurality of features using the at least one classifier.
[0049] In some embodiments, between 20 and 200, 30 and 190, 40 and 180, 50 and 170, 60 and 160, 70 and 150, 80 and 140, 90 and 130, or 100 and 120 variables are used in classifying the plurality of features using the at least one classifier.
[0050] In some embodiments, the demographic, physiological, and clinical features comprise a plurality of historical and current blood test results comprising results of 9 or less of the following of plurality of blood tests: red blood cells (RBC), hemoglobin (HGB), and hematocrit (HCT) and at least one result of the following blood tests hemoglobin (MCH) and mean corpuscular hemoglobin concentration (MCHC).
[0051] Implementation of the method and/or system of embodiments of the disclosure can involve performing or completing selected tasks manually, automatically, or a combination thereof. Moreover, according to actual instrumentation and equipment of embodiments of the method and/or system of the disclosure, several selected tasks could be implemented by hardware, by software or by firmware or by a combination thereof using an operating system. [0052] In some embodiments, hardware for performing selected tasks according to embodiments of the disclosure could be implemented as a chip or a circuit. As software, selected tasks according to embodiments of the disclosure could be implemented as a plurality of software instructions being executed by a computer using any suitable operating system.
[0053] In an example embodiment of the disclosure, one or more tasks according to exemplary embodiments of method and/or system as described herein are performed by a data processor, such as a computing platform for executing a plurality of instructions. In some embodiments, the data processor includes a volatile memory for storing instructions and/or data and/or a nonvolatile storage, for example, a magnetic hard-disk and/or removable media, for storing instructions and/or data. In some embodiments, a network connection is provided as well. A display and/or a user input device such as a keyboard or mouse are optionally provided as well. [0054] Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the invention, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.
BRIEF DESCRIPTION OF THE DRAWINGS
[0055] Some embodiments of the invention are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments of the invention. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments of the invention may be practiced. A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and the accompanying drawings (also “Figure” and “FIG.” herein), of which:
[0056] FIG. 1 provides a schematic of a computer system that is programmed or otherwise configured with the machine learning models and classifiers in order to implement methods provided herein. [0057] FIG. 2 shows ROC curves for the ColonFlag Replicate (AUC = 0.777) and ColonFlag Comparison (AUC = 0.835) models evaluated on the Medial Comparison Case-Control dataset. The ROC curve was calculated on the holdout set with a CRC prediction horizon of 3-6 months. The sensitivity = 50% and specificity = 99.5% points are marked with ‘x’ .
[0058] FIG. 3 shows ROC curve for Model A (AUC = 0.768). The ROC curve was calculated on the Northern Ireland Holdout set with a CRC prediction horizon of 24 months. The sensitivity = 50% and specificity = 99.5% points are shown with the orange and green ‘x’, respectively.
[0059] FIG. 4 shows SHAP values for Model A trained on the Dataset 1 cohort. SHAP values measure the contribution of each feature to the prediction. Positive SHAP values indicate that the feature increases a patient’s risk score and negative values indicate that a feature reduces a patient’s risk score. In the figure each point represents a single observation. The SHAP values for a given feature in every observation are plotted horizontally in each row of the figure. The vertical thickness of each row captures the relative frequency a feature’s contribution level across all observations. Finally, the color of each point indicates the value of the feature with blue representing low feature values and red representing high feature values. For example, the thick red cloud of points on the right side of the top row indicates that for many observations, higher values of demographic feature 1 contribute to higher CRC risk. To include longitudinal analysis of variables in the present model, seven of the features included in the model are obtained from variables measured at multiple time points: lab feature 3, lab feature 4, lab feature 7, lab feature 8, diagnosis feature 1, BMI feature 2, and lab feature 9.
[0060] FIG. 5 shows ROC curves for Model B (AUC = 0.719) and an age-sex baseline model trained on a subset of Dataset B drawn from seven US Census geographic divisions. The ROC curve was calculated on a validation data set drawn from a distinct geographic division with a prediction horizon of 2 years. The superior performance of Model B over the baseline is statistically significant.
DETAILED DESCRIPTION
[0061] The present disclosure, in some embodiments thereof, relates to cancer diagnosis and, more particularly, but not exclusively, to methods and systems of evaluating a risk of cancer.
I. CLASSIFIERS & MACHINE LEARNING MODELS
[0062] In various examples, features obtained from demographic, physiological and clinical variables are used as input datasets into trained algorithms (e.g., machine learning models or classifiers) to find correlations between features and patient groups. Examples of such patient groups include presence of diseases or conditions, stages, subtypes, treatment responders vs. non-responders, and progressors vs. non-progressors. In various examples, feature matrices are generated to compare samples obtained from individuals with known conditions or characteristics. In some embodiments, samples are obtained from healthy individuals, or individuals who do not have any of the known indications and samples from patients known to have increased risk of developing cancer, and in particular colorectal cancer.
[0063] As used herein, as it relates to machine learning and pattern recognition the term “feature” generally refers to an individual measurable property or characteristic of a phenomenon being observed. The concept of “feature” is related to that of explanatory variable used in statistical techniques such as for example, but not limited to, linear regression and logistic regression. Features are usually numeric, but structural features such as strings and graphs are used in syntactic pattern recognition.
[0064] As used herein, features may be obtained from demographic, physiological and clinical information or variable. The action of converting this information into features useful for computational method is referred to herein as “featurization”.
[0065] The term “input features” (or “features”), as used herein, generally refers to variables that are converted to a form, often a numeric form, used by the trained algorithm, (e.g., model or classifier) to predict an output classification (label) of a sample, e.g., a condition, cancer risk category. Feature values of the variables may be determined for an individual and used to determine a classification.
[0066] In various examples, features obtained from demographic, physiological and clinical variables are obtained from historical medical records of an individual. The present methods incorporated a historical view including aspects such as recency, trends, and sequence features which are believed to improve the predictive value of resulting.
[0067] Each observation included the entire patient history to that date. By contrast to other methods which trained a model on one observation per patient at one time point, the training data set used in the present methods includes multiple observations per patient across multiple time points.
[0068] In certain embodiments, feature data is collected longitudinally to provide time series of feature values. In certain embodiments, the time series of feature values is weighted according to elapsed time to provide higher significance of recent measurements.
[0069] In one aspect, a colorectal cancer risk classifier is provided that is trained on data that comprises a plurality of features based on demographic, physiological and clinical variables from a target individual, wherein the classifier is generated according to an analysis of a plurality of respective demographic, physiological and clinical features of a plurality of sampled individuals, wherein at least one of the features is based on data variables obtained at 2 or more time points.
[0070] In some embodiments, the model trains equally across all prior demographic and clinical features. While not to be bound by any mechanism, it is believed that such equal training prevent bias towards later detection.
[0071] Each of the plurality of demographic, physiological and clinical features comprises at least two features obtained from demographic, symptomatic, lifestyle, diagnosis or biochemical variables.
[0072] In certain embodiments, the demographic variables are selected from age, gender, weight, height, BMI, race, country, geographically determined data such as local air quality, limiting long-term illness, or Townsend deprivation index, and the like. The variable may then be featurized for use in the computerized method or for use as an input to train a computational classification model.
[0073] In certain embodiments, the set of features comprises an age of the target individual; wherein the at least one classifier is generated according to an analysis of the age of each of another plurality of sampled individuals.
[0074] In certain embodiments, the set of features comprises sex of the target individual; wherein the at least one classifier is generated according to an analysis of the sex of each of another plurality of sampled individuals.
[0075] In certain embodiments, the symptomatic variables are selected from heartburn, sweating, hemorrhoids, chronic sinusitis, atrial fibrillation, rectal bleeding, abdominal pain, bowel movement change, reflux, constipation, indigestion, hypertension, hematuria, diarrhea, bloating belching, abnormal weight loss, and obesity. The variable may then be featurized for use in the computerized method or for use as an input to train a computational classification model.
[0076] In certain embodiments, the lifestyle variables are selected from smoking and alcohol use, red meat consumption, and medications including progestogen, estrogen and other hormonal modulators, hormonal contraceptives, non-steroidal anti-inflammatory and antirheumatic drugs (NS AID), antithrombotic agents, blood glucose lowering drugs excluding insulin, insulin and analogues, immunosuppressants, sulfonamides and trimethoprim antibacterials, and drugs for peptic ulcer and gastroesophageal reflux disease (GERD). The variable may then be featurized for use in the computerized method or for use as an input to train a computational classification model. [0077] In certain embodiments, the diagnosis variables are selected from diverticular disease, multiple sclerosis, diabetes 1, diabetes 2, hyperthyroidism, psoriasis, anemia, peripheral arterial disease, AIDS, mental health disorder, inflammatory bowel disease, myocardial infarction, lupus, dementia, heart failure, hepatitis, alcoholic fatty liver disease, cirrhosis, rheumatoid arthritis, acromegaly, hemiplegia, COPD, irritable bowel syndrome, learning disabilities, depression, stroke, and chronic kidney disease. The variable may then be featurized for use in the computerized method or for use as an input to train a computational classification model. [0078] In certain embodiments, the biochemical variables are selected from alanine aminotransferase (ALT), albumin, alkaline phosphatase (ALP), aspartate aminotransferase (AST), calcium, carbon dioxide, chloride, cholesterol, creatinine, ferritin, globulin, glucose, high density lipoprotein (HDL), low density lipoprotein (LDL), phosphate, potassium, sodium, triglycerides, urea, and/or uric acid. The variable may then be featurized for use in the computerized method or for use as an input to train a computational classification model.
[0079] The present classifiers incorporate a historical view including aspects such as recency, trends, and sequence features which may improve the predictive value of models using the classifiers.
[0080] In certain embodiments, each variable includes the entire patient history to that date. [0081] In certain embodiments, the training data set used in the present methods includes multiple variables per patient across multiple time points, in contrast to other methods which trained a model on one variables per patient at one time point.
[0082] In certain embodiments, feature data is collected longitudinally to provide time series of feature values.
[0083] In certain embodiments, the time series of feature values is weighted according to elapsed time to provide higher significance of recent measurements.
[0084] In certain embodiments, regularization of machine learning methods is performed to reduce the complexity of the model and prevent overfitting of data.
[0085] In certain embodiments, at least about 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190 or 200 variables are used in the classification model.
[0086] In certain embodiments, between 20 and 200, 30 and 190, 40 and 180, 50 and 170, 60 and 160, 70 and 150, 80 and 140, 90 and 130, or 100 and 120 variables are used in the classification model.
[0087] In one aspect, a colorectal cancer risk classifier is provided trained on data that comprises a plurality of features based on demographic, physiological and clinical variables from a target individual, wherein the classifier is generated according to an analysis of a plurality of respective demographic, physiological and clinical features of a plurality of sampled individuals, wherein at least one of the features is based on data variables obtained at 2 or more time points.
[0088] In one aspect, a colorectal cancer risk classifier is provided trained on data that comprises a plurality of features based on two or more demographic, physiological and clinical variables, and 9 or less blood test features based on said plurality of current blood test results, each one of said 9 or less different blood test features is based on a blood test value of one of said plurality of current blood test results wherein at least one of the features is based on data variables obtained at 2 or more time points.
[0089] In certain embodiments, the plurality of blood test results of 9 or less of the following blood tests: red blood cells (RBC), hemoglobin (HGB), and hematocrit (HCT) and at least one result of the following blood tests hemoglobin (MCH) and mean corpuscular hemoglobin concentration (MCHC).
[0090] In certain embodiments that include blood test results as features in the classification model, each of the plurality of historical and current blood test results comprises 9 or less of the following blood tests: white blood cell count— WBC (CBC); mean platelet volume (MPV); mean cell; platelet count (CBC); eosinophils count; neutrophils percentage; monocytes percentage; eosinophils percentage; basophils percentage; lymphocytes percentage; and neutrophils count; monocytes count, lymphocytes count; neutrophil-lymphocyte ratio (NLR).
[0091] In certain embodiments, the model does not require a CBC analysis of the individual.
[0092] The present classifiers incorporate a historical view including aspects such as recency, trends, and sequence features which are believed to improve the predictive value of models using the classifiers.
[0093] According to some embodiments of the present disclosure, there is provided a classifier useful in a computerized method of evaluating colorectal cancer risk. The classifier is generated by training a classification model on a data set comprising a plurality of features obtained from demographic, physiological and clinical variables from of a target individual, to provide at least one classifier generated according to an analysis of a plurality of respective demographic, physiological and clinical features of each of another of a plurality of sampled individuals; and evaluating, using a processor, a colorectal cancer risk of the target individual by classifying the set of features using the at least one classifier wherein at least one of the features is based on data variables obtained at 2 or more time points. Each of the plurality of demographic, physiological and clinical features comprises of at least two the features obtained from demographic, symptomatic, lifestyle, diagnosis or biochemical variables.
[0094] For a plurality of demographic, physiological and clinical variables, the present methods and systems identify feature sets to input into a trained algorithm (e.g., machine learning model or classifier). In certain embodiments, the system input features of an individual and forms a feature vector from the measured values of the features. The system inputs the feature vector into the machine learning model and obtains an output classification of whether the individual has increased risk of cancer.
[0095] In some embodiments, the machine learning model outputs a classifier capable of distinguishing between two or more groups or classes of individuals or features in a population of individuals or features of the population. In some embodiments, the classifier is a trained machine learning classifier.
[0096] In some embodiments, the informative features obtained from the demographic, physiological and clinical variables are assayed using the methods and systems described to form a risk classification profile. Receiver-operating characteristic (ROC) curves may be generated by plotting the performance of a particular feature in distinguishing between at least two populations (e.g., individuals having high or low risk of colorectal cancer, or more than two levels of colorectal cancer risk). In some embodiments, the feature data across the entire population (e.g., the cases and controls) are sorted in ascending order based on the value of a single feature.
[0097] To include longitudinal analysis of variables in the present model, features included in the model may be obtained from variables measured at multiple time points.
[0098] In various embodiments, variables measured at least 2 time points provide trend data that may be featurized and input to train a classification model. In various embodiments, demographic, symptomatic, lifestyle, diagnosis or biochemical variables are measured at different time points such as about 6 months, about 9 months, about 12 months, about 18 months or about 24 months to provide trend data that may be featurized and input to train a classification model.
[0099] In certain embodiments, the demographic variables measured at least 2 time points provide trend data are selected from age, gender, weight, height, BMI, race, country, geographically determined data such as local air quality, limiting long-term illness, or Townsend deprivation index, and the like. The variable may then be featurized for use in the computerized method or for use as an input to train a computational classification model.
[0100] In certain embodiments, the symptomatic variables measured at least 2 time points provide trend data are selected from heartbum, sweating, hemorrhoids, chronic sinusitis, atrial fibrillation, rectal bleeding, abdominal pain, bowel movement change, reflux, constipation, indigestion, hypertension, hematuria, diarrhea, bloating belching, abnormal weight loss, and obesity. [0101] The variable may then be featurized for use in the computerized method or for use as an input to train a computational classification model.
[0102] In certain embodiments, the lifestyle variables measured at 2 or more time points to provide trend data are selected from (i) smoking and alcohol use, (ii) red meat consumption, and (iii) medications including progestogen, estrogen and other hormonal modulators, hormonal contraceptives, non-steroidal anti-inflammatory and antirheumatic drugs (NS AID), antithrombotic agents, blood glucose lowering drugs excluding insulin, insulin and analogues, immunosuppressants, sulfonamides and trimethoprim antibacterials, and drugs for peptic ulcer and gastroesophageal reflux disease (GERD). The variable may then be featurized for use in the computerized method or for use as an input to train a computational classification model.
[0103] In certain embodiments, the diagnosis variables measured at 2 or more time points to provide trend data are selected from diverticular disease, multiple sclerosis, diabetes 1, diabetes 2, hyperthyroidism, psoriasis, anemia, peripheral arterial disease, AIDS, mental health disorder, inflammatory bowel disease, myocardial infarction, lupus, dementia, heart failure, hepatitis, alcoholic fatty liver disease, cirrhosis, rheumatoid arthritis, acromegaly, hemiplegia, COPD, irritable bowel syndrome, learning disabilities, depression, stroke, and chronic kidney disease. The variable may then be featurized for use in the computerized method or for use as an input to train a computational classification model.
[0104] In certain embodiments, the biochemical variables measured at 2 or more time points to provide trend data are selected from alanine aminotransferase (ALT), albumin, alkaline phosphatase (ALP), aspartate aminotransferase (AST), calcium, carbon dioxide, chloride, cholesterol, creatinine, ferritin, globulin, glucose, high density lipoprotein (HDL), low density lipoprotein (LDL), phosphate, potassium, sodium, triglycerides, urea, and/or uric acid.
[0105] In certain embodiments that may include one or more blood test results, a blood test result variable is measured at least 2 time points to provide trend data and is selected from: red blood cells (RBC), hemoglobin (HGB), and hematocrit (HCT) and at least one result of the following blood tests hemoglobin (MCH) and mean corpuscular hemoglobin concentration (MCHC).
[0106] In certain embodiments that may include one or more blood test results, a blood test result variable is measured at least 2 time points to provide trend data and is selected from: white blood cell count— WBC (CBC); mean platelet volume (MPV); mean cell; platelet count (CBC); eosinophils count; neutrophils percentage; monocytes percentage; eosinophils percentage; basophils percentage; lymphocytes percentage; and neutrophils count; monocytes count, lymphocytes count; neutrophil-lymphocyte ratio (NLR). [0107] A feature of the present classification models may be an increase in lead time or advance notice before colorectal cancer diagnosis to permit treatment intervention at a time point that is still actionable and may improve clinical outcomes including treatment efficacy and mortality. Lead time may correspond to the difference in time from when the cancer would have been detected by symptoms, in the absence of a risk model, to when the detection occurred in the presence of a screening program. This information may be obtained from medical record databases, featurized, and used to train a classification model.
[0108] Previous models may train on fewer features and on a fixed time point too close to a colorectal cancer diagnosis and therefore do not provide measure of colorectal cancer risk early enough to appropriately intervene. The process of obtaining screening, scheduling physician appointments and confirmatory colonoscopies, and other treatment logistics often require more than 6 months of lead time to accomplish. Individuals identified by risk models as having higher level risk scores subsequently may need to navigate the treatment process and so risk profiles with higher sensitivity at earlier lead times are desirable.
[0109] The present methods and classifiers may be trained at arbitrarily set time points, such as 1 year, 2 year, 3 years, or 4 years in advance to train across different time horizons.
[0110] In certain embodiments, the advance notice or lead time provided by the classification model before a diagnosis can permit treatment intervention at a time point that is still actionable and may improve clinical outcomes including treatment efficacy and mortality.
[0111] In certain embodiments, the median advance notice or lead time is between 100 and 300 days before colorectal cancer diagnosis and/or is longer than other models.
[0112] In certain embodiments, the median advance notice or lead time provided by the classification model is at least 300 days, at least 400 days, at least 500 days, at least 600 days, at least 700 days, at least 800 days, at least 900 days, at least 1000 days, at least 1100 days, at least 1200 days, at least 1300 days, at least 1400 days, or at least 1500 days before colorectal cancer diagnosis.
[0113] In certain embodiments, the advance notice provided by the classification model is at least 600 days.
[0114] In certain embodiments, the advance notice provided by the classification model is at least 1000 days.
[0115] In some embodiments, the system comprises a classification circuit that is configured as a machine learning classifier selected from the group consisting of a deep learning classifier, a neural network classifier, a linear discriminant analysis (LDA) classifier, a quadratic discriminant analysis (QDA) classifier, a support vector machine (SVM) classifier, a random forest (RF) classifier, K nearest neighbor (KNN), a linear kernel support vector machine classifier, a first or second order polynomial kernel support vector machine classifier, a ridge regression classifier, an elastic net algorithm classifier, a sequential minimal optimization algorithm classifier, a naive Bayes algorithm classifier, and principal component analysis classifier.
[0116] In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer risk at a sensitivity of at least about 25%. In some embodiments, the classification model profile is indicative of an elevated risk of colorectal cancer at a sensitivity of at least about 30%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a sensitivity of at least about 35%. In some embodiments, the classification mode profile is indicative of an elevated risk of colorectal cancer at a sensitivity of at least about 40%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a sensitivity of at least about 50%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a sensitivity of at least about 60%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a sensitivity of at least about 70%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a sensitivity of at least about 80%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a sensitivity of at least about 90%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a sensitivity of at least about 95%.
[0117] In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 5%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 10%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 15%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 20%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 25%. In some embodiments, the classification model is indicative an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 30%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 40%. In some embodiments, the classification model is indicative an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 50%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 60%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 70%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 80%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 90%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 95%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a positive predictive value (PPV) of at least about 99%. [0118] In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a negative predictive value (NPV) of at least about 40%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a negative predictive value (NPV) of at least about 50%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a negative predictive value (NPV) of at least about 60%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a negative predictive value (NPV) of at least about 70%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a negative predictive value (NPV) of at least about 80%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a negative predictive value (NPV) of at least about 90%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a negative predictive value (NPV) of at least about 95%. In some embodiments, the classification model is indicative of an elevated risk of colorectal cancer at a negative predictive value (NPV) of at least about 99%.
[0119] In some embodiments, the trained algorithm determines an elevated risk of the colorectal cancer of the subject with an Area Under Curve (AUC) of at least about 0.50. In some embodiments, the trained algorithm determines an elevated risk of the colorectal cancer of the subject with an Area Under Curve (AUC) of at least about 0.60. In some embodiments, the trained algorithm determines an elevated risk of the colorectal cancer of the subject with an Area Under Curve (AUC) of at least about 0.70. In some embodiments, the trained algorithm determines an elevated risk of the colorectal cancer of the subject with an Area Under Curve (AUC) of at least about 0.80. In some embodiments, the trained algorithm determines an elevated risk of the colorectal cancer of the subject with an Area Under Curve (AUC) of at least about 0.90. In some embodiments, the trained algorithm determines an elevated risk of the colorectal cancer of the subject with an Area Under Curve (AUC) of at least about 0.95. In some embodiments, the trained algorithm determines an elevated risk of the colorectal cancer of the subject with an Area Under Curve (AUC) of at least about 0.99.
A. Classifier Data Analysis
[0120] In some examples, the present disclosure provides a system, method, or kit having data analysis realized in software application, computing hardware, or both. In various examples, the analysis application or system comprises at least a data receiving module, a data pre-processing module, a data analysis module (which can operate on one or more types of featurized data), a data interpretation module, or a data visualization module. In some embodiments, the data preprocessing module can comprise hardware systems or computer software that performs operations on the data in preparation for analysis. Examples of operations that may be applied to the data in the pre-processing module include affine transformations, denoising operations, data cleaning, reformatting, or subsampling. A data analysis module can perform probabilistic and statistical analysis to identify abnormal patterns related to a disease, pathology, state, risk, condition, or phenotype. A data interpretation module can use analysis methods, for example, drawn from statistics, mathematics, or biology, to support understanding of the relation between the identified abnormal patterns and health conditions, functional states, prognoses, or risks. A data visualization module can use methods of mathematical modeling, computer graphics, or rendering to create visual representations of data that can facilitate the understanding or interpretation of results. In various examples, machine learning methods are applied to distinguish samples in a population of samples. In some embodiments, machine learning methods are applied to distinguish samples between individuals at different levels of risk for colorectal cancer.
[0121] In some embodiments, the one or more machine learning operations used to train the prediction engine include one or more of: a generalized linear model, a generalized additive model, a non-parametric regression operation, a random forest classifier, a spatial regression operation, a Bayesian regression model, a time series analysis, a Bayesian network, a Gaussian network, a decision tree learning operation, an artificial neural network, a recurrent neural network, a convolutional neural network, a reinforcement learning operation, linear or nonlinear regression operations, a support vector machine, a clustering operation, and a genetic algorithm operation.
[0122] In various examples, computer processing methods are selected from logistic regression, multiple linear regression (MLR), dimension reduction, partial least squares (PLS) regression, principal component regression, autoencoders, variational autoencoders, singular value decomposition, Fourier bases, wavelets, discriminant analysis, support vector machine, decision tree, classification and regression trees (CART), tree-based methods, random forest, gradient boost tree, logistic regression, matrix factorization, multidimensional scaling (MDS), dimensionality reduction methods, t-distributed stochastic neighbor embedding (t-SNE), multilayer perceptron (MLP), network clustering, neuro-fuzzy, and artificial neural networks.
B. Classifier Generation
[0123] In an aspect, the disclosed systems and methods provide a classifier generated based on feature information derived from demographic, physiological and clinical variables. The classifier forms part of a predictive engine for distinguishing groups in a population based on risk features identified in demographic, physiological and clinical variables.
[0124] In some embodiments, a classifier is created by normalizing the features by formatting features into a unified format and a unified scale; storing the normalized feature information in a columnar database; training a prediction engine by applying one or more one machine learning operations to the stored normalized feature information, the prediction engine mapping, for a particular population, a combination of one or more features; applying the prediction engine to the accessed field information to identify an individual associated with a risk group; and classifying the individual into a risk group.
[0125] In some embodiments, a hierarchy is created by normalizing the feature information by formatting similar portions of the feature information into a unified format and a unified scale; storing the normalized feature information in a columnar database; training a prediction engine by applying one or more one machine learning operations to the stored normalized feature information, the prediction engine mapping, for a particular population, a combination of one or more features; applying the prediction engine to the accessed field information to identify an individual associated with a risk group; and classifying the individual into a risk group.
[0126] Specificity, as used herein, generally refers to “the probability of assigning a positive risk score among those who are at a low risk of colorectal cancer”. It may be calculated by the number of low-risk persons who tested negative divided by the total number of low-risk individuals.
[0127] In various examples, the model, classifier, or predictive test has a specificity of at least 40%, at least 45%, at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, or at least 99%.
[0128] Sensitivity, as used herein, generally refers to “the probability of a positive test among high-risk persons”. It may be calculated by the number of high-risk individuals who tested positive divided by the total number of high-risk individuals. [0129] In various examples, the model, classifier, or predictive test has a sensitivity of at least 40%, at least 45%, at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least
75%, at least 80%, at least 85%, at least 90%, at least 95%, or at least 99%.
C. Digital processing device
[0130] In some examples, the subject matter described herein can include a digital processing device or use of the same. In some examples, the digital processing device can include one or more hardware central processing units (CPU), graphics processing units (GPU), or tensor processing units (TPU) that carry out the device’s functions. In some examples, the digital processing device can include an operating system configured to perform executable instructions.
[0131] In some examples, the digital processing device can optionally be connected a computer network. In some examples, the digital processing device may be optionally connected to the Internet. In some examples, the digital processing device may be optionally connected to a cloud computing infrastructure. In some examples, the digital processing device may be optionally connected to an intranet. In some examples, the digital processing device may be optionally connected to a data storage device.
[0132] Non-limiting examples of suitable digital processing devices include server computers, desktop computers, laptop computers, notebook computers, sub-notebook computers, netbook computers, netpad computers, set-top computers, handheld computers, Internet appliances, mobile smartphones, and tablet computers. Suitable tablet computers can include, for example, those with booklet, slate, and convertible configurations.
[0133] In some examples, the digital processing device can include an operating system configured to perform executable instructions. For example, the operating system can include software, including programs and data, which manages the device’s hardware and provides services for execution of applications. Non-limiting examples of operating systems include Ubuntu, FreeBSD, OpenBSD, NetBSD®, Linux, Apple® Mac OS X Server®, Oracle® Solaris®, Windows Server®, and Novell® NetWare®. Non-limiting examples of suitable personal computer operating systems include Microsoft® Windows®, Apple® Mac OS X®, UNIX®, and UNIX-like operating systems such as GNU/Linux®.
[0134] In some examples, the operating system may be provided by cloud computing, and cloud computing resources may be provided by one or more service providers.
[0135] In some examples, the device can include a storage and/or memory device. The storage and/or memory device may be one or more physical apparatuses used to store data or programs on a temporary or permanent basis. In some examples, the device may be volatile memory and require power to maintain stored information. In some examples, the device may be non-volatile memory and retain stored information when the digital processing device is not powered. In some examples, the non-volatile memory can include flash memory. In some examples, the nonvolatile memory can include dynamic random-access memory (DRAM). In some examples, the non-volatile memory can include ferroelectric random access memory (FRAM). In some examples, the non-volatile memory can include phase-change random access memory (PRAM).
[0136] In some examples, the device may be a storage device including, for example, CD- ROMs, DVDs, flash memory devices, magnetic disk drives, magnetic tapes drives, optical disk drives, and cloud computing-based storage. In some examples, the storage and/or memory device may be a combination of devices such as those disclosed herein. In some examples, the digital processing device can include a display to send visual information to a user. In some examples, the display may be a cathode ray tube (CRT). In some examples, the display may be a liquid crystal display (LCD). In some examples, the display may be a thin film transistor liquid crystal display (TFT-LCD). In some examples, the display may be an organic light emitting diode (OLED) display. In some examples, on OLED display may be a passive- matrix OLED (PMOLED) or active-matrix OLED (AMOLED) display. In some examples, the display may be a plasma display. In some examples, the display may be a video projector. In some examples, the display may be a combination of devices such as those disclosed herein.
D. Non-transitory Storage Media
[0137] In some examples, the subject matter disclosed herein can include one or more non- transitory computer-readable storage media encoded with a program including instructions executable by the operating system of an optionally networked digital processing device. In some examples, a computer-readable storage medium may be a tangible component of a digital processing device. In some examples, a computer-readable storage medium may be optionally removable from a digital processing device. In some examples, a computer-readable storage medium can include, for example, CD-ROMs, DVDs, flash memory devices, solid state memory, magnetic disk drives, magnetic tape drives, optical disk drives, cloud computing systems and services, and the like. In some examples, the program and instructions may be permanently, substantially permanently, semi- permanently, or non-transitorily encoded on the media. E. Computer systems
[0138] The present disclosure provides computer systems that are programmed to implement methods described herein. FIG. 1 shows a computer system 101 that is programmed or otherwise configured to store, process, identify, or interpret demographic, physiological and clinical variables. The computer system 101 can process various aspects of patient demographic, physiological and clinical variables of the present disclosure (FIG. 1). The computer system 101 may be an electronic device of a user or a computer system that is remotely located with respect to the electronic device. The electronic device may be a mobile electronic device.
[0139] The computer system 101 comprises a central processing unit (CPU, also “processor” and “computer processor” herein) 105, which may be a single core or multi core processor, or a plurality of processors for parallel processing. The computer system 101 also comprises memory or memory location 110 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 115 (e.g., hard disk), communication interface 120 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 125, such as cache, other memory, data storage and/or electronic display adapters. The memory 110, storage unit 115, interface 120 and peripheral devices 125 are in communication with the CPU 105 through a communication bus (solid lines), such as a motherboard. The storage unit 115 may be a data storage unit (or data repository) for storing data. The computer system 101 may be operatively coupled to a computer network (“network”) 130 with the aid of the communication interface 120. The network 130 may be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet. The network 130 in some examples is a telecommunication and/or data network. The network 130 can include one or more computer servers, which can enable distributed computing, such as cloud computing. The network 130, in some examples with the aid of the computer system 101, can implement a peer-to-peer network, which may enable devices coupled to the computer system 101 to behave as a client or a server. [0140] The CPU 105 can execute a sequence of machine-readable instructions, which may be embodied in a program or software. The instructions may be stored in a memory location, such as the memory 110. The instructions may be directed to the CPU 105, which can subsequently program or otherwise configure the CPU 105 to implement methods of the present disclosure. Examples of operations performed by the CPU 105 can include fetch, decode, execute, and writeback.
[0141] The CPU 105 may be part of a circuit, such as an integrated circuit. One or more other components of the system 101 may be included in the circuit. In some examples, the circuit is an application specific integrated circuit (ASIC). [0142] The storage unit 115 can store files, such as drivers, libraries and saved programs. The storage unit 115 can store user data, e.g., user preferences and user programs. The computer system 101 in some examples can include one or more additional data storage units that are external to the computer system 101, such as located on a remote server that is in communication with the computer system 101 through an intranet or the Internet.
[0143] The computer system 101 can communicate with one or more remote computer systems through the network 130. For instance, the computer system 101 can communicate with a remote computer system of a user. Examples of remote computer systems include personal computers (e.g., portable PC), slate or tablet PC’s (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device, Blackberry®), or personal digital assistants. The user can access the computer system 101 via the network 130. [0144] Methods as described herein may be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 101, such as, for example, on the memory 110 or electronic storage unit 115. The machine-executable or machine-readable code may be provided in the form of software. During use, the code may be executed by the processor 105. In some examples, the code may be retrieved from the storage unit 115 and stored on the memory 110 for ready access by the processor 105. In some examples, the electronic storage unit 115 may be precluded, and machine-executable instructions are stored on memory 110.
[0145] The code may be pre-compiled and configured for use with a machine having a processer adapted to execute the code or may be interpreted or compiled during runtime. The code may be supplied in a programming language that may be selected to enable the code to execute in a precompiled, interpreted, or as-compiled fashion.
[0146] Aspects of the systems and methods provided herein, such as the computer system 101, may be embodied in programming. Various aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Machine- executable code may be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk. “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non- transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server. Thus, another type of media that may bear the software elements comprises optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software. As used herein, unless restricted to non-transitory, tangible “storage” media, terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.
[0147] Hence, a machine readable medium, such as computer-executable code, may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings. Volatile storage media include dynamic memory, such as main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer- readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
[0148] The computer system 101 can include or be in communication with an electronic display 135 that comprises a user interface (LT) 140 for providing, for example, a demographic, physiological and clinical variables or features. Examples of UI’s include, without limitation, a graphical user interface (GUI) and web-based user interface.
[0149] Methods and systems of the present disclosure may be implemented by way of one or more algorithms. An algorithm may be implemented by way of software upon execution by the central processing unit 105. The algorithm can, for example, store, process, identify, or interpret patient demographic, physiological and clinical variables. [0150] In some examples, the subject matter disclosed herein can include at least one computer program or use of the same. A computer program can a sequence of instructions, executable in the digital processing device’s CPU, GPU, or TPU, written to perform a specified task. Computer-readable instructions may be implemented as program modules, such as functions, objects, Application Programming Interfaces (APIs), data structures, and the like, that perform particular tasks or implement particular abstract data types. In light of the disclosure provided herein, a computer program may be written in various versions of various languages. The functionality of the computer-readable instructions may be combined or distributed as desired in various environments.
[0151] In some examples, a computer program can include one sequence of instructions.
In some examples, a computer program can include a plurality of sequences of instructions. In some examples, a computer program may be provided from one location. In some examples, a computer program may be provided from a plurality of locations. In some examples, a computer program can include one or more software modules. In some examples, a computer program can include, in part or in whole, one or more web applications, one or more mobile applications, one or more standalone applications, one or more web browser plug-ins, extensions, add- ins, or addons, or combinations thereof. In some examples, the computer processing may be a method of statistics, mathematics, biology, or any combination thereof. In some examples, the computer processing method comprises a dimension reduction method including, for example, logistic regression, dimension reduction, principal component analysis, autoencoders, singular value decomposition, Fourier bases, singular value decomposition, wavelets, discriminant analysis, support vector machine, tree-based methods, random forest, gradient boost tree, logistic regression, matrix factorization, network clustering, and neural network such as convolutional neural networks. In some examples, the computer processing method is a supervised machine learning method including, for example, a regression, support vector machine, tree-based method, and network. In some examples, the computer processing method is an unsupervised machine learning method including, for example, clustering, network, principal component analysis, and matrix factorization.
F. Databases
[0152] In some examples, the subject matter disclosed herein can include one or more databases, or use of the same to store patient data, demographic, physiological and clinical variables.
[0153] In view of the disclosure provided herein, many databases may be suitable for storage and retrieval of the demographic, physiological and clinical information. In some examples, suitable databases include electronic medical record (EMR) or electronic health record (EHR) databases. In some examples, suitable databases can include, for example, relational databases, non-relational databases, object-oriented databases, object databases, entity-relationship model databases, associative databases, and XML databases. In some examples, a database may be internet-based. In some examples, a database may be web-based. In some examples, a database may be cloud computing-based. In some examples, a database may be based on one or more local computer storage devices.
[0154] In some examples, the database is one or more medical record database(s) and/or connected to a medical record database interface. The database(s) may include a plurality of individual records, also referred to as a plurality of individual samples, which describe, for each of another of a plurality of sampled individuals, one or more sets of a plurality of historical test results each set of another individual, and optionally one or more demographic, physiological, or clinical variables. The set of a plurality of variables may be stored in a common sample record and/or gathered from a number of independent and/or connected databases.
[0155] In an aspect, the present disclosure provides a non-transitory computer-readable medium comprising instructions that direct a processor to carry out a method disclosed herein.
[0156] In an aspect, the present disclosure provides a computing device comprising the computer-readable medium.
II. METHODS AND SYSTEMS OF USE
[0157] The disclosed methods are directed to classification models used to assess risk of colorectal cancer and stratify patient populations based on the classification. Such classification and stratification are useful to prioritize individuals in a population for targeted colorectal cancer screening and earlier diagnosis and intervention.
[0158] The present methods may incorporate a historical view including aspects such as recency, trends, and sequence features which may improve the predictive value of resulting models.
[0159] Each observation may include the entire patient history to that date. By contrast to other methods which trained a model on one observation per patient at one time point, the training data set used in the present methods may include multiple observations per patient across multiple time points.
[0160] In certain embodiments, feature data is collected longitudinally to provide time series of feature values. In certain embodiments, the time series of feature values is weighted according to elapsed time to provide higher significance of recent measurements.
[0161] In certain embodiments, training data (and therefore training features) are collected at many points in a patient’s health history and not limited to single data points collected shortly before a patient’s CRC diagnosis. This approach may enable the present methods and machine learning models to learn and identify features indicative of CRC risk at earlier times.
[0162] In certain embodiments, observations and data are excluded if they occurred at a time point after a patient was referred to palliative or hospice care, after a colectomy; before the patient was 45, or after the patient turned 86, because screening is not recommended in those circumstances (thus the risk is not clinically relevant).
[0163] In one aspect, a method is provided of generating a risk profile for cancer comprising: providing a classifier capable of generating a risk profile for cancer in an individual wherein the a colorectal cancer risk classifier is trained on data that comprises a plurality of features based on demographic, physiological and clinical variables from a target individual, wherein the classifier is generated according to an analysis of a plurality of respective demographic, physiological and clinical features of a plurality of sampled individuals, wherein at least one of the features is based on data variables obtained at 2 or more time points.
[0164] In one aspect, a method is provided for classification of cancer risk in an individual comprising: providing a colorectal cancer risk classifier trained on data that comprises a plurality of features based on two or more demographic, physiological and clinical variables, and 9 or less blood test features based on said plurality of current blood test results, each one of said 9 or less different blood test features is based on a blood test value of one of said plurality of current blood test results wherein at least one of the features is based on data variables obtained at 2 or more time points.
[0165] In another aspect, the present disclosure provides a system for performing classifications of individuals based on cancer risk comprising: a) a receiver to receive a plurality of training individuals, each of the plurality of training individuals having a plurality of demographic, physiological and clinical variables wherein each of the plurality of training individuals comprises one or more known labels b) a feature module to identify a set of features based on the variables corresponding to an individual that are operable to be input to the machine learning model for each of the plurality of individuals, wherein the set of features corresponds to variables of the plurality of training individuals, wherein for each of the plurality of training individuals, the system is operable to subject a plurality of variables of the individual to a plurality of different assays to obtain sets of measured values, wherein each set of measured values is from one variable in the individual, wherein a plurality of sets of measured values are obtained for the plurality of individuals, c) an analysis module to analyze the sets of measured values to obtain a training vector for the individual, wherein the training vector comprises feature values of the N set of features of the corresponding variable, each feature value corresponding to a feature and including one or more measured values, wherein the training vector is formed using at least one feature from at least two of the N sets of features corresponding to a first subset of the plurality of different variables, d) a labeling module to inform the system on the training vectors using parameters of the machine learning model to obtain output labels for the plurality of individuals, e) a comparator module to compare the output labels to the known labels of the individual, f) a training module to iteratively search for optimal values of the parameters as part of training the machine learning model based on the comparing the output labels to the known labels of the individual, and g) an output module to provide the parameters of the machine learning model and the set of features for the machine learning model.
[0166] In one aspect, a system is provided for generating a risk profile comprising: a) a receiver to receive a plurality of training individuals, each of the plurality of training individuals having a plurality of demographic, physiological and clinical variables wherein each of the plurality of training individuals comprises one or more known labels b) a feature module to identify a set of features based on the variables corresponding to an individual that are operable to be input to the machine learning model for each of the plurality of individuals, wherein the set of features corresponds to variables of the plurality of training individuals, wherein for each of the plurality of training individuals, the system is operable to subject a plurality of variables of the individual to a plurality of different assays to obtain sets of measured values, wherein each set of measured values is from one variable in the individual, wherein a plurality of sets of measured values are obtained for the plurality of individuals, c) an analysis module to analyze the sets of measured values to obtain a training vector for the individual, wherein the training vector comprises feature values of the N set of features of the corresponding variable, each feature value corresponding to a feature and including one or more measured values, wherein the training vector is formed using at least one feature from at least two of the N sets of features corresponding to a first subset of the plurality of different variables, d) a labeling module to inform the system on the training vectors using parameters of the machine learning model to obtain output labels for the plurality of individuals, e) a comparator module to compare the output labels to the known labels of the individual, f) a training module to iteratively search for optimal values of the parameters as part of training the machine learning model based on the comparing the output labels to the known labels of the individual, and g) an output module to provide the parameters of the machine learning model and the set of features for the machine learning model.
[0167] In some embodiments, the risk profile identifies individuals at risk of colorectal cancer.
[0168] In some embodiments, the risk profile stratifies a population of individuals for cancer risk.
[0169] In some embodiments, the risk profile is used to provide treatment recommendations for the individual based on the risk profile.
[0170] According to some embodiments of the present disclosure, there are provided methods and systems of evaluating colorectal cancer risk by classifying a set of current blood test results of a target individual using one or more classifiers which are generated according to an analysis of historical blood test results of a plurality of individuals.
[0171] In some embodiments, the set of current blood test results includes 9 or less of the following blood tests hemoglobin (HGB), hematocrit (HCT), and red blood cells (RBC) and at least one result of the following blood tests: mean cell hemoglobin (MCH) and mean corpuscular hemoglobin concentration (MCHC) and the age of the target individual.
[0172] In some embodiments, the set of current blood test results further includes nine or less of the following blood tests: white blood cell count— WBC (CBC); mean platelet volume (MPV); mean cell volume (MCV); red cell distribution width (RDW); platelet count (CBC); eosinophils count; neutrophils percentage; monocytes percentage; eosinophils percentage; basophils percentage; neutrophils count; monocytes count; and platelets hematocrit (PCT).
[0173] In some embodiments, the colorectal cancer risk is evaluated by classifying biochemical blood test results of the target individual. In such embodiments, the classifiers are generated according to an analysis of historical biochemical blood test results of the plurality of individuals. The biochemical blood test results may include results of any of the following blood tests: Albumin, Calcium, Chloride, Cholesterol, Creatinine, high density lipoprotein (HDL), low density lipoprotein (LDL), Potassium, Sodium, Triglycerides, Urea, and/or Uric Acid.
[0174] In some embodiments, the colorectal cancer risk is evaluated by classifying demographic characteristics of the target individual. In such embodiments, the classifiers are generated according to an analysis of demographic characteristics of the plurality of individuals.
[0175] In some embodiments, both the demographic, physiological and clinical variables of the target individual and the demographic, physiological and clinical variables of sampled individuals are used for generating expended sets of features which include manipulated and/or weighted values. Optionally, each expended set of features is based on the demographic characteristics of a respective individual, for example as described below.
[0176] In some embodiments, the one or more classifiers are adapted to one or more demographic characteristics of the target individual. Optionally, the classifiers are selected to match one or more demographic characteristics of the target individual. In such embodiments, different classifiers may be used for women and men.
[0177] According to some embodiments of the present disclosure, there are provided methods and systems of generating one or more classifiers for colorectal cancer risk evaluation. The methods and systems are based on analysis of a plurality of demographic, physiological or clinical variables of each of another of a plurality of sampled individuals and generating accordingly a dataset having a plurality of sets of features each generated according to respective demographic, physiological or clinical variables. The dataset is then used to generate and output one or more classifiers, such as a deep learning classifier, a neural network classifier, a linear discriminant analysis (LDA) classifier, a quadratic discriminant analysis (QDA) classifier, a support vector machine (SVM) classifier, a random forest (RF) classifier, K nearest neighbor (KNN), a linear kernel support vector machine classifier, a first or second order polynomial kernel support vector machine classifier, a ridge regression classifier, an elastic net algorithm classifier, a sequential minimal optimization algorithm classifier, a naive Bayes algorithm classifier, and principal component analysis classifier. The classifiers may be provided as modules for execution on client terminals or used as an online service for evaluating colorectal cancer risk of target individuals based on their current blood test results.
[0178] As described above, each sample record includes one or more sets of a plurality of historical test results of an individual, each includes a combination blood test results, for example a combination of 9 or less and/or any intermediate number of blood test results. In one example, each extracted set of unprocessed features includes at 9 or less of the following blood test results: red blood cells (RBC); white blood cell count— WBC (CBC); mean platelet volume (MPV); hemoglobin (HGB); hematocrit (HCT); mean cell volume (MCV); mean cell hemoglobin (MCH); mean corpuscular hemoglobin concentration (MCHC); red cell distribution width (RDW); platelet count (CBC); eosinophils count; neutrophils percentage; monocytes percentage; eosinophils percentage; basophils percentage; neutrophils count; monocytes count; and Platelets hematocrit (PCT). In another example, each extracted set of unprocessed features includes at least result of the following blood tests HGB, HCT, and RBC, at least one result of the following blood tests MCH and MCHC and additional data reflecting the age of the target individual.
[0179] In some embodiments, this extracted set of unprocessed features further includes one or more of the following blood tests RDW, Platelets, and MCV. Additionally, this extracted set of unprocessed features may further includes one or more of the following blood tests WBC, eosinophils count, neutrophils percentage and/or count, basophils percentage and/or count, and monocytes percentage and/or count.
[0180] As used herein, the singular form "a", "an" and "the" include plural references unless the context clearly dictates otherwise. For example, the term "a compound" or "at least one compound" may include a plurality of compounds, including mixtures thereof.
[0181] The word "exemplary" is used herein to mean "serving as an example, instance or illustration". Any embodiment described as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments.
[0182] Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases "ranging/ranges between" a first indicate number and a second indicate number and "ranging/ranges from" a first indicate number "to" a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween. [0183] It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as suitable in any other described embodiment of the invention. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements. [0184] Although the invention has been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is intended to embrace all such alternatives, modifications and variations that fall within the spirit and broad scope of the appended claims. [0185] All publications, patents and patent applications mentioned in this specification are herein incorporated in their entirety by reference into the specification, to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention. To the extent that section headings are used, they should not be construed as necessarily limiting.
EXAMPLE
[0186] Example 1: Modeling Prognostic Colorectal Cancer Risk Using Dataset 1 Data and initial preparation
[0187] The models described herein were trained and tested using Dataset 1, a de-identified compilation of several decades of records from over twenty million patients in the UK, collected at over 800 primary care practices across England, Scotland, Wales and Northern Ireland. Dataset 1 incorporates data which was used for the validation of the ColonFlag Model published in Kinar et. al. (2016). The dataset was cleaned, curated, and transformed into a format amenable for ML methods combining custom techniques with those described in certain publications: Boursi et al. (2016), Lewis et al. (2015), Khan (2010), Barnett (2012), Cassell (2018), Payne (2020).
Training and validation of Model A
[0188] Model A was designed to help health systems and health plans improve patient outcomes in the context of existing CRC screening programs. To that end, a 24 month predictive window (is used in order to identify cancers early enough to improve outcomes), an inclusive population between the ages of 45 and 85 (consistent with current clinical and coverage guidelines), and a broad set of features (in order to improve performance). Further, the data used included data collected at many points in a patient’s health history and did not limit training set to a single data point collected shortly before a patient’s CRC diagnosis, as Kinar et al. (2016) did, thereby enabling the ML model to learn features indicative of CRC risk at earlier times.
[0189] The model was trained on a set of observations corresponding to the dates in a patients’ record on which CRC risk can be calculated, is relevant, to clinical care, and can be updated based on new information. Dates before January 2011 were excluded from the observation set because several UK initiatives impacting data quality were not complete until that date. Observations were also excluded if they occurred after a patient was referred to palliative or hospice care; after a colectomy; before the patient was 45; or after the patient turned 86, because screening is not recommended in those circumstances (thus the risk is not clinically relevant). Each observation included the entire patient history to that date. By contrast to ColonFlag (Medial Research Ltd), which was trained on one observation per patient, the present training set often included multiple observations per patient.
[0190] In order to test the geographic transferability of the model, and avoid the overestimates of performance that can arise when the validation set is a random subset, the data from sites in England. Wales, and Scotland was used for training and the sites from Northern Ireland (comprising 7% of the observations) as the holdout evaluation set. The four countries of the United Kingdom operate independent national healthcare systems (although each is known locally as “the National Health Service") and Northern Ireland is the most distinct geographically.
[0191] The present machine learning and imputation methods were developed to be as inclusive of potentially informative health data as possible, subject to our strict regularization criteria. While traditional prognostic health models use a static view of patient state at time of prediction (e.g., most recent laboratory values, and presence or absence of historical diagnoses), these methods incorporate temporal features (e.g., recency, trends, and orderings) which are shown to be helpful in increasing model performance.
Comparison to the ColonFlag Model
[0192] In order to compare the present machine learning approach to a more traditional approach, the ColonFlag Model (Kinar et. al. 2016) was chosen as a benchmark. This is a natural choice because it has been developed into the commercial product ColonFlag by ColonFlag and is used by some health systems in the United States. As noted above, the present Model A is not suitable for a direct comparison to ColonFlag because it was trained on a larger and more recent dataset using methods that differed in multiple ways. In order to isolate the machine learning methodology, two new models were trained in as close to the same manner as the ColonFlag Model as possible. Specifically, both were trained on the ColonFlag Comparison Cohort, which was constructed with the same study design as the dataset used to train the ColonFlag model. (The ColonFlag Model was trained with a dataset from Maccabi Healthcare Services, which is not available.) Both models were trained with the same methodological approach (cohort, two year horizon, most recent CBC observation for cases) as ColonFlag. The first model aims to closely replicate the ColonFlag Model and is limited to age, sex, and CBC values. (Referred to herein as the ColonFlag Replicate). The second model uses additional data about a patient’s health history in the same manner as the present Model A. (Referred to herein as the ColonFlag Replicate Model). Information about the models developed for this study, as well as Model A are summarized in Table 1.
[0193] Table 1
Figure imgf000044_0001
Figure imgf000045_0001
[0194] To facilitate comparison, the validation data (referred to as ColonFlag Comparison Case- Control Holdout set) was constructed to be as similar as possible to the validation set used in Kinar et al. (2016). As such, the models’ ability was evaluated to identify patients in a casecontrol dataset between the ages of 50 and 75 who, at the time of a complete blood count result, would be diagnosed with CRC within 3-6 months. The same set of performance metrics as Kinar et al. (2016) was used to evaluate models on this dataset.
Simulating real-world impact
[0195] To evaluate the clinical impact the Model A could have in practice, given its emphasis on early detection, a simple deployment simulation was developed to estimate the amount of advance notice it would give. In this simulation, we assume that patients’ CRC risk is evaluated every time one of the variables used in our ML model is updated. The first time a patient’s risk score exceeds the average predicted risk in the training population, the patient “accepts” an invitation for a follow-up colonoscopy. Assuming a 60-day scheduling window for colonoscopies, all predictions are dropped with lead time <60 days and subtract 60 days from the lead time for earlier predictions. Cases without an observation more than 60 days prior to their diagnosis are excluded, representing 5.3% of cases in the Northern Ireland holdout set.
(i) Results & Discussion
[0196] The availability of comprehensive EHR data for predicting personal risks and the growing patient engagement with algorithmically driven health digital health messaging has increased interest in risk models for prioritizing and personalizing care. At the same time, the availability of massive clinical datasets for training models and new machine learning methods suggests the potential for a new generation of models that are both more widely applicable across patient populations and have greater clinical impact. This model was developed for the risk of CRC, training on a large UK dataset with a 24 month predictive window. The results are encouraging. The ROC curve for Model A applied to the Northern Ireland Holdout set is shown in FIG. 3. On the held out Northern Irish sites, Model A achieved AUC :::: 0.768 (95% CI 0.756- 0.776). These performance metrics are evaluated for all patients and not the subset of patients who have had a CBC, as is the case with the ColonFlag Model . In the Northern Ireland Holdout, 15% of CRC cases did not have a CBC in the 24 months prior to their diagnosis and would not have been evaluated by the ColonFlag Model.
Comparison Study of Modeling Methods
[0197] The results of the comparison study in which we built two models that could be compared to the ColonFlag Model in a very similar validation, suggest that our ML methods perform as well or better than ColonFlag’ s methods. Specifically, ColonFlag Comparison Model outperforms the ColonFlag Model on two of the three metrics. (The results are listed in Table 1 .) That said, we are cautious about over-interpreting this study for several reasons: First, the validation sets were similar, but not identical. Second, the model built to " replicate " the ColonFlag Model (the ColonFlag Replicate Model) did not match the published ColonFlag validation results reported by Kinar, but significantly underperformed them.
Comparison of a broad vs. narrow feature set
[0198] The performance of the two comparison models shows very clearly the value of an expanded set of model inputs: the ColonFlag Comparison model with a broad set of features outperforms the ColonFlag Replicate Model across the board, including a large difference in AUC (0.835 vs 0.777). The ROC curves are shown in FIG. 2. In this case, there is a very clean comparison between the models since they were trained and validated in an identical (as opposed to similar) manner. [0199] The impact of the broader feature set is well illustrated by FIG. 4, which shows a plot of SHAP values (defined in Lundberg & Lee 2017) for Study Cohort 1 generated from the Model A. SHAP values measure the contribution of each feature to each prediction. After regularization (an adjustment to machine learning methods that reduces the complexity of the model and prevents overfitting) 142 variables remained in the model, suggesting that EHRs contain many more informative variables than have traditionally been included in predictive models.
[0200] In developing health risk models, there are arguments for restricting the feature set to variables with well-understood relationships to the outcome of interest and that are not dependent on patient behavior or variations in healthcare practice (e.g., patterns of practice or diagnosis that have developed in response to different reimbursement systems). The rationale for not using these features is that their meaning and relationship to the outcome can change when the model is applied in a new context. An example is the presence of a recent negative colonoscopy. In a health system with comprehensive screening, this would likely correlate with reduced risk of CRC, but in a system without screening, the risk reduction from the negative colonoscopy might be more than offset by the risk increase associated with the symptoms or signs that led to the colonoscopy. While it is tempting to restrict oneself to a small set of stable variables because it gives a (possibly illusory) sense for security that the resulting model will work well anywhere without modification, the comparison of broad and narrow models above shows the heavy cost of excluding broader data elements commonly available in the EHR. The difference between the AUC of 0.83 for the broad model and the AUC of 0.77 for the narrow model is large by modeling standards and quite impactful. Rather than limit the feature set, the re-tuning or re-validating the model in the context of different health systems, and implementing a data surveillance system to warn of changes over time represent better options for ensuring continued model accuracy without sacrificing the power of a broad feature set.
[0201] In addition to improving the model’s performance, using a broader feature set in combination with data imputation allows Model A to statistically assess risk for any patient at risk of CRC. By contrast, the ColonFlag Model requires that a patient has had a recent CBC in order to make a prediction. In Study Cohort 1, the ColonFlag Model would have made predictions on only 68% of the observations, requiring attending doctors to order CBCs for the remainder in order to assess their CRC risk.
Simulated Real-world impact
[0202] In clinical practice, a risk model such as that described herein is trained and validated for use at any time, not just at a single point in time as is the case in most statistical model validations. We have attempted to assess the clinical impact of the risk model more comprehensively by simulating a simple use case wherein a risk value over a certain threshold triggers a colonoscopy by applying our risk model to longitudinal patient data. This simulation yields an estimate of the "lead time," defined as the difference in the time of diagnosis predicted by the risk model versus diagnosis by the current standard of care. We simulated the impact of two similar models, the Model A and the ColonFlag Comparison Model, using the longitudinal data from the Northern Ireland Holdout Set. The most relevant difference between these models is that the ColonFlag Comparison Model (like the ColonFlag Model) trains preferentially on the most recent observation prior to cancer diagnosis, giving it a bias toward later detection, whereas Model A uses all prior observations equally. The results are shown in Table 1. Not only does Model A find more of the cases before the actual standard-of-care diagnosis (80.8% vs 76.7%,) but it does so with a median lead time that is 231 days longer than the ColonFlag Comparison Model.
(ii) Conclusion
[0203] The recent aggregation of massive sets of EHR data has created an environment well suited for using machine learning models to help improve resource prioritization and care personalization. Dataset 1 from the UK was used to develop a predictive risk model for CRC, paying particular attention in the model design to features that would drive clinical utility. We showed in a comparison study that when our methods are applied to a similar dataset and prediction task, they result in performance that is as good or better than that of a leading benchmark model. This improved performance is due largely to the inclusion of a richer feature set including features with multiple time points. Finally, we demonstrated that compared to the training approach used for the benchmark model, our training approach resulted in a meaningfully longer lead time before standard-of-care diagnosis, which should result in correspondingly more cancers discovered at a stage where they can be treated effectively.
[0204] Example 2: Modeling Prognostic Colorectal Cancer Risk using EMR and Claims data from the US.
[0205] The model whose performance is depicted here (Model B) was trained and tested using a de-identified dataset from partner B, a compilation of electronic medical records and health insurance claims. It was developed on a dataset comprising all available medical records and insurance history from a subset of the patients in partner B’s data warehouse: all patients with any prior cancer diagnosis and a random 20% sample of all patients without a history of cancer. The combined dataset (Dataset B) was cleaned, curated, and transformed into a format amenable for ML methods using custom techniques similar to those used in developing the Model A described in Example 1.
[0206] Model B was designed to help health systems and health plans improve patient outcomes in the context of existing CRC screening programs. Like Model A, a 24 month predictive window, an inclusive population between the ages of 45 and 75, and a broad set of features collected at many points in a patient’s health history were used in training and validation. Further, as with the model A, we use a validation set drawn from a geographically distinct region to test the generalizability of the model. Model B was trained on data collected in seven US census geographic divisions and validated on data from an eighth division.
[0207] Figure 5 represents the performance of the Model B in a facsimile of a population health management real-world use case. In this facsimile, a population health team evaluated eligible patients’ two-year CRC risk biyearly, on January 1 and July 1, each year 2012-2019. Eligible patients on each date comprised patients who, on this date, were continuously enrolled for at least one year prior and two years subsequent (or who died in the subsequent two years). The prior enrollment period requirement was implemented to ensure sufficient data were available to the risk prediction model. The follow-up period requirement ensured two year follow-up CRC incidence was captured in the data.
[0208] In order to ensure our resulting model’s validation performance statistics were representative of those we might expect to see in a real world use case, we re-weighted this patient population to match the approximate age distribution of the USA in 2000, according to SEER statistics.
[0209] On each January 1 and July 1 2012-2019, the eligible patients’ risk scores were assessed by both Model B and a similar model which used only the patient’s age and biological sex as features. The ROC across all eligible patient-dates, with weighting as described above, is depicted in Figure 5. The area under the ROC curve (AUC) for Model B is 0.719 (95% CI 0.701,0.736) and the AUC for age-sex baseline is 0.651 (95% CI 0.632, 0.669). The superior performance of Model B is statistically significant, the p-value for an AUC improvement of at least 0.05 is p=0.004.

Claims

CLAIMS What is claimed is:
1. A classifier for evaluation of colorectal cancer risk of a target individual, wherein the classifier is trained on at least one training data set that comprises a plurality of features based on demographic, physiological, and clinical variables from the target individual, wherein the classifier is generated based at least in part on an analysis of a plurality of respective demographic, physiological, and clinical features from a plurality of sampled individuals, wherein at least one of the plurality of features is derived from at least data of the demographic, physiological, and clinical variables obtained at 2 or more time points.
2. The classifier of claim 1, wherein the demographic, physiological, and clinical features comprise at least two features obtained from demographic, symptomatic, lifestyle, diagnosis, or biochemical variables.
3. The classifier of claim 2, wherein the demographic variables are selected from age, gender, weight, height, BMI, race, country, and geographically determined data.
4. The classifier of claim 3, wherein the demographic variables comprise local air quality, limiting long-term illness, or Townsend deprivation index.
5. The classifier of claim 4, wherein the demographic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
6. The classifier of claim 2, wherein the symptomatic variables are selected from heartbum, sweating, hemorrhoids, chronic sinusitis, atrial fibrillation, rectal bleeding, abdominal pain, bowel movement change, reflux, constipation, indigestion, hypertension, hematuria, diarrhea, bloating belching, abnormal weight loss, and obesity.
7. The classifier of claim 6, wherein the symptomatic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
8. The classifier of claim 2, wherein the lifestyle variables are selected from smoking, alcohol use, red meat consumption, medications, progestogen, estrogen and other hormonal modulators, hormonal contraceptives, non-steroidal anti-inflammatory and antirheumatic drugs (NS AID), antithrombotic agents, blood glucose lowering drugs excluding insulin, insulin and analogues, immunosuppressants, sulfonamides and trimethoprim antibacterials, and drugs for peptic ulcer and gastroesophageal reflux disease (GERD).
9. The classifier of claim 8, wherein the lifestyle variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
10. The classifier of claim 2, wherein the diagnosis variables are selected from diverticular disease, multiple sclerosis, diabetes 1, diabetes 2, hyperthyroidism, psoriasis, anemia, peripheral arterial disease, AIDS, mental health disorder, inflammatory bowel disease, myocardial infarction, lupus, dementia, heart failure, hepatitis, alcoholic fatty liver disease, cirrhosis, rheumatoid arthritis, acromegaly, hemiplegia, COPD, irritable bowel syndrome, learning disabilities, depression, stroke, and chronic kidney disease.
11. The classifier of claim 10, wherein the diagnosis variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
12. The classifier of claim 2, wherein the biochemical variables are selected from alanine aminotransferase (ALT), albumin, alkaline phosphatase (ALP), aspartate aminotransferase (AST), calcium, carbon dioxide, chloride, cholesterol, creatinine, ferritin, globulin, glucose, high density lipoprotein (HDL), low density lipoprotein (LDL), phosphate, potassium, sodium, triglycerides, urea, and uric acid.
13. The classifier of claim 12, wherein the biochemical variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
14. The classifier of any of claims 1 to 13, wherein the classifier comprises features based on recency, trends, and sequence features.
15. The classifier of any of claims 1 to 13, wherein the demographic, physiological, and clinical variables reflect an entire medical history of the target individual.
16. The classifier of any of claims 1 to 15, wherein the demographic, physiological, and clinical variables are not generated using a CBC analysis of the target individual.
17. The classifier of any of claims 1 to 16, wherein a training data set used to train the classifier includes comprises two or more variables per sampled individual across two or more time points.
18. The classifier of any of claims 1 to 17, wherein feature data for the plurality of features is collected longitudinally to provide a time series of feature values.
19. The classifier of any of claim 18, wherein the time series of feature values is weighted according to recency or elapsed time, thereby providing a higher significance to recent measurements of the time series of feature values.
20. The classifier of claim 18, wherein complexity and overfitting of the classifier are reduced at least in part by performing a machine learning regularization method.
21. The classifier of any of claims 1 to 20, wherein at least about 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 variables are used in the classification model.
22. The classifier of any of claims 1 to 20, wherein between 20 and 200, 30 and 190, 40 and 180, 50 and 170, 60 and 160, 70 and 150, 80 and 140, 90 and 130, or 100 and 120 variables are used in the classification model.
23. A classifier for evaluation of colorectal cancer risk of a target individual, wherein the classifier is trained on at least one training data set that comprises (i) a plurality of features based on two or more demographic, physiological, and clinical variables from the target individual, and (ii) 9 or less blood test features based on a plurality of current blood test results of the target individual, wherein each one of the 9 or less different blood test features is based on a blood test value of one of the plurality of current blood test results of the target individual, wherein at least one of the plurality of features is based on data of the two or more demographic, physiological, and clinical variables obtained at 2 or more time points.
24. The classifier of claim 23, wherein the plurality of blood test results comprises (i) 9 or less results of the following blood tests: red blood cells (RBC), hemoglobin (HGB), and hematocrit (HCT) and (ii) at least one result of the following blood tests: hemoglobin (MCH) and mean corpuscular hemoglobin concentration (MCHC).
25. The classifier of claim 23, wherein the blood test results comprise 9 or less results of the following blood tests: white blood cell count (WBC); mean platelet volume (MPV); mean cell; platelet count (CBC); eosinophils count; neutrophils percentage; monocytes percentage; eosinophils percentage; basophils percentage; lymphocytes percentage; and neutrophils count; monocytes count, lymphocytes count; neutrophil-lymphocyte ratio (NLR).
26. The classifier of any of claims 23 to 25, wherein the classifier is trained equally across all prior demographic, physiological and clinical features.
27. The classifier of claim 26, wherein the equal training reduces or prevents bias towards later detection.
28. The classifier of any of claims 23 to 27, wherein the classifier is configured to provide the evaluation of the colorectal cancer risk of the target individual with advance notice or lead time before a diagnosis that is sufficient to permit treatment intervention at a time point that is clinically actionable.
29. The classifier of claim 28 , wherein the advance notice or lead time is sufficient to permit a treatment that improves clinical outcomes including treatment efficacy and mortality.
30. The classifier of any of claims 28 to 29, wherein the advance notice or lead time has a median that is between 100 and 300 days before colorectal cancer diagnosis.
31. The classifier of any of claims 28 to 29, wherein the advance notice or lead time has a median that is at least 300 days, at least 400 days, at least 500 days, at least 600 days, at least 700 days, at least 800 days, at least 900 days, at least 1000 days, at least 1100 days, at least 1200 days, at least 1300 days, at least 1400 days, or at least 1500 days before colorectal cancer diagnosis.
32. The classifier of any of claims 28 to 31, wherein the advance notice or lead time is at least 600 days.
33. The classifier of any of claims 28 to 32, wherein the advance notice or lead time is at least 1000 days.
34. The classifier of any of claims 28 to 33, wherein the plurality of features comprises an age of the target individual; wherein the classifier is generated according to at least an analysis of the age of each of another plurality of sampled individuals.
35. The classifier of any of claims 28 to 34, wherein the plurality of features comprises a sex of the target individual; wherein the classifier is generated according to at least an analysis of the sex of each of another plurality of sampled individuals.
36. The classifier of any of claims 28 to 26, wherein the classifier is configured as a machine learning classifier selected from the group consisting of a deep learning classifier, a neural network classifier, a linear discriminant analysis (LDA) classifier, a quadratic discriminant analysis (QDA) classifier, a support vector machine (SVM) classifier, a random forest (RF) classifier, K nearest neighbor (KNN), a linear kernel support vector machine classifier, a first or second order polynomial kernel support vector machine classifier, a ridge regression classifier, an elastic net algorithm classifier, a sequential minimal optimization algorithm classifier, a naive Bayes algorithm classifier, and principal component analysis classifier.
37. A method of generating a colorectal cancer risk classifier, comprising:
(a) providing a plurality of features based on demographic, physiological and clinical variables from a target individual;
(b) generating a dataset having a plurality of sets of features, each set of features generated according to a respective plurality of features based on demographic, physiological and clinical features from a plurality of sampled individuals; and
(c) generating at least one classifier based at least in part on an analysis of the dataset, and outputting the at least one classifier, wherein at least one of the plurality of features is based on data of the demographic, physiological, and clinical variables obtained at 2 or more time points.
38. The method of claim 37, wherein the demographic, physiological, and clinical features comprise at least two features selected from demographic, symptomatic, lifestyle, diagnosis, and biochemical variables.
39. The method of claim 38, wherein the demographic variables are selected from age, gender, weight, height, BMI, race, country, and geographically determined data.
40. The method of claim 39, wherein the demographic variables comprise local air quality, limiting long-term illness, or Townsend deprivation index.
41. The method of claim 40, wherein the demographic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
42. The method of claim 38, wherein the symptomatic variables are selected from heartburn, sweating, hemorrhoids, chronic sinusitis, atrial fibrillation, rectal bleeding, abdominal pain, bowel movement change, reflux, constipation, indigestion, hypertension, hematuria, diarrhea, bloating belching, abnormal weight loss, and obesity.
43. The method of claim 42, wherein the symptomatic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
44. The method of claim 38, wherein the lifestyle variables are selected from smoking, alcohol use, red meat consumption, medications, progestogen, estrogen and other hormonal modulators, hormonal contraceptives, non-steroidal anti-inflammatory and antirheumatic drugs (NS AID), antithrombotic agents, blood glucose lowering drugs excluding insulin, insulin and analogues, immunosuppressants, sulfonamides and trimethoprim antibacterials, and drugs for peptic ulcer and gastroesophageal reflux disease (GERD).
45. The method of claim 44, wherein the lifestyle variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
46. The method of claim 38, wherein the diagnosis variables are selected from diverticular disease, multiple sclerosis, diabetes 1, diabetes 2, hyperthyroidism, psoriasis, anemia, peripheral arterial disease, AIDS, mental health disorder, inflammatory bowel disease, myocardial infarction, lupus, dementia, heart failure, hepatitis, alcoholic fatty liver disease, cirrhosis, rheumatoid arthritis, acromegaly, hemiplegia, COPD, irritable bowel syndrome, learning disabilities, depression, stroke, and chronic kidney disease.
47. The method of claim 46, wherein the diagnosis variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
48. The method of claim 38, wherein the biochemical variables are selected from alanine aminotransferase (ALT), albumin, alkaline phosphatase (ALP), aspartate aminotransferase (AST), calcium, carbon dioxide, chloride, cholesterol, creatinine, ferritin, globulin, glucose, high density lipoprotein (HDL), low density lipoprotein (LDL), phosphate, potassium, sodium, triglycerides, urea, and uric acid.
49. The method of claim 48, wherein the biochemical variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
50. The method of claim 37, wherein at least about 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190 or 200 variables are used in generating the colorectal cancer risk classifier.
51. The method of claim 37, wherein between 20 and 200, 30 and 190, 40 and 180, 50 and 170, 60 and 160, 70 and 150, 80 and 140, 90 and 130, or 100 and 120 variables are used in generating the colorectal cancer risk classifier.
52. The method of claim 37, wherein the generating of (c) comprises weighting each of the plurality of demographic, physiological and clinical features according to a date of the respective plurality of demographic, physiological and clinical features.
53. The method of claim 37, wherein the generating of (c) comprises filtering the plurality of demographic, physiological and clinical features to remove outliers according to a standard deviation maximum threshold.
54. The method of claim 37, wherein the plurality of features are weighted according to a date of the respective plurality of demographic, physiological and clinical features.
55. A method of generating a risk profile for colorectal cancer in a target individual comprising: a. obtaining a plurality of features based on demographic, physiological, and clinical variables from a target individual, b. providing at least one classifier generated based at least in part on an analysis of a plurality of respective demographic, physiological, and clinical features from a plurality of sampled individuals; and c. evaluating, using a processor, a colorectal cancer risk of the target individual at least in part by classifying the plurality of features using the at least one classifier to provide a risk profile of the target individual, wherein at least one of the plurality of features is based on data of the demographic, physiological, and clinical variables obtained at 2 or more time points.
56. The method of claim 55, wherein the demographic, physiological, and clinical features comprise at least two features selected from demographic, symptomatic, lifestyle, diagnosis, and biochemical variables.
57. The method of claim 56, wherein the demographic variables are selected from age, gender, weight, height, BMI, race, country, and geographically determined data.
58. The method of claim 57, wherein the demographic variables comprise local air quality, limiting long-term illness, or Townsend deprivation index.
59. The method of claim 58, wherein the demographic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
60. The method of claim 56, wherein the symptomatic variables are selected from heartburn, sweating, hemorrhoids, chronic sinusitis, atrial fibrillation, rectal bleeding, abdominal pain, bowel movement change, reflux, constipation, indigestion, hypertension, hematuria, diarrhea, bloating belching, abnormal weight loss, and obesity.
61. The method of claim 60, wherein the symptomatic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
62. The method of claim 56, wherein the lifestyle variables are selected from smoking, alcohol use, red meat consumption, medications, progestogen, estrogen and other hormonal modulators, hormonal contraceptives, non-steroidal anti-inflammatory and antirheumatic drugs (NS AID), antithrombotic agents, blood glucose lowering drugs excluding insulin, insulin and analogues, immunosuppressants, sulfonamides and trimethoprim antibacterials, and drugs for peptic ulcer and gastroesophageal reflux disease (GERD).
63. The method of claim 62, wherein the lifestyle variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
64. The method of claim 56, wherein the diagnosis variables are selected from diverticular disease, multiple sclerosis, diabetes 1, diabetes 2, hyperthyroidism, psoriasis, anemia, peripheral arterial disease, AIDS, mental health disorder, inflammatory bowel disease, myocardial infarction, lupus, dementia, heart failure, hepatitis, alcoholic fatty liver disease, cirrhosis, rheumatoid arthritis, acromegaly, hemiplegia, COPD, irritable bowel syndrome, learning disabilities, depression, stroke, and chronic kidney disease.
65. The method of claim 64, wherein the diagnosis variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
66. The method of claim 56, wherein the biochemical variables are selected from alanine aminotransferase (ALT), albumin, alkaline phosphatase (ALP), aspartate aminotransferase (AST), calcium, carbon dioxide, chloride, cholesterol, creatinine, ferritin, globulin, glucose, high density lipoprotein (HDL), low density lipoprotein (LDL), phosphate, potassium, sodium, triglycerides, urea, and uric acid.
67. The method of claim 66, wherein the biochemical variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
68. The method of any of claims 54 to 67, wherein at least about 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190 or 200 variables are used in generating the risk profile for colorectal cancer.
69. The method of any of claims 54 to 68, wherein between 20 and 200, 30 and 190, 40 and 180, 50 and 170, 60 and 160, 70 and 150, 80 and 140, 90 and 130, or 100 and 120 variables are used in generating the risk profile for colorectal cancer.
70. A method for evaluating colorectal cancer risk of a target individual, comprising: obtaining a plurality of features based on demographic, physiological, and clinical variables from a target individual; providing at least one classifier generated based at least in part on an analysis of a plurality of respective demographic, physiological, and clinical features of each of a plurality of sampled individuals; and evaluating, using a processor, a colorectal cancer risk of the target individual at least in part by classifying the plurality of features using the at least one classifier, wherein at least one of the features is based on data of the demographic, physiological, and clinical variables obtained at 2 or more time points.
71. The method of claim 70, wherein the demographic, physiological and clinical features comprise at least two features selected from demographic, symptomatic, lifestyle, diagnosis, and biochemical variables.
72. The method of claim 71, wherein the demographic variables are selected from age, gender, weight, height, BMI, race, country, and geographically determined data.
73. The method of claim 72, wherein the demographic variables comprise local air quality, limiting long-term illness, or Townsend deprivation index.
74. The method of claim 73, wherein the demographic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
75. The method of claim 71, wherein the symptomatic variables are selected from heartburn, sweating, hemorrhoids, chronic sinusitis, atrial fibrillation, rectal bleeding, abdominal pain, bowel movement change, reflux, constipation, indigestion, hypertension, hematuria, diarrhea, bloating belching, abnormal weight loss, and obesity.
76. The method of claim 73, wherein the symptomatic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
77. The method of claim 71, wherein the lifestyle variables are selected from smoking, alcohol use, red meat consumption, medications, progestogen, estrogen and other hormonal modulators, hormonal contraceptives, non-steroidal anti-inflammatory and antirheumatic drugs (NS AID), antithrombotic agents, blood glucose lowering drugs excluding insulin, insulin and analogues, immunosuppressants, sulfonamides and trimethoprim antibacterials, and drugs for peptic ulcer and gastroesophageal reflux disease (GERD).
78. The method of claim 77, wherein the lifestyle variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
79. The method of claim 71, wherein the diagnosis variables are selected from diverticular disease, multiple sclerosis, diabetes 1, diabetes 2, hyperthyroidism, psoriasis, anemia, peripheral arterial disease, AIDS, mental health disorder, inflammatory bowel disease, myocardial infarction, lupus, dementia, heart failure, hepatitis, alcoholic fatty liver disease, cirrhosis, rheumatoid arthritis, acromegaly, hemiplegia, COPD, irritable bowel syndrome, learning disabilities, depression, stroke, and chronic kidney disease.
80. The method of claim 79, wherein the diagnosis variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
81. The method of claim 71, wherein the biochemical variables are selected from alanine aminotransferase (ALT), albumin, alkaline phosphatase (ALP), aspartate aminotransferase (AST), calcium, carbon dioxide, chloride, cholesterol, creatinine, ferritin, globulin, glucose, high density lipoprotein (HDL), low density lipoprotein (LDL), phosphate, potassium, sodium, triglycerides, urea, and uric acid.
82. The method of claim 81, wherein the biochemical variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
83. The method of any of claims 70 to 82, wherein at least about 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190 or 200 variables are used in classifying the plurality of features using the at least one classifier.
84. The method of any of claims 70 to 83, wherein between 20 and 200, 30 and 190, 40 and 180, 50 and 170, 60 and 160, 70 and 150, 80 and 140, 90 and 130, or 100 and 120 variables are used in classifying the plurality of features using the at least one classifier.
85. A method for evaluation of colorectal cancer risk in a target individual, comprising: a) receiving by a computing system associated with a database storing a plurality of classifiers and from a client terminal and via a network, an indication of values of a plurality of features based on demographic, physiological, and clinical variables from a target individual, wherein the clinical variables comprise current blood test results calculated based at least in part on an analysis of a blood sample obtained from the target individual, and wherein at least one of the demographic, physiological, and clinical features is based on data of the demographic, physiological, and clinical variables obtained at 2 or more time points; b) generating, by said computing system, a combination of features based on two or more of the demographic, physiological and clinical variables, and 9 or less blood test features based on said plurality of current blood test results, each one of said 9 or less different blood test features is based on a blood test value of one of said plurality of current blood test results, wherein at least one of the features is based on data of the demographic, physiological, and clinical variables obtained at 2 or more time points; c) selecting at least one classifier from the plurality of classifiers according to at least one demographic characteristic of the target individual, wherein each of the plurality of classifiers is generated according to a plurality of respective demographic, physiological and clinical variables and historical blood test results of a plurality of sampled individuals having at least one different demographic characteristic, wherein the at least one classifier is generated based at least in part on an analysis of the plurality of respective demographic, physiological, and clinical variables and historical blood test results of each of another of the plurality of sampled individuals; and d) evaluating, using a computer processor of the computing system, a colorectal cancer risk of the target individual at least in part by classifying the demographic, physiological and clinical variables and combination of 9 or less blood test features using the at least one classifier; and e) outputting the colorectal cancer risk for presentation by the client terminal.
86. The method of claim 85, wherein the demographic, physiological and clinical features comprise at least two features selected from demographic, symptomatic, lifestyle, diagnosis, and biochemical variables.
87. The method of claim 86, wherein the demographic variables are selected from age, gender, weight, height, BMI, race, country, and geographically determined data.
88. The method of claim 87, wherein the demographic variables comprise local air quality, limiting long-term illness, or Townsend deprivation index.
89. The method of claim 88, wherein the demographic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
90. The method of claim 86, wherein the symptomatic variables are selected from heartburn, sweating, hemorrhoids, chronic sinusitis, atrial fibrillation, rectal bleeding, abdominal pain, bowel movement change, reflux, constipation, indigestion, hypertension, hematuria, diarrhea, bloating belching, abnormal weight loss, and obesity.
91. The method of claim 90, wherein the symptomatic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
92. The method of claim 86, wherein the lifestyle variables are selected from smoking, alcohol use, red meat consumption, medications, progestogen, estrogen and other hormonal modulators, hormonal contraceptives, non-steroidal anti-inflammatory and antirheumatic drugs (NS AID), antithrombotic agents, blood glucose lowering drugs excluding insulin, insulin and analogues, immunosuppressants, sulfonamides and trimethoprim antibacterials, and drugs for peptic ulcer and gastroesophageal reflux disease (GERD).
93. The method of claim 92, wherein the lifestyle variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
94. The method of claim 86, wherein the diagnosis variables are selected from diverticular disease, multiple sclerosis, diabetes 1, diabetes 2, hyperthyroidism, psoriasis, anemia, peripheral arterial disease, AIDS, mental health disorder, inflammatory bowel disease, myocardial infarction, lupus, dementia, heart failure, hepatitis, alcoholic fatty liver disease, cirrhosis, rheumatoid arthritis, acromegaly, hemiplegia, COPD, irritable bowel syndrome, learning disabilities, depression, stroke, and chronic kidney disease.
95. The method of claim 94, wherein the diagnosis variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
96. The method of claim 86, wherein the biochemical variables are selected from alanine aminotransferase (ALT), albumin, alkaline phosphatase (ALP), aspartate aminotransferase (AST), calcium, carbon dioxide, chloride, cholesterol, creatinine, ferritin, globulin, glucose, high density lipoprotein (HDL), low density lipoprotein (LDL), phosphate, potassium, sodium, triglycerides, urea, and uric acid.
97. The method of claim 96, wherein the biochemical variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
98. The method of any of claims 85 to 97, wherein at least about 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190 or 200 variables are used in classifying the demographic, physiological and clinical variables and combination of 9 or less blood test features using the at least one classifier.
99. The method of any of claims 85 to 98, wherein between 20 and 200, 30 and 190, 40 and 180, 50 and 170, 60 and 160, 70 and 150, 80 and 140, 90 and 130, or 100 and 120 variables are used in classifying the demographic, physiological and clinical variables and combination of 9 or less blood test features using the at least one classifier.
100. A system for generating a colorectal cancer risk profile comprising:(i) a processor, (ii) a memory unit which stores at least one classifier generated based at least in part on an analysis of a plurality of demographic, physiological, and clinical features of individuals of a plurality of sampled individuals, and an input unit which receives a plurality of demographic, physiological and clinical variables of a target individual, and (iii) a colorectal cancer evaluating module which evaluates, using the processor, a colorectal cancer risk of the target individual at least in part by classifying, using the at least one classifier, a plurality of features based on the plurality of demographic, physiological, and clinical variables, wherein at least one of the plurality of features is based on data of the plurality of demographic, physiological, and clinical variables obtained at 2 or more time points.
101. The system of claim 100, wherein the demographic, physiological, and clinical features comprise at least two features selected from demographic, symptomatic, lifestyle, diagnosis, and biochemical variables.
102. The system of claim 101, wherein the demographic, physiological and clinical features comprise at least two features selected from demographic, symptomatic, lifestyle, diagnosis, and biochemical variables.
103. The system of claim 102, wherein the demographic variables are selected from age, gender, weight, height, BMI, race, country, and geographically determined data.
104. The system of claim 103, wherein the demographic variables comprise local air quality, limiting long-term illness, or Townsend deprivation index.
105. The system of claim 104, wherein the demographic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
106. The system of claim 102, wherein the symptomatic variables are selected from heartburn, sweating, hemorrhoids, chronic sinusitis, atrial fibrillation, rectal bleeding, abdominal pain, bowel movement change, reflux, constipation, indigestion, hypertension, hematuria, diarrhea, bloating belching, abnormal weight loss, and obesity.
107. The system of claim 106, wherein the symptomatic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
108. The system of claim 102, wherein the lifestyle variables are selected from smoking, alcohol use, red meat consumption, medications, progestogen, estrogen and other hormonal modulators, hormonal contraceptives, non-steroidal anti-inflammatory and antirheumatic drugs (NS AID), antithrombotic agents, blood glucose lowering drugs excluding insulin, insulin and analogues, immunosuppressants, sulfonamides and trimethoprim antibacterials, and drugs for peptic ulcer and gastroesophageal reflux disease (GERD).
109. The system of claim 108, wherein the lifestyle variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
110. The system of claim 102, wherein the diagnosis variables are selected from diverticular disease, multiple sclerosis, diabetes 1, diabetes 2, hyperthyroidism, psoriasis, anemia, peripheral arterial disease, AIDS, mental health disorder, inflammatory bowel disease, myocardial infarction, lupus, dementia, heart failure, hepatitis, alcoholic fatty liver disease, cirrhosis, rheumatoid arthritis, acromegaly, hemiplegia, COPD, irritable bowel syndrome, learning disabilities, depression, stroke, and chronic kidney disease.
111. The system of claim 110, wherein the diagnosis variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
112. The system of claim 102, wherein the biochemical variables are selected from alanine aminotransferase (ALT), albumin, alkaline phosphatase (ALP), aspartate aminotransferase (AST), calcium, carbon dioxide, chloride, cholesterol, creatinine, ferritin, globulin, glucose, high density lipoprotein (HDL), low density lipoprotein (LDL), phosphate, potassium, sodium, triglycerides, urea, and uric acid.
113. The system of claim 112, wherein the biochemical variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
114. The system of any of claims 100 to 113, wherein at least about 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190 or 200 variables are used in classifying the plurality of features using the at least one classifier.
115. The system of any of claims 100 to 114, wherein between 20 and 200, 30 and 190, 40 and 180, 50 and 170, 60 and 160, 70 and 150, 80 and 140, 90 and 130, or 100 and 120 variables are used in classifying the plurality of features using the at least one classifier.
116. The system of any of claims 100 to 115, wherein the colorectal cancer risk profile identifies individuals at risk of colorectal cancer
117. The system of any of claims 100 to 116, wherein the colorectal cancer risk profile stratifies a population of individuals for cancer risk.
118. The system of any of claims 100 to 117, wherein the colorectal cancer risk profile is used to provide treatment recommendations for the individual based on the colorectal cancer risk profile.
119. A system for classifying a target individual for colorectal cancer risk comprising:(i) a processor, (ii) a memory unit which stores at least one classifier generated based at least in part on an analysis of a plurality of demographic, physiological, and clinical features of a plurality of sampled individuals, and an input unit which receives a plurality of demographic, physiological, and clinical variables of a target individual, and (iii) a colorectal cancer evaluating module which evaluates, using the processor, a colorectal cancer risk of the target individual at least in part by classifying, using the at least one classifier, a plurality of features based on the plurality of demographic, physiological, and clinical variables, wherein at least one of the plurality of features is based on data of the plurality of demographic, physiological, and clinical variables obtained at 2 or more time points.
120. The system of claim 119, wherein the demographic, physiological, and clinical features comprise at least two features selected from demographic, symptomatic, lifestyle, diagnosis, and biochemical variables.
121. The system of claim 120, wherein the demographic variables are selected from age, gender, weight, height, BMI, race, country, and geographically determined data.
122. The system of claim 121, wherein the demographic variables comprise local air quality, limiting long-term illness, or Townsend deprivation index.
123. The system of claim 122, wherein the demographic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
124. The system of claim 120, wherein the symptomatic variables are selected from heartburn, sweating, hemorrhoids, chronic sinusitis, atrial fibrillation, rectal bleeding, abdominal pain, bowel movement change, reflux, constipation, indigestion, hypertension, hematuria, diarrhea, bloating belching, abnormal weight loss, and obesity.
125. The system of claim 124, wherein the symptomatic variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
126. The system of claim 120, wherein the lifestyle variables are selected from smoking, alcohol use, red meat consumption, medications, progestogen, estrogen and other hormonal modulators, hormonal contraceptives, non-steroidal anti-inflammatory and antirheumatic drugs (NS AID), antithrombotic agents, blood glucose lowering drugs excluding insulin, insulin and analogues, immunosuppressants, sulfonamides and trimethoprim antibacterials, and drugs for peptic ulcer and gastroesophageal reflux disease (GERD).
127. The system of claim 126, wherein the lifestyle variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
128. The system of claim 120, wherein the diagnosis variables are selected from diverticular disease, multiple sclerosis, diabetes 1, diabetes 2, hyperthyroidism, psoriasis, anemia, peripheral arterial disease, AIDS, mental health disorder, inflammatory bowel disease, myocardial infarction, lupus, dementia, heart failure, hepatitis, alcoholic fatty liver disease, cirrhosis, rheumatoid arthritis, acromegaly, hemiplegia, COPD, irritable bowel syndrome, learning disabilities, depression, stroke, and chronic kidney disease.
129. The system of claim 128, wherein the diagnosis variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
130. The system of claim 118, wherein the biochemical variables are selected from alanine aminotransferase (ALT), albumin, alkaline phosphatase (ALP), aspartate aminotransferase (AST), calcium, carbon dioxide, chloride, cholesterol, creatinine, ferritin, globulin, glucose, high density lipoprotein (HDL), low density lipoprotein (LDL), phosphate, potassium, sodium, triglycerides, urea, and uric acid.
131. The system of claim 120, wherein the biochemical variables are featurized for use in a computerized method or for use as an input to train a computational classification model.
132. The system of any of claims 119 to 131, wherein at least about 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190 or 200 variables are used in classifying the plurality of features using the at least one classifier.
133. The system of any of claims 119 to 132, wherein between 20 and 200, 30 and 190, 40 and 180, 50 and 170, 60 and 160, 70 and 150, 80 and 140, 90 and 130, or 100 and 120 variables are used in classifying the plurality of features using the at least one classifier.
134. The system of any of claims 119 to 133, wherein the demographic, physiological, and clinical features comprise a plurality of historical and current blood test results comprising results of 9 or less of the following of plurality of blood tests: red blood cells (RBC), hemoglobin (HGB), and hematocrit (HCT) and at least one result of the following blood tests hemoglobin (MCH) and mean corpuscular hemoglobin concentration (MCHC).
PCT/US2023/061453 2022-01-28 2023-01-27 Methods and systems for risk stratification of colorectal cancer WO2023147472A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202263304101P 2022-01-28 2022-01-28
US63/304,101 2022-01-28

Publications (1)

Publication Number Publication Date
WO2023147472A1 true WO2023147472A1 (en) 2023-08-03

Family

ID=87472698

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/061453 WO2023147472A1 (en) 2022-01-28 2023-01-27 Methods and systems for risk stratification of colorectal cancer

Country Status (1)

Country Link
WO (1) WO2023147472A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014201515A1 (en) * 2013-06-18 2014-12-24 Deakin University Medical data processing for risk prediction
US20160283686A1 (en) * 2015-03-23 2016-09-29 International Business Machines Corporation Identifying And Ranking Individual-Level Risk Factors Using Personalized Predictive Models
US20170039334A1 (en) * 2012-05-03 2017-02-09 Medial Research Ltd. Methods and systems of evaluating a risk of a gastrointestinal cancer
US20180068083A1 (en) * 2014-12-08 2018-03-08 20/20 Gene Systems, Inc. Methods and machine learning systems for predicting the likelihood or risk of having cancer

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170039334A1 (en) * 2012-05-03 2017-02-09 Medial Research Ltd. Methods and systems of evaluating a risk of a gastrointestinal cancer
US20200050917A1 (en) * 2012-05-03 2020-02-13 Medial Research Ltd. Methods and systems of evaluating a risk of a gastrointestinal cancer
WO2014201515A1 (en) * 2013-06-18 2014-12-24 Deakin University Medical data processing for risk prediction
US20180068083A1 (en) * 2014-12-08 2018-03-08 20/20 Gene Systems, Inc. Methods and machine learning systems for predicting the likelihood or risk of having cancer
US20160283686A1 (en) * 2015-03-23 2016-09-29 International Business Machines Corporation Identifying And Ranking Individual-Level Risk Factors Using Personalized Predictive Models

Similar Documents

Publication Publication Date Title
Subudhi et al. Comparing machine learning algorithms for predicting ICU admission and mortality in COVID-19
Luo Evaluating the state of the art in missing data imputation for clinical data
Alsinglawi et al. An explainable machine learning framework for lung cancer hospital length of stay prediction
Tomašev et al. Use of deep learning to develop continuous-risk models for adverse event prediction from electronic health records
US20190108912A1 (en) Methods for predicting or detecting disease
Sun et al. Predicting COVID-19 disease progression and patient outcomes based on temporal deep learning
Simpson et al. Multiple self-controlled case series for large-scale longitudinal observational databases
Ganie et al. Performance analysis and prediction of type 2 diabetes mellitus based on lifestyle data using machine learning approaches
US20210342735A1 (en) Data model processing in machine learning using a reduced set of features
Xiao et al. An MCEM framework for drug safety signal detection and combination from heterogeneous real world evidence
Hu et al. Explainable machine-learning model for prediction of in-hospital mortality in septic patients requiring intensive care unit readmission
Sharma et al. Predicting 30-day readmissions in patients with heart failure using administrative data: a machine learning approach
Rashidi et al. Machine learning in the coagulation and hemostasis arena: an overview and evaluation of methods, review of literature, and future directions
Yeh et al. Hyperchloremia in critically ill patients: association with outcomes and prediction using electronic health record data
Ju et al. Variable selection methods for developing a biomarker panel for prediction of dengue hemorrhagic fever
Kanda et al. Machine learning models for prediction of HF and CKD development in early-stage type 2 diabetes patients
Huang et al. Deep significance clustering: a novel approach for identifying risk-stratified and predictive patient subgroups
Rahmatinejad et al. A comparative study of explainable ensemble learning and logistic regression for predicting in-hospital mortality in the emergency department
Sampath et al. Ensemble Nonlinear Machine Learning Model for Chronic Kidney Diseases Prediction
Premsagar et al. Comparing conventional statistical models and machine learning in a small cohort of South African cardiac patients
Roy et al. Predicting low information laboratory diagnostic tests
Nguyen et al. Budget constrained machine learning for early prediction of adverse outcomes for COVID-19 patients
Wu et al. A comprehensive way to access hospital death prediction model for acute mesenteric ischemia: a combination of traditional statistics and machine learning
Meng et al. Hierarchical continuous-time inhomogeneous hidden Markov model for cancer screening with extensive followup data
EP4352745A1 (en) Diagnostic data feedback loop and methods of use thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23747884

Country of ref document: EP

Kind code of ref document: A1