WO2024100632A1 - Systems and methods for prioritizing medical resources for cancer screening - Google Patents
Systems and methods for prioritizing medical resources for cancer screening Download PDFInfo
- Publication number
- WO2024100632A1 WO2024100632A1 PCT/IB2023/061440 IB2023061440W WO2024100632A1 WO 2024100632 A1 WO2024100632 A1 WO 2024100632A1 IB 2023061440 W IB2023061440 W IB 2023061440W WO 2024100632 A1 WO2024100632 A1 WO 2024100632A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- patient
- data
- features
- cancer
- candidate
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 238
- 206010028980 Neoplasm Diseases 0.000 title claims abstract description 135
- 201000011510 cancer Diseases 0.000 title claims abstract description 133
- 238000012216 screening Methods 0.000 title claims description 18
- 238000001574 biopsy Methods 0.000 claims abstract description 24
- 238000010801 machine learning Methods 0.000 claims description 102
- 238000012913 prioritisation Methods 0.000 claims description 71
- 208000020816 lung neoplasm Diseases 0.000 claims description 64
- 206010058467 Lung neoplasm malignant Diseases 0.000 claims description 62
- 201000005202 lung cancer Diseases 0.000 claims description 62
- 230000000391 smoking effect Effects 0.000 claims description 50
- 230000006399 behavior Effects 0.000 claims description 35
- 238000002591 computed tomography Methods 0.000 claims description 31
- 238000004422 calculation algorithm Methods 0.000 claims description 26
- 239000003814 drug Substances 0.000 claims description 24
- 238000003745 diagnosis Methods 0.000 claims description 22
- 210000004072 lung Anatomy 0.000 claims description 20
- 238000013528 artificial neural network Methods 0.000 claims description 16
- 230000036541 health Effects 0.000 claims description 15
- 238000007477 logistic regression Methods 0.000 claims description 11
- 210000002216 heart Anatomy 0.000 claims description 10
- 238000002600 positron emission tomography Methods 0.000 claims description 10
- 238000011282 treatment Methods 0.000 claims description 10
- 229940079593 drug Drugs 0.000 claims description 9
- 238000009533 lab test Methods 0.000 claims description 7
- 238000007637 random forest analysis Methods 0.000 claims description 7
- 239000000779 smoke Substances 0.000 claims description 7
- 229960005486 vaccine Drugs 0.000 claims description 7
- 208000006545 Chronic Obstructive Pulmonary Disease Diseases 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 6
- 206010041067 Small cell lung cancer Diseases 0.000 claims description 5
- 239000000654 additive Substances 0.000 claims description 5
- 230000000996 additive effect Effects 0.000 claims description 5
- 208000002154 non-small cell lung carcinoma Diseases 0.000 claims description 5
- 208000000587 small cell lung carcinoma Diseases 0.000 claims description 5
- 208000029729 tumor suppressor gene on chromosome 11 Diseases 0.000 claims description 5
- 208000010507 Adenocarcinoma of Lung Diseases 0.000 claims description 4
- 201000005249 lung adenocarcinoma Diseases 0.000 claims description 4
- 201000005243 lung squamous cell carcinoma Diseases 0.000 claims description 4
- 206010006458 Bronchitis chronic Diseases 0.000 claims description 3
- 206010020772 Hypertension Diseases 0.000 claims description 3
- 241000208125 Nicotiana Species 0.000 claims description 3
- 235000002637 Nicotiana tabacum Nutrition 0.000 claims description 3
- 206010035664 Pneumonia Diseases 0.000 claims description 3
- 206010006451 bronchitis Diseases 0.000 claims description 3
- 208000007451 chronic bronchitis Diseases 0.000 claims description 3
- 230000005586 smoking cessation Effects 0.000 claims description 3
- 208000000059 Dyspnea Diseases 0.000 claims description 2
- 206010013975 Dyspnoeas Diseases 0.000 claims description 2
- 241000371980 Influenza B virus (B/Shanghai/361/2002) Species 0.000 claims description 2
- 208000015710 Iron-Deficiency Anemia Diseases 0.000 claims description 2
- 206010057852 Nicotine dependence Diseases 0.000 claims description 2
- 208000002151 Pleural effusion Diseases 0.000 claims description 2
- 208000037656 Respiratory Sounds Diseases 0.000 claims description 2
- 208000025569 Tobacco Use disease Diseases 0.000 claims description 2
- 206010047924 Wheezing Diseases 0.000 claims description 2
- 208000006673 asthma Diseases 0.000 claims description 2
- 230000003143 atherosclerotic effect Effects 0.000 claims description 2
- 230000036772 blood pressure Effects 0.000 claims description 2
- 235000019504 cigarettes Nutrition 0.000 claims description 2
- 230000035487 diastolic blood pressure Effects 0.000 claims description 2
- 208000019622 heart disease Diseases 0.000 claims description 2
- 208000010125 myocardial infarction Diseases 0.000 claims description 2
- 230000035488 systolic blood pressure Effects 0.000 claims description 2
- 238000013058 risk prediction model Methods 0.000 abstract description 51
- 238000003384 imaging method Methods 0.000 abstract description 7
- 238000011156 evaluation Methods 0.000 abstract description 3
- 238000012544 monitoring process Methods 0.000 abstract description 2
- 238000012549 training Methods 0.000 description 46
- 238000004458 analytical method Methods 0.000 description 19
- 238000003860 storage Methods 0.000 description 16
- 229940124597 therapeutic agent Drugs 0.000 description 15
- 238000012360 testing method Methods 0.000 description 13
- 238000000605 extraction Methods 0.000 description 12
- 238000003066 decision tree Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 9
- 239000008194 pharmaceutical composition Substances 0.000 description 9
- 230000008569 process Effects 0.000 description 9
- 239000000203 mixture Substances 0.000 description 7
- 238000012417 linear regression Methods 0.000 description 6
- 238000002560 therapeutic procedure Methods 0.000 description 6
- 206010056342 Pulmonary mass Diseases 0.000 description 5
- 230000008901 benefit Effects 0.000 description 5
- 210000004369 blood Anatomy 0.000 description 5
- 239000008280 blood Substances 0.000 description 5
- 210000004027 cell Anatomy 0.000 description 5
- 238000004590 computer program Methods 0.000 description 5
- 238000011161 development Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 238000007726 management method Methods 0.000 description 5
- 239000013598 vector Substances 0.000 description 5
- 230000009471 action Effects 0.000 description 4
- HVYWMOMLDIMFJA-DPAQBDIFSA-N cholesterol Chemical compound C1C=C2C[C@@H](O)CC[C@]2(C)[C@@H]2[C@@H]1[C@@H]1CC[C@H]([C@H](C)CCCC(C)C)[C@@]1(C)CC2 HVYWMOMLDIMFJA-DPAQBDIFSA-N 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 210000000214 mouth Anatomy 0.000 description 4
- 210000001519 tissue Anatomy 0.000 description 4
- 206010009944 Colon cancer Diseases 0.000 description 3
- 241000282412 Homo Species 0.000 description 3
- 208000008839 Kidney Neoplasms Diseases 0.000 description 3
- 206010060862 Prostate cancer Diseases 0.000 description 3
- 208000000236 Prostatic Neoplasms Diseases 0.000 description 3
- 206010038389 Renal cancer Diseases 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 3
- 208000029742 colonic neoplasm Diseases 0.000 description 3
- 102000003675 cytokine receptors Human genes 0.000 description 3
- 108010057085 cytokine receptors Proteins 0.000 description 3
- 238000013500 data storage Methods 0.000 description 3
- 239000003085 diluting agent Substances 0.000 description 3
- 201000010982 kidney cancer Diseases 0.000 description 3
- 210000000867 larynx Anatomy 0.000 description 3
- 210000001165 lymph node Anatomy 0.000 description 3
- 210000003800 pharynx Anatomy 0.000 description 3
- 208000000649 small cell carcinoma Diseases 0.000 description 3
- 238000012706 support-vector machine Methods 0.000 description 3
- 230000001225 therapeutic effect Effects 0.000 description 3
- 238000010200 validation analysis Methods 0.000 description 3
- 238000005303 weighing Methods 0.000 description 3
- NYNZQNWKBKUAII-KBXCAEBGSA-N (3s)-n-[5-[(2r)-2-(2,5-difluorophenyl)pyrrolidin-1-yl]pyrazolo[1,5-a]pyrimidin-3-yl]-3-hydroxypyrrolidine-1-carboxamide Chemical compound C1[C@@H](O)CCN1C(=O)NC1=C2N=C(N3[C@H](CCC3)C=3C(=CC=C(F)C=3)F)C=CN2N=C1 NYNZQNWKBKUAII-KBXCAEBGSA-N 0.000 description 2
- LIOLIMKSCNQPLV-UHFFFAOYSA-N 2-fluoro-n-methyl-4-[7-(quinolin-6-ylmethyl)imidazo[1,2-b][1,2,4]triazin-2-yl]benzamide Chemical compound C1=C(F)C(C(=O)NC)=CC=C1C1=NN2C(CC=3C=C4C=CC=NC4=CC=3)=CN=C2N=C1 LIOLIMKSCNQPLV-UHFFFAOYSA-N 0.000 description 2
- AILRADAXUVEEIR-UHFFFAOYSA-N 5-chloro-4-n-(2-dimethylphosphorylphenyl)-2-n-[2-methoxy-4-[4-(4-methylpiperazin-1-yl)piperidin-1-yl]phenyl]pyrimidine-2,4-diamine Chemical compound COC1=CC(N2CCC(CC2)N2CCN(C)CC2)=CC=C1NC(N=1)=NC=C(Cl)C=1NC1=CC=CC=C1P(C)(C)=O AILRADAXUVEEIR-UHFFFAOYSA-N 0.000 description 2
- 102100036475 Alanine aminotransferase 1 Human genes 0.000 description 2
- 108010082126 Alanine transaminase Proteins 0.000 description 2
- 206010005003 Bladder cancer Diseases 0.000 description 2
- 206010006187 Breast cancer Diseases 0.000 description 2
- 208000026310 Breast neoplasm Diseases 0.000 description 2
- WQZGKKKJIJFFOK-GASJEMHNSA-N Glucose Natural products OC[C@H]1OC(O)[C@H](O)[C@@H](O)[C@@H]1O WQZGKKKJIJFFOK-GASJEMHNSA-N 0.000 description 2
- 102000015779 HDL Lipoproteins Human genes 0.000 description 2
- 108010010234 HDL Lipoproteins Proteins 0.000 description 2
- 102000007330 LDL Lipoproteins Human genes 0.000 description 2
- 108010007622 LDL Lipoproteins Proteins 0.000 description 2
- 206010054107 Nodule Diseases 0.000 description 2
- 206010033128 Ovarian cancer Diseases 0.000 description 2
- 206010061535 Ovarian neoplasm Diseases 0.000 description 2
- 238000012879 PET imaging Methods 0.000 description 2
- 206010061902 Pancreatic neoplasm Diseases 0.000 description 2
- 208000015634 Rectal Neoplasms Diseases 0.000 description 2
- 208000005718 Stomach Neoplasms Diseases 0.000 description 2
- 208000024770 Thyroid neoplasm Diseases 0.000 description 2
- 208000007097 Urinary Bladder Neoplasms Diseases 0.000 description 2
- 208000002495 Uterine Neoplasms Diseases 0.000 description 2
- 239000004480 active ingredient Substances 0.000 description 2
- 239000013543 active substance Substances 0.000 description 2
- 239000005557 antagonist Substances 0.000 description 2
- 239000003124 biologic agent Substances 0.000 description 2
- 210000001124 body fluid Anatomy 0.000 description 2
- 239000010839 body fluid Substances 0.000 description 2
- 229950004272 brigatinib Drugs 0.000 description 2
- 239000000969 carrier Substances 0.000 description 2
- 210000002939 cerumen Anatomy 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 239000003795 chemical substances by application Substances 0.000 description 2
- 210000000038 chest Anatomy 0.000 description 2
- 235000012000 cholesterol Nutrition 0.000 description 2
- 210000001072 colon Anatomy 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- DDRJAANPRJIHGJ-UHFFFAOYSA-N creatinine Chemical compound CN1CC(=O)NC1=N DDRJAANPRJIHGJ-UHFFFAOYSA-N 0.000 description 2
- 230000029087 digestion Effects 0.000 description 2
- 201000010099 disease Diseases 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 229950009791 durvalumab Drugs 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000009472 formulation Methods 0.000 description 2
- 206010017758 gastric cancer Diseases 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 238000000126 in silico method Methods 0.000 description 2
- 238000001990 intravenous administration Methods 0.000 description 2
- 229950003970 larotrectinib Drugs 0.000 description 2
- 210000000265 leukocyte Anatomy 0.000 description 2
- IIXWYSCJSQVBQM-LLVKDONJSA-N lorlatinib Chemical compound N=1N(C)C(C#N)=C2C=1CN(C)C(=O)C1=CC=C(F)C=C1[C@@H](C)OC1=CC2=CN=C1N IIXWYSCJSQVBQM-LLVKDONJSA-N 0.000 description 2
- 208000015486 malignant pancreatic neoplasm Diseases 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 201000001441 melanoma Diseases 0.000 description 2
- 208000037819 metastatic cancer Diseases 0.000 description 2
- 208000011575 metastatic malignant neoplasm Diseases 0.000 description 2
- HAYYBYPASCDWEQ-UHFFFAOYSA-N n-[5-[(3,5-difluorophenyl)methyl]-1h-indazol-3-yl]-4-(4-methylpiperazin-1-yl)-2-(oxan-4-ylamino)benzamide Chemical compound C1CN(C)CCN1C(C=C1NC2CCOCC2)=CC=C1C(=O)NC(C1=C2)=NNC1=CC=C2CC1=CC(F)=CC(F)=C1 HAYYBYPASCDWEQ-UHFFFAOYSA-N 0.000 description 2
- 229960003301 nivolumab Drugs 0.000 description 2
- 231100000252 nontoxic Toxicity 0.000 description 2
- 230000003000 nontoxic effect Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 210000000056 organ Anatomy 0.000 description 2
- 229960001592 paclitaxel Drugs 0.000 description 2
- 201000002528 pancreatic cancer Diseases 0.000 description 2
- 208000008443 pancreatic carcinoma Diseases 0.000 description 2
- 229960002621 pembrolizumab Drugs 0.000 description 2
- 210000004909 pre-ejaculatory fluid Anatomy 0.000 description 2
- 230000003449 preventive effect Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000000069 prophylactic effect Effects 0.000 description 2
- 206010038038 rectal cancer Diseases 0.000 description 2
- 201000001275 rectum cancer Diseases 0.000 description 2
- 230000001850 reproductive effect Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- XIIOFHFUYBLOLW-UHFFFAOYSA-N selpercatinib Chemical compound OC(COC=1C=C(C=2N(C=1)N=CC=2C#N)C=1C=NC(=CC=1)N1CC2N(C(C1)C2)CC=1C=NC(=CC=1)OC)(C)C XIIOFHFUYBLOLW-UHFFFAOYSA-N 0.000 description 2
- 239000000243 solution Substances 0.000 description 2
- 206010041823 squamous cell carcinoma Diseases 0.000 description 2
- 239000003381 stabilizer Substances 0.000 description 2
- 201000011549 stomach cancer Diseases 0.000 description 2
- RCINICONZNJXQF-MZXODVADSA-N taxol Chemical compound O([C@@H]1[C@@]2(C[C@@H](C(C)=C(C2(C)C)[C@H](C([C@]2(C)[C@@H](O)C[C@H]3OC[C@]3([C@H]21)OC(C)=O)=O)OC(=O)C)OC(=O)[C@H](O)[C@@H](NC(=O)C=1C=CC=CC=1)C=1C=CC=CC=1)O)C(=O)C1=CC=CC=C1 RCINICONZNJXQF-MZXODVADSA-N 0.000 description 2
- 201000002510 thyroid cancer Diseases 0.000 description 2
- 230000000699 topical effect Effects 0.000 description 2
- 238000002604 ultrasonography Methods 0.000 description 2
- 201000005112 urinary bladder cancer Diseases 0.000 description 2
- 206010046766 uterine cancer Diseases 0.000 description 2
- 238000002255 vaccination Methods 0.000 description 2
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Chemical compound O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 2
- BSPLGGCPNTZPIH-IPZCTEOASA-N (e)-n-[4-(3-chloro-4-fluoroanilino)-7-methoxyquinazolin-6-yl]-4-piperidin-1-ylbut-2-enamide;hydrate Chemical compound O.C=12C=C(NC(=O)\C=C\CN3CCCCC3)C(OC)=CC2=NC=NC=1NC1=CC=C(F)C(Cl)=C1 BSPLGGCPNTZPIH-IPZCTEOASA-N 0.000 description 1
- 108010058566 130-nm albumin-bound paclitaxel Proteins 0.000 description 1
- 208000003950 B-cell lymphoma Diseases 0.000 description 1
- 241000283690 Bos taurus Species 0.000 description 1
- 208000003174 Brain Neoplasms Diseases 0.000 description 1
- 241000282465 Canis Species 0.000 description 1
- 201000009030 Carcinoma Diseases 0.000 description 1
- 208000017897 Carcinoma of esophagus Diseases 0.000 description 1
- 206010050337 Cerumen impaction Diseases 0.000 description 1
- 206010008342 Cervix carcinoma Diseases 0.000 description 1
- 102000004127 Cytokines Human genes 0.000 description 1
- 108090000695 Cytokines Proteins 0.000 description 1
- 206010014561 Emphysema Diseases 0.000 description 1
- 241000283073 Equus caballus Species 0.000 description 1
- 241000282324 Felis Species 0.000 description 1
- 201000003741 Gastrointestinal carcinoma Diseases 0.000 description 1
- 239000012981 Hank's balanced salt solution Substances 0.000 description 1
- 102000001554 Hemoglobins Human genes 0.000 description 1
- 108010054147 Hemoglobins Proteins 0.000 description 1
- 208000017604 Hodgkin disease Diseases 0.000 description 1
- 208000010747 Hodgkins lymphoma Diseases 0.000 description 1
- 208000031226 Hyperlipidaemia Diseases 0.000 description 1
- 206010020751 Hypersensitivity Diseases 0.000 description 1
- 206010061218 Inflammation Diseases 0.000 description 1
- 239000005411 L01XE02 - Gefitinib Substances 0.000 description 1
- 239000005551 L01XE03 - Erlotinib Substances 0.000 description 1
- 206010025323 Lymphomas Diseases 0.000 description 1
- 206010064912 Malignant transformation Diseases 0.000 description 1
- 241000124008 Mammalia Species 0.000 description 1
- 206010027476 Metastases Diseases 0.000 description 1
- 241001465754 Metazoa Species 0.000 description 1
- 241001529936 Murinae Species 0.000 description 1
- ZDZOTLJHXYCWBA-VCVYQWHSSA-N N-debenzoyl-N-(tert-butoxycarbonyl)-10-deacetyltaxol Chemical compound O([C@H]1[C@H]2[C@@](C([C@H](O)C3=C(C)[C@@H](OC(=O)[C@H](O)[C@@H](NC(=O)OC(C)(C)C)C=4C=CC=CC=4)C[C@]1(O)C3(C)C)=O)(C)[C@@H](O)C[C@H]1OC[C@]12OC(=O)C)C(=O)C1=CC=CC=C1 ZDZOTLJHXYCWBA-VCVYQWHSSA-N 0.000 description 1
- 206010029260 Neuroblastoma Diseases 0.000 description 1
- 206010052399 Neuroendocrine tumour Diseases 0.000 description 1
- 206010030155 Oesophageal carcinoma Diseases 0.000 description 1
- 108091034117 Oligonucleotide Proteins 0.000 description 1
- 229930012538 Paclitaxel Natural products 0.000 description 1
- ZLMJMSJWJFRBEC-UHFFFAOYSA-N Potassium Chemical compound [K] ZLMJMSJWJFRBEC-UHFFFAOYSA-N 0.000 description 1
- 208000000453 Skin Neoplasms Diseases 0.000 description 1
- 108020004459 Small interfering RNA Proteins 0.000 description 1
- 208000000102 Squamous Cell Carcinoma of Head and Neck Diseases 0.000 description 1
- 208000002847 Surgical Wound Diseases 0.000 description 1
- 206010042971 T-cell lymphoma Diseases 0.000 description 1
- 208000027585 T-cell non-Hodgkin lymphoma Diseases 0.000 description 1
- 229940126232 Tabrecta Drugs 0.000 description 1
- 208000024313 Testicular Neoplasms Diseases 0.000 description 1
- 206010057644 Testis cancer Diseases 0.000 description 1
- 208000008385 Urogenital Neoplasms Diseases 0.000 description 1
- 208000006105 Uterine Cervical Neoplasms Diseases 0.000 description 1
- 210000001015 abdomen Anatomy 0.000 description 1
- 238000002679 ablation Methods 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 208000009956 adenocarcinoma Diseases 0.000 description 1
- 239000002671 adjuvant Substances 0.000 description 1
- 230000007815 allergy Effects 0.000 description 1
- 210000004381 amniotic fluid Anatomy 0.000 description 1
- 229940121363 anti-inflammatory agent Drugs 0.000 description 1
- 239000002260 anti-inflammatory agent Substances 0.000 description 1
- 230000003110 anti-inflammatory effect Effects 0.000 description 1
- 238000009175 antibody therapy Methods 0.000 description 1
- 239000000427 antigen Substances 0.000 description 1
- 102000036639 antigens Human genes 0.000 description 1
- 108091007433 antigens Proteins 0.000 description 1
- 239000003963 antioxidant agent Substances 0.000 description 1
- 230000003078 antioxidant effect Effects 0.000 description 1
- 239000000074 antisense oligonucleotide Substances 0.000 description 1
- 238000012230 antisense oligonucleotides Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 210000001742 aqueous humor Anatomy 0.000 description 1
- 229960003852 atezolizumab Drugs 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 210000000941 bile Anatomy 0.000 description 1
- 230000004071 biological effect Effects 0.000 description 1
- 239000000090 biomarker Substances 0.000 description 1
- 229960000106 biosimilars Drugs 0.000 description 1
- 239000000117 blood based biomarker Substances 0.000 description 1
- 238000002725 brachytherapy Methods 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 239000006172 buffering agent Substances 0.000 description 1
- 229960001838 canakinumab Drugs 0.000 description 1
- 229950005852 capmatinib Drugs 0.000 description 1
- 229960004562 carboplatin Drugs 0.000 description 1
- 190000008236 carboplatin Chemical compound 0.000 description 1
- 208000002458 carcinoid tumor Diseases 0.000 description 1
- 210000001175 cerebrospinal fluid Anatomy 0.000 description 1
- 201000010881 cervical cancer Diseases 0.000 description 1
- 208000019065 cervical carcinoma Diseases 0.000 description 1
- 210000003679 cervix uteri Anatomy 0.000 description 1
- 230000010109 chemoembolization Effects 0.000 description 1
- 238000002512 chemotherapy Methods 0.000 description 1
- 210000001268 chyle Anatomy 0.000 description 1
- 210000004913 chyme Anatomy 0.000 description 1
- 229960004316 cisplatin Drugs 0.000 description 1
- DQLATGHUWYMOKM-UHFFFAOYSA-L cisplatin Chemical compound N[Pt](N)(Cl)Cl DQLATGHUWYMOKM-UHFFFAOYSA-L 0.000 description 1
- 238000002052 colonoscopy Methods 0.000 description 1
- 238000002648 combination therapy Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000009223 counseling Methods 0.000 description 1
- 229940109239 creatinine Drugs 0.000 description 1
- LVXJQMNHJWSHET-AATRIKPKSA-N dacomitinib Chemical compound C=12C=C(NC(=O)\C=C\CN3CCCCC3)C(OC)=CC2=NC=NC=1NC1=CC=C(F)C(Cl)=C1 LVXJQMNHJWSHET-AATRIKPKSA-N 0.000 description 1
- 229950002205 dacomitinib Drugs 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 239000003599 detergent Substances 0.000 description 1
- 239000012153 distilled water Substances 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 229960003668 docetaxel Drugs 0.000 description 1
- 239000003937 drug carrier Substances 0.000 description 1
- 238000004520 electroporation Methods 0.000 description 1
- 229950000521 entrectinib Drugs 0.000 description 1
- 210000000981 epithelium Anatomy 0.000 description 1
- 229960001433 erlotinib Drugs 0.000 description 1
- AAKJLRGGTJKAMG-UHFFFAOYSA-N erlotinib Chemical compound C=12C=C(OCCOC)C(OCCOC)=CC2=NC=NC=1NC1=CC=CC(C#C)=C1 AAKJLRGGTJKAMG-UHFFFAOYSA-N 0.000 description 1
- 201000005619 esophageal carcinoma Diseases 0.000 description 1
- 210000003238 esophagus Anatomy 0.000 description 1
- 230000029142 excretion Effects 0.000 description 1
- 210000003722 extracellular fluid Anatomy 0.000 description 1
- 210000001508 eye Anatomy 0.000 description 1
- 238000000556 factor analysis Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 102000037865 fusion proteins Human genes 0.000 description 1
- 108020001507 fusion proteins Proteins 0.000 description 1
- 210000000232 gallbladder Anatomy 0.000 description 1
- 229960002584 gefitinib Drugs 0.000 description 1
- XGALLCVXEZPNRQ-UHFFFAOYSA-N gefitinib Chemical compound C=12C=C(OCCCN3CCOCC3)C(OC)=CC2=NC=NC=1NC1=CC=C(F)C(Cl)=C1 XGALLCVXEZPNRQ-UHFFFAOYSA-N 0.000 description 1
- 229960005277 gemcitabine Drugs 0.000 description 1
- SDUQYLNIPVEERB-QPPQHZFASA-N gemcitabine Chemical compound O=C1N=C(N)C=CN1[C@H]1C(F)(F)[C@H](O)[C@@H](CO)O1 SDUQYLNIPVEERB-QPPQHZFASA-N 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 208000005017 glioblastoma Diseases 0.000 description 1
- 239000008103 glucose Substances 0.000 description 1
- 230000012010 growth Effects 0.000 description 1
- 201000010536 head and neck cancer Diseases 0.000 description 1
- 201000003911 head and neck carcinoma Diseases 0.000 description 1
- 208000014829 head and neck neoplasm Diseases 0.000 description 1
- 201000000459 head and neck squamous cell carcinoma Diseases 0.000 description 1
- 201000005787 hematologic cancer Diseases 0.000 description 1
- 208000024200 hematopoietic and lymphoid system neoplasm Diseases 0.000 description 1
- 239000005556 hormone Substances 0.000 description 1
- 229940088597 hormone Drugs 0.000 description 1
- 210000004251 human milk Anatomy 0.000 description 1
- 235000020256 human milk Nutrition 0.000 description 1
- 238000009169 immunotherapy Methods 0.000 description 1
- 238000000338 in vitro Methods 0.000 description 1
- 238000001727 in vivo Methods 0.000 description 1
- 238000012880 independent component analysis Methods 0.000 description 1
- 230000004054 inflammatory process Effects 0.000 description 1
- 230000000977 initiatory effect Effects 0.000 description 1
- 201000002313 intestinal cancer Diseases 0.000 description 1
- 210000000936 intestine Anatomy 0.000 description 1
- 210000002977 intracellular fluid Anatomy 0.000 description 1
- 238000007917 intracranial administration Methods 0.000 description 1
- 238000007918 intramuscular administration Methods 0.000 description 1
- 238000007912 intraperitoneal administration Methods 0.000 description 1
- 238000007913 intrathecal administration Methods 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 210000003734 kidney Anatomy 0.000 description 1
- 208000003849 large cell carcinoma Diseases 0.000 description 1
- 238000002647 laser therapy Methods 0.000 description 1
- 210000004185 liver Anatomy 0.000 description 1
- 201000007270 liver cancer Diseases 0.000 description 1
- 208000014018 liver neoplasm Diseases 0.000 description 1
- 229950001290 lorlatinib Drugs 0.000 description 1
- 238000005461 lubrication Methods 0.000 description 1
- 201000001037 lung lymphoma Diseases 0.000 description 1
- 201000003866 lung sarcoma Diseases 0.000 description 1
- 208000037841 lung tumor Diseases 0.000 description 1
- 210000002751 lymph Anatomy 0.000 description 1
- 238000002595 magnetic resonance imaging Methods 0.000 description 1
- 230000036212 malign transformation Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000001404 mediated effect Effects 0.000 description 1
- 210000004914 menses Anatomy 0.000 description 1
- 230000003340 mental effect Effects 0.000 description 1
- 230000004630 mental health Effects 0.000 description 1
- 238000002705 metabolomic analysis Methods 0.000 description 1
- 230000001431 metabolomic effect Effects 0.000 description 1
- 230000009401 metastasis Effects 0.000 description 1
- 210000003097 mucus Anatomy 0.000 description 1
- 210000003205 muscle Anatomy 0.000 description 1
- 201000005962 mycosis fungoides Diseases 0.000 description 1
- 208000025113 myeloid leukemia Diseases 0.000 description 1
- 210000002850 nasal mucosa Anatomy 0.000 description 1
- 201000011682 nervous system cancer Diseases 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 208000016065 neuroendocrine neoplasm Diseases 0.000 description 1
- 201000011519 neuroendocrine tumor Diseases 0.000 description 1
- 230000000174 oncolytic effect Effects 0.000 description 1
- 239000003002 pH adjusting agent Substances 0.000 description 1
- 210000000496 pancreas Anatomy 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 229960005079 pemetrexed Drugs 0.000 description 1
- QOFFJEBXNKRSPX-ZDUSSCGKSA-N pemetrexed Chemical compound C1=N[C]2NC(N)=NC(=O)C2=C1CCC1=CC=C(C(=O)N[C@@H](CCC(O)=O)C(O)=O)C=C1 QOFFJEBXNKRSPX-ZDUSSCGKSA-N 0.000 description 1
- 239000000546 pharmaceutical excipient Substances 0.000 description 1
- 239000002953 phosphate buffered saline Substances 0.000 description 1
- 238000002428 photodynamic therapy Methods 0.000 description 1
- 230000004962 physiological condition Effects 0.000 description 1
- 239000002504 physiological saline solution Substances 0.000 description 1
- 210000002381 plasma Anatomy 0.000 description 1
- 210000004910 pleural fluid Anatomy 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 229910052700 potassium Inorganic materials 0.000 description 1
- 239000011591 potassium Substances 0.000 description 1
- 238000000513 principal component analysis Methods 0.000 description 1
- 238000011321 prophylaxis Methods 0.000 description 1
- 210000004915 pus Anatomy 0.000 description 1
- 238000007674 radiofrequency ablation Methods 0.000 description 1
- 229960002633 ramucirumab Drugs 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 238000002271 resection Methods 0.000 description 1
- 210000002345 respiratory system Anatomy 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 229940124668 retevmo Drugs 0.000 description 1
- 210000003296 saliva Anatomy 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000007790 scraping Methods 0.000 description 1
- 210000002374 sebum Anatomy 0.000 description 1
- 229940121610 selpercatinib Drugs 0.000 description 1
- 210000000582 semen Anatomy 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 210000002966 serum Anatomy 0.000 description 1
- 201000000849 skin cancer Diseases 0.000 description 1
- 150000003384 small molecules Chemical group 0.000 description 1
- 210000000952 spleen Anatomy 0.000 description 1
- 208000017572 squamous cell neoplasm Diseases 0.000 description 1
- 210000002784 stomach Anatomy 0.000 description 1
- 239000011232 storage material Substances 0.000 description 1
- 238000007920 subcutaneous administration Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 210000004243 sweat Anatomy 0.000 description 1
- 208000024891 symptom Diseases 0.000 description 1
- 210000001179 synovial fluid Anatomy 0.000 description 1
- 210000001138 tear Anatomy 0.000 description 1
- 229940066453 tecentriq Drugs 0.000 description 1
- 201000003120 testicular cancer Diseases 0.000 description 1
- 230000001988 toxicity Effects 0.000 description 1
- 231100000419 toxicity Toxicity 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 150000003626 triacylglycerols Chemical class 0.000 description 1
- 230000004614 tumor growth Effects 0.000 description 1
- 238000012285 ultrasound imaging Methods 0.000 description 1
- 210000002700 urine Anatomy 0.000 description 1
- 208000019553 vascular disease Diseases 0.000 description 1
- 239000003981 vehicle Substances 0.000 description 1
- GBABOYUKABKIAF-GHYRFKGUSA-N vinorelbine Chemical compound C1N(CC=2C3=CC=CC=C3NC=22)CC(CC)=C[C@H]1C[C@]2(C(=O)OC)C1=CC([C@]23[C@H]([C@]([C@H](OC(C)=O)[C@]4(CC)C=CCN([C@H]34)CC2)(O)C(=O)OC)N2C)=C2C=C1OC GBABOYUKABKIAF-GHYRFKGUSA-N 0.000 description 1
- 229960002066 vinorelbine Drugs 0.000 description 1
- 230000003612 virological effect Effects 0.000 description 1
- 210000004916 vomit Anatomy 0.000 description 1
- 230000008673 vomiting Effects 0.000 description 1
- 230000003442 weekly effect Effects 0.000 description 1
- 239000000080 wetting agent Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H40/00—ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/60—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
Definitions
- Lung cancer most commonly begins with the development of a lung nodule.
- the larger the nodule the more rapid its growth or the more irregular it is in appearance, and the more likely it is to be cancer.
- lung nodules in patients remain undetected for periods of time or, even if detected, can already indicate an advanced stage of cancer.
- early prediction of lung cancer risk in patients even prior to the development of one or more lung nodules can be valuable.
- early-stage cancer screening remains difficult as screening large numbers of patients using resource-intensive methodologies would be infeasible. For example, performing tissue biopsies and/or image scanning across large patient populations is untenable. Therefore, there is a need to effectively identify patients most likely at risk of developing cancer.
- Embodiments of the disclosure disclosed herein involve implementing machine learning models to analyze electronic records of patients.
- Electronic records of patients can represent valuable information that are predictive for the risk of cancer.
- electronic records can be accumulated easily and cost effectively e.g., during patient visits. Therefore, electronic records can be valuable data for continuous monitoring and evaluation of patients for their risk of developing cancer.
- Methods disclosed herein are useful for identifying such patients that may be at risk of developing cancer, hereafter referred to as candidate patients. Therefore, healthcare providers, who may be caring for large numbers of patients, can appropriately prioritize limited medical resources by providing interventions to candidate patients that are at most risk of developing cancer. For example, candidate patients can undergo subsequent imaging and/or biopsy procedures, which are far more costly procedures for confirming whether the candidate patients are at risk of developing cancer.
- a method for prioritizing medical resources for screening a patient for cancer comprising obtaining a temporally diverse dataset comprising electronic records of a patient; weighting features from data of the electronic records of the patient according to timepoints that the data were recorded in the electronic records of the patient; analyzing the weighted features for the patient using a machine learning model to categorize the patient as a candidate patient at risk for cancer or a non-candidate patient; and responsive to determining the patient as a candidate patient, providing identification of the candidate patient for prioritization of medical resources.
- weighting the features according to timepoints that the data were recorded in the electronic records of the patient comprises assigning higher weights to features from data that were more recently recorded in the electronic records in comparison to features from data that were earlier recorded in the electronic records.
- methods disclosed herein further comprise normalizing the data of the electronic records of the patient.
- normalizing the data comprises applying a hyperbolic tangent transformation.
- the machine learning model outputs a score indicative of cancer risk for the patient.
- the score indicative of cancer risk is a continuous score between 0 and 1.
- the machine learning model further outputs an identification of a feature or feature grouping that contributed to the score indicative of cancer risk.
- providing identification of the candidate patient for prioritization of medical resources further comprises providing the corresponding identifications of features or feature groupings of the candidate patient for prioritization of medical resources.
- the features from data of the electronic records comprises features from electronic health record (EHR) data.
- the features from data of the electronic records comprises features from medical claims data.
- the features from data of the electronic records comprises features from EHR data and medical claims data.
- the features from EHR data comprises one or more of patient demographics data, laboratory test data, prior diagnoses data, prior procedures data, and prior prescriptions data.
- the patient demographics data comprises one or more of patient gender, patient age, patient behavior, patient ethnicity, patient race, patient geographic location, patient socioeconomic status, education status, number of months active, and/or insurance status.
- the patient behavior comprises smoking behavior selected from one of current smoker, previously smoked, never smoked, other smoking, not currently smoking, or unknown smoke.
- the features from medical claims data comprises one or more of patient demographics data, prior diagnoses data, prior procedures data, and prior prescriptions data.
- the prior diagnoses data comprises one or more diagnostic codes.
- the one or more diagnostic codes comprise ICD-9 or ICD-10 codes.
- the one or more diagnostic codes comprise ICD- 10 codes, wherein one or more ICD-10 codes were converted from one or more ICD-9 codes.
- the prior procedures data comprises one or more procedures codes.
- the one or more procedures codes comprise HCPCS or CPT-4 codes.
- the prior prescriptions data comprises one or more national drug codes (NDCs).
- NDCs national drug codes
- the patient is between 50-80 years old. In various embodiments, the patient exhibits a prior smoking history.
- the patient has not previously undergone a computed tomography (CT) scan, a positron emission tomography (PET) scan, or a PET-CT scan.
- CT computed tomography
- PET positron emission tomography
- PET-CT PET-CT scan
- the patient has not previously undergone a cancer biopsy procedure.
- the patient has not previously received a cancer diagnosis.
- the cancer comprises lung cancer.
- the lung cancer is one of non-small cell lung cancer, small cell lung cancer, adenocarcinoma, and squamous cell carcinoma.
- the prioritization of medical resources comprises prioritizing patients for undergoing computed tomography (CT) scans.
- CT computed tomography
- the machine learning model comprises a logistic regression model, a random forest model, or a neural network.
- methods disclosed herein further comprise obtaining updated electronic records for one or more patients, the updated electronic records comprising additional data recorded in the updated electronic records subsequent to providing identification of the candidate patient; analyzing features from the additional data using a machine learning model to categorize a patient as an additional candidate patient at risk for cancer or a non-candidate patient; and responsive to determining that the patient is an additional candidate patient, providing identification of the additional candidate patient for prioritization of medical resources.
- a method for prioritizing medical resources for screening a patient for cancer comprising obtaining a dataset comprising electronic records of a patient; receiving an indication of available medical resources of a third party; extracting features from data of the electronic records of the patient; analyzing the extracted features for the patient using a machine learning model to categorize the patient as a candidate patient at risk for cancer or a non-candidate patient, wherein the categorizing of the patient uses at least a prediction of the machine learning model and a threshold selected according to the received indication; and responsive to determining the patient as a candidate patient, providing identification of the candidate patient for prioritization of medical resources.
- the threshold is selected to account for the available medical resources of the third party.
- a lower threshold is selected for the third party for an indication reflecting higher available medical resources of the third party, in comparison to a higher threshold that is selected for the third party for an indication reflecting lower available resources for the third party.
- methods disclosed herein further comprise weighting the extracted features according to timepoints that the data were recorded in the electronic records of the patient.
- a method for prioritizing medical resources for screening individuals for cancer comprising obtaining a dataset comprising electronic records of a patient; extracting features from data of the electronic records of the patient; analyzing the extracted features for the patient using a machine learning model to categorize the patient as a candidate patient at risk for cancer or a non-candidate patient, wherein the machine learning model is configured to output 1) a score indicative of lung cancer risk for the patient and 2) identification of a feature grouping that contributed to the score indicative of lung cancer risk, wherein the feature grouping comprises two or more features; and responsive to determining the patient as a candidate patient, providing identification of the candidate patient and the identification of the feature grouping to the third party for prioritization of medical resources.
- the feature grouping comprises at least 2 features. In some embodiments, the feature grouping may comprise between 2 and 10 features. In various embodiments, the feature grouping comprises one of a lung issue grouping, heart issue grouping, smoking status grouping, patient characteristics grouping, patient behavior grouping, and vaccine grouping. In various embodiments, the lung issue grouping comprises one or more of chronic obstructive pulmonary disease (COPD), chronic bronchitis, pleural effusion, dyspnea, wheezing, and inhaled treatment for COPD and/or asthma.
- COPD chronic obstructive pulmonary disease
- the heart issue grouping comprises one or more of atherosclerotic heart disease, iron deficiency anemias, elevated blood pressure, treatment for high blood pressure, and treatment for reducing risk of heart attack and/or stroke.
- the smoking status grouping comprises one or more of tobacco use, nicotine dependence, cigarette use, smoking cessation, number of months actively smoking, never smoked observation, and current smoker observation.
- the patient characteristics grouping comprises one or more of systolic blood pressure, diastolic blood pressure, number of months active, patient age, and geographic location.
- the patient behavior grouping comprises one or more of prior established patient visits and new patient visits.
- the vaccine grouping comprises one or more of pneumonia vaccine and flu vaccine.
- analyzing the extracted features using the machine learning model comprises implementing a Shapley additive contribution algorithm to determine contributions of one or more feature groupings.
- the feature grouping identified by the output of the machine learning model comprises a feature grouping providing the highest contribution to the score.
- the output of the machine learning model further comprises an identification of a second feature grouping providing the second highest contribution to the score.
- the output of the machine learning model further comprises an identification of a third feature grouping providing the third highest contribution to the score.
- categorizing the patient as a candidate patient or a noncandidate patient based on the score further comprises selecting a threshold according to a received indication of available medical resources of the third party; and categorizing the patient as a candidate patient or a non-candidate patient using the score and the threshold.
- FIG. 1 A depicts a system overview for prioritizing medical resources for candidate patients, in accordance with an embodiment.
- FIG. IB depicts a block diagram of an example patient prioritization system, in accordance with an embodiment.
- FIG. 2A depicts an example flow diagram for implementing a risk prediction model for identifying candidate patients, in accordance with an embodiment.
- FIG. 2B depicts an example diagram for organizing the electronic data, in accordance with an embodiment.
- FIG. 3A depicts an example flow process for identifying candidate patients, in accordance with a first embodiment.
- FIG. 3B depicts an example flow process for identifying candidate patients, in accordance with a second embodiment.
- FIG. 3C depicts an example interaction diagram for identifying candidate patients, in accordance with a third embodiment.
- FIG. 4 illustrates an example computer for implementing the entities shown in FIG. 1A, IB, 2, 3A, and 3B.
- FIG. 5 A depicts an example data pipeline for developing the algorithm (e.g., machine learning model).
- algorithm e.g., machine learning model
- FIG. 5B depicts an example data pipeline for validating the algorithm (e.g., machine learning model).
- algorithm e.g., machine learning model
- FIG. 6A depicts example output for a patient with a standard lung cancer risk.
- FIG. 6B depicts example output for a patient with an elevated lung cancer risk.
- FIG. 7 shows performance of various machine learning models.
- subject or “patient” are used interchangeably and encompass a cell, tissue, or organism, human, or non-human, whether in vivo, ex vivo, or in vitro, male, or female.
- mammal encompasses both humans and non-humans and includes but is not limited to humans, non-human primates, canines, felines, murines, bovines, equines, and porcines.
- electro records and “electronic data” are used interchangeably and generally refer to patient data stored in electronic form. Examples of electronic records described herein include electronic health records and claims data.
- the term “obtaining a dataset comprising electronic records of a patient” and variants thereof encompasses obtaining dataset comprising electronic records captured from the patient.
- Obtaining the dataset comprising electronic records can encompass performing steps of capturing the dataset e.g., obtaining data from the patient and recording the data.
- the phrase can also encompass receiving the dataset, e.g., from a third party that has performed the steps of capturing the dataset comprising electronic records from the patient.
- the term “obtaining a dataset comprising electronic records of a patient” can also include having (e.g., instructing) a third party obtain the dataset.
- sample or “test sample” can include a single cell or multiple cells or fragments of cells or an aliquot of body fluid, such as a blood sample, taken from a patient, by means including venipuncture, excretion, ejaculation, massage, biopsy, needle aspirate, lavage sample, scraping, surgical incision, or intervention or other means known in the art.
- a sample can be a biopsy of a tissue, such as a lung tumor or a lung nodule.
- the phrase “at risk for cancer” refers to a risk that a patient will develop cancer within a given time period, e.g., within 1 year.
- the risk of cancer refers to a likelihood that a patient will develop cancer within a given time period from time zero (TO), wherein time zero refers to when electronic data was obtained from the patient.
- the risk of cancer refers to a likelihood that a patient will develop cancer within a certain period, for example, 6 months, 1 year, 10 years, or 20 years.
- the terms “treating,” “treatment,” or “therapy” of lung cancer shall mean slowing, stopping, or reversing a cancer’s progression by administration of treatment.
- treating lung cancer means reversing the cancer’s progression, ideally to the point of eliminating the cancer itself.
- “treating,” “treatment,” or “therapy” of lung cancer includes administering a therapeutic agent or pharmaceutical composition to the patient. Additionally, as used herein, “treating,” “treatment,” or “therapy” of lung cancer further includes administering a therapeutic agent or pharmaceutical composition for prophylactic purposes.
- Prophylaxis of a cancer refers to the administration of a composition or therapeutic agent to prevent the occurrence, development, onset, progression, or recurrence of cancer or some or all the symptoms of lung cancer or to lessen the likelihood of the onset of lung cancer.
- FIG. 1A depicts a system overview for prioritizing medical resources for candidate patients, in accordance with an embodiment.
- the system environment 100 provides context to introduce patients 110, stored electronic data 120, and a patient prioritization system 130 for identifying candidate patients 140.
- FIG. 1A depicts a system environment 100 including three patients 110, in various embodiments, the system environment 100 includes additional or fewer patients such that that patient prioritization system 130 identifies a subset of patients as candidate patients 140.
- the system environment 100 may include 2, 100, 1000, 1 million, 100 million, or other number of patients.
- the patients 110 are presumed to be healthy.
- the patients 110 have not been previously diagnosed with cancer.
- the patients 110 have not been previously suspected of having cancer.
- the methods disclosed herein can be beneficial for identifying candidate patients who may be at risk of cancer from patients who are presumed to be healthy.
- the type of cancer is a lung cancer.
- the methods described herein can be beneficial for prioritizing candidate patients for early detection of lung cancer.
- a patient 110 may have been previously diagnosed with a cancer.
- the patient 110 can be in remission and therefore, the methods disclosed herein can be beneficial for determining whether the patient 110 is likely to experience a recurrence of cancer.
- data can be obtained from the patients 110 and stored.
- data can include electronic data.
- FIG. 1A shows example stored electronic data 120 that is obtained from patients 110.
- Exemplary electronic data include electronic health record (EHR) data and claims data.
- EHR data represents an electronic version of a patient’s medical history.
- Claims data includes administrative data covering information such as doctor’s appointments, bills, and insurance information. Additional details and examples of EHR data and claims data are further described herein.
- the stored electronic data 120 can be gathered from patients 110 at one or more patient visits (e.g., patient visits to a medical provider). Thus, upon each patient visit, the stored electronic data 120 can be further augmented or supplemented by the information gathered at that patient visit.
- the stored electronic data 120 represents a temporally diverse dataset of electronic records including electronic data recorded at various timepoints (e.g., at various timepoints when the patient visited).
- the stored electronic data 120 is maintained and updated in real-time as additional information is gathered from a patient.
- the stored electronic data 120 of various patients can be maintained in a cloud service and therefore, can be continuously updated as new or updated information of patients 110 are obtained.
- Database management system for storing electronic data can be any suitable system, for example, EPIC Healthcare Software, EBS PathoSof, HxRx Healthcare Management System, Healcon Practice, Drug Inventory Management System (DIMS), oeHealth, Patientpop, Webptis, GeBBS HIM Solutions, Cemer, WebPT, eClinicalWorks, and NextGen Healthcare EHR.
- EBS PathoSof HxRx Healthcare Management System
- DIMS Drug Inventory Management System
- oeHealth Patientpop
- Webptis Webptis
- GeBBS HIM Solutions Cemer
- WebPT eClinicalWorks
- NextGen Healthcare EHR NextGen Healthcare EHR.
- the stored electronic data 120 and the patient prioritization system 130 are maintained or employed by a common party.
- the stored electronic data 120 and the patient prioritization system 130 are maintained or employed by different parties.
- a first party may maintain the stored electronic data 120 and/or continuously update the stored electronic data 120 in view of new or updated information from patients 110.
- a different party may operate the patient prioritization system 130 to analyze the stored electronic data 120 to identify candidate patients 140.
- the party that maintains the stored electronic data 120 may be a hospital or physician’s office.
- the patient prioritization system 130 may request for and access the stored electronic data 120 maintained by a hospital or a physician’s office to identify candidate patients 140.
- FIG. 1A shows a single stored electronic data 120
- each stored electronic data 120 can be maintained by a different party (e.g., a different hospital or physician’s office) or multiple parties. Therefore, the patient prioritization system 130 can access and analyze stored electronic data 120 from a large number of patients by accessing stored electronic data 120 from various sources maintained by any number of parties.
- a different party e.g., a different hospital or physician’s office
- the patient prioritization system 130 can access and analyze stored electronic data 120 from a large number of patients by accessing stored electronic data 120 from various sources maintained by any number of parties.
- the patient prioritization system 130 accesses and analyzes stored electronic data 120 to identify candidate patients 140 that may be at risk of cancer.
- Candidate patients 140 may represent a subset of the patients 110.
- Candidate patients 140 may be prioritized to receive a subsequent intervention, whereas patients that are not identified as candidate patients 140 (patients not identified as candidate patients are hereafter referred to as “non-candidate patients”) may be deprioritized, in an embodiment.
- the patient prioritization system 130 accesses the stored electronic data 120 by sending a request to a party that maintains the stored electronic data 120.
- the patient prioritization system 130 continuously accesses the stored electronic data 120 over time. Continuously accessing the stored electronic data 120 over time may enable the patient prioritization system 130 to access the most up-to-date stored electronic data 120 such that the patient prioritization system 130 can identify the candidate patients based on the most up-to-date stored electronic data 120.
- the patient prioritization system 130 accesses the stored electronic data 120 at predetermined intervals of time (e.g., daily, bi-weekly, weekly, monthly, annually, or other suitable time intervals).
- the party maintaining the stored electronic data 120 can provide the stored electronic data 120 to the patient prioritization system 130 when a trigger event occurs.
- a trigger event may be an update to the stored electronic data 120 in view of new patient information or change to patient information.
- the patient prioritization system 130 analyzes stored electronic data 120, such as EHR data, claims data, or other data that may be easily obtained and/or does not require extensive computing resource to analyze. Using easily obtainable data allows a larger pool of patients, making it suitable for use in early-stage cancer screening.
- the patient prioritization system 130 may analyze stored electronic data 120 of patients 110 by deploying a trained machine learning model, hereafter referred to as a trained risk prediction model.
- the trained risk prediction model may analyze features derived from the stored electronic data 120 of patients 110 to determine which patients are to be categorized as candidate patients 140.
- FIG. IB depicts a block diagram of an example patient prioritization system 130.
- the patient prioritization system 130 includes a feature extraction module 145, a resource availability module 150, a model training module 155, a model deployment module 160, and a candidate patient identifier module 165.
- the patient prioritization system 130 may be configured differently with additional, fewer, or different modules.
- the feature extraction module 145 may process electronic data, such as electronic data obtained from stored electronic data 120 shown in FIG. 1A.
- the feature extraction module 145 may identify eligible patients according to one or more criteria (e.g., 50-80 years old, non-smoker, and no prior scan, biopsy procedure, or cancer diagnosis).
- the feature extraction module 145 may extract features from the electronic data of eligible patients that meet the one or more criteria.
- the feature extraction module 145 may assign weights to different features. For example, the feature extraction module 145 may assign higher weights to features from the electronic data 120 that were more recently recorded in comparison to features from the electronic data 120 that were earlier recorded in the electronic records.
- the feature extraction module 145 may provide the extracted features to the model deployment module 160, e.g., for inputting into a trained machine learning model. .
- the resource availability module 150 manages communications with one or more third parties to assess the availability of resources at the one or more third parties. Availability of resources at each third party may differ. For example, a third party may be a hospital or a physician’s office that provides care to different numbers of patients.
- the resource availability module 150 may communicate with one or more third parties to receive indications identifying the quantity of available medical resources each third party has available. Based on the indications, the resource availability module 150 can determine whether the number of candidate patients identified for a particular third party exceeds the medical resources available at that third party. As an example, the resource availability module 150 may receive, from a third party, an indication that identifies that the third party has the capacity to perform X image scans for candidate patients.
- the resource availability module 150 can ensure that the total number of candidate patients identified to the third party does not exceed the third party’s capacity of A image scans.
- the resource availability module 150 sets a threshold according to the indication from the third party that reflects the available medical resources at the third party. By modulating the threshold, the resource availability module 150 can control the number of candidate patients identified for a particular third party.
- the model training module 155 may perform steps to train one or more machine learning models.
- the model training module 155 trains machine learning models such that the machine learning models can accurately separate candidate patients, who are likely at higher risk of developing cancer, from non-candidate patients, who are likely at lower risk of developing cancer.
- the model training module 155 trains machine learning models using training data that includes electronic data, such as EHR data and/or claims data.
- the model deployment module 160 may retrieve and deploy trained machine learning models to generate predictions for patients, the predictions being informative for determining whether a patient is categorized as a candidate patient or a non-candidate patient.
- the model deployment module 160 deploys a trained machine learning model that outputs a score that is informative for determining whether a patient is categorized as a candidate patient or a non-candidate patient.
- the trained machine earning model outputs relative contributions of feature groupings that contributed towards the score informative for determining whether a patient is categorized as a candidate patient or a non-candidate patient.
- the candidate patient identifier module 165 may identify a subset of patients as candidate patients using the predictions generated by trained machine learning models. In various embodiments, to determine whether a patient is to be categorized as a candidate patient, the candidate patient identifier module 165 compares a prediction generated by the machine learning model for the patient to a threshold score. Based on the comparison, the candidate patient identifier module 165 may classify the patient as a candidate patient or a non-candidate patient. In some embodiments, the candidate patient identifier module 165 communicates with one or more third parties by providing identification of candidate patients. For example, the identified candidate patients may represent a subset of patients that are under the care of a third party. Thus, by providing identification of the candidate patients to the third party, the third party may then prioritize its available medical resources by providing interventions to the candidate patients over non-candidate patients.
- Embodiments described herein include methods for identifying candidate patients by applying one or more trained risk prediction models. Such methods can be performed by the patient prioritization system 130 described in FIG. IB. Reference is now made to FIG. 2A, which depicts an example flow diagram for implementing a risk prediction model for identifying candidate patients.
- the patient prioritization system 130 receives or accesses the stored electronic data 120 (e.g., a temporally diverse dataset comprising electronic records of patients) that may be maintained by a third party.
- the patient prioritization system 130 analyzes the stored electronic data 120 to identify eligible patients, for example, by identifying those satisfying one or more criteria.
- the one or more criteria can include a particular age group (e.g., between 30-100 years old, between 40-90 years old, or between 50-80 years old), smoking related observation (e.g., smoking habit of the patient, such as a smoking pack year history), and a lack of a prior scan, prior biopsy, or prior cancer diagnosis.
- the criteria may include one or more of the United States Preventive Services Task Force (USPSTF) recommendations, such as 50-80 years old and a 20+ smoking pack year history.
- USPSTF United States Preventive Services Task Force
- the criteria include each of age group (e.g., 50-80 years old), smoking related observation, and lack of each of a prior scan, prior biopsy, and prior cancer diagnosis.
- the patient prioritization system 130 identifies eligible patients by comparing the stored electronic data 120 of patients to the one or more criteria. If the electronic data 120 of a patient satisfies the criteria, the patient prioritization system 130 may identify the patient as an eligible patient and retain that patient’s electronic data 120 for subsequent analysis. If the electronic data of a patient does not satisfy the criteria, in some embodiments the patient prioritization system 130 identifies the patient as an ineligible patient. The electronic data 120 of the ineligible patient may not be retained or may be set aside and not included for subsequent analysis.
- the patient prioritization system 130 organizes the electronic data 120 of the eligible patients prior to subsequent analysis.
- the patient prioritization system 130 may organize the electronic data of eligible patients into one or more scalar tables to facilitate the subsequent analysis (e.g., to facilitate the later extraction of features).
- different scalar tables can be generated for different types of electronic data.
- a scalar table can be generated for EHR data and a second scalar table can be generated for claims data.
- separate scalar tables can be generated for patient demographic data, and observations data (e.g., diagnoses, procedures, and/or prescription codes).
- four separate scalar tables are generated including 1) patient demographics table, 2) diagnosis table, 3) procedures table, and 4) prescriptions table.
- FIG. 2B depicts an example diagram for organizing the electronic data, in accordance with an embodiment.
- the top of FIG. 2B shows a unique patient demographic table that organizes the patient demographic data of patients.
- the patient demographic data can include the patient’s birth year, patient’s gender, patient’s ethnicity, patient’s race, patient’s division, and the like.
- the patient demographic table may only include demographic data of eligible patients.
- the bottom three tables shown in FIG. 2B include a diagnoses table, a procedures table, and a prescriptions table using for example, National Drug Code (NDC).
- NDC National Drug Code
- the diagnoses table can include a time (e.g., estimated date) at which the data was recorded (shown as “Est Dt”).
- the procedures table shown in the bottom middle of FIG. 2B includes an identifier of the patient (e.g., “Patient ID”) and further includes codes for one or more procedures (shown as “Code Type” and “Proc Code”).
- Example code types include Current Procedural Terminology (CPT) or Healthcare Common Procedure Coding System (HCPCS) codes.
- Example CPT/HCPCS codes are described in further detail herein.
- the diagnoses table can include a time (e.g., estimated date) at which the data was recorded (shown as “Est Dt”).
- the prescriptions (NDC) table shown in the bottom right of FIG. 2B includes an identifier of the patient (e.g., “Patient ID”) and further includes one or more prescriptions (shown as “NDC”).
- Example prescriptions are described in further detail herein.
- the diagnoses table can include a time (e.g., estimated date) at which the data was recorded (shown as “Est Dt”).
- a subsequent analysis 220 is performed to analyze the electronic data of the patient and to determine whether the patient is to be categorized as candidate patient or a non-candidate patient.
- the analysis 220 includes accessing the electronic data 255 of the patient, extracting features 260, and analyzing the extracted features using a risk prediction model 265 to generate a risk prediction 270.
- the risk prediction 270 is useful for determining whether the patient is to be categorized as candidate patient or a non-candidate patient.
- the analysis 220 can be performed multiple times for multiple patients. For example, each performance of the analysis 220 is patient specific. For example, for Z eligible patients, the analysis 220 can be performed Z times to determine whether the Z eligible patients are to be categorized as candidate patients or non-candidate patients.
- a first step in the analysis 220 involves extracting features 260 from the electronic data 255 of a patient.
- this step may be performed by the feature extraction module 145 described in FIG. IB.
- features include any of a patient demographic datapoint, a diagnosis code, a procedures code, a prescriptions code, or a combination thereof.
- a patient demographic datapoint can be a value representing any of patient gender, patient age, patient behavior, patient ethnicity, patient race, patient geographic location, patient socioeconomic status, education status, number of months active, and/or insurance status.
- a value representing the patient age can, in various embodiments, be the patient age itself.
- a value for a patient ethnicity may be an integer value, where a particular integer value represents a particular patient ethnicity.
- a diagnosis code feature can be a diagnosis code, such as an ICD-9 or an ICD- 10 diagnosis code (or information representative thereof).
- Example ICD-9 and ICD-10 codes are shown below in Tables 1 and 2.
- a procedures code feature can be a procedures code, such as a Current Procedural Terminology (CPT) or Healthcare Common Procedure Coding System (HCPCS) code (or information representative thereof).
- CPT and HCPCS codes are shown below in Tables 3 and 4.
- a prescriptions code feature can be a particular prescription. Example prescriptions are shown in Table 5.
- extracting the features 260 involves encoding the features that can then be analyzed by a machine learning model.
- encoding the features can involve encoding the features into an input vector that can be analyzed by a machine learning model.
- Any suitable way to encode categorical variables may be used to encode the features, for example, one-hot encoding, label/ordinal encoding, target encoding, feature hashing, binary encoding, or count encoding.
- the feature extraction module 145 may differentially weigh the extracted features before inputting into the machine learning model. In some embodiments, differential weighing of the extracted features need not be performed. In some embodiments, the feature extraction module 145 differentially weighs the extracted features according to timepoints that the data were recorded. For example, the feature extraction module 145 assigns higher weights to features from the electronic data that were more recently recorded in comparison to features from data that were earlier recorded in the electronic record. Given that more recently recorded electronic data 120 may be more informative of the patient’s current risk for cancer as opposed to electronic data 120 that was recorded earlier, the more recently recorded electronic is assigned a higher weight to reflect its increased informativeness. In some embodiments, weights may be assigned based on, but not limited to, dates, fde sizes, medical professional names, and the like.
- differentially weighing the extracted features involves modifying the features according to different weight values. For example, if the features are encoded as an input vector, differentially weighing the features can involve modifying individual entries of the input vector by the different assigned weights to generate weighted features. Thus, values of features that are assigned higher weights can be increased relative to values of features that are assigned smaller weights.
- the next step of the analysis 220 involves applying a risk prediction model 265 (e.g., by the model deployment module 160 shown in FIG. IB) to analyze the features.
- a risk prediction model analyzes the features and generates a risk prediction 270, which may be informative for determining whether the patient is to be categorized as a candidate patient or non-candidate patient.
- the risk prediction 270 can be represented by a value.
- the risk prediction 270 is a binary value (e.g., 0 or 1, where 0 indicates unlikely to develop cancer in a certain time period and 1 indicates likely to develop cancer in a certain time period).
- the risk prediction 270 is represented by a score, such as a continuous value (e.g., between 0 and 1, where a value closer to 1 indicates higher likelihood of developing cancer).
- the risk prediction model 265 is a regression model (e.g., a logistic regression or linear regression model) that calculates a risk prediction 270 by combining a set of trained parameters with the extracted features.
- the risk prediction model can be a neural network model that calculates a risk prediction 270 by combining a set of trained parameters associated with nodes and layers of the neural network with values of the extracted features.
- the risk prediction model 265 can be a random forest model that calculates a risk prediction 270 by combining a set of trained parameters associated with decision tree nodes with values of the extracted features.
- the risk prediction model 265 can be a gradient boosted machine model that calculates a risk prediction 270 by combining a set of trained parameters associated with decision tree nodes with values of the extracted features.
- the risk prediction model 265 analyzes feature groupings, where a feature grouping represents 2 or more extracted features. Extracted features in a feature grouping may be related. For example, extracted features of a feature grouping can be related according to an anatomical organ or according to the patient, examples of which include patient behavior, patient characteristics, smoking status, and vaccination status. Exemplary feature groupings are described herein and further shown below in Table 6.
- the risk prediction model 265 may combine individual features into respective feature groupings, and then analyzes the feature groupings to determine the risk prediction 270. In various embodiments, the risk prediction model 265 analyzes both individual features, as well as feature groupings in generating the risk prediction 270.
- the risk prediction model 265 analyzes only feature groupings in generating the risk prediction 270. [0068] In various embodiments, the risk prediction model 265 further performs an analysis to determine the relative contributions of feature groupings that resulted in the risk prediction 270. The relative contributions of feature groupings are additionally referenced herein as subscores. In various embodiments, the risk prediction model 265 determines a relative contribution of a features or feature grouping by constructing and calculating outputs across various scenarios, a subset of scenarios including the extracted feature or feature grouping and another subset of scenarios excluding the extracted feature or feature grouping. Thus, by analyzing the changes of the outputs across the various scenarios, the relative contribution of the extracted feature or feature grouping can be deduced.
- Grouping features into feature groupings and performing analysis using feature grouping to determine relative contribution may use less resources than determining relative contribution using ungrouped features.
- the risk prediction model 265 groups features into any number of feature groupings such that the risk prediction model 265 only needs to determine a reduced number of relative contributions.
- the risk prediction model 265 performs a Shapley Additive Explanation (SHAP) analysis to determine SHAP values for feature groupings.
- SHAP Shapley Additive Explanation
- the SHAP analysis takes into account all different combinations of input variables with different subsets of the predictor vector as contributing to the output prediction.
- the risk prediction model 265 performs a Kernel SHAP analysis that calculates contributions of feature groupings across fewer scenarios.
- Kernel SHAP uses a weighted linear regression, where the coefficients of the linear regression represent the contributions of the feature groupings.
- the various scenarios may be used to determine weights of the linear regression to determine the contributions of the feature groupings.
- oilier machine learning models and/or algorithms may be used to determine contributions of feature groupings, without limitation, such as supervised, unsupervised, or other machine learning models.
- the risk prediction 270 may be calculated based on the combination of contributions from feature groupings.
- the risk prediction 270 may, in various embodiments, be a summation of the contributions from individual feature groupings.
- the candidate patient identifier module 165 can compare the risk prediction 270 to a threshold value to determine whether the patient is a candidate patient or a non-candidate patient.
- the threshold value maybe a fixed score. For example, if the risk prediction 270 is above the threshold score, the patient is classified into one category (e.g., a candidate patient). Alternatively, if the risk prediction 270 is below the threshold score, the patient is classified into a different category (e.g., non-candidate patient).
- the threshold score is set according to a quantity of available medical resources at a third party.
- each threshold score may represent a custom threshold score that is personalized for each third party.
- a third party may provide an indication that reflects the available medical resources of the third party.
- the candidate patient identifier module 165 may set a threshold score according to the indication that reflects the available medical resources. For example, for a third party that is severely limited on resources, the third party sends an indication reflecting those limited resources.
- the candidate patient identifier module 165 may set a high threshold value (e.g., at least 0.6, at least 0.7, or at least 0.8) such that fewer patients have scores above the threshold value and are identified as candidate patients.
- the candidate patient identifier module 165 may set a lower threshold value (e.g., at most 0.4, at most 0.3, or at most 0.2) such that more patients have scores above the threshold value and are identified as candidate patients. Threshold values may be set based on patient categories, classifications, demographics, diagnosis, and the like.
- the patient prioritization system 130 provides the identification of the candidate patients 140 to a third party (e.g., a third party managing the care for the candidate patients).
- a third party e.g., a third party managing the care for the candidate patients.
- the third party can appropriately prioritize its available medical resources to provide interventions to the candidate patients.
- the patient prioritization system 130 can additionally provide, to the third party, identification of one or more feature groupings that contributed to the categorization of patients as candidate patients.
- the patient prioritization system 130 can provide identification of feature groupings on a per-patient basis. For example, for each candidate patient, the patient prioritization system 130 provides identification of the specific feature groupings that contributed to the categorization of the patient as a candidate patient.
- the patient prioritization system 130 ranks the features or feature groupings that contributed to the categorization of a patient as a candidate patient. The patient prioritization system 130 can provide the top-ranked feature or topranked feature grouping.
- the patient prioritization system 130 can provide one or more features or feature groupings in accordance with their rank. For example, the patient prioritization system 130 provides the top 3 features or feature groupings that contributed to the categorization of a patient as a candidate patient. For another example, the patient prioritization system 130 provides the third -ranked feature or feature grouping that contributed to the categorization of a patient as a candidate patient.
- the third party can use the identification of features or feature groupings to select and provide care to a candidate patient. For example, if the top feature grouping that most heavily contribute to the patient being categorized as a candidate patient is the smoking behavior of the patient, the third party can appropriately counsel the candidate patient regarding the smoking behavior (e.g., counsel to reduce smoking or terminate smoking).
- a machine learning model is structured such that it analyzes features extracted from electronic data, such as features extracted from EHR data and/or features extracted from claims data, and generates a prediction informative for classifying a patient as a candidate or non-candidate patient.
- the risk prediction model can use any suitable machine learning model, such as a regression model (e.g., linear regression, logistic regression, or polynomial regression), decision tree, random forest, gradient boosting (e.g., a XGBoost gradient boosting model or a CatBoost gradient boosting model), support vector machine, Naive Bayes model, k-means cluster, or neural network (e.g., feed-forward networks, convolutional neural networks (CNN), deep neural networks (DNN), autoencoder neural networks, or any combination thereof).
- a regression model e.g., linear regression, logistic regression, or polynomial regression
- decision tree e.g., logistic regression, or polynomial regression
- random forest e.g., gradient boosting (e.g., a XGBoost gradient boosting model or a CatBoost gradient boosting model), support vector machine, Naive Bayes model, k-means cluster, or neural network (e.g., feed-forward networks, convolutional neural networks (CNN
- the risk prediction model can be trained using a machine learning implemented method, such as any one of a linear regression algorithm, logistic regression algorithm, decision tree algorithm, support vector machine classification, Naive Bayes classification, K- Nearest Neighbor classification, random forest algorithm, deep learning algorithm, gradient boosting algorithm, and dimensionality reduction techniques such as manifold learning, principal component analysis, factor analysis, autoencoder regularization, and independent component analysis, or combinations thereof.
- the risk prediction model is trained using supervised learning algorithms, unsupervised learning algorithms, semi-supervised learning algorithms (e.g., partial supervision), weak supervision, transfer, multi-task learning, or any combination thereof.
- the risk prediction model has one or more parameters, such as hyperparameters or model parameters.
- Hyperparameters are generally established prior to training. Examples of hyperparameters include the learning rate, depth or leaves of a decision tree, number of hidden layers in a deep neural network, number of clusters in a k- means cluster, penalty in a regression model, and a regularization parameter associated with a cost function.
- Model parameters are generally adjusted during training. Examples of model parameters include weights associated with nodes in layers of neural network, support vectors in a support vector machine, node values in a decision tree, and coefficients in a regression model.
- the model parameters of the risk prediction model are trained (e.g., adjusted) using the training data to improve the predictive capacity of the risk prediction model.
- the model training module 155 trains the risk prediction model using training data.
- the training data includes extracted features from electronic data (e.g., EHR data and/or claims data) obtained from training individuals.
- a training individual may be an individual known to be at risk or not be at risk for cancer.
- a training individual may be an individual known to not be at risk for cancer if the training individual is not subsequently diagnosed with cancer.
- such a training individual known to not be at risk for cancer may be an individual who later underwent an intervention (e.g., CT/PET imaging and/or biopsy) and was determined to not have cancer.
- a training individual may be an individual known to be at risk for cancer if the training individual is subsequently diagnosed with cancer.
- a training individual known to be at risk for cancer may be an individual who later underwent an intervention (e.g., CT/PET imaging and/or biopsy) and was determined to have cancer.
- a training individual may be an individual known to be at risk for cancer if the training individual is subsequently diagnosed with cancer at a timepoint at least A months in the future.
- the at least A months in the future is sufficiently distant in the future such that when the electronic records were obtained from the training individual, the cancer in the training individual represented an early-stage cancer.
- A may be at least 1 month. In some embodiments, A may be between about 1 month to about 55 months.
- the training data can be obtained from a split of a dataset.
- the dataset can undergo a 50:50 training:testing dataset split.
- the dataset can undergo a 60:40 training:testing dataset split.
- the dataset can undergo a 80:20 training lesting dataset split.
- the training data used for training the imputation model includes reference ground truths that indicate that a training individual was subsequently diagnosed with cancer (hereafter also referred to as “positive” or “+”) or whether the training individual was not subsequently diagnosed with cancer (hereafter also referred to as
- the reference ground truths in the training data are binary values, such as “1” or “0.”
- a training individual that was subsequently diagnosed with cancer can be identified in the training data with a value of “1” whereas a training individual who was not diagnosed with cancer can be identified in the training data with a value of “0.”
- the model training module 155 trains the risk prediction model using the training data to minimize a loss function such that the risk prediction model can better generate a prediction (e.g., a score informative for determining whether the patient is a candidate or non-candidate patient) based on the input (e.g., extracted features of the electronic data).
- the loss function is constructed for any of
- risk prediction models disclosed herein achieve a performance metric.
- Example performance metrics include an area under the curve (AUC) of a receiver operating curve, a positive predictive value, and/or a negative predictive value.
- AUC area under the curve
- risk prediction models disclosed herein exhibit an AUC value of at least 0.5.
- risk prediction models disclosed herein exhibit an AUC value of at least 0.6.
- risk prediction models disclosed herein exhibit an AUC value of at least 0.7.
- risk prediction models disclosed herein exhibit an AUC value of at least 0.8.
- risk prediction models disclosed herein exhibit an AUC value of at least 0.9.
- risk prediction models disclosed herein exhibit an AUC value of at least 0.95.
- risk prediction models disclosed herein exhibit an AUC value of at least 0.99. In various embodiments, risk prediction models disclosed herein exhibit an AUC value of at least 0.51. In some embodiments, AUC values may be between about 0.51 to about 0.99, without limitation.
- risk prediction models disclosed herein achieve an odds ratio, which refers to the relative risk of the higher risk population (e.g., candidate patients) compared to the standard risk. In various embodiments, risk prediction models disclosed herein achieve an odds ratio of at least 1.1 to about 3.0, without limitation.
- example methods for prioritizing medical resources for screening cancer patients involve analysis of features from electronic records using a trained machine learning model.
- the features of the electronic records are weighted according to timepoints that data were received. For example, more recently recorded data in the electronic records can be assigned a higher weight than earlier recorded data in the electronic records. More recently recorded data may be more reflective of a current state of the patient and therefore, a higher weight of the more recently recorded data enables the machine learning model to appropriately account for the timing of the recordation. Altogether, this enables the machine learning model to predict candidate patients more accurately.
- Step 310 involves obtaining a temporally diverse dataset of electronic records.
- the temporally diverse dataset includes information of patients that are recorded at various timepoints.
- the temporally dataset may include EHR and/or claims data obtained from the patient during a first hospital visit, and may further include EHR and/or claims data obtained from the same patient during a second hospital visit.
- Step 315 involves an overall step of categorizing a patient as a candidate patient or a non-candidate patient. As shown in FIG. 3 A, step 315 includes step 320 and step 330. Step 315 can be performed multiple times across different patients to determine whether each of the patients are a candidate patient or a non-candidate patient.
- Step 320 involves weighting features from data of the electronic records according to timepoints that the data of the electronic records were recorded. Specifically, features from data more recently recorded in the electronic records are more heavily weighted in compared to features from data that were earlier recorded in the electronic records.
- Step 330 involves analyzing the weighted features using a trained machine learning model.
- the trained machine learning model outputs a prediction for categorizing the patient as a candidate patient or a non-candidate patient.
- Step 335 involves providing identification of the candidate patients.
- step 335 involves providing identification of the candidate patients to a third party that is managing the care of the patient (e.g., a hospital or physician’s office).
- a third party that is managing the care of the patient (e.g., a hospital or physician’s office).
- the flow process can restart again at step 310.
- a new temporally diverse dataset of electronic records can be obtained.
- the temporally diverse dataset of electronic records may include data that was newly recorded since a prior version of the temporally diverse dataset was obtained.
- example methods for prioritizing medical resources for screening cancer patients involve analyze groupings of features from electronic records using a trained machine learning model.
- these feature groupings can be analyzed to determine how much each feature grouping contributed to the prediction of the machine learning model (e.g., a prediction informative of a candidate patient or a non-candidate patient).
- Step 340 involves obtaining a dataset comprising electronic records.
- the electronic records may include one or both of EHR data and claims data for one or more patients.
- Step 345 involves an overall step of categorizing a patient as a candidate patient or a non-candidate patient. As shown in FIG. 3B, step 345 includes step 350 and step 360. Step 345 can be performed multiple times across different patients to determine whether each of the patients are a candidate patient or a non-candidate patient.
- Step 350 involves extracting features from data of the electronic records.
- Example features from EHR data and/or claims data are described herein.
- Step 360 involves analyzing features using a trained machine learning model.
- the machine learning model can output a score indicative of cancer risk for the patient.
- the score indicative of cancer risk is determinative of whether the patient is a candidate patient or a non-candidate patient.
- the machine learning model can further identify a feature grouping that contributed to the score indicative of cancer risk.
- the identified feature grouping may be a feature grouping that most heavily contributed to the score indicative of cancer risk.
- Step 365 involves providing identification of the candidate patients and the corresponding identifications of feature groupings.
- step 365 involves providing identification of the candidate patients and the corresponding identifications of feature groupings to a third party that is managing the care of the patient (e.g., a hospital or physician’s office).
- a third party that is managing the care of the patient (e.g., a hospital or physician’s office).
- the flow process can restart again at step 340.
- a new dataset of electronic records can be obtained.
- the new dataset of electronic records may include data that was newly recorded since a prior version of the temporally diverse dataset was obtained.
- example methods for prioritizing medical resources for screening cancer patients involve categorizing patients according to an indication reflecting available medical resources of a third party.
- the third party may manage the care of various patients and may gave limited medical resources.
- the third party may need to prioritize the medical resources for a subset of the patients (e.g., candidate patients).
- An example third party may be a hospital or physician’s office that cares for the patients and/or stores electronic records (e.g., EHR data and/or claims data) related to the patients.
- EHR data and/or claims data electronic records
- FIG. 3C depicts an example interaction diagram for identifying candidate patients, in accordance with a third embodiment.
- the interaction diagram shows an example patient prioritization system 130 and a third party 370.
- the third-party stores electronic data, such as electronic data for one or more patients.
- the patient prioritization system 130 receives the electronic data of patients.
- the patient prioritization system 130 receives an indication reflecting the available medical resources of the third party 370.
- the patient prioritization system 130 extracts features from data of the electronic records (e.g., EHR data and/or claims data). Example features of EHR data and/or claims data is further described herein.
- the patient prioritization system 130 categorizes patients using the indication received from the third party 370.
- step 382 involves analyzing features using a trained machine learning model to generate a prediction of whether patients are to be categorized as candidate patients or non-candidate patients.
- the patient prioritization system 130 establishes a threshold score using the indication received from the third party 370 such that the machine learning model uses the threshold score to categorize patients as candidate patients or non-candidate patients.
- the patient prioritization system 130 sets a threshold score to meet the available medical resources for the third party 370.
- the patient prioritization system 130 can provide identification of a tailored set of candidate patients to the third party 370.
- the third party 370 can provide an intervention to the candidate patients, while withholding the intervention for non-candidate patients.
- Electronic data generally refers to data gathered from patients that are stored in electronic form.
- Exemplary electronic data include electronic health record (EHR) data and claims data of patients.
- EHR electronic health record
- electronic data further includes timepoints for which the EHR data and/or claims data of patients were recorded.
- EHR data it represents readily available medical information that may have been previously obtained from patients (e.g., obtained over one or more patients to a hospital or physician’s office).
- EHR data represents an electronic version of a patient’s medical history.
- EHR data comprises one or more of patient demographics data, laboratory test data, prior diagnoses data, prior procedures data, and prior prescriptions data.
- EHR data comprises each of patient demographics data, prior diagnoses data, prior procedures data, and prior prescriptions data.
- claims data it represents administrative data collected from patients, examples of which include information from doctor’s appointments, bills, and insurance information.
- claims data comprise one or more of patient demographics data, prior diagnoses data, prior procedures data, and prior prescriptions data.
- claims data comprise each of patient demographics data, prior diagnoses data, prior procedures data, and prior prescriptions data.
- Patient demographics data of the EHR data and/or the claims data can refer to background characteristics of the patient.
- Example patient demographics data include patient gender, patient age, patient behavior, patient ethnicity, patient race, patient geographic location, patient socioeconomic status, education status, number of months active (e.g., number of months for which EHR data was stored for a patient), and/or insurance status.
- patient demographics data includes patient behavior. Examples of patient behavior can include number of prior hospitalizations, number of prior physician visits, number of emergency room visits, and number of unique providers.
- the patient behavior includes the smoking behavior of a patient. The smoking behavior of a patient can be identified as one of current smoker, previously smoked, never smoked, other smoking, not currently smoking, or unknown smoker.
- Prior diagnoses data of the EHR data and/or the claim data can refer to a number of prior diagnoses and/or identifications of prior diagnoses for the patient.
- diagnoses data of the EHR data can include diagnosis codes, which correspond to diagnoses of lung -related or non-lung related issues.
- Example diagnoses codes for diagnosing lung- related issues are shown below in Table 1.
- example diagnoses codes for identifying diagnosing non-lung related issues are shown below in Table 2.
- Prior procedures data of the EHR data and/or the claim data can refer to a number of prior procedures and/or identifications of prior procedures for the patient.
- Such procedures data can include procedure codes, which correspond to performed procedures for lung-related or non-lung related issues.
- Example procedure codes for lung-related procedures are shown below in Table 3.
- example procedure codes for non-lung related procedures are shown below in Table 4.
- Prior prescriptions data of the EHR data and/or the claim data can refer to a number of prior prescriptions and/or identifications of prior prescriptions that were provided to the patient.
- Example prior prescriptions can include prescriptions for treating a lung- related condition or a non-lung related condition.
- Example prior prescriptions are shown below in Table 5. The right-most column of Table 5 shows the target body area of the drug, including lung and non-lung (e.g., blood, digestion, heart, mental, allergies, skin, reproductive, hormone, smoke, vaccine, and general) conditions.
- EHR data and claims data may include overlapping information of patients.
- both the EHR data and claims data can include patient demographics data for patients.
- both the EHR data and claims data can include prior diagnoses data for patients.
- both the EHR data and claims data can include prior procedures data for patients.
- both the EHR data and claims data can include prior prescriptions data for patients. Overlapping patient data between the EHR data and claims data can be useful for verifying patient data, as the overlapping patient data would represent more reliable patient information.
- EHR data may include additional patient information that is not available in the claims data, and vice versa.
- EHR data can further include additional demographics information of additional specificity that may not be available in the claims data.
- additional demographics data can include living situation (e.g., single or living alone, married, or living together) as well as language (e.g., primary spoken language).
- EHR data can further include laboratory test data that may not be available in claims data.
- EHR data can include measurements of characteristics or quantitative values for one or more biomarkers determined for the patient.
- Example laboratory test data can include values for alanine aminotransferase (ALT), body mass index, cholesterol, creatinine, forced expiratory volume (FEV-1), FEV-l/FVC ratio, glucose, high-density lipoprotein (HDL), international normalized ratio (INR), potassium, low density lipoprotein (LDL), mean corpuscular hemoglobin concentration (MCHC-M), platelets, red cell distribution width (ROW), triglycerides, white blood cells (WBC).
- ALT alanine aminotransferase
- HDL high-density lipoprotein
- IR international normalized ratio
- MHC-M mean corpuscular hemoglobin concentration
- platelets platelets
- ROW red cell distribution width
- WBC white blood cells
- each dataset can be used to supplement the other dataset.
- EHR data and claims data enables the more accurate prediction and identification of candidate patients in comparison to the use of any single data alone.
- Methods disclosed herein further involve analyzing two or more extracted features e.g., in a feature grouping to determine whether a patient is to be categorized as a candidate patient or a non-candidate patient.
- analyzing a feature grouping comprising two or more extracted features e.g., analyzing using a trained machine learning model
- methods disclosed herein can involve determining a contribution of the feature grouping that resulted in the categorization of the patient as a candidate patient or non-candidate patient.
- a “feature grouping” refers to one or more extracted features.
- a feature grouping refers to 2 or more extracted features.
- a feature grouping may refer to about 2 extracted features to about 30 extracted features, without limitation.
- a feature grouping may refer to more than 30 extracted features, in some embodiments.
- extracted features of a feature grouping can be related according to an anatomical organ, such as any one of a brain, heart, blood, thorax, eyes, lung, abdomen, colon, cervix, pancreas, kidney, liver, muscle, lymph nodes, oral cavity, pharynx, larynx, esophagus, intestine, spleen, stomach, and gall bladder.
- extracted features of a feature grouping can be related to the patient, examples of which include patient behavior, patient characteristics, smoking status, and vaccination status.
- Example feature groupings can include Lung, Heart, Preventative Care, Blood, Digestion, Tobacco Use/Smoking, Mental Health, Reproductive, Oral Cavity/Pharynx/Larynx, Pain/Pain Management, Health Measures/Benchmarks, and Vision.
- the cancer in the patient can include one or more of: lymphoma, B cell lymphoma, T cell lymphoma, mycosis fungoides, Hodgkin's Disease, myeloid leukemia, bladder cancer, brain cancer, nervous system cancer, head and neck cancer, squamous cell carcinoma of head and neck, kidney cancer, lung cancer, neuroblastoma/glioblastoma, ovarian cancer, pancreatic cancer, prostate cancer, skin cancer, liver cancer, melanoma, squamous cell carcinomas of the mouth, throat, larynx, and lung, colon cancer, cervical cancer, cervical carcinoma, breast cancer, and epithelial cancer, renal cancer, genitourinary cancer, pulmonary cancer, esophageal carcinoma, stomach cancer, thyroid cancer, head and neck carcinoma, large bowel cancer, hematopoietic cancer, testi
- the cancer in the patient can be a metastatic cancer, including any one of bladder cancer, breast cancer, colon cancer, kidney cancer, lung cancer, melanoma, ovarian cancer, pancreatic cancer, prostatic cancer, rectal cancer, stomach cancer, thyroid cancer, or uterine cancer.
- the cancer is a lung cancer.
- the cancer is a type of lung cancer, including any one of small cell lung cancer, non-small cell lung cancer, non-small cell carcinoma, adenocarcinoma, squamous cell cancer, large cell carcinoma, small cell carcinoma, combined small cell carcinoma, neuroendocrine tumor, lung sarcoma, lung lymphoma, bronchial carcinoids.
- the cancer is an early-stage cancer.
- the early-stage cancer is a stage I cancer.
- the early- stage cancer is a stage II cancer.
- the early-stage cancer is an early- stage lung cancer.
- the early-stage lung cancer refers to a stage prior to the development of nodules, such as lung nodules or lymph node nodules.
- the early-stage lung cancer may not yet have been previously diagnosed or identified (e.g., via biopsy or imaging). Thus, methods disclosed herein can be useful for prioritizing patients that would most benefit from subsequent analysis (e.g., via biopsy or imaging).
- Embodiments described herein involve prioritizing medical resources by identifying candidate patients likely to be at risk of a cancer using risk prediction models.
- the methods disclosed herein are performed on patients who have not previously received any of the following: an image scan (e.g., any of a LDCT/Chest- CT/PET/PET-CT scan), a lung cancer biopsy procedure, or a lung cancer diagnosis.
- an image scan e.g., any of a LDCT/Chest- CT/PET/PET-CT scan
- a lung cancer biopsy procedure e.g., a LDCT/Chest- CT/PET/PET-CT scan
- a lung cancer diagnosis e.g., any of a LDCT/Chest- CT/PET/PET-CT scan
- the intervention can be any one of: application of a diagnostic, application of a prophylactic therapeutic agent, or a subsequent action.
- Example subsequent actions can include a subsequent testing of the patient to confirm whether the patient develops cancer.
- Subsequent testing can include any of a subsequent biopsy (e.g., cancer biopsy or lymph node biopsy) or subsequent image scanning (e.g., CT scanning, PET scanning, MRI scanning, ultrasound imaging, or X-ray imaging).
- the subsequent testing includes performing a CT or PET image scanning. The CT or PET image scanning can then be used to confirm the risk of cancer in the patient.
- the subsequent testing includes performing a chest CT or PET image scanning.
- subsequent testing of the patient can occur during at a next scheduled visit or at a pre-determined amount of time such as, but not limited to, about 1 month to about 24 months after predicting the future risk of cancer. In some embodiments, a pre-determined amount of time may be less than 1 month or greater than 24 months.
- additional subsequent actions can include subsequent actions to treat a cancer that has developed in the patient, such as tumor resection, bronchoscopic diagnosis, selection and/or administration of therapeutic(s), selection/administration of pharmaceutical composition, or any combination thereof.
- a therapeutic agent can be selected and/or administered to the patient based on the predicted future risk of cancer.
- the selected therapeutic agent is likely to delay or prevent the development of the cancer, such as lung cancer.
- exemplary therapeutic agents include chemotherapies, energy therapies (e.g., external beam, microwave, radiofrequency ablation, brachytherapy, electroporation, cryoablation, photothermal ablation, laser therapy, photodynamic therapy, electrocauterization, chemoembolization, high intensity focused ultrasound, low intensity focused ultrasound), antigen-specific monoclonal antibodies, anti-inflammatories, oncolytic viral therapies, or immunotherapies.
- the selected therapeutic agent is an energy therapy and the amount (e.g., dose and duration) of the energy applied can be tailored to achieve a desired therapeutic effect.
- the therapeutic agent is a small molecule or biologic, e.g., a cytokine, antibody, soluble cytokine receptor, anti-sense oligonucleotide, siRNA, etc.
- biologic agents encompass muteins and derivatives of the biological agent, which derivatives can include, for example, fusion proteins, PEGylated derivatives, cholesterol conjugated derivatives, and the like as known in the art.
- antagonists of cytokines and cytokine receptors e.g., traps and monoclonal antagonists.
- Therapeutic agents for lung cancer can include chemotherapeutics such as docetaxel, cisplatin, carboplatin, gemcitabine, Nab-paclitaxel, paclitaxel, pemetrexed, gefitinib, erlotinib, brigatinib (Alunbrig®), capmatinib (Tabrecta®), selpercatinib (Retevmo®), entrectinib (Rozlytrek®), lorlatinib (Lorbrena®), larotrectinib (Vitrakvi®), dacomitinib (Vizimpro®), and vinorelbine.
- chemotherapeutics such as docetaxel, cisplatin, carboplatin, gemcitabine, Nab-paclitaxel, paclitaxel, pemetrexed, gefitinib, erlotinib, brigatinib (Alunbrig®), capmatinib
- Therapeutic agents for lung cancer can include antibody therapies such as durvalumab (Imfinzi®), nivolumab (Opdivo®), pembrolizumab (Keytruda®), atezolizumab (Tecentriq®), canakinumab, and ramucirumab.
- antibody therapies such as durvalumab (Imfinzi®), nivolumab (Opdivo®), pembrolizumab (Keytruda®), atezolizumab (Tecentriq®), canakinumab, and ramucirumab.
- one or more of the therapeutic agents described can be combined as a combination therapy for treating the patient.
- a pharmaceutical composition can be selected and/or administered to the patient based on the patient level risk of metastatic cancer , the selected therapeutic agent likely to exhibit efficacy against the cancer.
- a pharmaceutical composition administered to an individual includes an active agent such as the therapeutic agent described above.
- the active ingredient is present in a therapeutically effective amount, i.e., an amount sufficient when administered to treat a disease or medical condition mediated thereby.
- the compositions can also include various other agents to enhance delivery and efficacy, e.g., to enhance delivery and stability of the active ingredients.
- the compositions can also include, depending on the formulation desired, pharmaceutically acceptable, nontoxic carriers or diluents, which are defined as vehicles commonly used to formulate pharmaceutical compositions for animal or human administration.
- the diluent may be selected so as not to affect the biological activity of the combination.
- examples of such diluents are distilled water, buffered water, physiological saline, PBS, Ringer’s solution, dextrose solution, and Hank’s solution.
- the pharmaceutical composition or formulation can include other carriers, adjuvants, or non-toxic, nontherapeutic, nonimmunogenic stabilizers, excipients and the like.
- the compositions can also include additional substances to approximate physiological conditions, such as pH adjusting and buffering agents, toxicity adjusting agents, wetting agents, and detergents.
- the composition can also include any of a variety of stabilizing agents, such as an antioxidant.
- compositions or therapeutic agents described herein can be administered in numerous ways. Examples include administering a composition containing a pharmaceutically acceptable carrier via oral, intranasal, intramodular, intralesional, rectal, topical, intraperitoneal, intravenous, intramuscular, subcutaneous, subdermal, transdermal, intrathecal, endobronchial, transthoracic, or intracranial method.
- a clinical response can be provided to the patient based on the predicted future risk of cancer generated for the patient by implementing risk prediction models.
- a clinical response can include providing counseling to modify a behavior of the patient (e.g., counsel the patient about smoking cessation to reduce risk), initiating of an inhaled/topical, intravenous or enteral (by mouth) therapeutic that could delay/prevent malignant transformation, slow tumor growth or even prevent spread of disease (metastasis), establishing an adaptive screening schedule for future risk similar to what is done with colonoscopy for polyps (e.g., individuals predicted to be higher risk for future lung cancer should have more frequent follow up and imaging), or performing or scheduling to be performed an additional risk prediction test to confirm the predicted future risk of lung cancer (e.g., persons deemed to be higher risk for lung cancer may also then undergo additional testing to either confirm that risk or narrow the cancer type the person is at greatest risk for.
- counseling to modify a behavior of the patient e.g., counsel the patient about smoking cessation to reduce risk
- initiating of an inhaled/topical, intravenous or enteral (by mouth) therapeutic that could delay/prevent malignant transformation, slow tumor
- the additional risk prediction test could include blood-based biomarkers (to look for non-specific inflammation which is a known risk for lung cancer), metabolomics/proteomics/gene expression/genetic sequencing.
- the person could also have additional sampling of tissue (nasal epithelium, bronchial epithelium, etc.) to look at changes in gene expression in the respiratory tract.)
- the methods disclosed herein, including the prioritizing medical resources by identifying candidate patients likely to be at risk of a cancer using risk prediction models, are, in some embodiments, performed on one or more computers.
- the patient prioritization system 130 can include one or more computers. Therefore, in various embodiments, the steps described in reference to the patient prioritization system 130 are performed in silico.
- the building and deployment of a risk prediction model can be implemented in hardware or software, or a combination of both.
- a machine-readable storage medium comprising a data storage material encoded with machine readable data which, when using a machine programmed with instructions for using said data, is capable of executing the training or deployment of risk prediction models and/or displaying any of the datasets or results (e.g., future risk of cancer predictions for patients) described herein.
- the disclosure can be implemented in computer programs executing on programmable computers, comprising a processor, a data storage system (including volatile and non-volatile memory and/or storage elements), a graphics adapter, a pointing device, a network adapter, at least one input device, and at least one output device.
- a display is coupled to the graphics adapter.
- Program code is applied to input data to perform the functions described above and generate output information.
- the output information is applied to one or more output devices, in known fashion.
- the computer can be, for example, a personal computer, microcomputer, or workstation of conventional design.
- Each program can be implemented in a high-level procedural or object- oriented programming language to communicate with a computer system.
- the programs can be implemented in assembly or machine language, if desired. In any case, the language can be a compiled or interpreted language.
- Each such computer program is preferably stored on a storage media or device (e.g., ROM or magnetic diskette) readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein.
- the system can also be considered to be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
- the signature patterns and databases thereof can be provided in a variety of media to facilitate their use.
- Media refers to a manufacture that contains the signature pattern information of the present disclosure.
- the databases of the present disclosure can be recorded on computer readable media, e.g., any medium that can be read and accessed directly by a computer.
- Such media include, but are not limited to magnetic storage media, such as floppy discs, hard disc storage medium, and magnetic tape; optical storage media such as CD-ROM; electrical storage media such as RAM and ROM; and hybrids of these categories such as magnetic/optical storage media.
- magnetic storage media such as floppy discs, hard disc storage medium, and magnetic tape
- optical storage media such as CD-ROM
- electrical storage media such as RAM and ROM
- hybrids of these categories such as magnetic/optical storage media.
- Recorded refers to a process for storing information on computer readable medium, using any such methods as known in the art. Any convenient data storage structure can be chosen, based on the means used to access the stored information. A variety of data processor programs and formats can be used for storage, e.g., word processing text file, database format, etc.
- the methods of the disclosure are performed on one or more computers in a distributed computing system environment (e.g., in a cloud computing environment).
- cloud computing is defined as a model for enabling on- demand network access to a shared set of configurable computing resources. Cloud computing can be employed to offer on-demand access to the shared set of configurable computing resources.
- the shared set of configurable computing resources can be rapidly provisioned via virtualization and released with low management effort or service provider interaction, and then scaled accordingly.
- a cloud-computing model can be composed of various characteristics such as, for example, on-demand self-service, broad network access, resource pooling, rapid elasticity, measured service, and so forth.
- a cloud-computing model can also expose various service models, such as, for example, Software as a Service (“SaaS”), Platform as a Service (“PaaS”), and Infrastructure as a Service (“laaS”).
- SaaS Software as a Service
- PaaS Platform as a Service
- laaS Infrastructure as a Service
- a cloud-computing model can also be deployed using different deployment models such as private cloud, community cloud, public cloud, hybrid cloud, and so forth.
- a “cloud-computing environment” is an environment in which cloud computing is employed.
- FIG. 4 illustrates an example computer for implementing the entities shown in FIG. 1A, IB, 2, 3A, and 3B.
- the computer 400 includes at least one processor 402 coupled to a chipset 404.
- the chipset 404 includes a memory controller hub 420 and an input/output (I/O) controller hub 422.
- a memory 406 and a graphics adapter 412 are coupled to the memory controller hub 420, and a display 418 is coupled to the graphics adapter 412.
- a storage device 408, an input device 414, and network adapter 416 are coupled to the I/O controller hub 422.
- Other embodiments of the computer 400 have different architectures.
- the storage device 408 is a non-transitory computer-readable storage medium such as a hard drive, compact disk read-only memory (CD-ROM), DVD, or a solid-state memory device.
- the memory 406 holds instructions and data used by the processor 402.
- the input device 414 is a touch-screen interface, a mouse, track ball, or other type of pointing device, a keyboard, or some combination thereof, and is used to input data into the computer 400.
- the computer 400 may be configured to receive input (e.g., commands) from the input device 414 via gestures from the user.
- the network adapter 416 couples the computer 400 to one or more computer networks.
- the graphics adapter 412 displays images and other information on the display 418.
- the display 418 is configured such that the user may (e.g., radiologist, oncologist, pulmonologist) may input user selections on the display 418 to, for example, initiate risk prediction for a patient, order any additional exams or procedures and/or set parameters for the risk prediction models.
- the display 418 may include a touch interface.
- the display 418 can show one or more predictions of a risk prediction model.
- the display 418 can show a score indicative of lung cancer risk for the patient.
- the display 418 can show scores for feature groupings that contribute to the score indicative of lung cancer risk. Example information shown on a display 418 are depicted in FIGs. 6A and 6B, and described in further detail below.
- a user who accesses the display 418 can inform the patient of the score indicative of lung cancer risk.
- the display 418 can show information such as the feature groupings that most heavily contributed to the score indicative of lung cancer risk. Displaying the top contributing feature groups can provide context to a user e.g., clinician user in understanding the features that resulted in the score indicative of lung cancer.
- the computer 400 is adapted to execute computer program modules for providing functionality described herein.
- module refers to computer program logic used to provide the specified functionality.
- a module can be implemented in hardware, firmware, and/or software.
- program modules are stored on the storage device 408, loaded into the memory 406, and executed by the processor 402.
- the types of computers 400 can vary depending upon the embodiment and the processing power required by the entity.
- the patient prioritization system 130 can run in a single computer 400 or multiple computers 400 communicating with each other through a network such as in a server farm.
- the computers 400 can lack some of the components described above, such as graphics adapters 412 and displays 418.
- Such a system can include at least the patient prioritization system 130 described above in FIG. 1A.
- the patient prioritization system 130 is embodied as a computer system, such as a computer system with example computer 400 described in FIG. 4.
- Example 1 Example Categorization of Patients
- Example 1 describes an algorithm to identify candidate patients. It is noted that the current USPSTF recommendations are based on certain factors. For example, the USPSTF recommends that patients who meet data points of between 50-80 years old and with a 20+ smoking pack year history to pursue a preliminary low dose computed tomography (LDCT) scan for possible lung cancer. However, there are drawbacks using the USPSTF recommendations - for example, the 20+ smoking pack year history is largely a self-reported data point and its accuracy is dependent on the patients reporting the correct number if they report at all.
- LDCT low dose computed tomography
- USPSTF United States Preventive Services Task Force
- the machine learning algorithm identifies candidate patients at risk of lung cancer.
- Various machine learning models can be used (the results of which are described below in Example 2).
- machine learning models may be developed using a gradient boosting decision tree algorithm such as CatBoost or XGBoost.
- Other machine learning models may be developed, e.g., a neural network.
- the algorithm may be trained on features extracted from stored electronic data (e.g., lung issues, heart issues) and claims data (e.g., procedure codes), and may be analyzed through Shapley analysis.
- Each third party e.g., clinical site, can adjust a threshold for demarcating elevated versus standard risk of lung cancer, based on site preference.
- the raw output of the algorithm may be a floating integer (propensity score) between 0 and 1 corresponding to lung cancer risk score. Higher number may refer to higher risk.
- the raw output may also include a list of “features” (each “feature” is either an individual feature, or a feature grouping of related features to ease Shapley computational burden) and their Shapley additive contribution to lung cancer score.
- the formatted output may be visible to the user and may be a binary output of elevated versus standard risk.
- the user interface may also show feature(s) that contributed most to the score.
- FIG. 5A depicts an example data pipeline for developing the algorithm (e.g., machine learning model).
- Native electronic data e.g., electronic health record (EHR) or claims data
- EHR electronic health record
- the study criteria e.g., 50-80 years old, smoking related observation, and no prior scan, biopsy procedure, or cancer diagnosis
- the selected patient data can be provided (from health care provider or parsed from health care provider output) in the form of four tables listing the following info for each patient:
- diagnosis (Dx) codes e.g., lung issues such as COPD, chronic bronchitis, pneumonia, emphysema; heart issues such as hypertension, vascular disease, hyperlipidemia
- Patients may be split into training, testing, and validation datasets.
- the data underwent feature identification.
- the patient data may undergo a transformation (e.g., a hyperbolic transformation) to change input values to be between -1 and 1.
- the patient data may be used to train the machine learning models.
- the parameters and/or hyperparameters of the machine learning models may be tuned during training and final versions of the models may be saved after training.
- the machine learning models may undergo further validation.
- FIG. 5B depicts an example data pipeline for validating the algorithm (e.g., machine learning model).
- the algorithm e.g., machine learning model
- no further training or tuning of the machine learning models occurred in this phase.
- the machine learning models may be deployed to determine final performance metrics, measuring the performance of the machine learning models.
- software may process patient data into “scalar tables.”
- the patient data may be input into algorithm and the algorithm may analyze designated input features (diagnoses, procedures, prescriptions, demographics).
- the example features are described herein in Tables 1-5.
- the raw output of the algorithm may include:
- lung cancer score e.g., normalized to 0-1, or any other suitable scale.
- the lung cancer score may be compared to a predetermined threshold to classify a patient as having an elevated risk or a standard risk for future lung cancer.
- the output to a health care provider may include:
- FIGs. 6A and 6B depict example outputs for a patient with a standard lung cancer risk and a patient with an elevated lung cancer risk, respectively.
- FIG. 6A shows the results of a patient with standard lung cancer risk (e.g., a non-candidate patient).
- the machine learning model predicted an overall score for the patient (e.g., “propensity score”) of 0.30. Given that the score is below a threshold value, the patient is categorized as a non-candidate patient.
- the chart on the left as well as the table on the right in FIG. 6A shows individual contributions of various features. The contributions are denoted as “SHAP values.”
- the “core drivers” shown in FIG. 6A indicates the most influential and important feature groupings that contributed to the propensity score.
- FIG. 6B shows the results of a patient with elevated lung cancer risk (e.g., a candidate patient).
- the machine learning model predicted an overall score for the patient (e.g., “propensity score”) of 0.73. Given that the score is above a threshold value, the patient is categorized as a candidate patient.
- the chart on the left as well as the table on the right in FIG. 6B show individual contributions of various features. The contributions are denoted as “SHAP values.”
- the “core drivers” shown in FIG. 6B of Smoking Status and COPD indicate the most influential and important feature groupings that contributed to the propensity score.
- the candidate patient corresponding to the results in FIG. 6B can be prioritized for subsequent screening (e.g., imaging, such as a subsequent CT scan) whereas the non-candidate patient corresponding to the results in FIG. 6A can be withheld from subsequent screening.
- Example 2 Categorization of Patients Using Logistic Regression, Neural Network,
- Various machine learning models may be developed to show the applicability of the disclosed methodology across different machine learning model architectures.
- each of a logistic regression machine learning model, a neural network machine learning model, a XGBoost gradient boosted decision tree, and CatBoost gradient boosted decision tree may be developed.
- FIG. 7 shows performance of various machine learning models (e.g., logistic regression, neural network, XGBoost, and CatBoost).
- each of the machine learning models identified patients that were of high risk of developing lung cancer.
- the various machine learning approaches may provide varying results.
- FIG. 7 shows that gradient boosting decision tree algorithms (CatBoost, XGBoost) appears to provide the best results, followed by neural network, and logistic regression.
- each machine learning model achieved an Odds Ratio (e.g., Odds Ratio refers to the relative risk of the high-risk population compared to the standard risk) greater than 1, indicating that all machine learning models were successful.
- Odds Ratio e.g., Odds Ratio refers to the relative risk of the high-risk population compared to the standard risk
Landscapes
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Primary Health Care (AREA)
- Public Health (AREA)
- Business, Economics & Management (AREA)
- Biomedical Technology (AREA)
- General Business, Economics & Management (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
Methods involve deploying trained risk prediction models to analyze electronic records of patients to identify which patients are at risk of developing cancer. Electronic records are obtainable in large quantities and in a cost-effective manner and therefore, can be valuable data for continuous monitoring and evaluation of large patient populations for patients that are at risk of developing cancer. Patients at risk of developing cancer, referred to herein as candidate patients, can be prioritized for medical interventions, thereby enabling healthcare providers to appropriate divert medical attention to candidate patients that are most in need. Candidate patients can undergo subsequent imaging and/or biopsy procedures to confirm the risk of developing cancer.
Description
SYSTEMS AND METHODS FOR PRIORITIZING MEDICAL RESOURCES FOR CANCER SCREENING
CROSS-REFERENCE TO RELATED APPLICATION
[0001] This application claims priority to and the benefit of U.S. Provisional Application No. 63/424,570, filed November 11, 2022, of which is incorporated by reference herein in its entirety.
BACKGROUND
[0002] Lung cancer most commonly begins with the development of a lung nodule. Generally, the larger the nodule, the more rapid its growth or the more irregular it is in appearance, and the more likely it is to be cancer. In many scenarios, lung nodules in patients remain undetected for periods of time or, even if detected, can already indicate an advanced stage of cancer. Thus, early prediction of lung cancer risk in patients even prior to the development of one or more lung nodules can be valuable. However, early-stage cancer screening remains difficult as screening large numbers of patients using resource-intensive methodologies would be infeasible. For example, performing tissue biopsies and/or image scanning across large patient populations is untenable. Therefore, there is a need to effectively identify patients most likely at risk of developing cancer.
SUMMARY
[0003] Embodiments of the disclosure disclosed herein involve implementing machine learning models to analyze electronic records of patients. Electronic records of patients can represent valuable information that are predictive for the risk of cancer. Furthermore, electronic records can be accumulated easily and cost effectively e.g., during patient visits. Therefore, electronic records can be valuable data for continuous monitoring and evaluation of patients for their risk of developing cancer. Methods disclosed herein are useful for identifying such patients that may be at risk of developing cancer, hereafter referred to as candidate patients. Therefore, healthcare providers, who may be caring for large numbers of patients, can appropriately prioritize limited medical resources by providing interventions to candidate patients that are at most risk of developing cancer. For example, candidate patients can undergo subsequent imaging and/or biopsy procedures, which are far more costly procedures for confirming whether the candidate patients are at risk of developing cancer.
[0004] Disclosed herein is a method for prioritizing medical resources for screening a patient for cancer, the method comprising obtaining a temporally diverse dataset comprising electronic records of a patient; weighting features from data of the electronic records of the patient according to timepoints that the data were recorded in the electronic records of the patient; analyzing the weighted features for the patient using a machine learning model to categorize the patient as a candidate patient at risk for cancer or a non-candidate patient; and responsive to determining the patient as a candidate patient, providing identification of the candidate patient for prioritization of medical resources.
[0005] In various embodiments, weighting the features according to timepoints that the data were recorded in the electronic records of the patient comprises assigning higher weights to features from data that were more recently recorded in the electronic records in comparison to features from data that were earlier recorded in the electronic records. In various embodiments, methods disclosed herein further comprise normalizing the data of the electronic records of the patient. In various embodiments, normalizing the data comprises applying a hyperbolic tangent transformation. In various embodiments, the machine learning model outputs a score indicative of cancer risk for the patient. In various embodiments, the score indicative of cancer risk is a continuous score between 0 and 1. In various embodiments, the machine learning model further outputs an identification of a feature or feature grouping that contributed to the score indicative of cancer risk.
[0006] In various embodiments, providing identification of the candidate patient for prioritization of medical resources further comprises providing the corresponding identifications of features or feature groupings of the candidate patient for prioritization of medical resources. In various embodiments, the features from data of the electronic records comprises features from electronic health record (EHR) data. In various embodiments, the features from data of the electronic records comprises features from medical claims data. In various embodiments, the features from data of the electronic records comprises features from EHR data and medical claims data. In various embodiments, the features from EHR data comprises one or more of patient demographics data, laboratory test data, prior diagnoses data, prior procedures data, and prior prescriptions data. In various embodiments, the patient demographics data comprises one or more of patient gender, patient age, patient behavior, patient ethnicity, patient race, patient geographic location, patient socioeconomic status, education status, number of months active, and/or insurance status. In various embodiments, the patient behavior comprises smoking behavior selected from one of current smoker, previously smoked, never smoked, other smoking, not currently smoking, or
unknown smoke. In various embodiments, the features from medical claims data comprises one or more of patient demographics data, prior diagnoses data, prior procedures data, and prior prescriptions data.
[0007] In various embodiments, the prior diagnoses data comprises one or more diagnostic codes. In various embodiments, the one or more diagnostic codes comprise ICD-9 or ICD-10 codes. In various embodiments, the one or more diagnostic codes comprise ICD- 10 codes, wherein one or more ICD-10 codes were converted from one or more ICD-9 codes. In various embodiments, the prior procedures data comprises one or more procedures codes. In various embodiments, the one or more procedures codes comprise HCPCS or CPT-4 codes. In various embodiments, the prior prescriptions data comprises one or more national drug codes (NDCs). In various embodiments, the patient is between 50-80 years old. In various embodiments, the patient exhibits a prior smoking history. In various embodiments, the patient has not previously undergone a computed tomography (CT) scan, a positron emission tomography (PET) scan, or a PET-CT scan. In various embodiments, the patient has not previously undergone a cancer biopsy procedure. In various embodiments, the patient has not previously received a cancer diagnosis.
[0008] In various embodiments, the cancer comprises lung cancer. In various embodiments, the lung cancer is one of non-small cell lung cancer, small cell lung cancer, adenocarcinoma, and squamous cell carcinoma. In various embodiments, the prioritization of medical resources comprises prioritizing patients for undergoing computed tomography (CT) scans. In various embodiments, the machine learning model comprises a logistic regression model, a random forest model, or a neural network.
[0009] In various embodiments, methods disclosed herein further comprise obtaining updated electronic records for one or more patients, the updated electronic records comprising additional data recorded in the updated electronic records subsequent to providing identification of the candidate patient; analyzing features from the additional data using a machine learning model to categorize a patient as an additional candidate patient at risk for cancer or a non-candidate patient; and responsive to determining that the patient is an additional candidate patient, providing identification of the additional candidate patient for prioritization of medical resources.
[0010] Additionally disclosed herein is a method for prioritizing medical resources for screening a patient for cancer, the method comprising obtaining a dataset comprising electronic records of a patient; receiving an indication of available medical resources of a third party; extracting features from data of the electronic records of the patient; analyzing the
extracted features for the patient using a machine learning model to categorize the patient as a candidate patient at risk for cancer or a non-candidate patient, wherein the categorizing of the patient uses at least a prediction of the machine learning model and a threshold selected according to the received indication; and responsive to determining the patient as a candidate patient, providing identification of the candidate patient for prioritization of medical resources. In various embodiments, the threshold is selected to account for the available medical resources of the third party. In various embodiments, a lower threshold is selected for the third party for an indication reflecting higher available medical resources of the third party, in comparison to a higher threshold that is selected for the third party for an indication reflecting lower available resources for the third party.
[0011] In various embodiments, methods disclosed herein further comprise weighting the extracted features according to timepoints that the data were recorded in the electronic records of the patient.
[0012] Additionally disclosed herein is a method for prioritizing medical resources for screening individuals for cancer, the method comprising obtaining a dataset comprising electronic records of a patient; extracting features from data of the electronic records of the patient; analyzing the extracted features for the patient using a machine learning model to categorize the patient as a candidate patient at risk for cancer or a non-candidate patient, wherein the machine learning model is configured to output 1) a score indicative of lung cancer risk for the patient and 2) identification of a feature grouping that contributed to the score indicative of lung cancer risk, wherein the feature grouping comprises two or more features; and responsive to determining the patient as a candidate patient, providing identification of the candidate patient and the identification of the feature grouping to the third party for prioritization of medical resources.
[0013] In various embodiments, the feature grouping comprises at least 2 features. In some embodiments, the feature grouping may comprise between 2 and 10 features. In various embodiments, the feature grouping comprises one of a lung issue grouping, heart issue grouping, smoking status grouping, patient characteristics grouping, patient behavior grouping, and vaccine grouping. In various embodiments, the lung issue grouping comprises one or more of chronic obstructive pulmonary disease (COPD), chronic bronchitis, pleural effusion, dyspnea, wheezing, and inhaled treatment for COPD and/or asthma. In various embodiments, the heart issue grouping comprises one or more of atherosclerotic heart disease, iron deficiency anemias, elevated blood pressure, treatment for high blood pressure, and treatment for reducing risk of heart attack and/or stroke. In various embodiments, the
smoking status grouping comprises one or more of tobacco use, nicotine dependence, cigarette use, smoking cessation, number of months actively smoking, never smoked observation, and current smoker observation. In various embodiments, the patient characteristics grouping comprises one or more of systolic blood pressure, diastolic blood pressure, number of months active, patient age, and geographic location. In various embodiments, the patient behavior grouping comprises one or more of prior established patient visits and new patient visits. In various embodiments, the vaccine grouping comprises one or more of pneumonia vaccine and flu vaccine. In various embodiments, analyzing the extracted features using the machine learning model comprises implementing a Shapley additive contribution algorithm to determine contributions of one or more feature groupings. In various embodiments, the feature grouping identified by the output of the machine learning model comprises a feature grouping providing the highest contribution to the score. In various embodiments, the output of the machine learning model further comprises an identification of a second feature grouping providing the second highest contribution to the score. In various embodiments, the output of the machine learning model further comprises an identification of a third feature grouping providing the third highest contribution to the score. In various embodiments, categorizing the patient as a candidate patient or a noncandidate patient based on the score further comprises selecting a threshold according to a received indication of available medical resources of the third party; and categorizing the patient as a candidate patient or a non-candidate patient using the score and the threshold.
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] These and other features, aspects, and advantages of the present disclosure will become better understood with regard to the following description and accompanying drawings.
[0015] FIG. 1 A depicts a system overview for prioritizing medical resources for candidate patients, in accordance with an embodiment.
[0016] FIG. IB depicts a block diagram of an example patient prioritization system, in accordance with an embodiment.
[0017] FIG. 2A depicts an example flow diagram for implementing a risk prediction model for identifying candidate patients, in accordance with an embodiment.
[0018] FIG. 2B depicts an example diagram for organizing the electronic data, in accordance with an embodiment.
[0019] FIG. 3A depicts an example flow process for identifying candidate patients, in accordance with a first embodiment.
[0020] FIG. 3B depicts an example flow process for identifying candidate patients, in accordance with a second embodiment.
[0021] FIG. 3C depicts an example interaction diagram for identifying candidate patients, in accordance with a third embodiment.
[0022] FIG. 4 illustrates an example computer for implementing the entities shown in FIG. 1A, IB, 2, 3A, and 3B.
[0023] FIG. 5 A depicts an example data pipeline for developing the algorithm (e.g., machine learning model).
[0024] FIG. 5B depicts an example data pipeline for validating the algorithm (e.g., machine learning model).
[0025] FIG. 6A depicts example output for a patient with a standard lung cancer risk.
[0026] FIG. 6B depicts example output for a patient with an elevated lung cancer risk.
[0027] FIG. 7 shows performance of various machine learning models.
DETAILED DESCRIPTION
I. Definitions
[0028] Terms used in the claims and specification are defined as set forth below unless otherwise specified.
[0029] The terms “subject” or “patient” are used interchangeably and encompass a cell, tissue, or organism, human, or non-human, whether in vivo, ex vivo, or in vitro, male, or female.
[0030] The term “mammal” encompasses both humans and non-humans and includes but is not limited to humans, non-human primates, canines, felines, murines, bovines, equines, and porcines.
[0031] The phrases “electronic records” and “electronic data” are used interchangeably and generally refer to patient data stored in electronic form. Examples of electronic records described herein include electronic health records and claims data.
[0032] The term “obtaining a dataset comprising electronic records of a patient” and variants thereof encompasses obtaining dataset comprising electronic records captured from the patient. Obtaining the dataset comprising electronic records can encompass performing steps of capturing the dataset e.g., obtaining data from the patient and recording the data. The
phrase can also encompass receiving the dataset, e.g., from a third party that has performed the steps of capturing the dataset comprising electronic records from the patient. The term “obtaining a dataset comprising electronic records of a patient” can also include having (e.g., instructing) a third party obtain the dataset.
[0033] The term “sample” or “test sample” can include a single cell or multiple cells or fragments of cells or an aliquot of body fluid, such as a blood sample, taken from a patient, by means including venipuncture, excretion, ejaculation, massage, biopsy, needle aspirate, lavage sample, scraping, surgical incision, or intervention or other means known in the art. Examples of an aliquot of body fluid include amniotic fluid, aqueous humor, bile, lymph, breast milk, interstitial fluid, blood, blood plasma, cerumen (earwax), Cowper’s fluid (pre- ejaculatory fluid), chyle, chyme, female ejaculate, menses, mucus, saliva, urine, vomit, tears, vaginal lubrication, sweat, serum, semen, sebum, pus, pleural fluid, cerebrospinal fluid, synovial fluid, intracellular fluid, and vitreous humour. In various embodiments, a sample can be a biopsy of a tissue, such as a lung tumor or a lung nodule.
[0034] The phrase “at risk for cancer” refers to a risk that a patient will develop cancer within a given time period, e.g., within 1 year. In various embodiments, the risk of cancer refers to a likelihood that a patient will develop cancer within a given time period from time zero (TO), wherein time zero refers to when electronic data was obtained from the patient. In various embodiments, the risk of cancer refers to a likelihood that a patient will develop cancer within a certain period, for example, 6 months, 1 year, 10 years, or 20 years. [0035] The terms “treating,” “treatment,” or “therapy” of lung cancer shall mean slowing, stopping, or reversing a cancer’s progression by administration of treatment. In some embodiments, treating lung cancer means reversing the cancer’s progression, ideally to the point of eliminating the cancer itself. In various embodiments, “treating,” “treatment,” or “therapy” of lung cancer includes administering a therapeutic agent or pharmaceutical composition to the patient. Additionally, as used herein, “treating,” “treatment,” or “therapy” of lung cancer further includes administering a therapeutic agent or pharmaceutical composition for prophylactic purposes. Prophylaxis of a cancer refers to the administration of a composition or therapeutic agent to prevent the occurrence, development, onset, progression, or recurrence of cancer or some or all the symptoms of lung cancer or to lessen the likelihood of the onset of lung cancer.
[0036] It must be noted that, as used in the specification, the singular forms “a,” “an” and “the” include plural referents unless the context clearly dictates otherwise.
II. System Environment Overview
[0037] FIG. 1A depicts a system overview for prioritizing medical resources for candidate patients, in accordance with an embodiment. The system environment 100 provides context to introduce patients 110, stored electronic data 120, and a patient prioritization system 130 for identifying candidate patients 140. Although FIG. 1A depicts a system environment 100 including three patients 110, in various embodiments, the system environment 100 includes additional or fewer patients such that that patient prioritization system 130 identifies a subset of patients as candidate patients 140. For example, the system environment 100 may include 2, 100, 1000, 1 million, 100 million, or other number of patients.
[0038] In various embodiments, the patients 110 are presumed to be healthy. For example, the patients 110 have not been previously diagnosed with cancer. As another example, the patients 110 have not been previously suspected of having cancer. Thus, the methods disclosed herein can be beneficial for identifying candidate patients who may be at risk of cancer from patients who are presumed to be healthy. In various embodiments, the type of cancer is a lung cancer. Thus, the methods described herein can be beneficial for prioritizing candidate patients for early detection of lung cancer. In various embodiments, a patient 110 may have been previously diagnosed with a cancer. For example, the patient 110 can be in remission and therefore, the methods disclosed herein can be beneficial for determining whether the patient 110 is likely to experience a recurrence of cancer.
[0039] Generally, data can be obtained from the patients 110 and stored. For example, such data can include electronic data. FIG. 1A shows example stored electronic data 120 that is obtained from patients 110. Exemplary electronic data include electronic health record (EHR) data and claims data. EHR data represents an electronic version of a patient’s medical history. Claims data includes administrative data covering information such as doctor’s appointments, bills, and insurance information. Additional details and examples of EHR data and claims data are further described herein.
[0040] In various embodiments, the stored electronic data 120 can be gathered from patients 110 at one or more patient visits (e.g., patient visits to a medical provider). Thus, upon each patient visit, the stored electronic data 120 can be further augmented or supplemented by the information gathered at that patient visit. As referred to herein, the stored electronic data 120 represents a temporally diverse dataset of electronic records including electronic data recorded at various timepoints (e.g., at various timepoints when the
patient visited). In various embodiments, the stored electronic data 120 is maintained and updated in real-time as additional information is gathered from a patient. For example, the stored electronic data 120 of various patients can be maintained in a cloud service and therefore, can be continuously updated as new or updated information of patients 110 are obtained. Database management system for storing electronic data can be any suitable system, for example, EPIC Healthcare Software, EBS PathoSof, HxRx Healthcare Management System, Healcon Practice, Drug Inventory Management System (DIMS), oeHealth, Patientpop, Webptis, GeBBS HIM Solutions, Cemer, WebPT, eClinicalWorks, and NextGen Healthcare EHR.
[0041] In various embodiments, the stored electronic data 120 and the patient prioritization system 130 are maintained or employed by a common party. In some embodiments, the stored electronic data 120 and the patient prioritization system 130 are maintained or employed by different parties. For example, a first party may maintain the stored electronic data 120 and/or continuously update the stored electronic data 120 in view of new or updated information from patients 110. A different party may operate the patient prioritization system 130 to analyze the stored electronic data 120 to identify candidate patients 140. In one example, the party that maintains the stored electronic data 120 may be a hospital or physician’s office. Thus, the patient prioritization system 130 may request for and access the stored electronic data 120 maintained by a hospital or a physician’s office to identify candidate patients 140.
[0042] Although FIG. 1A shows a single stored electronic data 120, in various embodiments there may be a plurality of stored electronic data 120. For example, each stored electronic data 120 can be maintained by a different party (e.g., a different hospital or physician’s office) or multiple parties. Therefore, the patient prioritization system 130 can access and analyze stored electronic data 120 from a large number of patients by accessing stored electronic data 120 from various sources maintained by any number of parties.
[0043] Referring to the patient prioritization system 130, it accesses and analyzes stored electronic data 120 to identify candidate patients 140 that may be at risk of cancer. Candidate patients 140 may represent a subset of the patients 110. Candidate patients 140 may be prioritized to receive a subsequent intervention, whereas patients that are not identified as candidate patients 140 (patients not identified as candidate patients are hereafter referred to as “non-candidate patients”) may be deprioritized, in an embodiment.
[0044] In various embodiments, the patient prioritization system 130 accesses the stored electronic data 120 by sending a request to a party that maintains the stored electronic
data 120. In various embodiments, the patient prioritization system 130 continuously accesses the stored electronic data 120 over time. Continuously accessing the stored electronic data 120 over time may enable the patient prioritization system 130 to access the most up-to-date stored electronic data 120 such that the patient prioritization system 130 can identify the candidate patients based on the most up-to-date stored electronic data 120. In various embodiments, the patient prioritization system 130 accesses the stored electronic data 120 at predetermined intervals of time (e.g., daily, bi-weekly, weekly, monthly, annually, or other suitable time intervals). In various embodiments, the party maintaining the stored electronic data 120 can provide the stored electronic data 120 to the patient prioritization system 130 when a trigger event occurs. For example, a trigger event may be an update to the stored electronic data 120 in view of new patient information or change to patient information. [0045] In some embodiments the patient prioritization system 130 analyzes stored electronic data 120, such as EHR data, claims data, or other data that may be easily obtained and/or does not require extensive computing resource to analyze. Using easily obtainable data allows a larger pool of patients, making it suitable for use in early-stage cancer screening. [0046] To identify candidate patients 140, the patient prioritization system 130 may analyze stored electronic data 120 of patients 110 by deploying a trained machine learning model, hereafter referred to as a trained risk prediction model. The trained risk prediction model may analyze features derived from the stored electronic data 120 of patients 110 to determine which patients are to be categorized as candidate patients 140.
[0047] Reference is now made to FIG. IB, which depicts a block diagram of an example patient prioritization system 130. In this example, the patient prioritization system 130 includes a feature extraction module 145, a resource availability module 150, a model training module 155, a model deployment module 160, and a candidate patient identifier module 165. In various embodiments, the patient prioritization system 130 may be configured differently with additional, fewer, or different modules.
[0048] The feature extraction module 145 may process electronic data, such as electronic data obtained from stored electronic data 120 shown in FIG. 1A. In various embodiments, the feature extraction module 145 may identify eligible patients according to one or more criteria (e.g., 50-80 years old, non-smoker, and no prior scan, biopsy procedure, or cancer diagnosis). The feature extraction module 145 may extract features from the electronic data of eligible patients that meet the one or more criteria. In various embodiments, the feature extraction module 145 may assign weights to different features. For example, the feature extraction module 145 may assign higher weights to features from the electronic data
120 that were more recently recorded in comparison to features from the electronic data 120 that were earlier recorded in the electronic records. The feature extraction module 145 may provide the extracted features to the model deployment module 160, e.g., for inputting into a trained machine learning model. .
[0049] The resource availability module 150 manages communications with one or more third parties to assess the availability of resources at the one or more third parties. Availability of resources at each third party may differ. For example, a third party may be a hospital or a physician’s office that provides care to different numbers of patients. The resource availability module 150 may communicate with one or more third parties to receive indications identifying the quantity of available medical resources each third party has available. Based on the indications, the resource availability module 150 can determine whether the number of candidate patients identified for a particular third party exceeds the medical resources available at that third party. As an example, the resource availability module 150 may receive, from a third party, an indication that identifies that the third party has the capacity to perform X image scans for candidate patients. The resource availability module 150 can ensure that the total number of candidate patients identified to the third party does not exceed the third party’s capacity of A image scans. In one embodiment, the resource availability module 150 sets a threshold according to the indication from the third party that reflects the available medical resources at the third party. By modulating the threshold, the resource availability module 150 can control the number of candidate patients identified for a particular third party.
[0050] The model training module 155 may perform steps to train one or more machine learning models. In some embodiments, the model training module 155 trains machine learning models such that the machine learning models can accurately separate candidate patients, who are likely at higher risk of developing cancer, from non-candidate patients, who are likely at lower risk of developing cancer. In some embodiments, the model training module 155 trains machine learning models using training data that includes electronic data, such as EHR data and/or claims data.
[0051] The model deployment module 160 may retrieve and deploy trained machine learning models to generate predictions for patients, the predictions being informative for determining whether a patient is categorized as a candidate patient or a non-candidate patient. In various embodiments, the model deployment module 160 deploys a trained machine learning model that outputs a score that is informative for determining whether a patient is categorized as a candidate patient or a non-candidate patient. In various embodiments, the
trained machine earning model outputs relative contributions of feature groupings that contributed towards the score informative for determining whether a patient is categorized as a candidate patient or a non-candidate patient.
[0052] The candidate patient identifier module 165 may identify a subset of patients as candidate patients using the predictions generated by trained machine learning models. In various embodiments, to determine whether a patient is to be categorized as a candidate patient, the candidate patient identifier module 165 compares a prediction generated by the machine learning model for the patient to a threshold score. Based on the comparison, the candidate patient identifier module 165 may classify the patient as a candidate patient or a non-candidate patient. In some embodiments, the candidate patient identifier module 165 communicates with one or more third parties by providing identification of candidate patients. For example, the identified candidate patients may represent a subset of patients that are under the care of a third party. Thus, by providing identification of the candidate patients to the third party, the third party may then prioritize its available medical resources by providing interventions to the candidate patients over non-candidate patients.
III. Methods for Predicting Risk of Cancer
[0053] Embodiments described herein include methods for identifying candidate patients by applying one or more trained risk prediction models. Such methods can be performed by the patient prioritization system 130 described in FIG. IB. Reference is now made to FIG. 2A, which depicts an example flow diagram for implementing a risk prediction model for identifying candidate patients.
[0054] In FIG. 2A, the patient prioritization system 130 receives or accesses the stored electronic data 120 (e.g., a temporally diverse dataset comprising electronic records of patients) that may be maintained by a third party. At step 210, the patient prioritization system 130 analyzes the stored electronic data 120 to identify eligible patients, for example, by identifying those satisfying one or more criteria. The one or more criteria can include a particular age group (e.g., between 30-100 years old, between 40-90 years old, or between 50-80 years old), smoking related observation (e.g., smoking habit of the patient, such as a smoking pack year history), and a lack of a prior scan, prior biopsy, or prior cancer diagnosis. Inclusion of the criteria of lacking a prior scan, prior biopsy, or prior cancer diagnosis may allow patients that may not typically be suspected of being at risk of cancer to be included as eligible patients. In various embodiments, the criteria may include one or more of the United
States Preventive Services Task Force (USPSTF) recommendations, such as 50-80 years old and a 20+ smoking pack year history. In some embodiments, the criteria include each of age group (e.g., 50-80 years old), smoking related observation, and lack of each of a prior scan, prior biopsy, and prior cancer diagnosis.
[0055] In some embodiments, the patient prioritization system 130 identifies eligible patients by comparing the stored electronic data 120 of patients to the one or more criteria. If the electronic data 120 of a patient satisfies the criteria, the patient prioritization system 130 may identify the patient as an eligible patient and retain that patient’s electronic data 120 for subsequent analysis. If the electronic data of a patient does not satisfy the criteria, in some embodiments the patient prioritization system 130 identifies the patient as an ineligible patient. The electronic data 120 of the ineligible patient may not be retained or may be set aside and not included for subsequent analysis.
[0056] In various embodiments, the patient prioritization system 130 organizes the electronic data 120 of the eligible patients prior to subsequent analysis. For example, the patient prioritization system 130 may organize the electronic data of eligible patients into one or more scalar tables to facilitate the subsequent analysis (e.g., to facilitate the later extraction of features). In various embodiments, different scalar tables can be generated for different types of electronic data. For example, a scalar table can be generated for EHR data and a second scalar table can be generated for claims data. As another example, separate scalar tables can be generated for patient demographic data, and observations data (e.g., diagnoses, procedures, and/or prescription codes). In some embodiments, four separate scalar tables are generated including 1) patient demographics table, 2) diagnosis table, 3) procedures table, and 4) prescriptions table.
[0057] Reference is now made to FIG. 2B, which depicts an example diagram for organizing the electronic data, in accordance with an embodiment. Specifically, the top of FIG. 2B shows a unique patient demographic table that organizes the patient demographic data of patients. For example, the patient demographic data can include the patient’s birth year, patient’s gender, patient’s ethnicity, patient’s race, patient’s division, and the like. In various embodiments, the patient demographic table may only include demographic data of eligible patients. The bottom three tables shown in FIG. 2B include a diagnoses table, a procedures table, and a prescriptions table using for example, National Drug Code (NDC). The diagnoses table shown on the bottom left of FIG. 2B includes an identifier of the patient (e.g., “Patient ID”) and further includes codes for one or more diagnoses (shown as “Code Type” and “Diag Code”). Example code types include ICD-9 or ICD-10 codes.
Example ICD-9 and ICD-10 codes are described in further detail herein. Furthermore, the diagnoses table can include a time (e.g., estimated date) at which the data was recorded (shown as “Est Dt”). The procedures table shown in the bottom middle of FIG. 2B includes an identifier of the patient (e.g., “Patient ID”) and further includes codes for one or more procedures (shown as “Code Type” and “Proc Code”). Example code types include Current Procedural Terminology (CPT) or Healthcare Common Procedure Coding System (HCPCS) codes. Example CPT/HCPCS codes are described in further detail herein. Furthermore, the diagnoses table can include a time (e.g., estimated date) at which the data was recorded (shown as “Est Dt”). The prescriptions (NDC) table shown in the bottom right of FIG. 2B includes an identifier of the patient (e.g., “Patient ID”) and further includes one or more prescriptions (shown as “NDC”). Example prescriptions are described in further detail herein. Furthermore, the diagnoses table can include a time (e.g., estimated date) at which the data was recorded (shown as “Est Dt”).
[0058] Returning to FIG. 2A, for each of the eligible patients, a subsequent analysis 220 is performed to analyze the electronic data of the patient and to determine whether the patient is to be categorized as candidate patient or a non-candidate patient. As shown in FIG. 2A, the analysis 220 includes accessing the electronic data 255 of the patient, extracting features 260, and analyzing the extracted features using a risk prediction model 265 to generate a risk prediction 270. The risk prediction 270 is useful for determining whether the patient is to be categorized as candidate patient or a non-candidate patient. As shown in FIG. 2A, the analysis 220 can be performed multiple times for multiple patients. For example, each performance of the analysis 220 is patient specific. For example, for Z eligible patients, the analysis 220 can be performed Z times to determine whether the Z eligible patients are to be categorized as candidate patients or non-candidate patients.
[0059] A first step in the analysis 220 involves extracting features 260 from the electronic data 255 of a patient. Here, this step may be performed by the feature extraction module 145 described in FIG. IB. In various embodiments, features include any of a patient demographic datapoint, a diagnosis code, a procedures code, a prescriptions code, or a combination thereof.
[0060] A patient demographic datapoint can be a value representing any of patient gender, patient age, patient behavior, patient ethnicity, patient race, patient geographic location, patient socioeconomic status, education status, number of months active, and/or insurance status. For example, a value representing the patient age can, in various embodiments, be the patient age itself. As another example, a value for a patient ethnicity
may be an integer value, where a particular integer value represents a particular patient ethnicity.
[0061] A diagnosis code feature can be a diagnosis code, such as an ICD-9 or an ICD- 10 diagnosis code (or information representative thereof). Example ICD-9 and ICD-10 codes are shown below in Tables 1 and 2. A procedures code feature can be a procedures code, such as a Current Procedural Terminology (CPT) or Healthcare Common Procedure Coding System (HCPCS) code (or information representative thereof). Example CPT and HCPCS codes are shown below in Tables 3 and 4. A prescriptions code feature can be a particular prescription. Example prescriptions are shown in Table 5.
[0062] In various embodiments, extracting the features 260 involves encoding the features that can then be analyzed by a machine learning model. For example, encoding the features can involve encoding the features into an input vector that can be analyzed by a machine learning model. Any suitable way to encode categorical variables may be used to encode the features, for example, one-hot encoding, label/ordinal encoding, target encoding, feature hashing, binary encoding, or count encoding.
[0063] In various embodiments, the feature extraction module 145 may differentially weigh the extracted features before inputting into the machine learning model. In some embodiments, differential weighing of the extracted features need not be performed. In some embodiments, the feature extraction module 145 differentially weighs the extracted features according to timepoints that the data were recorded. For example, the feature extraction module 145 assigns higher weights to features from the electronic data that were more recently recorded in comparison to features from data that were earlier recorded in the electronic record. Given that more recently recorded electronic data 120 may be more informative of the patient’s current risk for cancer as opposed to electronic data 120 that was recorded earlier, the more recently recorded electronic is assigned a higher weight to reflect its increased informativeness. In some embodiments, weights may be assigned based on, but not limited to, dates, fde sizes, medical professional names, and the like.
[0064] In various embodiments, differentially weighing the extracted features involves modifying the features according to different weight values. For example, if the features are encoded as an input vector, differentially weighing the features can involve modifying individual entries of the input vector by the different assigned weights to generate weighted features. Thus, values of features that are assigned higher weights can be increased relative to values of features that are assigned smaller weights.
[0065] The next step of the analysis 220 involves applying a risk prediction model 265 (e.g., by the model deployment module 160 shown in FIG. IB) to analyze the features. In various embodiments, a risk prediction model analyzes the features and generates a risk prediction 270, which may be informative for determining whether the patient is to be categorized as a candidate patient or non-candidate patient. In various embodiments, the risk prediction 270 can be represented by a value. In various embodiments, the risk prediction 270 is a binary value (e.g., 0 or 1, where 0 indicates unlikely to develop cancer in a certain time period and 1 indicates likely to develop cancer in a certain time period). In various embodiments, the risk prediction 270 is represented by a score, such as a continuous value (e.g., between 0 and 1, where a value closer to 1 indicates higher likelihood of developing cancer).
[0066] In various embodiments, the risk prediction model 265 is a regression model (e.g., a logistic regression or linear regression model) that calculates a risk prediction 270 by combining a set of trained parameters with the extracted features. As another example, the risk prediction model can be a neural network model that calculates a risk prediction 270 by combining a set of trained parameters associated with nodes and layers of the neural network with values of the extracted features. As another example, the risk prediction model 265 can be a random forest model that calculates a risk prediction 270 by combining a set of trained parameters associated with decision tree nodes with values of the extracted features. As another example, the risk prediction model 265 can be a gradient boosted machine model that calculates a risk prediction 270 by combining a set of trained parameters associated with decision tree nodes with values of the extracted features.
[0067] In various embodiments, the risk prediction model 265 analyzes feature groupings, where a feature grouping represents 2 or more extracted features. Extracted features in a feature grouping may be related. For example, extracted features of a feature grouping can be related according to an anatomical organ or according to the patient, examples of which include patient behavior, patient characteristics, smoking status, and vaccination status. Exemplary feature groupings are described herein and further shown below in Table 6. In such embodiments, the risk prediction model 265 may combine individual features into respective feature groupings, and then analyzes the feature groupings to determine the risk prediction 270. In various embodiments, the risk prediction model 265 analyzes both individual features, as well as feature groupings in generating the risk prediction 270. In various embodiments, the risk prediction model 265 analyzes only feature groupings in generating the risk prediction 270.
[0068] In various embodiments, the risk prediction model 265 further performs an analysis to determine the relative contributions of feature groupings that resulted in the risk prediction 270. The relative contributions of feature groupings are additionally referenced herein as subscores. In various embodiments, the risk prediction model 265 determines a relative contribution of a features or feature grouping by constructing and calculating outputs across various scenarios, a subset of scenarios including the extracted feature or feature grouping and another subset of scenarios excluding the extracted feature or feature grouping. Thus, by analyzing the changes of the outputs across the various scenarios, the relative contribution of the extracted feature or feature grouping can be deduced.
[0069] Grouping features into feature groupings and performing analysis using feature grouping to determine relative contribution may use less resources than determining relative contribution using ungrouped features. In various embodiments, the risk prediction model 265 groups features into any number of feature groupings such that the risk prediction model 265 only needs to determine a reduced number of relative contributions.
[0070] In various embodiments, the risk prediction model 265 performs a Shapley Additive Explanation (SHAP) analysis to determine SHAP values for feature groupings.
Generally, the SHAP analysis takes into account all different combinations of input variables with different subsets of the predictor vector as contributing to the output prediction. In various embodiments, the risk prediction model 265 performs a Kernel SHAP analysis that calculates contributions of feature groupings across fewer scenarios. For example, and without limitation, Kernel SHAP uses a weighted linear regression, where the coefficients of the linear regression represent the contributions of the feature groupings. The various scenarios may be used to determine weights of the linear regression to determine the contributions of the feature groupings. In some embodiments, oilier machine learning models and/or algorithms may be used to determine contributions of feature groupings, without limitation, such as supervised, unsupervised, or other machine learning models.
[0071] The risk prediction 270 may be calculated based on the combination of contributions from feature groupings. For example, the risk prediction 270 may, in various embodiments, be a summation of the contributions from individual feature groupings.
[0072] Given the risk prediction 270, the candidate patient identifier module 165, as shown in FIG. IB, can compare the risk prediction 270 to a threshold value to determine whether the patient is a candidate patient or a non-candidate patient. The threshold value maybe a fixed score. For example, if the risk prediction 270 is above the threshold score, the patient is classified into one category (e.g., a candidate patient). Alternatively, if the risk
prediction 270 is below the threshold score, the patient is classified into a different category (e.g., non-candidate patient).
[0073] In various embodiments, the threshold score is set according to a quantity of available medical resources at a third party. Thus, each threshold score may represent a custom threshold score that is personalized for each third party. For example, as described herein, a third party may provide an indication that reflects the available medical resources of the third party. Thus, the candidate patient identifier module 165 may set a threshold score according to the indication that reflects the available medical resources. For example, for a third party that is severely limited on resources, the third party sends an indication reflecting those limited resources. The candidate patient identifier module 165 may set a high threshold value (e.g., at least 0.6, at least 0.7, or at least 0.8) such that fewer patients have scores above the threshold value and are identified as candidate patients. As another example, for a third party that is not limited on resources, the third party sends an indication reflecting nonlimited resources. The candidate patient identifier module 165 may set a lower threshold value (e.g., at most 0.4, at most 0.3, or at most 0.2) such that more patients have scores above the threshold value and are identified as candidate patients. Threshold values may be set based on patient categories, classifications, demographics, diagnosis, and the like.
[0074] Following identification of the candidate patients 140, the patient prioritization system 130 provides the identification of the candidate patients 140 to a third party (e.g., a third party managing the care for the candidate patients). Thus, the third party can appropriately prioritize its available medical resources to provide interventions to the candidate patients.
[0075] In various embodiments, the patient prioritization system 130 can additionally provide, to the third party, identification of one or more feature groupings that contributed to the categorization of patients as candidate patients. Here, the patient prioritization system 130 can provide identification of feature groupings on a per-patient basis. For example, for each candidate patient, the patient prioritization system 130 provides identification of the specific feature groupings that contributed to the categorization of the patient as a candidate patient. [0076] In various embodiments, the patient prioritization system 130 ranks the features or feature groupings that contributed to the categorization of a patient as a candidate patient. The patient prioritization system 130 can provide the top-ranked feature or topranked feature grouping. In various embodiments, the patient prioritization system 130 can provide one or more features or feature groupings in accordance with their rank. For example, the patient prioritization system 130 provides the top 3 features or feature groupings that
contributed to the categorization of a patient as a candidate patient. For another example, the patient prioritization system 130 provides the third -ranked feature or feature grouping that contributed to the categorization of a patient as a candidate patient.
[0077] The third party can use the identification of features or feature groupings to select and provide care to a candidate patient. For example, if the top feature grouping that most heavily contribute to the patient being categorized as a candidate patient is the smoking behavior of the patient, the third party can appropriately counsel the candidate patient regarding the smoking behavior (e.g., counsel to reduce smoking or terminate smoking).
IV. Training a Machine Learning Model
[0078] Generally, a machine learning model is structured such that it analyzes features extracted from electronic data, such as features extracted from EHR data and/or features extracted from claims data, and generates a prediction informative for classifying a patient as a candidate or non-candidate patient.
[0079] The risk prediction model can use any suitable machine learning model, such as a regression model (e.g., linear regression, logistic regression, or polynomial regression), decision tree, random forest, gradient boosting (e.g., a XGBoost gradient boosting model or a CatBoost gradient boosting model), support vector machine, Naive Bayes model, k-means cluster, or neural network (e.g., feed-forward networks, convolutional neural networks (CNN), deep neural networks (DNN), autoencoder neural networks, or any combination thereof).
[0080] The risk prediction model can be trained using a machine learning implemented method, such as any one of a linear regression algorithm, logistic regression algorithm, decision tree algorithm, support vector machine classification, Naive Bayes classification, K- Nearest Neighbor classification, random forest algorithm, deep learning algorithm, gradient boosting algorithm, and dimensionality reduction techniques such as manifold learning, principal component analysis, factor analysis, autoencoder regularization, and independent component analysis, or combinations thereof. In various embodiments, the risk prediction model is trained using supervised learning algorithms, unsupervised learning algorithms, semi-supervised learning algorithms (e.g., partial supervision), weak supervision, transfer, multi-task learning, or any combination thereof.
[0081] In various embodiments, the risk prediction model has one or more parameters, such as hyperparameters or model parameters. Hyperparameters are generally established
prior to training. Examples of hyperparameters include the learning rate, depth or leaves of a decision tree, number of hidden layers in a deep neural network, number of clusters in a k- means cluster, penalty in a regression model, and a regularization parameter associated with a cost function. Model parameters are generally adjusted during training. Examples of model parameters include weights associated with nodes in layers of neural network, support vectors in a support vector machine, node values in a decision tree, and coefficients in a regression model. The model parameters of the risk prediction model are trained (e.g., adjusted) using the training data to improve the predictive capacity of the risk prediction model.
[0082] The model training module 155 trains the risk prediction model using training data. In various embodiments, the training data includes extracted features from electronic data (e.g., EHR data and/or claims data) obtained from training individuals. As used herein, a training individual may be an individual known to be at risk or not be at risk for cancer. In various embodiments, a training individual may be an individual known to not be at risk for cancer if the training individual is not subsequently diagnosed with cancer. For example, such a training individual known to not be at risk for cancer may be an individual who later underwent an intervention (e.g., CT/PET imaging and/or biopsy) and was determined to not have cancer. In various embodiments, a training individual may be an individual known to be at risk for cancer if the training individual is subsequently diagnosed with cancer. For example, such a training individual known to be at risk for cancer may be an individual who later underwent an intervention (e.g., CT/PET imaging and/or biopsy) and was determined to have cancer. In various embodiments, a training individual may be an individual known to be at risk for cancer if the training individual is subsequently diagnosed with cancer at a timepoint at least A months in the future. For example, the at least A months in the future is sufficiently distant in the future such that when the electronic records were obtained from the training individual, the cancer in the training individual represented an early-stage cancer. In various embodiments, A may be at least 1 month. In some embodiments, A may be between about 1 month to about 55 months.
[0083] In various embodiments, the training data can be obtained from a split of a dataset. For example, the dataset can undergo a 50:50 training:testing dataset split. In some embodiments, the dataset can undergo a 60:40 training:testing dataset split. In some embodiments, the dataset can undergo a 80:20 training lesting dataset split.
[0084] In various embodiments, the training data used for training the imputation model includes reference ground truths that indicate that a training individual was subsequently diagnosed with cancer (hereafter also referred to as “positive” or “+”) or whether the training
individual was not subsequently diagnosed with cancer (hereafter also referred to as
In various embodiments, the reference ground truths in the training data are binary values, such as “1” or “0.” For example, a training individual that was subsequently diagnosed with cancer can be identified in the training data with a value of “1” whereas a training individual who was not diagnosed with cancer can be identified in the training data with a value of “0.” In various embodiments, the model training module 155 trains the risk prediction model using the training data to minimize a loss function such that the risk prediction model can better generate a prediction (e.g., a score informative for determining whether the patient is a candidate or non-candidate patient) based on the input (e.g., extracted features of the electronic data). In various embodiments, the loss function is constructed for any of a least absolute shrinkage and selection operator (LASSO) regression, Ridge regression, or ElasticNet regression.
[0085] In various embodiments, risk prediction models disclosed herein achieve a performance metric. Example performance metrics include an area under the curve (AUC) of a receiver operating curve, a positive predictive value, and/or a negative predictive value. In various embodiments, risk prediction models disclosed herein exhibit an AUC value of at least 0.5. In various embodiments, risk prediction models disclosed herein exhibit an AUC value of at least 0.6. In various embodiments, risk prediction models disclosed herein exhibit an AUC value of at least 0.7. In various embodiments, risk prediction models disclosed herein exhibit an AUC value of at least 0.8. In various embodiments, risk prediction models disclosed herein exhibit an AUC value of at least 0.9. In various embodiments, risk prediction models disclosed herein exhibit an AUC value of at least 0.95. In various embodiments, risk prediction models disclosed herein exhibit an AUC value of at least 0.99. In various embodiments, risk prediction models disclosed herein exhibit an AUC value of at least 0.51. In some embodiments, AUC values may be between about 0.51 to about 0.99, without limitation.
[0086] In various embodiments, risk prediction models disclosed herein achieve an odds ratio, which refers to the relative risk of the higher risk population (e.g., candidate patients) compared to the standard risk. In various embodiments, risk prediction models disclosed herein achieve an odds ratio of at least 1.1 to about 3.0, without limitation.
V. Example Methods for Prioritization Medical Resources for Screening Cancer
Patients
[0087] As described herein, example methods for prioritizing medical resources for screening cancer patients involve analysis of features from electronic records using a trained machine learning model. In some embodiments, the features of the electronic records are weighted according to timepoints that data were received. For example, more recently recorded data in the electronic records can be assigned a higher weight than earlier recorded data in the electronic records. More recently recorded data may be more reflective of a current state of the patient and therefore, a higher weight of the more recently recorded data enables the machine learning model to appropriately account for the timing of the recordation. Altogether, this enables the machine learning model to predict candidate patients more accurately.
[0088] Reference is now made to FIG. 3A, which depicts an example flow process for identifying candidate patients, in accordance with a first embodiment. Step 310 involves obtaining a temporally diverse dataset of electronic records. Here, the temporally diverse dataset includes information of patients that are recorded at various timepoints. For example, for a patient, the temporally dataset may include EHR and/or claims data obtained from the patient during a first hospital visit, and may further include EHR and/or claims data obtained from the same patient during a second hospital visit.
[0089] Step 315 involves an overall step of categorizing a patient as a candidate patient or a non-candidate patient. As shown in FIG. 3 A, step 315 includes step 320 and step 330. Step 315 can be performed multiple times across different patients to determine whether each of the patients are a candidate patient or a non-candidate patient.
[0090] Step 320 involves weighting features from data of the electronic records according to timepoints that the data of the electronic records were recorded. Specifically, features from data more recently recorded in the electronic records are more heavily weighted in compared to features from data that were earlier recorded in the electronic records.
[0091] Step 330 involves analyzing the weighted features using a trained machine learning model. The trained machine learning model outputs a prediction for categorizing the patient as a candidate patient or a non-candidate patient.
[0092] Step 335 involves providing identification of the candidate patients. In various embodiments, step 335 involves providing identification of the candidate patients to a third party that is managing the care of the patient (e.g., a hospital or physician’s office).
[0093] As shown in FIG. 3A, the flow process can restart again at step 310. Here, a new temporally diverse dataset of electronic records can be obtained. Here, the temporally diverse dataset of electronic records may include data that was newly recorded since a prior version of the temporally diverse dataset was obtained.
[0094] In various embodiments, example methods for prioritizing medical resources for screening cancer patients involve analyze groupings of features from electronic records using a trained machine learning model. Here, these feature groupings can be analyzed to determine how much each feature grouping contributed to the prediction of the machine learning model (e.g., a prediction informative of a candidate patient or a non-candidate patient).
[0095] Reference is now made to FIG. 3B, which depicts an example flow process for identifying candidate patients, in accordance with a second embodiment. Step 340 involves obtaining a dataset comprising electronic records. The electronic records may include one or both of EHR data and claims data for one or more patients.
[0096] Step 345 involves an overall step of categorizing a patient as a candidate patient or a non-candidate patient. As shown in FIG. 3B, step 345 includes step 350 and step 360. Step 345 can be performed multiple times across different patients to determine whether each of the patients are a candidate patient or a non-candidate patient.
[0097] Step 350 involves extracting features from data of the electronic records. Example features from EHR data and/or claims data are described herein.
[0098] Step 360 involves analyzing features using a trained machine learning model. The machine learning model can output a score indicative of cancer risk for the patient. The score indicative of cancer risk is determinative of whether the patient is a candidate patient or a non-candidate patient. In various embodiments, the machine learning model can further identify a feature grouping that contributed to the score indicative of cancer risk. For example, the identified feature grouping may be a feature grouping that most heavily contributed to the score indicative of cancer risk.
[0099] Step 365 involves providing identification of the candidate patients and the corresponding identifications of feature groupings. In various embodiments, step 365 involves providing identification of the candidate patients and the corresponding identifications of feature groupings to a third party that is managing the care of the patient (e.g., a hospital or physician’s office).
[00100] As shown in FIG. 3B, the flow process can restart again at step 340. Here, a new dataset of electronic records can be obtained. The new dataset of electronic records may
include data that was newly recorded since a prior version of the temporally diverse dataset was obtained.
[00101] In various embodiments, example methods for prioritizing medical resources for screening cancer patients involve categorizing patients according to an indication reflecting available medical resources of a third party. For example, the third party may manage the care of various patients and may gave limited medical resources. Thus, the third party may need to prioritize the medical resources for a subset of the patients (e.g., candidate patients). An example third party may be a hospital or physician’s office that cares for the patients and/or stores electronic records (e.g., EHR data and/or claims data) related to the patients. Methods disclosed herein can be useful for identifying candidate patients such that the third party can prioritize medical resources and provide interventions to the candidate patients first.
[00102] Reference is now made to FIG. 3C, which depicts an example interaction diagram for identifying candidate patients, in accordance with a third embodiment. The interaction diagram shows an example patient prioritization system 130 and a third party 370. At step 372, the third-party stores electronic data, such as electronic data for one or more patients. At step 375, the patient prioritization system 130 receives the electronic data of patients. Furthermore, at step 378, the patient prioritization system 130 receives an indication reflecting the available medical resources of the third party 370. At step 380, the patient prioritization system 130 extracts features from data of the electronic records (e.g., EHR data and/or claims data). Example features of EHR data and/or claims data is further described herein.
[00103] At step 382, the patient prioritization system 130 categorizes patients using the indication received from the third party 370. Here, step 382 involves analyzing features using a trained machine learning model to generate a prediction of whether patients are to be categorized as candidate patients or non-candidate patients. In some embodiments, the patient prioritization system 130 establishes a threshold score using the indication received from the third party 370 such that the machine learning model uses the threshold score to categorize patients as candidate patients or non-candidate patients. The patient prioritization system 130 sets a threshold score to meet the available medical resources for the third party 370. Thus, at step 385, the patient prioritization system 130 can provide identification of a tailored set of candidate patients to the third party 370. Then, at step 390, the third party 370 can provide an intervention to the candidate patients, while withholding the intervention for non-candidate patients.
VI. Example Electronic Data
[00104] Methods disclosed herein involve analyzing electronic data of patients to categorize patients as candidate or non-candidate patients. Electronic data generally refers to data gathered from patients that are stored in electronic form. Exemplary electronic data include electronic health record (EHR) data and claims data of patients. In various embodiments, electronic data further includes timepoints for which the EHR data and/or claims data of patients were recorded.
[00105] Referring first to EHR data, it represents readily available medical information that may have been previously obtained from patients (e.g., obtained over one or more patients to a hospital or physician’s office). For example, EHR data represents an electronic version of a patient’s medical history. In various embodiments, EHR data comprises one or more of patient demographics data, laboratory test data, prior diagnoses data, prior procedures data, and prior prescriptions data. In some embodiments, EHR data comprises each of patient demographics data, prior diagnoses data, prior procedures data, and prior prescriptions data.
[00106] Referring to claims data, it represents administrative data collected from patients, examples of which include information from doctor’s appointments, bills, and insurance information. In various embodiments, claims data comprise one or more of patient demographics data, prior diagnoses data, prior procedures data, and prior prescriptions data. In various embodiments, claims data comprise each of patient demographics data, prior diagnoses data, prior procedures data, and prior prescriptions data.
[00107] Patient demographics data of the EHR data and/or the claims data can refer to background characteristics of the patient. Example patient demographics data include patient gender, patient age, patient behavior, patient ethnicity, patient race, patient geographic location, patient socioeconomic status, education status, number of months active (e.g., number of months for which EHR data was stored for a patient), and/or insurance status. In some embodiments, patient demographics data includes patient behavior. Examples of patient behavior can include number of prior hospitalizations, number of prior physician visits, number of emergency room visits, and number of unique providers. In some embodiments, the patient behavior includes the smoking behavior of a patient. The smoking behavior of a patient can be identified as one of current smoker, previously smoked, never smoked, other smoking, not currently smoking, or unknown smoker.
[00108] Prior diagnoses data of the EHR data and/or the claim data can refer to a number of prior diagnoses and/or identifications of prior diagnoses for the patient. Such diagnoses data of the EHR data can include diagnosis codes, which correspond to diagnoses of lung -related or non-lung related issues. Example diagnoses codes for diagnosing lung- related issues are shown below in Table 1. Furthermore, example diagnoses codes for identifying diagnosing non-lung related issues are shown below in Table 2.
[00109] Prior procedures data of the EHR data and/or the claim data can refer to a number of prior procedures and/or identifications of prior procedures for the patient. Such procedures data can include procedure codes, which correspond to performed procedures for lung-related or non-lung related issues. Example procedure codes for lung-related procedures are shown below in Table 3. Furthermore, example procedure codes for non-lung related procedures are shown below in Table 4.
[00110] Prior prescriptions data of the EHR data and/or the claim data can refer to a number of prior prescriptions and/or identifications of prior prescriptions that were provided to the patient. Example prior prescriptions can include prescriptions for treating a lung- related condition or a non-lung related condition. Example prior prescriptions are shown below in Table 5. The right-most column of Table 5 shows the target body area of the drug, including lung and non-lung (e.g., blood, digestion, heart, mental, allergies, skin, reproductive, hormone, smoke, vaccine, and general) conditions.
[00111] In various embodiments, EHR data and claims data may include overlapping information of patients. For example, both the EHR data and claims data can include patient demographics data for patients. As another example, both the EHR data and claims data can include prior diagnoses data for patients. As another example, both the EHR data and claims data can include prior procedures data for patients. As another example, both the EHR data and claims data can include prior prescriptions data for patients. Overlapping patient data between the EHR data and claims data can be useful for verifying patient data, as the overlapping patient data would represent more reliable patient information.
[00112] In various embodiments, EHR data may include additional patient information that is not available in the claims data, and vice versa. For example, EHR data can further include additional demographics information of additional specificity that may not be available in the claims data. Such additional demographics data can include living situation (e.g., single or living alone, married, or living together) as well as language (e.g., primary spoken language). In various embodiments, EHR data can further include laboratory test data
that may not be available in claims data. For example, EHR data can include measurements of characteristics or quantitative values for one or more biomarkers determined for the patient. Example laboratory test data can include values for alanine aminotransferase (ALT), body mass index, cholesterol, creatinine, forced expiratory volume (FEV-1), FEV-l/FVC ratio, glucose, high-density lipoprotein (HDL), international normalized ratio (INR), potassium, low density lipoprotein (LDL), mean corpuscular hemoglobin concentration (MCHC-M), platelets, red cell distribution width (ROW), triglycerides, white blood cells (WBC). Further examples EHR data and claims data are described in Franklin, Jessica M., et al. "The relative benefits of claims and electronic health record data for predicting medication adherence trajectory." American heart journal 197 (2018): 153-162, which is hereby incorporated by reference in its entirety.
[00113] Altogether, in such embodiments where the EHR data and claims data differ, each dataset can be used to supplement the other dataset. Thus, using both EHR data and claims data enables the more accurate prediction and identification of candidate patients in comparison to the use of any single data alone.
Example Feature Groupings
[00114] Methods disclosed herein further involve analyzing two or more extracted features e.g., in a feature grouping to determine whether a patient is to be categorized as a candidate patient or a non-candidate patient. By analyzing a feature grouping comprising two or more extracted features (e.g., analyzing using a trained machine learning model), methods disclosed herein can involve determining a contribution of the feature grouping that resulted in the categorization of the patient as a candidate patient or non-candidate patient.
[00115] As used herein, a “feature grouping” refers to one or more extracted features. In some embodiments, a feature grouping refers to 2 or more extracted features. A feature grouping may refer to about 2 extracted features to about 30 extracted features, without limitation. A feature grouping may refer to more than 30 extracted features, in some embodiments. For example, extracted features of a feature grouping can be related according to an anatomical organ, such as any one of a brain, heart, blood, thorax, eyes, lung, abdomen, colon, cervix, pancreas, kidney, liver, muscle, lymph nodes, oral cavity, pharynx, larynx, esophagus, intestine, spleen, stomach, and gall bladder. As another example, extracted features of a feature grouping can be related to the patient, examples of which include patient behavior, patient characteristics, smoking status, and vaccination status. Example feature
groupings can include Lung, Heart, Preventative Care, Blood, Digestion, Tobacco Use/Smoking, Mental Health, Reproductive, Oral Cavity/Pharynx/Larynx, Pain/Pain Management, Health Measures/Benchmarks, and Vision.
[00116] Further exemplary feature groupings are shown below in Table 6.
VII. Cancers
[00117] Methods described herein involve prioritizing medical resources by identifying candidate patients likely to be at risk of a cancer using risk prediction models. In various embodiments, the cancer in the patient can include one or more of: lymphoma, B cell lymphoma, T cell lymphoma, mycosis fungoides, Hodgkin's Disease, myeloid leukemia, bladder cancer, brain cancer, nervous system cancer, head and neck cancer, squamous cell carcinoma of head and neck, kidney cancer, lung cancer, neuroblastoma/glioblastoma, ovarian cancer, pancreatic cancer, prostate cancer, skin cancer, liver cancer, melanoma, squamous cell carcinomas of the mouth, throat, larynx, and lung, colon cancer, cervical cancer, cervical carcinoma, breast cancer, and epithelial cancer, renal cancer, genitourinary cancer, pulmonary cancer, esophageal carcinoma, stomach cancer, thyroid cancer, head and neck carcinoma, large bowel cancer, hematopoietic cancer, testicular cancer, colon and/or
rectal cancer, uterine cancer, or prostatic cancer. In some embodiments, the cancer in the patient can be a metastatic cancer, including any one of bladder cancer, breast cancer, colon cancer, kidney cancer, lung cancer, melanoma, ovarian cancer, pancreatic cancer, prostatic cancer, rectal cancer, stomach cancer, thyroid cancer, or uterine cancer. In some embodiments, the cancer is a lung cancer. In some embodiments, the cancer is a type of lung cancer, including any one of small cell lung cancer, non-small cell lung cancer, non-small cell carcinoma, adenocarcinoma, squamous cell cancer, large cell carcinoma, small cell carcinoma, combined small cell carcinoma, neuroendocrine tumor, lung sarcoma, lung lymphoma, bronchial carcinoids.
[00118] In some embodiments, the cancer is an early-stage cancer. In some embodiments, the early-stage cancer is a stage I cancer. In some embodiments, the early- stage cancer is a stage II cancer. In various embodiments, the early-stage cancer is an early- stage lung cancer. In various embodiments, the early-stage lung cancer refers to a stage prior to the development of nodules, such as lung nodules or lymph node nodules. In various embodiments, the early-stage lung cancer may not yet have been previously diagnosed or identified (e.g., via biopsy or imaging). Thus, methods disclosed herein can be useful for prioritizing patients that would most benefit from subsequent analysis (e.g., via biopsy or imaging).
VIII. Interventions
[00119] Embodiments described herein involve prioritizing medical resources by identifying candidate patients likely to be at risk of a cancer using risk prediction models. In various embodiments, the methods disclosed herein are performed on patients who have not previously received any of the following: an image scan (e.g., any of a LDCT/Chest- CT/PET/PET-CT scan), a lung cancer biopsy procedure, or a lung cancer diagnosis. Thus, by analyzing, in silico, electronic records (e.g., electronic health records and/or claims data) without the need for images/biopsy information, this enables rapid and cost-effective evaluation of patients to guide the provision of interventions only to candidate patients that would most likely benefit from the intervention.
[00120] In various embodiments, the intervention can be any one of: application of a diagnostic, application of a prophylactic therapeutic agent, or a subsequent action. Example subsequent actions can include a subsequent testing of the patient to confirm whether the patient develops cancer. Subsequent testing can include any of a subsequent biopsy (e.g.,
cancer biopsy or lymph node biopsy) or subsequent image scanning (e.g., CT scanning, PET scanning, MRI scanning, ultrasound imaging, or X-ray imaging). In some embodiments, the subsequent testing includes performing a CT or PET image scanning. The CT or PET image scanning can then be used to confirm the risk of cancer in the patient. In some embodiments, the subsequent testing includes performing a chest CT or PET image scanning.
[00121] In various embodiments, subsequent testing of the patient can occur during at a next scheduled visit or at a pre-determined amount of time such as, but not limited to, about 1 month to about 24 months after predicting the future risk of cancer. In some embodiments, a pre-determined amount of time may be less than 1 month or greater than 24 months. In various embodiments, additional subsequent actions can include subsequent actions to treat a cancer that has developed in the patient, such as tumor resection, bronchoscopic diagnosis, selection and/or administration of therapeutic(s), selection/administration of pharmaceutical composition, or any combination thereof.
[00122] In various embodiments, a therapeutic agent can be selected and/or administered to the patient based on the predicted future risk of cancer. The selected therapeutic agent is likely to delay or prevent the development of the cancer, such as lung cancer. Exemplary therapeutic agents include chemotherapies, energy therapies (e.g., external beam, microwave, radiofrequency ablation, brachytherapy, electroporation, cryoablation, photothermal ablation, laser therapy, photodynamic therapy, electrocauterization, chemoembolization, high intensity focused ultrasound, low intensity focused ultrasound), antigen-specific monoclonal antibodies, anti-inflammatories, oncolytic viral therapies, or immunotherapies. In various embodiments, the selected therapeutic agent is an energy therapy and the amount (e.g., dose and duration) of the energy applied can be tailored to achieve a desired therapeutic effect. In various embodiments the therapeutic agent is a small molecule or biologic, e.g., a cytokine, antibody, soluble cytokine receptor, anti-sense oligonucleotide, siRNA, etc. Such biologic agents encompass muteins and derivatives of the biological agent, which derivatives can include, for example, fusion proteins, PEGylated derivatives, cholesterol conjugated derivatives, and the like as known in the art. Also included are antagonists of cytokines and cytokine receptors, e.g., traps and monoclonal antagonists. Also included are biosimilar or bioequivalent drugs to the active agents set forth herein.
[00123] Therapeutic agents for lung cancer can include chemotherapeutics such as docetaxel, cisplatin, carboplatin, gemcitabine, Nab-paclitaxel, paclitaxel, pemetrexed, gefitinib, erlotinib, brigatinib (Alunbrig®), capmatinib (Tabrecta®), selpercatinib
(Retevmo®), entrectinib (Rozlytrek®), lorlatinib (Lorbrena®), larotrectinib (Vitrakvi®), dacomitinib (Vizimpro®), and vinorelbine. Therapeutic agents for lung cancer can include antibody therapies such as durvalumab (Imfinzi®), nivolumab (Opdivo®), pembrolizumab (Keytruda®), atezolizumab (Tecentriq®), canakinumab, and ramucirumab.
[00124] In various embodiments, one or more of the therapeutic agents described can be combined as a combination therapy for treating the patient.
[00125] In various embodiments, a pharmaceutical composition can be selected and/or administered to the patient based on the patient level risk of metastatic cancer , the selected therapeutic agent likely to exhibit efficacy against the cancer. A pharmaceutical composition administered to an individual includes an active agent such as the therapeutic agent described above. The active ingredient is present in a therapeutically effective amount, i.e., an amount sufficient when administered to treat a disease or medical condition mediated thereby. The compositions can also include various other agents to enhance delivery and efficacy, e.g., to enhance delivery and stability of the active ingredients. Thus, for example, the compositions can also include, depending on the formulation desired, pharmaceutically acceptable, nontoxic carriers or diluents, which are defined as vehicles commonly used to formulate pharmaceutical compositions for animal or human administration. The diluent may be selected so as not to affect the biological activity of the combination. Examples of such diluents are distilled water, buffered water, physiological saline, PBS, Ringer’s solution, dextrose solution, and Hank’s solution. In addition, the pharmaceutical composition or formulation can include other carriers, adjuvants, or non-toxic, nontherapeutic, nonimmunogenic stabilizers, excipients and the like. The compositions can also include additional substances to approximate physiological conditions, such as pH adjusting and buffering agents, toxicity adjusting agents, wetting agents, and detergents. The composition can also include any of a variety of stabilizing agents, such as an antioxidant.
[00126] The pharmaceutical compositions or therapeutic agents described herein can be administered in numerous ways. Examples include administering a composition containing a pharmaceutically acceptable carrier via oral, intranasal, intramodular, intralesional, rectal, topical, intraperitoneal, intravenous, intramuscular, subcutaneous, subdermal, transdermal, intrathecal, endobronchial, transthoracic, or intracranial method. [00127] In various embodiments, a clinical response can be provided to the patient based on the predicted future risk of cancer generated for the patient by implementing risk prediction models. In various embodiments, a clinical response can include providing counseling to modify a behavior of the patient (e.g., counsel the patient about smoking
cessation to reduce risk), initiating of an inhaled/topical, intravenous or enteral (by mouth) therapeutic that could delay/prevent malignant transformation, slow tumor growth or even prevent spread of disease (metastasis), establishing an adaptive screening schedule for future risk similar to what is done with colonoscopy for polyps (e.g., individuals predicted to be higher risk for future lung cancer should have more frequent follow up and imaging), or performing or scheduling to be performed an additional risk prediction test to confirm the predicted future risk of lung cancer (e.g., persons deemed to be higher risk for lung cancer may also then undergo additional testing to either confirm that risk or narrow the cancer type the person is at greatest risk for. In various embodiments, the additional risk prediction test could include blood-based biomarkers (to look for non-specific inflammation which is a known risk for lung cancer), metabolomics/proteomics/gene expression/genetic sequencing. The person could also have additional sampling of tissue (nasal epithelium, bronchial epithelium, etc.) to look at changes in gene expression in the respiratory tract.)
IX. Computer Implementation
[00128] The methods disclosed herein, including the prioritizing medical resources by identifying candidate patients likely to be at risk of a cancer using risk prediction models, are, in some embodiments, performed on one or more computers. For example, as shown in reference to FIGs. 1A and IB, the patient prioritization system 130 can include one or more computers. Therefore, in various embodiments, the steps described in reference to the patient prioritization system 130 are performed in silico.
[00129] In various embodiments, the building and deployment of a risk prediction model can be implemented in hardware or software, or a combination of both. In one embodiment of the disclosure, a machine-readable storage medium is provided, the medium comprising a data storage material encoded with machine readable data which, when using a machine programmed with instructions for using said data, is capable of executing the training or deployment of risk prediction models and/or displaying any of the datasets or results (e.g., future risk of cancer predictions for patients) described herein. The disclosure can be implemented in computer programs executing on programmable computers, comprising a processor, a data storage system (including volatile and non-volatile memory and/or storage elements), a graphics adapter, a pointing device, a network adapter, at least one input device, and at least one output device. A display is coupled to the graphics adapter. Program code is applied to input data to perform the functions described above and generate
output information. The output information is applied to one or more output devices, in known fashion. The computer can be, for example, a personal computer, microcomputer, or workstation of conventional design.
[00130] Each program can be implemented in a high-level procedural or object- oriented programming language to communicate with a computer system. However, the programs can be implemented in assembly or machine language, if desired. In any case, the language can be a compiled or interpreted language. Each such computer program is preferably stored on a storage media or device (e.g., ROM or magnetic diskette) readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein. The system can also be considered to be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
[00131] The signature patterns and databases thereof can be provided in a variety of media to facilitate their use. “Media” refers to a manufacture that contains the signature pattern information of the present disclosure. The databases of the present disclosure can be recorded on computer readable media, e.g., any medium that can be read and accessed directly by a computer. Such media include, but are not limited to magnetic storage media, such as floppy discs, hard disc storage medium, and magnetic tape; optical storage media such as CD-ROM; electrical storage media such as RAM and ROM; and hybrids of these categories such as magnetic/optical storage media. One of skill in the art can readily appreciate how any of the presently known computer readable mediums can be used to create a manufacture comprising a recording of the present database information. “Recorded” refers to a process for storing information on computer readable medium, using any such methods as known in the art. Any convenient data storage structure can be chosen, based on the means used to access the stored information. A variety of data processor programs and formats can be used for storage, e.g., word processing text file, database format, etc.
[00132] In some embodiments, the methods of the disclosure, including the methods of prioritizing medical resources by identifying candidate patients, are performed on one or more computers in a distributed computing system environment (e.g., in a cloud computing environment). In this description, “cloud computing” is defined as a model for enabling on- demand network access to a shared set of configurable computing resources. Cloud computing can be employed to offer on-demand access to the shared set of configurable
computing resources. The shared set of configurable computing resources can be rapidly provisioned via virtualization and released with low management effort or service provider interaction, and then scaled accordingly. A cloud-computing model can be composed of various characteristics such as, for example, on-demand self-service, broad network access, resource pooling, rapid elasticity, measured service, and so forth. A cloud-computing model can also expose various service models, such as, for example, Software as a Service (“SaaS”), Platform as a Service (“PaaS”), and Infrastructure as a Service (“laaS”). A cloud-computing model can also be deployed using different deployment models such as private cloud, community cloud, public cloud, hybrid cloud, and so forth. In this description and in the claims, a “cloud-computing environment” is an environment in which cloud computing is employed.
[00133] FIG. 4 illustrates an example computer for implementing the entities shown in FIG. 1A, IB, 2, 3A, and 3B. The computer 400 includes at least one processor 402 coupled to a chipset 404. The chipset 404 includes a memory controller hub 420 and an input/output (I/O) controller hub 422. A memory 406 and a graphics adapter 412 are coupled to the memory controller hub 420, and a display 418 is coupled to the graphics adapter 412. A storage device 408, an input device 414, and network adapter 416 are coupled to the I/O controller hub 422. Other embodiments of the computer 400 have different architectures. [00134] The storage device 408 is a non-transitory computer-readable storage medium such as a hard drive, compact disk read-only memory (CD-ROM), DVD, or a solid-state memory device. The memory 406 holds instructions and data used by the processor 402. The input device 414 is a touch-screen interface, a mouse, track ball, or other type of pointing device, a keyboard, or some combination thereof, and is used to input data into the computer 400. In some embodiments, the computer 400 may be configured to receive input (e.g., commands) from the input device 414 via gestures from the user. The network adapter 416 couples the computer 400 to one or more computer networks.
[00135] The graphics adapter 412 displays images and other information on the display 418. In various embodiments, the display 418 is configured such that the user may (e.g., radiologist, oncologist, pulmonologist) may input user selections on the display 418 to, for example, initiate risk prediction for a patient, order any additional exams or procedures and/or set parameters for the risk prediction models. In one embodiment, the display 418 may include a touch interface.
[00136] In various embodiments, the display 418 can show one or more predictions of a risk prediction model. For example, the display 418 can show a score indicative of lung
cancer risk for the patient. As another example, the display 418 can show scores for feature groupings that contribute to the score indicative of lung cancer risk. Example information shown on a display 418 are depicted in FIGs. 6A and 6B, and described in further detail below.
[00137] A user who accesses the display 418 can inform the patient of the score indicative of lung cancer risk. In various embodiments, the display 418 can show information such as the feature groupings that most heavily contributed to the score indicative of lung cancer risk. Displaying the top contributing feature groups can provide context to a user e.g., clinician user in understanding the features that resulted in the score indicative of lung cancer.
[00138] The computer 400 is adapted to execute computer program modules for providing functionality described herein. As used herein, the term “module” refers to computer program logic used to provide the specified functionality. Thus, a module can be implemented in hardware, firmware, and/or software. In one embodiment, program modules are stored on the storage device 408, loaded into the memory 406, and executed by the processor 402.
[00139] The types of computers 400 can vary depending upon the embodiment and the processing power required by the entity. For example, the patient prioritization system 130 can run in a single computer 400 or multiple computers 400 communicating with each other through a network such as in a server farm. The computers 400 can lack some of the components described above, such as graphics adapters 412 and displays 418.
[00140] Further disclosed herein are systems for prioritizing medical resources by identifying candidate patients. In various embodiments, such a system can include at least the patient prioritization system 130 described above in FIG. 1A. In various embodiments, the patient prioritization system 130 is embodied as a computer system, such as a computer system with example computer 400 described in FIG. 4.
EXAMPLES
[00141] Below are example embodiments for carrying out the present disclosure. The examples are offered for illustrative purposes only and are not intended to limit the scope of the present disclosure in any way.
Example 1: Example Categorization of Patients
[00142] Example 1 describes an algorithm to identify candidate patients. It is noted that the current USPSTF recommendations are based on certain factors. For example, the USPSTF recommends that patients who meet data points of between 50-80 years old and with a 20+ smoking pack year history to pursue a preliminary low dose computed tomography (LDCT) scan for possible lung cancer. However, there are drawbacks using the USPSTF recommendations - for example, the 20+ smoking pack year history is largely a self-reported data point and its accuracy is dependent on the patients reporting the correct number if they report at all.
[00143] Using the data from United States Preventive Services Task Force (USPSTF) Recommendations, there were an estimated 90 million current and former smokers across all ages. 45 million current and former smokers are between the ages of 50-80 years old, with only 15 million of those individuals having a 20 pack-year smoking history. The related statistics using the USPSTF Recommendations is as follows: One-year positive predictive value (PPV) of 0.62% and a population incidence of 0.21%.
[00144] Comparatively, the machine learning algorithm in accordance with this disclosure analyzed patients between 50-80 years old with a smoking related observation who also have not received any of the following:
- a EDCT/Chest-CT/PET/PET-CT scan,
- a lung cancer biopsy procedure, or
- a lung cancer diagnosis.
[00145] At a high level, the machine learning algorithm identifies candidate patients at risk of lung cancer. Various machine learning models can be used (the results of which are described below in Example 2). For example, machine learning models may be developed using a gradient boosting decision tree algorithm such as CatBoost or XGBoost. Other machine learning models may be developed, e.g., a neural network. The algorithm may be trained on features extracted from stored electronic data (e.g., lung issues, heart issues) and claims data (e.g., procedure codes), and may be analyzed through Shapley analysis. Each third party, e.g., clinical site, can adjust a threshold for demarcating elevated versus standard risk of lung cancer, based on site preference. Once the threshold has been set, the patient’s electronic health records and/or claims data may be used as input. The raw output of the algorithm may be a floating integer (propensity score) between 0 and 1 corresponding to lung cancer risk score. Higher number may refer to higher risk. The raw output may also include a list of “features” (each “feature” is either an individual feature, or a feature grouping of
related features to ease Shapley computational burden) and their Shapley additive contribution to lung cancer score. The formatted output may be visible to the user and may be a binary output of elevated versus standard risk. The user interface may also show feature(s) that contributed most to the score.
[00146] FIG. 5A depicts an example data pipeline for developing the algorithm (e.g., machine learning model). Native electronic data (e.g., electronic health record (EHR) or claims data) were retrieved and eligible patient with the study criteria (e.g., 50-80 years old, smoking related observation, and no prior scan, biopsy procedure, or cancer diagnosis) were identified. Here, the selected patient data can be provided (from health care provider or parsed from health care provider output) in the form of four tables listing the following info for each patient:
• diagnosis (Dx) codes (e.g., lung issues such as COPD, chronic bronchitis, pneumonia, emphysema; heart issues such as hypertension, vascular disease, hyperlipidemia)
• procedure codes
• prescription (NDC) codes
• patient demographic codes
[00147] Patients may be split into training, testing, and validation datasets. Using the training cohort, the data underwent feature identification. Furthermore, the patient data may undergo a transformation (e.g., a hyperbolic transformation) to change input values to be between -1 and 1. Next, the patient data may be used to train the machine learning models. The parameters and/or hyperparameters of the machine learning models may be tuned during training and final versions of the models may be saved after training.
[00148] Following training, the machine learning models may undergo further validation. Reference is now made to FIG. 5B, which depicts an example data pipeline for validating the algorithm (e.g., machine learning model). In this case, no further training or tuning of the machine learning models occurred in this phase. Using patient data from patients categorized in the validation set or testing set, the machine learning models may be deployed to determine final performance metrics, measuring the performance of the machine learning models.
[00149] Specifically, software may process patient data into “scalar tables.” The patient data may be input into algorithm and the algorithm may analyze designated input features (diagnoses, procedures, prescriptions, demographics). The example features are described herein in Tables 1-5.
[00150] The raw output of the algorithm may include:
• “Propensity score” or “lung cancer score” (e.g., normalized to 0-1, or any other suitable scale). The lung cancer score may be compared to a predetermined threshold to classify a patient as having an elevated risk or a standard risk for future lung cancer.
• Shapley additive contribution to lung cancer score for each feature or feature grouping.
[00151] The output to a health care provider may include:
• Classification of patient as having elevated or standard risk
• Top 3 features or feature groupings contributing most to the lung cancer score per their Shapley contributions
[00152] For example, FIGs. 6A and 6B depict example outputs for a patient with a standard lung cancer risk and a patient with an elevated lung cancer risk, respectively. FIG. 6A shows the results of a patient with standard lung cancer risk (e.g., a non-candidate patient). Here, the machine learning model predicted an overall score for the patient (e.g., “propensity score”) of 0.30. Given that the score is below a threshold value, the patient is categorized as a non-candidate patient. The chart on the left as well as the table on the right in FIG. 6A shows individual contributions of various features. The contributions are denoted as “SHAP values.” The “core drivers” shown in FIG. 6A indicates the most influential and important feature groupings that contributed to the propensity score.
[00153] FIG. 6B shows the results of a patient with elevated lung cancer risk (e.g., a candidate patient). Here, the machine learning model predicted an overall score for the patient (e.g., “propensity score”) of 0.73. Given that the score is above a threshold value, the patient is categorized as a candidate patient. The chart on the left as well as the table on the right in FIG. 6B show individual contributions of various features. The contributions are denoted as “SHAP values.” The “core drivers” shown in FIG. 6B of Smoking Status and COPD indicate the most influential and important feature groupings that contributed to the propensity score.
[00154] Altogether, the candidate patient corresponding to the results in FIG. 6B can be prioritized for subsequent screening (e.g., imaging, such as a subsequent CT scan) whereas the non-candidate patient corresponding to the results in FIG. 6A can be withheld from subsequent screening.
Example 2: Categorization of Patients Using Logistic Regression, Neural Network,
XGBoost, and CatBoost Machine Learning Models
[00155] Various machine learning models may be developed to show the applicability of the disclosed methodology across different machine learning model architectures. Using the methodology described in Example 1, each of a logistic regression machine learning model, a neural network machine learning model, a XGBoost gradient boosted decision tree, and CatBoost gradient boosted decision tree may be developed.
[00156] Reference is now made to FIG. 7, which shows performance of various machine learning models (e.g., logistic regression, neural network, XGBoost, and CatBoost). As shown in FIG. 7, each of the machine learning models identified patients that were of high risk of developing lung cancer. The various machine learning approaches may provide varying results. For example, FIG. 7 shows that gradient boosting decision tree algorithms (CatBoost, XGBoost) appears to provide the best results, followed by neural network, and logistic regression. However, each machine learning model achieved an Odds Ratio (e.g., Odds Ratio refers to the relative risk of the high-risk population compared to the standard risk) greater than 1, indicating that all machine learning models were successful.
Claims
1. A method for prioritizing medical resources for screening a patient for cancer, the method comprising: obtaining a temporally diverse dataset comprising electronic records of a patient; weighting features from data of the electronic records of the patient according to timepoints that the data were recorded in the electronic records of the patient; analyzing the weighted features for the patient using a machine learning model to categorize the patient as a candidate patient or a non-candidate patient; and responsive to determining the patient as a candidate patient, providing identification of the candidate patient for prioritization of medical resources.
2. The method of claim 1, wherein weighting the features according to the timepoints that the data were recorded in the electronic records of the patient comprises assigning higher weights to features from data that were more recently recorded in the electronic records in comparison to features from data that were earlier recorded in the electronic records.
3. The method of claim 1 or 2, further comprising: normalizing the data of the electronic records of the patient.
4. The method of claim 3, wherein normalizing the data comprises applying a hyperbolic tangent transformation.
5. The method of any one of claims 1-4, wherein the machine learning model outputs a score indicative of cancer risk for the patient.
6. The method of claim 5, wherein the score indicative of cancer risk is a continuous score between 0 and 1.
7. The method of claim 5 or 6, wherein the machine learning model further outputs an identification of a feature or feature grouping that contributed to the score indicative of cancer risk.
8. The method of claim 7, wherein providing identification of the candidate patient for prioritization of medical resources further comprises providing the corresponding identifications of features or feature groupings of the candidate patient for prioritization of medical resources.
9. The method of claim 1-8, wherein the features from data of the electronic records comprises features from electronic health record (EHR) data.
10. The method of claim 1-8, wherein the features from data of the electronic records comprises features from medical claims data.
11. The method of claim 1-8, wherein the features from data of the electronic records comprises features from EHR data and medical claims data.
12. The method of claim 9 or 11, wherein the features from EHR data comprises one or more of patient demographics data, laboratory test data, prior diagnoses data, prior procedures data, and prior prescriptions data.
13. The method of claim 12, wherein the patient demographics data comprises one or more of patient gender, patient age, patient behavior, patient ethnicity, patient race, patient geographic location, patient socioeconomic status, education status, number of months active, and/or insurance status.
14. The method of claim 13, wherein the patient behavior comprises smoking behavior selected from one of current smoker, previously smoked, never smoked, other smoking, not currently smoking, or unknown smoke.
15. The method of claim 10 or 11, wherein the features from medical claims data comprises one or more of patient demographics data, prior diagnoses data, prior procedures data, and prior prescriptions data.
16. The method of claim 15, wherein the patient demographics data comprises one or more of patient gender, patient age, patient behavior, patient ethnicity, patient race, patient geographic location, patient socioeconomic status, education status, number of months active, and/or insurance status.
17. The method of claim 16, wherein the patient behavior comprises smoking behavior selected from one of current smoker, previously smoked, never smoked, other smoking, not currently smoking, or unknown smoke.
18. The method of any one of claims 12-17, wherein the prior diagnoses data comprises one or more diagnostic codes.
19. The method of claim 18, wherein the one or more diagnostic codes comprise ICD-9 or ICD-10 codes.
20. The method of claim 18, wherein the one or more diagnostic codes comprise ICD-10 codes, wherein one or more ICD-10 codes were converted from one or more ICD-9 codes.
21. The method of any one of claims 12-16, wherein the prior procedures data comprises one or more procedures codes.
22. The method of claim 21, wherein the one or more procedures codes comprise HCPCS or CPT-4 codes.
23. The method of any one of claims 12-16, wherein the prior prescriptions data comprises one or more national drug codes (NDCs).
24. The method of any one of claims 1-23, wherein the patient is between 50-80 years old.
25. The method of any one of claims 1-24, wherein the patient exhibits a prior smoking history.
26. The method of any one of claims 1-25, wherein the patient has not previously undergone a computed tomography (CT) scan, a positron emission tomography (PET) scan, or a PET-CT scan.
27. The method of any one of claims 1-26, wherein the patient has not previously undergone a cancer biopsy procedure.
28. The method of any one of claims 1-27, wherein the patient has not previously received a cancer diagnosis.
29. The method of any one of claims 1-28, wherein the cancer comprises lung cancer.
30. The method of claim 29, wherein the lung cancer is one of non-small cell lung cancer, small cell lung cancer, adenocarcinoma, and squamous cell carcinoma.
31. The method of any one of claims 1-29, wherein the prioritization of medical resources comprises prioritizing patients for undergoing computed tomography (CT) scans.
32. The method of any one of claims 1-31, wherein the machine learning model comprises a logistic regression model, a random forest model, or a neural network.
33. The method of any one of claims 1-31, wherein the machine learning model comprises a gradient boosted model.
34. The method of any one of claims 1-31, wherein the machine learning model comprises a gradient boosted model.
35. The method of any one of claims 1-31, wherein the machine learning model comprises a neural network.
36. The method of any one of claims 1-35, further comprising: obtaining updated electronic records for one or more patients, the updated electronic records comprising additional data recorded in the updated electronic records subsequent to providing identification of the candidate patient; for a patient of the one or more patients: analyzing features from at least additional data of the updated electronic records for the patient using a machine learning model to categorize the patient as an additional candidate patient at risk for cancer or a non-candidate patient; and responsive to determining that the patient of the one or more patients is an additional candidate patient, providing identification of the additional candidate patient for prioritization of medical resources.
37. A method for prioritizing medical resources for screening a patient for cancer, the method comprising obtaining a dataset comprising electronic records of a patient; receiving an indication of available medical resources of the third party; extracting features from data of the electronic records of the patient; analyzing the extracted features for the patient using a machine learning model to categorize the patient as a candidate patient at risk for cancer or a non-candidate patient, wherein the categorizing of the patient uses at least a prediction of the machine learning model and a threshold selected according to the received indication; and responsive to determining the patient as a candidate patient, providing identification of the candidate patient for prioritization of medical resources.
38. The method of claim 37, wherein the threshold is selected to account for the available medical resources of the third party.
39. The method of claim 37 or 38, wherein a lower threshold is selected for the third party for an indication reflecting higher available medical resources of the third party, in comparison to a higher threshold that is selected for the third party for an indication reflecting lower available resources for the third party.
40. The method of any one of claims 37-39, further comprising: weighting the extracted features according to timepoints that the data were recorded in the electronic records of the patient.
41. The method of claim 40, wherein weighting the features according to timepoints that the data were recorded in the electronic records of the patient comprises assigning higher weights to features from data that were more recently recorded in the electronic records in comparison to values of features from data that were earlier recorded in the electronic records.
42. The method of any one of claims 37-41, further comprising: normalizing the data of the electronic records of the patient.
43. The method of claim 42, wherein normalizing the data comprises applying a hyperbolic tangent transformation.
44. The method of any one of claims 37-43, wherein the prediction of the machine learning model comprises a score indicative of cancer risk for the patient.
45. The method of claim 44, wherein the score indicative of cancer risk is a continuous score between 0 and 1.
46. The method of claim 43 or 44, wherein the prediction of the machine learning model further comprises an identification of feature groupings that contributed to the score indicative of cancer risk.
47. The method of claim 46, wherein providing identification of the candidate patient for prioritization of medical resources further comprises providing the corresponding
identifications of feature groupings of the candidate patient for prioritization of medical resources.
48. The method of any one of claims 37-47, wherein the features from data of the electronic records comprises features from electronic health record (EHR) data.
49. The method of any one of claims 37-47, wherein the features from data of the electronic records comprises features from medical claims data.
50. The method of any one of claims 37-47, wherein the features from data of the electronic records comprises features from EHR data and medical claims data.
51. The method of claim 48 or 50, wherein the features from EHR data comprises one or more of patient demographics data, laboratory test data, prior diagnoses data, prior procedures data, and prior prescriptions data.
52. The method of claim 51, wherein the patient demographics data comprises one or more of patient gender, patient age, patient behavior, patient ethnicity, patient race, patient geographic location, patient socioeconomic status, education status, number of months active, and/or insurance status.
53. The method of claim 52, wherein the patient behavior comprises smoking behavior selected from one of current smoker, previously smoked, never smoked, other smoking, not currently smoking, or unknown smoke.
54. The method of claim 49 or 50, wherein the features from medical claims data comprises one or more of patient demographics data, prior diagnoses data, prior procedures data, and prior prescriptions data.
55. The method of claim 54, wherein the patient demographics data comprises one or more of patient gender, patient age, patient behavior, patient ethnicity, patient race, patient geographic location, patient socioeconomic status, education status, number of months active, and/or insurance status.
56. The method of claim 55, wherein the patient behavior comprises smoking behavior selected from one of current smoker, previously smoked, never smoked, other smoking, not currently smoking, or unknown smoke.
57. The method of any one of claims 51-56, wherein the prior diagnoses data comprises one or more diagnostic codes.
58. The method of claim 57, wherein the one or more diagnostic codes comprise ICD-9 or ICD-10 codes.
59. The method of claim 57, wherein the one or more diagnostic codes comprise ICD-10 codes, wherein one or more ICD-10 codes were converted from one or more ICD-9 codes.
60. The method of any one of claims 51-56, wherein the prior procedures data comprises one or more procedures codes.
61. The method of claim 60, wherein the one or more procedures codes comprise HCPCS or CPT-4 codes.
62. The method of any one of claims 51-56, wherein the prior prescriptions data comprises one or more national drug codes (NDCs).
63. The method of any one of claims 37-62, wherein the patient is between 50-80 years old.
64. The method of any one of claims 37-63, wherein the patient exhibits a prior smoking history.
65. The method of any one of claims 37-64, wherein the patient has not previously undergone a computed tomography (CT) scan, a positron emission tomography (PET) scan, or a PET-CT scan.
66. The method of any one of claims 37-65, wherein the patient has not previously undergone a cancer biopsy procedure.
67. The method of any one of claims 37-66, wherein the patient has not previously received a cancer diagnosis.
68. The method of any one of claims 37-67, wherein the cancer comprises lung cancer.
69. The method of claim 68, wherein the lung cancer is one of non-small cell lung cancer, small cell lung cancer, adenocarcinoma, and squamous cell carcinoma.
70. The method of any one of claims 37-69, wherein the prioritization of medical resources comprises prioritizing patients for undergoing computed tomography (CT) scans.
71. The method of any one of claims 37-70, wherein the machine learning model comprises a logistic regression model.
72. The method of any one of claims 37-70, wherein the machine learning model comprises a random forest model.
73. The method of any one of claims 37-70, wherein the machine learning model comprises a gradient boosted model.
74. The method of any one of claims 37-70, wherein the machine learning model comprises a neural network.
75. The method of any one of claims 37-74, further comprising: obtaining updated electronic records for one or more patients, the updated electronic records comprising additional data recorded in the updated electronic records subsequent to providing identification of the candidate patient; for a patient of the one or more patients: analyzing features from at least additional data of the updated electronic records for the patient using a machine learning model to categorize the patient as an additional candidate patient at risk for cancer or a non-candidate patient; and responsive to determining that the patient of the one or more patients is an additional candidate patient, providing identification of the additional candidate patient for prioritization of medical resources.
76. A method for prioritizing medical resources for screening individuals for cancer, the method comprising: obtaining a dataset comprising electronic records of a patient; extracting features from data of the electronic records of the patient; analyzing the extracted features for the patient using a machine learning model to categorize the patient as a candidate patient at risk for cancer or a non-candidate patient, wherein the machine learning model is configured to output 1) a score indicative of lung cancer risk for the patient and 2) identification of a feature
grouping that contributed to the score indicative of lung cancer risk, wherein the feature grouping comprises two or more features; and responsive to determining the patient as a candidate patient, providing identification of the candidate patient and the identification of the feature grouping to the third party for prioritization of medical resources.
77. The method of claim 76, wherein the feature grouping comprises between 2 and 10 features.
78. The method of claim 76 or 77, wherein the feature grouping comprises one of a lung issue grouping, heart issue grouping, smoking status grouping, patient characteristics grouping, patient behavior grouping, and vaccine grouping.
79. The method of claim 78, wherein the lung issue grouping comprises one or more of chronic obstructive pulmonary disease (COPD), chronic bronchitis, pleural effusion, dyspnea, wheezing, and inhaled treatment for COPD and/or asthma.
80. The method of claim 78, wherein the heart issue grouping comprises one or more of atherosclerotic heart disease, iron deficiency anemias, elevated blood pressure, treatment for high blood pressure, and treatment for reducing risk of heart attack and/or stroke.
81. The method of claim 78, wherein the smoking status grouping comprises one or more of tobacco use, nicotine dependence, cigarette use, smoking cessation, number of months actively smoking, never smoked observation, and current smoker observation.
82. The method of claim 78, wherein the patient characteristics grouping comprises one or more of systolic blood pressure, diastolic blood pressure, number of months active, patient age, and geographic location.
83. The method of claim 78, wherein the patient behavior grouping comprises one or more of prior established patient visits and new patient visits.
84. The method of claim 78, wherein the vaccine grouping comprises one or more of pneumonia vaccine and flu vaccine.
85. The method of any one of claims 76-84, wherein analyzing the extracted features using the machine learning model comprises implementing a Shapley additive contribution algorithm to determine contributions of one or more feature groupings.
86. The method of any one of claims 76-85, wherein the feature grouping identified by the output of the machine learning model comprises a feature grouping providing the highest contribution to the score.
87. The method of any one of claims 76-86, wherein the output of the machine learning model further comprises an identification of a second feature grouping providing the second highest contribution to the score.
88. The method of any one of claims 76-87, wherein the output of the machine learning model further comprises an identification of a third feature grouping providing the third highest contribution to the score.
89. The method of any one of claims 76-88, wherein categorizing the patient as a candidate patient or a non-candidate patient based on the score further comprises: selecting a threshold according to a received indication of available medical resources of the third party; and categorizing the patient as a candidate patient or a non-candidate patient using the score and the threshold.
90. The method of claim 89, wherein a lower threshold is selected for the third party for an indication reflecting higher available medical resources of the third party, in comparison to a higher threshold that is selected for the third party for an indication reflecting lower available resources for the third party.
91. The method of any one of claims 76-90, further comprising: weighting the extracted features according to timepoints that the data were recorded in the electronic records of the patient.
92. The method of claim 91, wherein weighting the features according to timepoints that the data were recorded in the electronic records of the patient comprises assigning higher weights to features from data that were more recently recorded in the electronic records in comparison to features from data that were earlier recorded in the electronic records.
93. The method of any one of claims 76-92, further comprising: normalizing the data of the electronic records of the patient.
94. The method of claim 93, wherein normalizing the data comprises applying a hyperbolic tangent transformation.
95. The method of any one of claims 76-94, wherein the features from data of the electronic records comprises features from electronic health record (EHR) data.
96. The method of any one of claims 76-94, wherein the features from data of the electronic records comprises features from medical claims data.
97. The method of any one of claims 76-94, wherein the features from data of the electronic records comprises features from EHR data and medical claims data.
98. The method of claim 95 or 97, wherein the features from EHR data comprises one or more of patient demographics data, laboratory test data, prior diagnoses data, prior procedures data, and prior prescriptions data.
99. The method of claim 98, wherein the patient demographics data comprises one or more of patient gender, patient age, patient behavior, patient ethnicity, patient race, patient geographic location, patient socioeconomic status, education status, number of months active, and/or insurance status.
100. The method of claim 99, wherein the patient behavior comprises smoking behavior selected from one of current smoker, previously smoked, never smoked, other smoking, not currently smoking, or unknown smoke.
101. The method of claim 96 or 97, wherein the features from medical claims data comprises one or more of patient demographics data, prior diagnoses data, prior procedures data, and prior prescriptions data.
102. The method of claim 101, wherein the patient demographics data comprises one or more of patient gender, patient age, patient behavior, patient ethnicity, patient race, patient geographic location, patient socioeconomic status, education status, number of months active, and/or insurance status.
103. The method of any one of claims 98-102, wherein the prior diagnoses data comprises one or more diagnostic codes.
104. The method of claim 103, wherein the one or more diagnostic codes comprise ICD-9 or ICD-10 codes.
105. The method of claim 103, wherein the one or more diagnostic codes comprise ICD-10 codes, wherein one or more ICD-10 codes were converted from one or more ICD-9 codes.
106. The method of any one of claims 98-102, wherein the prior procedures data comprises one or more procedures codes.
107. The method of claim 106, wherein the one or more procedures codes comprise HCPCS or CPT-4 codes.
108. The method of any one of claims 98-102, wherein the prior prescriptions data comprises one or more national drug codes (NDCs).
109. The method of any one of claims 76-108, wherein the patient is between 50-80 years old.
110. The method of any one of claims 76-109, wherein the patient exhibits a prior smoking history.
111. The method of any one of claims 76-110, wherein the patient has not previously undergone a computed tomography (CT) scan, a positron emission tomography (PET) scan, or a PET-CT scan.
112. The method of any one of claims 76-111, wherein the patient has not previously undergone a cancer biopsy procedure.
113. The method of any one of claims 76-112, wherein the patient has not previously received a cancer diagnosis.
114. The method of any one of claims 76-113, wherein the cancer comprises lung cancer.
115. The method of claim 114, wherein the lung cancer is one of non-small cell lung cancer, small cell lung cancer, adenocarcinoma, or squamous cell carcinoma.
116. The method of any one of claims 76-115, wherein the prioritization of medical resources comprises prioritizing patients for undergoing computed tomography (CT) scans.
117. The method of any one of claims 76-116, wherein the machine learning model comprises a logistic regression model.
118. The method of any one of claims 76-116, wherein the machine learning model comprises a random forest model.
119. The method of any one of claims 76-116, wherein the machine learning model comprises a gradient boosted model.
120. The method of any one of claims 76-116, wherein the machine learning model comprises a neural network.
121. The method of any one of claims 76-120, further comprising: obtaining updated electronic records for one or more patients, the updated electronic records comprising additional data recorded in the updated electronic records subsequent to providing identification of the candidate patient; for a patient of the one or more patients: analyzing features from at least additional data of the updated electronic records for the patient using a machine learning model to categorize the patient as an additional candidate patient at risk for cancer or a non-candidate patient; and responsive to determining that the patient of the one or more patients is an additional candidate patient, providing identification of the additional candidate patient for prioritization of medical resources.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202263424570P | 2022-11-11 | 2022-11-11 | |
US63/424,570 | 2022-11-11 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024100632A1 true WO2024100632A1 (en) | 2024-05-16 |
Family
ID=88839511
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IB2023/061440 WO2024100632A1 (en) | 2022-11-11 | 2023-11-13 | Systems and methods for prioritizing medical resources for cancer screening |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2024100632A1 (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100312581A1 (en) * | 2009-06-08 | 2010-12-09 | Peter James Wachtell | Process and system for efficient allocation of medical resources |
US20160125143A1 (en) * | 2014-10-31 | 2016-05-05 | Cerner Innovation, Inc. | Identification, stratification, and prioritization of patients who qualify for care management services |
US20190198163A1 (en) * | 2016-08-15 | 2019-06-27 | Babylon Partners Limited | A system and method for optimising supply networks |
US20210398660A1 (en) * | 2020-03-13 | 2021-12-23 | Kairoi Healthcare Strategies, Inc. | Time-based resource allocation for long-term integrated health computer system |
-
2023
- 2023-11-13 WO PCT/IB2023/061440 patent/WO2024100632A1/en unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100312581A1 (en) * | 2009-06-08 | 2010-12-09 | Peter James Wachtell | Process and system for efficient allocation of medical resources |
US20160125143A1 (en) * | 2014-10-31 | 2016-05-05 | Cerner Innovation, Inc. | Identification, stratification, and prioritization of patients who qualify for care management services |
US20190198163A1 (en) * | 2016-08-15 | 2019-06-27 | Babylon Partners Limited | A system and method for optimising supply networks |
US20210398660A1 (en) * | 2020-03-13 | 2021-12-23 | Kairoi Healthcare Strategies, Inc. | Time-based resource allocation for long-term integrated health computer system |
Non-Patent Citations (1)
Title |
---|
FRANKLIN, JESSICA M. ET AL.: "The relative benefits of claims and electronic health record data for predicting medication adherence trajectory", AMERICAN HEART JOURNAL, vol. 197, 2018, pages 153 - 162 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11869187B2 (en) | System and method for predicting the risk of future lung cancer | |
Subramanian et al. | An integrated breast cancer risk assessment and management model based on fuzzy cognitive maps | |
Selvanambi et al. | Lung cancer prediction using higher-order recurrent neural network based on glowworm swarm optimization | |
US8929625B2 (en) | Method and device for side-effect prognosis and monitoring | |
McNutt et al. | Using big data analytics to advance precision radiation oncology | |
Baek et al. | Survival time prediction by integrating cox proportional hazards network and distribution function network | |
US20230027734A1 (en) | System and Method for Predicting the Risk of Future Lung Cancer | |
Janik et al. | Machine learning–assisted recurrence prediction for patients with early-stage non–small-cell lung cancer | |
Shi et al. | An intelligent decision support algorithm for diagnosis of colorectal cancer through serum tumor markers | |
Jin et al. | Identification of immune-related biomarkers for sciatica in peripheral blood | |
Valencia-Moreno et al. | Breast cancer risk estimation with intelligent algorithms and risk factors for Cuban women | |
Li et al. | Ensemble learning-assisted prediction of prolonged hospital length of stay after spine correction surgery: a multi-center cohort study | |
Polat et al. | An improved approach to medical data sets classification: artificial immune recognition system with fuzzy resource allocation mechanism | |
US20240104340A1 (en) | Apparatus for enhancing longevity and a method for its use | |
Li et al. | ML3 LASSO (least absolute shrinkage and selection operator) and XGBoost (eXtreme gradient boosting) models for predicting depression-related work impairment in US working adults | |
WO2024100632A1 (en) | Systems and methods for prioritizing medical resources for cancer screening | |
WO2022256018A1 (en) | Machine learning based decision support system for spinal cord stimulation long term response | |
Setiawan et al. | Comparing Decision Tree and Logistic Regression for Pancreatic Cancer Classification | |
Xu et al. | A weighted distance-based dynamic ensemble regression framework for gastric cancer survival time prediction | |
Pasha et al. | Machine learning to predict completion of treatment for pancreatic cancer | |
WO2024071242A1 (en) | Suggestion device, suggestion method, suggestion system, program, and information recording medium for suggesting factor parameters for estimating target label | |
Ershadi et al. | A hierarchical machine learning model based on Glioblastoma patients' clinical, biomedical, and image data to analyze their treatment plans | |
CN117917956A (en) | System and method for predicting risk of future lung cancer | |
Marinaro et al. | ML2 Supervised Machine Learning Predicts Mortality in COVID-19 Patients Using Electronic Health Records | |
Gala et al. | ML1 Comparing Mortality in Cardiac Patient Surgical Clusters with Machine Learning Clusters in the National Inpatient Sample |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23808901 Country of ref document: EP Kind code of ref document: A1 |