WO2024072921A1 - Système d'évaluation de risque de développement d'un cancer du sein et procédés associés - Google Patents

Système d'évaluation de risque de développement d'un cancer du sein et procédés associés Download PDF

Info

Publication number
WO2024072921A1
WO2024072921A1 PCT/US2023/033921 US2023033921W WO2024072921A1 WO 2024072921 A1 WO2024072921 A1 WO 2024072921A1 US 2023033921 W US2023033921 W US 2023033921W WO 2024072921 A1 WO2024072921 A1 WO 2024072921A1
Authority
WO
WIPO (PCT)
Prior art keywords
dataset
risk
model
human subjects
disorder
Prior art date
Application number
PCT/US2023/033921
Other languages
English (en)
Inventor
Kaitlin CHRISTINE
Ashmitha RAJENDRAN
Balaji KESAVAN
Haiyue LI
Original Assignee
Gabbi, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Gabbi, Inc. filed Critical Gabbi, Inc.
Publication of WO2024072921A1 publication Critical patent/WO2024072921A1/fr

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/20ICT specially adapted for the handling or processing of patient-related medical or healthcare data for electronic clinical trials or questionnaires
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H15/00ICT specially adapted for medical reports, e.g. generation or transmission thereof
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H40/00ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices
    • G16H40/60ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the operation of medical equipment or devices
    • G16H40/63ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the operation of medical equipment or devices for local operation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H40/00ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices
    • G16H40/60ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the operation of medical equipment or devices
    • G16H40/67ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the operation of medical equipment or devices for remote operation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Definitions

  • the present disclosure relates to methods and systems for assessing the risk of a subject (e.g., human male or human female patients referred to as “users” throughout this disclosure) for developing a disorder, a disease (e.g., cancer), medical condition, and/or syndrome using machine learning, and more particularly, to digital health software designed to improve accessibility and interpretability of breast cancer risk assessment and early detection measures for female patients (defined as people who are assigned female at birth) based on machine learning algorithms.
  • a subject e.g., human male or human female patients referred to as “users” throughout this disclosure
  • a disorder e.g., cancer
  • medical condition e.g., cancer
  • syndrome e.g., cancer
  • digital health software designed to improve accessibility and interpretability of breast cancer risk assessment and early detection measures for female patients (defined as people who are assigned female at birth) based on machine learning algorithms.
  • breast cancer is the second deadliest cancer behind lung cancer.
  • health insurers or “payors” pay $163 billion due to late diagnoses and the associated healthcare costs.
  • At least one aspect of the present disclosure is directed to a computer-implemented method for training a machine learning (ML) model to assess the risk of a human subject for developing at least one disorder.
  • the method includes receiving an input dataset including at least medical claim data corresponding to a plurality of human subjects over a target prediction period, creating a modified dataset that includes compounded risk factors derived from the input dataset, splitting the modified dataset into a first dataset corresponding to a first portion of the plurality of human subjects and a second dataset corresponding to a second portion of the plurality of human subjects, selecting at least one risk factor associated with developing the at least one disorder from the first dataset; training a machine learning (ML) model using the first dataset and the at least one risk factor; providing the second dataset to the ML model to generate a risk prediction for developing the at least one disorder by the end of the target prediction period for each human subject included in the second portion of the plurality of human subjects, and tuning at least one parameter of the ML model based on the generated risk predictions for the second portion of the plurality of
  • the machine learning model comprises a supervised learning binary classifier including, but is not limited to, a simple model (e.g., logistic regression and decision trees), a Bayesian model, an ensemble method (e.g., gradient and other boosted trees and random forest classifiers), and a deep learning model (e.g., neural network, multilayer perceptron).
  • a simple model e.g., logistic regression and decision trees
  • a Bayesian model e.g., an ensemble method
  • a deep learning model e.g., neural network, multilayer perceptron
  • the input dataset includes a plurality of input risk factors.
  • the method includes creating a reduced dataset including a portion of the plurality of input risk factor from the input dataset and generating the compounded risk factors by performing a series of pairwise multiplications using the input risk factor types included in the reduced dataset.
  • the method further comprises selecting features or risk factors, reducing features or risk factors, and engineering techniques based on statistical analysis results and embedding to structure the input and/or represent the input mathematically.
  • the first dataset is a training dataset and the second dataset is a validation dataset.
  • the method includes creating a third dataset corresponding to a third portion of the plurality of human subjects, and providing the third dataset to the ML model with the at least one adjusted parameter to generate a risk prediction for developing the at least one disorder by the end of the target prediction period for each human subject included in the third portion of the plurality of human subjects.
  • the method includes determining whether each human subject of the plurality of human subjects has developed the at least one disorder by the end of the target prediction period, labeling a portion of the plurality of human subjects who have developed the at least one disorder by the end of the target prediction period as positive for the disorder, and labeling a remaining portion of the plurality of human subjects as healthy.
  • determining that a human subject has developed the least one disorder includes detecting at least one identifying factor in a final year of the target prediction period.
  • the first portion of the plurality of human subjects has a first ratio of positive to healthy human subjects and the second portion of the plurality of human subjects has a second ratio of positive to healthy human subjects.
  • the first ratio and the second ratio are different.
  • selecting the at least one risk factor associated with developing the at least one disorder includes identifying at least one risk factor in a first year of the target prediction period associated with a diagnosis of the at least one disorder by the end of the target time period.
  • the at least one risk factor corresponds to at least one Clinical Classifications Software Refined (CCSR) category. Risk factors may also be grouped based on the updated International Classification of Diseases 10th Revision (ICD10) hierarchy published by World Health Organization (WHO), where the ICD10 hierarchy includes clinical description of medical classifications.
  • at least one disorder is breast cancer.
  • the trained ML model is configured to receive input data corresponding to a user and provide a risk prediction indicating the user’s risk of being diagnosed with the at least one disorder by the end of the target prediction period.
  • the risk prediction includes a risk score.
  • the risk prediction includes linear and other types of regression techniquesthat fall under the machine learning model along with suitable generative models.
  • Another aspect of the present disclosure is directed to a computer-implemented method for training a machine learning (ML) model to assess the risk of a human subject for developing at least one disorder.
  • the method includes receiving an input dataset including at least medical claim data corresponding to a plurality of human subjects over a target prediction period, creating a binary dataset representing at least a portion of the input dataset, splitting the binary dataset into a first dataset corresponding to a first portion of the plurality of human subjects and a second dataset corresponding to a second portion of the plurality of human subjects, selecting at least one risk factor associated with developing the at least one disorder from the first dataset, training a machine learning (ML) model using the first dataset and the at least one risk factor, the ML model including at least one logistic regression model, providing the second dataset to the ML model to generate a risk prediction for developing the at least one disorder by the end of the target prediction period for each human subject included in the second portion of the plurality of human subjects, and tuning at least one parameter of the ML model based on the generated risk predictions for the
  • the input dataset includes a plurality of input risk factors.
  • creating the binary dataset includes assigning a binary value to each input risk factor based on a corresponding claim count of the risk factor from the input dataset.
  • Another aspect of the present disclosure is directed to a system for training a machine learning (ML) model to assess the risk of a human subject for developing at least one disorder.
  • the system includes one or more computer systems programmed to perform operations that include receiving an input dataset including at least medical claim data corresponding to a plurality of human subjects over a target prediction period, creating a modified dataset that includes compounded risk factors derived from the input dataset, splitting the modified dataset into a first dataset corresponding to a first portion of the plurality of human subjects and a second dataset corresponding to a second portion of the plurality of human subjects, selecting at least one risk factor associated with developing the at least one disorder from the first dataset, training a machine learning (ML) model using the first dataset and the at least one risk factor, the ML model including at least one logistic regression model, providing the second dataset to the ML model to generate a risk prediction for developing the at least one disorder by the end of the target prediction period for each human subject included in the second portion of the plurality of human subjects, and tuning at least one parameter of the ML
  • the input dataset includes a plurality of input risk factors.
  • the one or more computer systems is programmed to perform operations that include creating a reduced dataset including a portion of the plurality of input risk factor from the input dataset and generating the compounded risk factors by performing a series of pairwise multiplications using the input risk factor types included in the reduced dataset.
  • the one or more computer systems is programmed to perform operations comprising: creating a third dataset corresponding to a third portion of the plurality of human subjects, and providing the third dataset to the ML model with the at least one adjusted parameter to generate a risk prediction for developing the at least one disorder by the end of the target prediction period for each human subject included in the third portion of the plurality of human subjects.
  • the one or more computer systems is programmed to perform operations comprising: determining whether each human subject of the plurality of human subjects has developed the at least one disorder by the end of the target prediction period, labeling a portion of the plurality of human subjects who have developed the at least one disorder by the end of the target prediction period as positive for the disorder, and labeling a remaining portion of the plurality of human subjects as healthy. [0018] In some embodiments, determining that a human subject has developed the at least one disorder includes detecting at least one identifying factor in a final year of the target prediction period. In some embodiments, the first portion of the plurality of human subjects has a first ratio of positive to healthy human subjects and the second portion of the plurality of human subjects has a second ratio of positive to healthy human subjects. In some embodiments, selecting the at least one risk factor associated with developing the at least one disorder includes identifying at least one risk factor in a first year of the target prediction period associated with a diagnosis of the at least one disorder by the end of the target prediction period.
  • the at least one risk factor corresponds to at least one Clinical Classifications Software Refined (CCSR) category.
  • the at least one disorder is breast cancer.
  • the trained ML model is configured to receive input data corresponding to a user and provide a risk prediction indicating the user’s risk of being diagnosed with the at least one disorder by the end of the target prediction period.
  • the risk prediction includes a risk score.
  • Another aspect of the present disclosure is directed to a system for training a machine learning (ML) model to assess the risk of a human subject for developing at least one disorder.
  • the system includes one or more computer systems programmed to perform operations that include receiving an input dataset including at least medical claim data corresponding to a plurality of human subjects over a target prediction period, creating a binary dataset representing at least a portion of the input dataset, splitting the binary dataset into a first dataset corresponding to a first portion of the plurality of human subjects and a second dataset corresponding to a second portion of the plurality of human subjects, selecting at least one risk factor associated with developing the at least one disorder from the first dataset, training a machine learning (ML) model using the first dataset and the at least one risk factor, the ML model including at least one logistic regression model, providing the second dataset to the ML model to generate a risk prediction for developing the at least one disorder by the end of the target prediction period for each human subject included in the second portion of the plurality of human subjects, and tuning at least one parameter of the ML model based
  • the input dataset includes a plurality of input risk factors.
  • creating the binary dataset includes assigning a binary value to each input risk factor based on a corresponding claim count of the risk factor from the input dataset.
  • FIG. 1 is a block diagram of a platform in accordance with aspects described herein;
  • FIG. 2 is a flow diagram of a method for building and training a risk assessment model in accordance with aspects described herein;
  • FIGS. 3 A-3C illustrate example input parameters and characteristics in accordance with aspects described herein;
  • FIG. 4 illustrates several example hypothesized risk factors in accordance with aspects described herein;
  • FIG. 5 illustrates several example cohorts in accordance with aspects described herein;
  • FIG. 6 illustrates several example plots of log odds for various risk factors in accordance with aspects described herein;
  • FIG. 7 illustrates several example plots of log odds for various transformed risk factors in accordance with aspects described herein;
  • FIG. 8 A a flow diagram of a method for building and training a risk assessment model in accordance with aspects described herein;
  • FIG. 8B a flow diagram of a method for building and training a risk assessment model in accordance with aspects described herein;
  • FIG. 9 A a flow diagram of a method for building and training a risk assessment model in accordance with aspects described herein;
  • FIG. 9B a flow diagram of a method for building and training a risk assessment model in accordance with aspects described herein.
  • FIG. 10 illustrates an example computing device.
  • Risk assessment models currently available attempt to solve the aforementioned problems; however, often times, these existing risk assessment models fall short of expectations with respect to optimal risk assessment.
  • models such as the BOADICEA, Tyer-Cuzick, Gail, and/or BRCAPRO models can determine the initial risks of a woman by regression-based assessments, which are considered standard of care and are present in the medical guidelines.
  • These risk assessment tools are developed and trained on limited datasets (e.g., only 100s of subjects at most), and with a severe lack of participant diversity (e.g., focused on Caucasian women in North America and Western Europe). The foregoing leads to the models being skewed and, as a result, only truly benefit those that fit into the limited parameters of the training sets.
  • breast cancer 5-year relative survival rates are approximately 90%.
  • a majority of people who are susceptible to breast cancer are unaware of the risks, and, therefore, do not maintain breast health including medical check-up necessities, and monitoring for signs. Therefore, a need exists to address all of these gaps in public awareness.
  • the standard of care for risk assessment is inadequate for diverse populations (e.g., White, Black, Hispanic/Latino, Asian, etc.) and varying demographics (e.g., socioeconomic diversity, communities of color, low income or poverty-stricken communities, etc.). It is known that there are a multitude of risk factors associated with breast cancer development, but conventional assessment tools do not account for all possible ones, quantitate the influence of each one, or associate different factors with each other.
  • the systems and methods described in the present disclosure relate to digital health software solutions designed to improve accessibility and interpretability of disease (e.g., breast cancer) risk assessment.
  • the solutions provided herein utilize machine learning to achieve improved accessibility and interpretability of disease risk assessment.
  • the benefits of this approach include non-linearity, whereas existing models and standards of care assume linear relationships between risk factors and breast cancer risk assessment.
  • the systems and methods described herein, according to exemplary embodiment describe risk assessment model(s) that incorporate the use of specific machine learning techniques within the workflow to quantitatively define the learned relationships in order to remove uncertainty.
  • the systems and related methods herein use a central platform that aims to solve the aforementioned problems (i.e., loss of lives due to failed early detection of a disease and the cost burden on the healthcare system of attempting to treat or cure that disease) by saving lives and saving payors’ money by providing users (e.g., human males and human females concerned with potentially developing a lethal disease such as breast cancer) with a risk assessment analysis using a wide berth of data together with machine learning.
  • users e.g., human males and human females concerned with potentially developing a lethal disease such as breast cancer
  • the systems and methods described herein provide a community -based approach to holistic disease assessment and management.
  • the platform provided herein is capable of aggregating meaningful data to provide personalized assessments, tailored action plans, and engaging communities.
  • FIG. 1 illustrates a platform 100 in accordance with aspects described herein.
  • the platform 100 is referred to by its brand name, “Gabbi.”
  • the platform 100 refers to a digital health software designed to improve accessibility and interpretability of breast cancer risk assessment and early detection measures for female patients (defined as people who are assigned female at birth and referred to as users throughout).
  • the platform 100 is used in conjunction with associated hardware (e.g., servers, storage, processors, etc.) to either store and/or execute the software/algorithms according to exemplary embodiments taught throughout this disclosure.
  • the platform 100 is implemented on a platform server 102.
  • the platform server 102 comprises software components and databases that are, in some embodiments, deployed at one or more data centers (not shown) in one or more geographic locations, for example.
  • the platform 100 is in electronic communication with devices capable of a user input (e.g., user devices) or having graphical user interfaces (GUIs) for entering data into the platform 100’s software.
  • the device include mobile phones, PC computers, tablets, or any other suitable electronic device for inputting a user selection or user data (e.g., data related to a human subject, such as, height, weight, demographic information, socioeconomic information, etc.).
  • this data is acquired through insurance companies and/or payors.
  • the platform 100 presents necessary breast cancer and breast health information digestible for the layperson, and suggests a personalized medical plan.
  • the personalized medical plan is based on the National Cancer Consortium Network and/or other medical and clinician based literature. Based on zip code, insurance, and personal factors, users (e.g., human male or human female subjects in search of determining their risk assessment for developing a particular disease or disorder) are suggested appropriate physicians and clinics for follow ups.
  • the platform 100 utilizes one or more machine learning algorithms (or models) and/or artificial intelligence models.
  • the machine learning algorithms and/or artificial intelligence models enable the platform 100 to assist the user (e.g., human subject) in determining a more accurate risk assessment of developing a disease, disorder, etc.
  • the machine learning algorithms in some embodiments, are configured to provide personalization, and adapted to foster improved engagement between users and the platform 100 (e.g., the digital health software taught herein) to improve clinical outcomes (e.g., surviving past their anticipated mortality rate due to a disease or cancer).
  • the platform 100 includes a community engagement engine 104.
  • users are grouped into cohorts based on risk factors and action plan similarities such as social determinants of health (i.e. geographic location, ethnicity, socioeconomic status, etc.), risk level, and family history.
  • the community engagement engine 104 incites cohort engagement with activations and intra-cohort communications with open discussions and activities.
  • the goal of incorporating cohort participation into the user’s action plan is to encourage adherence to health plans and treatments, improve attitude towards the medical field, and better prevention, early detection, and improve health outcomes.
  • Encouraging engagement involves understanding which features and topics are most successful at doing so.
  • the community engagement engine 104 analyzes which sentiments are associated with which engagement topics and deduce which topics promote the most interactions.
  • the platform 100 includes software that is used by the user to be their “best friend,” “sister,” and “mom” with the medical expertise of a personal physician (e.g., oncologist).
  • the platform 100 incorporates personalization and precision based action plans and presentation by taking in personal determinants of health, aggregating them in a meaningful way, to provide unique insights.
  • the platform 100 includes a risk assessment engine 106 using an interpretable machine learning model that quantifies relationships between patient characteristics, demographics, and other information from medical claims and breast cancer risk.
  • the risk assessment engine 106 includes a risk assessment model, which is referred throughout this disclosure as the Gabbi Risk Assessment Model or “GRAM.”
  • the risk assessment engine 106 is configured to develop the risk assessment model using interpretable machine learning using all factors available from medical claims datasets. For example, in some embodiments machine learning-based clustering and feature selection identifies most impactful clinical factors for breast cancer prediction. The risk assessment engine 106 is used to significantly enhance clinical understanding of influential risk factors and/or improve risk prediction.
  • the platform 100 includes a community action engine 108.
  • the community action engine 108 is configured to provide a “Gabbi Action Plan” (GAP) and/or a “Gabbi Community”, which provides personalized clinical suggestions and create an encouraging community.
  • GAP Gabbi Action Plan
  • the platform 100 is used to develop personalized outcome reports from clinically and medically derived recommendations for breast health maintenance and steps personalized for each risk assessment and combination of risk factors and validate with medical and clinical expertise.
  • the GAP encourages maintenance of breast health, and early detection of nodules or cancer.
  • the GAP plan is derived from clinical and medical literature that correlate risk and general health and life stages to clinically driven plans (e.g., the National Comprehensive Cancer Network, National Society of Genetic Counselors, United States and Preventative Task Force, and/or clinicians associated with the platform 100).
  • users are given a notification of their next actionable item. In some embodiments, this also includes notifications when their engagement cohorts are in communication or active. As users complete tasks or communicate in the engagement platform, they are asked to “mark” their completion off (e.g., like a task list).
  • the medical tasks e.g., self-breast exams, mammograms, set up general medical checkup appointments
  • the medical tasks are clearly defined with user friendly instructions and reasonings. It should be appreciated that the medical tasks are recommendations and that official medical evaluations are performed by the user’s clinician (resources for one will be provided per user on their action plan based on zip code, insurance, and physician preferences).
  • the platform 100 is configured to provide an engagement subplatform, which is a subset of the overall platform 100.
  • the engagement sub-platform significantly increases engagement resulting in health outcome improvement.
  • the GRAM workflow groups subjects (e.g., women) into specific groups, cohorts, based on the GRAM results (specifically grouping women in cohorts who have similar risk levels). Women are grouped based on similar risks and put into curated cohorts. In some examples, similarities are determined by comparingtop risk factors.
  • non-cohort members cannot join and be part of the internal cohort conversations, however there are a variety of “open groups” based on popular topics of discussion found within the cohort (i.e., breast health, mammograms, biopsies, family history).
  • the social component of the platform 100 is configured to send notifications that prompt joining conversations, and notifications around new topics or questions of discussion within cohort and groups.
  • Each individual female member creates a profile, similar to a resume or CV, but with pertinent personal information and health related data such as risk factors, family history, diagnosis.
  • Members initiate conversations, participate in group conversations, and share their action plans as they make progress together.
  • “female” subject is referred to as the main user or subject using the platform 100, the platform 100 is not limited to human females, but may also be used for risk assessment of human males.
  • loT devices e.g., smart watches, fitness trackers, etc.
  • a member is able to self-select which path he or she wants to take based on difficulty: minimal/small interventions like lifestyle changes, medium interventions like diagnostic exams, and large interventions like surgical interventions.
  • Notifications are connected to the user’s device (e.g., smartphone, smart watch, etc.) to alert and notify the member on community discussions, new questions and conversations, when to take the next step on the action plan and when they have “unlocked” risk reduction over time.
  • the platform 100 is configured to determine ease of use of the platform 100 by using quantitative metrics to guide user testing. For example, in some embodiments quantitative success is derived from search engine analytics (e.g., Google) by measuring threshold of engagement, short- and long-term retention, screen flow, and identifying which features, topics, and content are most interacted.
  • search engine analytics e.g., Google
  • an alpha version of the platform 100 (or a respective application) is released to a consumer base representative of the platform 100’s target market (e.g., female at birth, ages 35+, and access to a phone or computer). For example, in some embodiments this initial release is performed with at least 150 participants.
  • the amount of success is be determined by frequency and amount of active user interaction with the Gabbi engagement platform and success with GAP completion.
  • the number of guided engagement tasks and prompts per week is adjusted to maintain a specific percentage (e.g., 40% or about 40%) of users interacting and a specific percentage (e.g., 50% or about 50%) of users adhering to medical plans. These metrics indicate the ease of use of each feature, identifying where qualitative tests could indicate improvement tactics.
  • the platform 100 is optimized to improve user engagement, retention, and/or ease of use. For example, in some embodiments after the first 3 months (or about 3 months), the platform 100 determines user perception of the platform and experience (e.g., emotional reactions and ease of use) through usability interviews. Based on quantitative metrics (as described earlier), a group of users (e.g., 8-12 users) are selected based on ease of use, amount of engagement, and personal factors to identify user satisfaction of each feature and optimization opportunities. The group of users are selected based on varying analytical experiences, assessment results, and demographics for usability interviews. These interviews include task-based activities (“Can you calculate your risk?”, “Can you find the interaction platform?”, etc.) and expectation and real time walk-through feedback. In some examples, these tests are repeated (e.g., every 3 months) to optimize the platform 100.
  • these tests are repeated (e.g., every 3 months) to optimize the platform 100.
  • the platform 100 Based on the quantitative metrics and results from the usability research reports, the platform 100 improvesuser experience and launch a beta version to a larger consumer base (e.g., at least 100 new users). Following the same qualitative and quantitative usability research steps, the platform 100 is configured to gage experience, modify, and release the final product.
  • the platform 100 is configured to commercialize products by selling to entities such as health insurers (e.g., payors), employers, providers/sy stems, and/or governments using a “top-down approach.” For example, in some embodiments after proving significance, accuracy, and engagement, the platform 100 is configured to sell to the payors of medium to large organizations.
  • the platform 100 is used to identify each woman specifically who will be getting breast cancer this year and ensure that she engages with necessary preventative strategies at the earliest possible stages when it is most life-saving, treatable, and cheapest for the payor. Payors may want to purchase the platform 100 because the platform 100 significantly decreases their annual spending as it relates to breast cancer delayed diagnosis and treatment today.
  • the platform 100 increases member engagement, and increase the lifetime value of the member.
  • the platform 100 is configured to commercialize products by selling to end users (e.g., individuals) using a “bottom-up approach.” The end users interact with the platform 100 and deliver the platform 100 or associated results (e.g., risk assessment scores) to their employer, provider, payor etc.
  • GRAM Gabbi Risk Assessment Model
  • the risk assessment engine 106 of the platform 100 incorporates a demographically inclusive risk assessment tool referred to as the Gabbi Risk Assessment Model (GRAM).
  • GRAM Gabbi Risk Assessment Model
  • the GRAM utilizes interpretable machine learning methods and non-invasive user information to improve breast cancer risk prediction.
  • the GRAM utilizes interpretable machine learning methods along with other models including deep learning models, large language models, and/or Natural Language Processing (NLP) models combined with clinical understanding of risks.
  • NLP Natural Language Processing
  • the GRAM provides quantitative insight into breast cancer risk factors where previous studies have had conflicting correlations - including menstrual disorders, other breast conditions, and family histories of cancer. With additional validation through other datasets and biological validation, the GRAM is used as an alternative to standard of care in order to provide the most accurate assessment for patients of all ages and ethnicities.
  • developing and training the GRAM includes four phases: member selection, feature selection, training, and prediction.
  • member selection phase eligible members were selected to create an input dataset.
  • the input data is structured with all of the diagnosis information, pharmacy information, and lab information for each member and split into training datasets (e.g., used for the model to learn from) and validation datasets (e.g., to test the model and validate the prediction accuracies).
  • the feature selection phase includes identifying the most influential input risk factors in predicting breast cancer diagnosis using automated methods that return the optimal number and combination of factors.
  • the feature selection phase includes eliminating factors that are correlated with each other and provide redundant information (e.g., multicollinearity).
  • the model is trained with the optimal factors retained from the feature selection phase.
  • the model learns the subtleties in the input dataset to derive weights and coefficients per feature, resulting in a quantitative prediction of probability of breast cancer at the end of a target timeframe.
  • the parameters of the model are fine-tuned using at least one first validation dataset.
  • at least one second validation dataset is used to calculate the final accuracy metrics.
  • FIG. 2 illustrates a method 200 for developing and training a risk assessment model in accordance with aspects described herein.
  • the method 200 corresponds to the development and training of the GRAM associated with the platform 100.
  • at least a portion of the method 200 is performed by the risk assessment engine 106 of the platform 106.
  • a medical claims dataset is received.
  • the medical claims dataset includes medical claims (e.g., diagnostic information), pharmacy information, biomarker (e.g., genetic testing done on certain members), and/or lab information associated with a plurality of different individuals (e.g., patients).
  • the plurality of individuals represented in the dataset are members of one or more health insurers.
  • the dataset corresponds to a large number of members (e.g., 5 million, 10 million, 100 million) and includes data collected over different periods of time (e.g., 6 months, 1 year, 5 years). Given the large size of the dataset, the dataset is aggregated and/or summarized prior to use for model development. For example, in some embodiments the dataset is organized in groups of diagnoses, labs, and/or pharmacy data and narrowed to members of representative cohorts.
  • each entry has a corresponding International Classification of Diseases (ICD) code representing the diagnosis, disease, or injury associated with the claim entry.
  • ICD codes are be grouped in Clinical Classifications Software Refined (CCSRs) categories, which are established overarching diagnostic groups where each group encompasses tens to hundreds of similar ICD codes.
  • CCSRs Clinical Classifications Software Refined
  • the ICD codes correspondingto entries #! and #2 are both included in the same CCSR group (NE0030).
  • the ICD code associated with entry #3 (N63) and the ICD code associated with entry #5 (D0510) are included in different CCSR groups (GEN017, NEO029).
  • the input dataset includes other parameters and user characteristics.
  • these parameters and characteristics are provided with the medical claim data and/or derived from the medical claim data.
  • the parameters are collected from users directly (e.g., a survey or interface provided via a user application in communication with the platform 100). Several examples of such parameters and characteristics are shown in FIGS. 3A-3C.
  • risk factors are assigned to different CCSR groups.
  • the hypothesized risk factors correspond to factors, symptoms, or conditions that are believed, suspected, and/or proven to be related to breast cancer diagnoses.
  • FIG. 4 illustrates two example hypothesized risk factors: amenorrhea and anxiety.
  • the first hypothesized risk factor “amenorrhea” is linked to a first CCSR group GEN021 (menstrual disorders).
  • all medical claim entries having ICD codes included in the first CCSR group are classified under “Risk Factor 1” (i.e., amenorrhea).
  • the second hypothesized risk factor “anxiety” is linked to a second CCSR group MBD005 (anxiety and fear related disorders).
  • all medical claim entries having ICD codes included in the second CCSR group are classified under “Risk Factor 2” (i.e., anxiety).
  • specific ICD codes are used to identify CCSR groups that are relevant to breast cancer diagnosis (e.g., to be used as risk factors).
  • risk factors are considered in some embodiments (e.g., dozens, hundreds, thousands, etc.).
  • types of risk factors used for model development are narrowed based on biological validity.
  • biological validity of different risk factors is determined by a board of practicing medical professionals. For example, in some embodiments endometriosis is considered as a risk factor based on evidence of environmental and molecular links with breast cancer.
  • PCOS polycystic ovarian syndrome
  • abnormal uterine bleeding menorrhagia, metrorrhagia, anovulation
  • metabolic and endocrine related syndromes gastrointestinal disorders and syndromes
  • anxiety, depression and related disorders complications during or post-pregnancy and any personal or familial histories of said conditions
  • risk factors for model development.
  • endometrial hyperplasia, endometrial cancer, and ovarian cancer have been associated with specific genetic mutations that also increase the risk of breast cancer, these conditions are used or considered as risk factors.
  • conditions, diseases, disorders, symptoms, or any other attribute are used or considered as risk factors without medical evaluation.
  • eligible members are identified from the medical claims dataset.
  • the eligibility of each member is determined based on medical claim data over a target prediction period.
  • an w-year prediction period corresponds to n+ years of medical claim data (e.g., n years and a prediction year).
  • a two-year prediction period is used to develop the model.
  • mammography can detect the presence of a lump two years before they can be felt externally. Given this, identifying high risk patients prior to any palpable mass allows for potential clinical action to minimize the burden of disease. Additionally, on average, employees stay with a single employer for two to five years.
  • a member is determined as eligible if the medical claim data associated with said member does not include any breast cancer related claims prior to the last year of the prediction period. For example, in some embodiments a member is determined as eligible for a two-year prediction period if the member’s medical claim data does not include any breast cancer related claims in the first and second years. In certain examples, eligibility is determined based on additional criteria (e.g., female gender).
  • the target prediction period is used to define multiple cohorts of eligible members.
  • FIG. 5 illustrates an example plurality of cohorts 502 distributed across a five year time period.
  • each cohort of the plurality of cohorts 502 corresponds to a two-year prediction period (e.g., three years of medical claim data).
  • Each cohort includes eligible members with continuous coverage across three years.
  • the first cohort 502a includes eligible members having medical claim coverage from 2014 to 2016.
  • Predictive features of the first cohort 502a members are identified in year 1 (e.g., 1/1/2014 to 12/31/2014) to predict whether a breast cancer diagnosis would be given in year 3 (e.g., 1/1/2017-12/31/2017).
  • the second cohort 502b includes eligible members havingmedical claim coverage from 2015 to 2017. Predictive features of the second cohort 502b members are be identified in year 1 (e.g., 1/1/2015 to 12/31/2015) to predict whether a breast cancer diagnosis would be given in year 3 (e.g., 1/1/2017-12/31/2017).
  • the third cohort 502c includes eligible members having medical claim coverage from 2015 to 2017. Predictive features of the third cohort 402c members are be identified in year 1 (e.g., 1/1/2017 to 12/31/2017) to predict whether a breast cancer diagnosis would be given in year 3 (e.g., 1/1/2018-12/31/2018).
  • the breast cancer diagnosis of eligible members is determined based on the presence of one or more ICD codes in a member’s medical claim data.
  • a breast cancer diagnosis is defined using codes in the NE0030 CCSR group (“breast cancer of all types”).
  • diagnoses are determined using the following criteria: the member has at least one inpatient claim associated with NE0030 CCSR diagnosis codes and/or the member has at least two distinct outpatient claims associated with NE0030 CCSR diagnosis codes.
  • a different CCSR code or combinations of CCSR codes are used to define a breast cancer diagnosis.
  • the medical claim data associated with eligible members is split into training and validation datasets.
  • the cohort datasets include CCSR (e.g., diagnostic) claim counts, pharmacy information, lab information, and biomarker data.
  • CCSR e.g., diagnostic
  • Each eligible members is assigned a binary label indicating whether they had a breast cancer diagnosis in yearw+1 or not.
  • the dataset of cohorts is split by 50-25- 25% for training and two validation datasets respectively. In other examples, different dataset configurations are used (e.g., 40-30-30%). The distributions of healthy to breast cancer positive members is maintained when splitting up the datasets.
  • the training dataset (e.g., 50% of the input dataset) was down-sampled where the minority class (e.g., members with breast cancer) was left intact but the majority class (e.g., members without breast cancer) was under-sampled.
  • the minority class was randomly selected until a training dataset having an equal (or substantially equal) number of minority class members and majority class members was achieved.
  • down-stream predictions with the validation datasets incorporated prediction probability adjustments to counter the under-sampling of the training dataset.
  • the feature selection process includes calculating one or more metrics associated with the training dataset.
  • the feature selection process includes determining the association of each risk factor type with a breast cancer diagnosis is calculated. For example, in some embodiments an odds analysis is performed to determine the probability of a breast cancer diagnosis for each CCSR type.
  • the odds (or probability) for each CCSR type is defined as the odds for having breast cancer at the end of an //-year prediction period (e.g., in year n+ ) given a risk factor in the first year.
  • the log odds for each CCSR type are calculated as follows: for each ccsr a vn all n CCSRs, the number of claim counts associated with all diagnoses in the cc r n is counted per eligible member. A unique count, m, is retained for each ccsr n . In certain examples, claim counts associated with fewer than a predetermined number of members (e.g., 5, 10, 15, etc.) were removed and considered outlier unique counts. For each unique count m of ccsr a , the probability of having a breast cancer diagnosis is calculated.
  • This probability is equal to the number of members with a breast cancer diagnosis (member B c) with unique count m of ccsr n divided by the total number of members.
  • member B c member with unique count m of ccsr n divided by the total number of members.
  • the probabilities and claim counts in Table 2 correspond to the CCSR type ENDO 16 (“Other specified and unspecified nutritional and metabolic disorders”).
  • CCSR type ENDO 16 there are 243 different diagnostic codes (e.g., ICD codes). The number of claims associated with any of these 243 diagnostic codes is counted per member. Then, for each unique count, the probability of having that count and having a breast cancer diagnosis in n years is calculated. In some embodiments, the natural log of the probabilities are taken to determine the log odds associated with each claim count per CCSR type.
  • the log odds are used to determine trends (or associations) between each risk factor type and a breast cancer diagnosis.
  • FIG. 6 illustrates several example plots 602 of log odds for different CCSR types (e.g., characteristics, disorders, symptoms, diseases, etc.).
  • the x-axis represents the claim count and the y-axis represents the log odds of having a breast cancer diagnosis.
  • the first plot 602a corresponds to the log odds of a breast cancer diagnosis relative to a first CCSR type (e.g., anxiety and fear related disorders)
  • the second plot 602b corresponds to the log odds of a breast cancer diagnosis relative to a second CCSR type (e.g., diabetes or abnormal glucose tolerance complicating pregnancy)
  • the third plot 602c corresponds to the log odds of a breast cancer diagnosis relative to a third CCSR type (e.g., menstrual disorders)
  • the fourth plot 602d corresponds to the log odds of a breast cancer diagnosis relative to a fourth CCSR type (e.g., menopausal disorders).
  • each CCSR type has a different association (or correlation) to breast cancer diagnosis relative to claim count.
  • the risk factors that showed a positive trend with a breast cancer diagnosis were retained for model development.
  • the feature selection process includes calculating a variance inflation factor (VIF) and Akaike information criterion (AIC) with stepwise regression.
  • VIF variance inflation factor
  • AIC Akaike information criterion
  • the VIF and AIC metrics are calculated for the risk factor type that passes the log odds analysis (e.g., showed a positive trend); however, in other examples, the VIF and AIC metrics are calculated for all risk factor types (e.g., all hypothesized risk factors).
  • the VIF and AIC metrics are calculated to determine the multicollinearity of the training dataset.
  • Multicollinearity defines a situation where input predictive features in a regression are linearly related to each other proving to be redundant.
  • redundancy increases a model’s confidence intervals and reduce the effects of the independent variables. Additionally, independent individual predictors narrow the input space and allow for more reliable interpretation of input weights.
  • VIF quantitatively identifies correlations between each predictor variable (e.g., risk factor). In some embodiments, high correlations are an indication that the input feature space is capable of being narrowed.
  • the VIF metric uses a least squares regression analysis to determine collinearity across features. Several examples CCSR types and corresponding VIF values are shown in Table 3 below:
  • a VIF below a predetermined threshold indicates no significant collinearity for the corresponding risk factor type.
  • a predetermined threshold e.g. 3, 5, 7, etc.
  • risk factor types with VIF values that exceed the predetermined threshold are eliminated from model development.
  • the second metric identifies the optimal combination of features while penalizing model complexity and input space size.
  • the AIC stepwise regression provides an estimated coefficient, a standard error, a t-value, and a p-value for each risk factor type included.
  • the p-value of each risk factor is used to analyze the potential prediction contributions associated with each risk factor.
  • Table 4 Several example CCSR types and corresponding p-values determined from the AIC calculation are shown in Table 4 below: Table 4
  • both forward and backward steps are included in the AIC calculation, where a regression model is built in a stepwise fashion, including and removing predictor variables (e.g., risk factor types) with each iteration.
  • the output variables correspond to the “best” model.
  • the predictor variables deemed significant by stepwise AIC and passed VIF were retained in the data.
  • risk factor types having a p- value below a predetermined threshold e.g., 0.001, 0.01, etc.
  • the CCSR types FAC021, GEN017, GEN021, GEN023, MBD005, PRG028, and PRG029 are retained for model development.
  • different metrics are used to determine which risk factor types are retained for model development (e.g., coefficients, standard errors, t-values, etc.).
  • a machine learning (ML) model is trained using the training dataset and the selected predictor variables.
  • the ML model is a logistic regression model (or algorithm).
  • the ML model is one of a random forest model, a support vector machine(s) model, a decision tree(s) model, or a gradient boosted algorithm.
  • the logistic regression model is iteratively trained using a series of logistic regressions. For example, in some embodiments an initial logistic regression is performed using the training data. The initial logistic regression provides an estimated coefficient, a standard error, a z-value, and a p-value for each predictor variable.
  • predictor variables e.g., risk factors
  • a predetermined threshold e.g., 0.001, 0.01, etc.
  • the CCSR types FAC021, GEN017, GEN021, GEN023, and PRG029 are retained for model development.
  • different metrics are used to determine which predictor variables are retained for model development (e.g., coefficients, standard errors, z-values, etc.).
  • the log odds for each resulting predictor variable are recalculated with different transformations to determine best fit.
  • these transformations include square, log and/or natural log transformations.
  • a correlation coefficient e.g., R-value is used to determine the transformation with the best fit for each predictor variable.
  • FIG. 7 illustrates several example plots 702 of log odds for different transformed predictor variables (e.g., characteristics, disorders, symptoms, diseases).
  • the x- axis represents the transformed claim count and the y-axis represents the log odds of having a breast cancer diagnosis.
  • the first plot 702a corresponds to the log odds of a breast cancer diagnosis relative to a log transformation of the claim count of a first predictor variable (e.g., age)
  • the second plot 702b corresponds to the log odds of a breast cancer diagnosis relative to a log transformation of the claim count of a second predictor variable (e.g., the FAC021 CCSR type)
  • the third plot 702c corresponds to the log odds of a breast cancer diagnosis relative to a log transformation of the claim count of a third predictor variable (e.g., the GEN017 CCSR type)
  • the fourth plot 702d corresponds to the log odds of a breast cancer diagnosis relative to a square transformation of the claim count of a fourth predictor variable (e.g., the GEN021 CCSR type).
  • a second logistic regression is performed using the training data and the transformed predictor variables. Similar to the initial logistic regression, the second logistic regression provides an estimated coefficient, a standard error, a z-value, and a p-value for each predictor variable.
  • Table 6 Several example predictor variables and corresponding p- values output from the second logistic regression are shown in Table 6 below:
  • predictor variables having a p-value below a predetermined threshold are retained.
  • a predetermined threshold e.g., 0.001, 0.01, etc.
  • the CCSR types FAC021 and GEN021 are retained as predictor variables.
  • different metrics are used to determine which predictor variables are retained for model development (e.g., coefficients, standard errors, z-values, etc.).
  • the logistic regression model is tested with the first validation dataset.
  • the model is configured to output a breast cancer diagnosis prediction for each member included in the first validation dataset.
  • the resulting predicted probabilities are adjusted to account for the previous down-sampling of the training dataset. For example, in some embodiments the probabilities are adjusted using the following equations:
  • the adjusted prediction results corresponding to the first validation dataset are used to finetune the logistic regression model.
  • the results are used to select cutoff and beta metrics for the resulting logistic regression model.
  • the cutoff and beta metrics are referred to as “hyperparameters” of the logistic regression model. For example, in some embodiments selection is performed across a comparison of beta values 1 through 10 and cutoff thresholds between 0.01 and 1 at increments of 0.01. In some embodiments, selection is performed across a comparison of beta values of about 1 through about 10 and cutoff thresholds between about 0.01 and about 1 at increments of about 0.01. In some embodiments, different ranges and increments are used. The beta value and cutoff threshold that calculated the best f-measure was selected for the final version of the logistic regression model. Adjusted predicted probabilities greater than or equal to the cutoff were identified as increased breast cancer risk, and below as average or low breast cancer risk.
  • the logistic regression model is updated.
  • the model is updated by adjusting (or optimizing) at least one hyperparameter of the model.
  • the logistic regression model is updated by applying the selected beta value(s) and/or the selected cutoff threshold(s) from step 214.
  • the logistic regression model with the updated hyperparameters is tested with the second validation dataset.
  • the model is configured to output a breast cancer diagnosis prediction for each member included in the second validation dataset.
  • the breast cancer diagnosis prediction is provided as a risk score.
  • the risk score is a grade, ranking, or percentage (e.g., B+, 90%, etc.).
  • a high score e.g., A, 95%, etc. indicates a low risk for breast cancer; however, in other embodiments, a high score indicates a high risk for breast cancer.
  • the risk score is compared to one or more thresholds to determine a risk level (e.g., high risk, low risk, etc.) of each member.
  • the logistic regression model resulted in an accuracy of 89.53%, area under the curve (AUC) of 0.82, sensitivity of 0.39, and specificity of 0.90. In some embodiments, the logistic regression model resulted in an accuracy of approximately 89.53%, AUC of approximately 0.82, sensitivity of approximately 0.39, positive predictive value of approximately 1%, and negative predictive value of approximately 99%. In some embodiments, additional metrics such as observed-expected ratios and entropy metrics are used to assess model performance.
  • the same input CCSR types are narrowed based on medical relevance to breast cancer using relevant research sources and medical and scientific expertise. This reduction results in a set of input groups derived from different CCSR types (e.g., non-malignant breast conditions, age, personal family history, and menstrual disorders).
  • the different CCSR types correspond to those shown in FIG. 7.
  • there are 18 input groups including 17 groups of ICD codes and one group directed to age.
  • An example of the ICD code groups are provided in Table 7 below:
  • the individual input groups are then iterated through the workflow of step 204.
  • the input groups are paired based on biological relevancy and possible compounded effects on breast cancer.
  • each input group is paired with the other input groups.
  • input group 1 is paired with the age group and groups 2-17
  • group 2 is paired with the age group, group 1, and groups 3-17, and so on.
  • the claim counts of these pairs are multiplied together, creating an additional, compounded input risk factor.
  • the individual groups and multiplied compounded groups are then used to create the input dataset.
  • the workflow of steps 206 to 218 is repeated for the input dataset.
  • multiple models are developed in step 210.
  • four models are developed, including a logistic regression model, a decision tree model, an xgboost model, and a random forest model.
  • the logistic regression developed from the 18 input groups resulted in accuracies that include an AUC of 0.77, a sensitivity of 0.59, a specificity of 0.76, a positive predictive value of 21 %, and a negative predictive value of 94% without any model tuning, VIF, or AIC.
  • the performance (or accuracies) of each model type are evaluated relative to one or more thresholds.
  • the one or more thresholds correspond to the performance of the logistic regression model that was developed using the full set of CCSR types.
  • the model type that exceeds the one or more thresholds (or out performs the other model types) is selected for use.
  • the different models types are ensembled to provide a combined performance result.
  • the input dataset is converted from claim counts to binary inputs (e.g., ‘ 1 ’ or ‘0’).
  • the input groups described above e.g., the 17 ICD groups and the age group
  • each group is assigned a binary value of ‘ 1 ’ if there are claim counts present.
  • a group is assigned a binary value of ‘0’ if there are no claim counts present.
  • a threshold is used to determine the binary assignments. If a group has a claim count number exceeding the threshold (e.g., 2, 5, 10, 20, etc.), the group is assigned a binary value of ‘ 1 ’. Otherwise, the group is assigned a binary value of ‘O’.
  • each group has a unique threshold that is used for the comparison.
  • a first group is compared to a first threshold
  • a second group is compared to a second threshold
  • the binarized input dataset is filtered through the same workflow of steps 204-218.
  • the binarized input dataset mimics “Yes or No” responses to survey questions.
  • a sample dataset of user-provided survey responses e.g., from at least 500 users
  • the input features is used to test whether binary GRAM can be used with binary survey based input features.
  • the survey includes questions such as “Have you been diagnosed with endometriosis?” which corresponds to one of the input groups containing a list of ICD codes associated with an endometriosis diagnosis.
  • the results of the survey-based dataset are compared to the results of the binary -based dataset to evaluate the performance (e.g., accuracy) of each.
  • the GRAM is used to provide breast cancer diagnosis predictions (e.g., risk scores) to users of the platform 100.
  • an input dataset corresponding to a user e.g., patient, member
  • the platform 100 or the risk assessment engine 106 to generate a user-specific breast cancer prediction.
  • an alert or notification is automatically sent to the user, the user’s physician, and/or the user’s payor via the platform 100.
  • the GRAM is updated or retrained periodically by the risk assessment engine 106 as new data becomes available (e.g., data not included in the original training or validation datasets).
  • the risk assessment model may be advantageously used for various other applications.
  • the GRAM may be used to scan for other types of cancers or diseases.
  • the algorithms, concepts, and/or techniques used to build, develop, and train the risk assessment model for breast cancer are applied to other cancers such as endometrial or ovarian cancers, or applied to preventable or difficult to diagnose syndromes such as polycystic ovary syndrome.
  • the risk assessment calculator (e.g., the GRAM) is offered to users outside of specific payor partnerships by settingup a fee for access to the calculator and subsequent community/resource access.
  • the GRAMbecome s part of the standard of care for all women at the first sign of any breast pathology or part of the regular pap smear exam schedule.
  • the GRAM could help women without the ultimate end user needing to know about and seek the GRAM platform.
  • Full integration of the GRAM calculator would allow for the least amount of disruption and most impact in women’s lives.
  • the GRAM could also be integrated into hospital/clinic EMRs such that physicians and/or nurse practitioners can order or prescribe the GRAM calculator be run for a particular patient.
  • the GRAM (or platform 100) is leveraged to gain an understanding of how many women are scheduled for but not booking/attending their prescribed breast cancer screening examinations (e.g. mammogram, MRI, ultrasound, etc.). This information could not only help providers ensure that their patients are getting screened when they need to be but will also help payor to increase revenue.
  • breast cancer screening examinations e.g. mammogram, MRI, ultrasound, etc.
  • the platform 100 (or another computing device or system) is configured to use a Latent Dirichlet Allocation (LDA) based model to provide a probabilistic topic modeling approach.
  • LDA Latent Dirichlet Allocation
  • the LDA model assumes latent features are associated with different classes. For example, in some embodiments the LDA model assumes that features - clinical data, RX information, personal family history, etc. - are driven by the different classes. In this case, the classes being breast cancer positive and breast cancer negative (e.g., healthy), as well as different stages of breast cancer across each other.
  • LDA models are given a set of documents about several topics.
  • the platform 100 when deriving “signatures” associated with specific classes uses inputs and outputs.
  • the inputs include “documents with words.”
  • a digital media company uses topic modeling to understand what content their readers/users like to take in by assessing a customer’s past content (e.g., input) and then using the topics generated (e.g., output) to suggest relevant stories/content for future consumption.
  • members with breast cancer are treated as a first document and members without breast cancer are treated as a second document.
  • the associated claims data and social determinants of health are the “words” of each document.
  • members with different stages of breast cancer are treated as different documents.
  • the associated claims data and social determinants of health are the “words” of each document.
  • the resulting topic associations are combined using ensemble learning, resulting in a predictive claims data “signature”.
  • the “words” are the diagnoses (e.g., ICD, CCSR codes), procedures (e.g., CPT codes), demographics, lab information, pharmacy information, and other information provided by the user and payor.
  • the “documents” with those “words” are fed into the LDA model or algorithm.
  • the output is a series of topics with “words”. A quantitative association between topics and documents (e.g., topics being correlated member information and documents being breast cancer stages) is calculated.
  • Table 8 is an example of “topics” and potential “words” associated with such “topics” to illustrative this concept further:
  • these topics and associated weights constitute the “breast cancer signature” and the “null signature”.
  • the LDA workflow is performed without separating the types of features - input all features without a label of personal, familial, clinical, socioeconomic, or demographic.
  • the risk assessment model (e.g., the GRAM) is a logistic regression machine learning model configured to provide risk assessment scores.
  • different types models including supervised learning binary classifiers such as simple models (e.g., logistic regression and decision trees), Bayesian models, ensemble methods (e.g., gradient and other boosted trees, and random forest classifiers), and deep learning models (e.g., neural networks and multilayer perceptron) and/or algorithms are used to build and develop the risk assessment model.
  • the model uses a deep learning backbone (e.g., a neural network).
  • the model may correspond to a deep learning workflow that leverages nonlinear machine learning modalities to identify the most influential features and integrate them in a weighted fashion.
  • This workflow uniquely adapts multi-omic and computer vision techniques in a clinical setting (e.g., neural networks, ensemble machine learning, and image masking and saliency).
  • the deep learning model is built to be adaptable (e.g., consistently finetuned and improved with each new dataset).
  • the platform 100 includes a neural network which is constructed with each layer corresponding to sets of related risk factors. One or more factors are selected based on overlapping features from a subset of the Sandbox, UK Biobank, nurses health study, and other publicly available datasets.
  • Cross validation is incorporated with each iteration to track prediction errors and prevent overfitting.
  • Image masking - a sensitivity analysis used to quantitatively describe the impact that each input component has on the end prediction - is adapted for each feature, called feature masking.
  • feature masking a sensitivity analysis used to quantitatively describe the impact that each input component has on the end prediction - is adapted for each feature, called feature masking.
  • individual and combinations of up to three features per individual are retained and used as input in the fully trained expanded model.
  • the final adjusted risk assessment is compared to the original assessment for that individual using divergence metrics.
  • Features are ranked based on divergence metrics to determine rank of influence.
  • the biological implication of these features including novel application of combined clinical, social determinant, demographic, and/or lifestyle-based factors are validated by clinical and research experts and published literature.
  • the final adjusted risk assessment is compared to or combined with existing models (e.g., Gail model, TC model).
  • existing models e.g., Gail model, TC model
  • the expanded model and combined model with the Gail and TC models is incorporated using the remaining subset of the participant data (e.g., Sandbox, UK Biobank, nurses’ health study, and other publicly available sets).
  • shifts in the overall risk assessment between the weighted Gail and TC models, the expanded model, and the combined model is assessed through AUC change and/or other evaluation metrics that fit the use case, including sensitivity, specificity, precision, and recall rate.
  • noise and entropy on final predictions across models is quantified by divergence metrics and the final GRAM is based on the model with the highest accuracy and least entropy.
  • the risk assessment model includes an autoencoder that is used to learn a distribution of features in the claims data and general associations with classes (e.g., breast cancer or healthy).
  • the autoencoder is used platform 100 when determining a risk assessment.
  • the autoencoder takes in or receives a user’s claims data (e.g., a user may refer to a member of an insurance policy) as input, compresses the information, and then decompresses the information to reconstruct the data. Through training the autoencoder determines weights that are used to efficiently compress and reconstruct the claims data.
  • the trained autoencoder weights are used to expedite the training of downstream models (e.g., the claims model) by providing a baseline for the parameters instead of starting with random parameters. This allows the claims model to begin training with a set of learned parameters.
  • the trained autoencoder weights are also used as nonrandom weight initialization and/or frozen weights in the claims model.
  • the weights correspond to parameters that weigh features (e.g., input features).
  • the autoencoder narrows the feature space and expedite the training of the claims model by providing a baseline for parameter values instead of starting at random (e.g., with a large number of parameters and/or values).
  • the claims model is then trained with claims data (e.g., from insurance companies/payors) and the weights (e.g., parameters) derived from the autoencoder.
  • debiasing and generalizability are used to train the claims model to make the claims model more generalizable.
  • the output of the claims model is a risk assessment model that is used to predict breast cancer diagnosis in users (e.g., provide a breast cancer risk score).
  • the output from the claims model is interpreted using two methods - masking and saliency. In some embodiments, both of these methods occur simultaneously and provide quantitative insight on the association of each feature with a breast cancer diagnosis. These interpretation methods allow for biological/literature validation and downstream clinical validation.
  • an alternative to the claims model and interpretation steps includes using topic modeling (e.g., alternative information). This method derives signatures that are used as the predictive model (e.g., instead of the neural networks and machine learning in the claims model).
  • the claims model is configured to determine associations between the features from the claims data and breast cancer diagnoses.
  • breast cancer positive is determined by either the presence of two breast cancer diagnosis claims and/or the presence of a pathology related breast cancer code. If a member is breast cancer positive based on this criterion, then the first instance of a breast cancer suspicion on their claims record was used as the platform 100’s target. In some embodiments, this suspicion includes any procedure code used for breast cancer diagnosis and screening immediately before a breast cancer diagnosis code, presence of diagnosis code pertaining to a benign breast nodule, breast lump, and/or related breast anomaly.
  • the claims model is configured as a “risk factor type model.” As such, the claims model analyzes the medical claims dataset as subsets of groupings based on groups of risk factors. In some embodiments, these groups include: personal factors (e.g., age, height, weight, place of birth, race, gender identity), socioeconomic factors (e.g., income, education, employment, zip code, marital status, etc.), and/or clinical and medical factors (e.g., known genetic risks, underlying conditions).
  • personal factors e.g., age, height, weight, place of birth, race, gender identity
  • socioeconomic factors e.g., income, education, employment, zip code, marital status, etc.
  • clinical and medical factors e.g., known genetic risks, underlying conditions.
  • Individual models are developed from risk factor groups. For example, in some embodiments personal factors (e.g., familial history, lifestyle, age, height, weight, etc.), is one subset of the data with all patients included. This is fed into a feed forward neural network to result with a classification model to identify probability of a breast cancer diagnosis by stage. To assist in training, factor weights from the optimized autoencoder are transferred - resulting in models that have the same encoding architecture, activation function, and optimizer. Additionally, instead of a random initializer, the starting factors are based on the encoding architecture.
  • personal factors e.g., familial history, lifestyle, age, height, weight, etc.
  • This is fed into a feed forward neural network to result with a classification model to identify probability of a breast cancer diagnosis by stage.
  • factor weights from the optimized autoencoder are transferred - resulting in models that have the same encoding architecture, activation function, and optimizer. Additionally, instead of a random initializer, the starting factors are based on the
  • FIG. 8A illustrates a method 800 for developing and training a claims model in accordance with aspects described herein.
  • the method 800 corresponds to the development and training of a risk factor type model.
  • step 802 medical claims data is received and preprocessed where columns are all factors and rows are all members (or patients).
  • the medical claims data includes at least 30 million claims.
  • the distribution of each of the factor types is taken and a subset is extracted - maintaining the same distribution.
  • final subset dataset includes at least 5 million claims per factor type.
  • the subset data is segregated by factor type with columns as features and rows and members.
  • the encoding architecture from the autoencoder is incorporated with the presence of a breast cancer diagnosis or not (e.g., healthy) as the target.
  • each factor type is fed into a corresponding feed-forward neural network.
  • each factor type is fed into a common neural network.
  • the neural networks provide a class probability for each member.
  • FIG. 8B illustrates another method 850 for developing and training a claims model in accordance with aspects described herein.
  • the method 850 corresponds to the development and training of a risk factor type model.
  • step 852 medical claims data is received and preprocessed where columns are all factors and rows are all members (or patients).
  • the medical claims data includes at least 30 million claims.
  • the distribution of each of the factor types is taken and a subset is extracted - maintaining the same distribution.
  • final subset dataset includes at least 5 million claims per factor type.
  • the subset data is segregated by factor type with columns as features and rows and members.
  • the data is fed into different machine learning models and/or algorithms.
  • the data for each factor type is provided to a k-fold split model (e.g., 3 /4 training and 14 testing).
  • the training data from the k-fold split model is used with a logistic regression model, a support vector machine model, a random forest model, a decision tree model, or any other suitable machine learning model or algorithm.
  • the best performing algorithm is selected per factor type dataset.
  • the best performing algorithm is selected using metrics such as accuracy, AUC, sensitivity, and/or specificity.
  • step 860 the class probability for breast cancer risk prediction based on factor type is provided.
  • the resulting models are combined using stacked generalization or ensemble learning - involving combining multiple weak classifiers to boost prediction accuracy, sensitivity, and specificity.
  • a weighted combination of the individual factor type models where the weights are the coefficients of the ensemble model is provided.
  • the claims model is configured as an “individual risk factor model.” As such, the claims model analyzes the medical claims dataset as subsets of groupings based on masking one individual risk factor. For example, in some embodiments if there are n factors in the claims dataset, the first subset consists of all factors - factor and further delineated with all members grouped by categorical features of factor ⁇ The second subset consists of all factors - factor 2 with all members grouped by categorical features of factor 2 and so on with subsets consisting of all factors - factor n and members grouped by categorical features of factor n . In some embodiments, one factor is excluded at a time and then followed by grouping the members by that excluded factor to account for biases prior to combining (or ensembling) the models.
  • the result is a data set of all factors - factor age .
  • the dataset is subset further based on member age.
  • Each of the subsets have the same columns (factors), butthe members for each subset is different.
  • a first subset has members ages up to 12 (or about 12), a second subset has members ages 13-18 (or about 13-18), a third subset has members ages 19-24 (or about 19- 24), and so on.
  • a new member is run through all the models of all factors - factor age for each age bin. The differences in risk assessment across models indicates quantitatively how much of an effect age plays in the model development as well. This indicates whether individual models should be built with age categorized and separated then ensembled or not. This same process is repeated for all factors and different combinations of factors as well.
  • the individual claims model is developed from a subset based on an individual’s risk. For example, in one embodiment all patients of ages 15-35, 36-55, 56-75, 76 and above, are all individual subsets resulting in individual models.
  • each model is fed into feed forward neural networks to provide a classification model that identifies the probability of a breast cancer diagnosis by stage.
  • factor weights from the autoencoder are transferred - resulting in models that have the same encoding architecture, activation function, and optimizer. Additionally, instead of a random initializer, the starting factors are based on the encoding architecture.
  • FIG. 9A illustrates a method 900 for developing and training a claims model in accordance with aspects described herein.
  • the method 900 corresponds to the development and training of an individual risk factor model.
  • medical claims data is received and preprocessed where columns are all factors and rows are all members (or patients).
  • the medical claims data includes at least 30 million claims. In some embodiments, the medical claims data includes about 30 million claims.
  • the distribution of each of the factor types is taken and a subset is extracted - maintaining the same distribution.
  • final subset dataset includes at least 5 million claims per factor type. In some embodiments, the final subset dataset includes about 5 million claims per factor type.
  • the dataset is split into n datasets where n represents the total number of factors. Each dataset contains all but one of the factors. The removed factor is used to categorically group the members. Therefore, the columns of the first dataset consists of all factors - factori with members grouped by factori, the second consists of all factors - factor 2 with members grouped by all factors - factor 2 , and so on.
  • the encoding architecture from the autoencoder is incorporated with the presence of a breast cancer diagnosis or not (e.g., healthy) as the target.
  • each factor group is fed into a corresponding feed-forward neural network.
  • each factor group is fed into a common neural network.
  • the neural networks provide a class probability for each member.
  • step 914 the class probabilities for each factor group are combined to provide a risk assessment score for each member.
  • FIG. 9B illustrates another method 950 for developing and training a claims model in accordance with aspects described herein.
  • the method 950 corresponds to the development and training of an individual risk factor model.
  • medical claims data is received and preprocessed where columns are all factors and rows are all members (or patients).
  • the medical claims data includes at least 30 million claims. In some embodiments, the medical claims data includes about 30 million claims.
  • the distribution of each of the factor types is taken and a subset is extracted - maintaining the same distribution.
  • final subset dataset includes at least 5 million claims per factor type. In some embodiments, the final subset dataset includes about 5 million claims per factor type.
  • the dataset is split into n datasets where n represents the total number of factors. Each dataset contains all but one of the factors. The removed factor is used to categorically group the members. Therefore, the columns of the first dataset consists of all factors - factorx with members grouped by factori, the second consists of all factors - factor 2 with members grouped by all factors - factor 2 , and so on.
  • the data is fed into different machine learning models and/or algorithms.
  • the data for each factor type is provided to a k-fold split model (e.g., 3 /4 training and % testing).
  • the training data from the k-fold split model is used with a logistic regression model, a support vector machine model, a random forest model, a decision tree model, or any other suitable machine learning model or algorithm.
  • the best performing algorithm is selected per factor dataset. For example, in some embodiments the best performing algorithm is selected using metrics such as accuracy, AUC, sensitivity, and/or specificity.
  • the class probability for breast cancer risk prediction based on factor group is provided.
  • the resulting models are combined using stacked generalization or ensemble learning - involving combining multiple weak classifiers to boost prediction accuracy, sensitivity, and specificity.
  • a weighted combination of the individual factor group models where the weights are the coefficients of the ensemble model is provided.
  • the platform 100 utilizes interpretation to determine a risk assessment model. For example, in some embodiments, the platform 100 utilizes “factor masking.”
  • Image masking is generally a type of computer vision analysis where a mask is applied to a portion of an image (e.g., setting all pixels in that region to 0) while retaining the rest of the image. This is a sensitivity analysis to quantitatively determine the impact of the retained image on the final classification.
  • the platform 100 uses what is termed as “factor masking.”
  • the platform 100 performs two types of factor masking: masking groups of risk factors (e.g., factor types), and masking of the individual risk factors.
  • the purpose of factor type masking is to quantitatively determine the influence of a group of risk factors (e.g clinical, personal, socioeconomic, etc.) to breast cancer risks.
  • masking underlies and defines the associations developed from the claims model - solving the problem of black box interpretability of neural networks and deep learning.
  • risk factor masking is performed by grouping the input dataset based on factor types where is the individual factor and F M indicates the factor type it belongs to (e.g. personal factors Fp, clinical factors Fc, socioeconomic factors Fs, etc.).
  • the data is organized such that the factor types are the columns and patients are the rows.
  • One of the factor types is retained while the rest are masked (e.g., set to zero).
  • the masked data is fed as the input into the trained claims model resulting in predictions for each patient based on solely the retained factor type. This process is repeated for each factor type F M .
  • the amount of change between the predicted risk assessment from the full dataset for a patient versus the predicted risk assessment from the masked dataset for a patient is determined using average precision (AP) and AUC metrics and is calculated per F M .
  • AP average precision
  • AUC metrics the factor types are ranked to determine the strength of influence of the risk assessment predictions.
  • the ranked list of factor types indicates defined associations with breast cancer risk. The list exposes factor types that have yet to be explored in the context of cancer risk from a quantitative perspective.
  • a similar process is performed for individual risk factor masking.
  • the input dataset is grouped based on factor types where / n is the individual factor.
  • the data is organized such that the factors are the columns and patients are the rows.
  • One of the factors is retained while the rest are masked (e.g., set to zero).
  • all factors are set to 0 except for one.
  • the masked data is fed as the input into the trained claims model resulting in predictions for each patient based on solely the retained factor type. This process is repeated for each factor f a .
  • the amount of change between the predicted risk assessment from the full dataset for a patient versus the predicted risk assessment from the masked dataset for a patient is determined using AP and AUC metrics and is calculated perj ⁇ .
  • the factors are ranked to determine the strength of influence of the risk assessment predictions.
  • the ranked list of factors defines quantitative associations with breast cancer risk. The list exposes factors that have yet to be explored as well in the context of cancer risk from a quantitative perspective.
  • both of these masking analyses are used to validate the association results from the claims model. Additionally, insights from the calculated interpretations dictate where and what unbiasing steps are required.
  • the platform 100 also utilizes class saliency to better determine a risk assessment model.
  • Class saliency is another computer vision technique that is used to compute the gradient of an output class prediction with respect to an input via back propagation.
  • class saliency is used by the platform 100 to identify the relevant input components whose values would affect the positive class probability in a trained neural network.
  • the platform 100 uses what is termed as “factor saliency” to be the factors whose change in value would increase the model’s belief of the positive class label. In this case, the more salient factors were the ones that increased the probability of developing a disorder or disease, such as the probability of developing breast cancer.
  • machine learning models and algorithms when implemented, generate bias.
  • the machine learning features of the platform 100 generate bias overtime.
  • a de-biasing technique is employed by the platform 100.
  • a de-biasing technique is implemented using generative adversarial networks (GANs).
  • GANs Generative Adversarial Networks
  • GANs are unsupervised learning tasks in machine learning that involve two sub-models: a generator model, which creates new plausible data examples from the original data and a discriminator model, which classifies the data (both new and original) as real (e.g., original) or fake (e.g., new).
  • GANs learn the “ins and outs” of the dataset and generate new data points that would plausibly be a part of the original set.
  • the modified datasets are ran as a supervised learning problem and the discriminator model attempts to identify examples as either “real” (e.g., from the original data) or “fake” (e.g., from the generated set).
  • the two sub-models are trained together until the model is not able to tell the two groups apart (e.g., 50% of the time or about 50% of the time). This means that the model is generating truly plausible examples.
  • GANs enable the platform 100 to decrease the bias introduced to the platform 100’s risk assessment model (e.g., the GRAM model) by the datasets.
  • risk assessment model e.g., the GRAM model
  • many marginalized populations within the platform 100’s datasets are not as well represented as others (i.e., Caucasian, middle-class, and over age 45).
  • the platform 100 evens out the representation and create a risk assessment model that consciously tackles bias from all angles, ensuring that the platform 100 is capable of providing all women (or men) with an accurate risk assessment regardless of age, ethnicity, socio-economic status, and all other differentiators.
  • the method of de-biasing uses the learned representations of the model to adjust the training data. Adjustments are based on where the model fails or significant variance accuracy results for certain demographics or groups of members that have a lack of dimensionality in the training data. This is an unsupervised approach where the platform 100 does not specify the underrepresented groups (e.g., sensitive features). The unbalanced representations are learned during the training process. For example, in some embodiments when deciding if a member has breast cancer or not, latent variables (e.g., such as demographics or geographical features) are not drastically impact the classification results.
  • latent variables e.g., such as demographics or geographical features
  • VAE Variational autoencoders
  • an overall workflow to create the debiased training dataset is decided by: 1) latent distribution estimated through the encoder and 2) adaptively adjust the frequency of the occurrence of over represented regions and increasing sampling of underrepresented regions.
  • the training dataset is used by the platform 100 to train the risk assessment (or claims) model.
  • this method provides insight on which latent variables provide the most bias in the original algorithm based on the amount of adjustment used per variable.
  • the adjustment amount is used to rank the latent variables based on the variables introducing the most bias.
  • transfer learning using autoencoders for generalizability is employed by the platform 100.
  • Transfer learning is a method where information gained from training one model is transferred to a different problem space. This decreases the time needed for training each subsequent problem and sample size.
  • the platform 100 utilizes pre-trained layers of machine learning networks and add subsequent dense layers at the end for recognizing members of the new dataset.
  • the risk assessment model is initially created, trained, and tested with data from a first payor. Subsequently, using transfer learning techniques, the model is then adapted for a second payor, a third payor space, etc. The goal is to create additional layers in the machine learning approach associated with each individual payor and the unique information they provide while also tuning the existing overlapping layers.
  • a first level corresponds to features overlapping with most payors
  • a second level corresponds to payor specific features
  • a third level corresponds to user survey features.
  • the risk assessment model (e.g., the GRAM) is trained and tested using data from a first payor to develop three model levels.
  • the first model (e.g., Level 1) contains features that are overlapping across most national payors - diagnoses, procedures, labs, pharmacy, and basic member information (e.g., age, height, weight).
  • the second model (e.g., Level 2) contains features specific to the payor (e.g., the payor associated with the input dataset). Payor specific information is generally varying in social determinants of health information.
  • the third model (e.g., Level 3) contains features that are from end user surveys.
  • the end user surveys include information such as: “at what age did the member start their first period?”, “did the member breastfeed after giving birth and for how long?”, “at what age did they start menopause?”, and other information that is not available from claims data.
  • the three levels are created and trained with a first payor’s claims data and employees under the first payor (e.g., who are the end users).
  • the levels are ensembled to create a final risk assessment prediction model.
  • the model is taken to a second payor and fined tuned with second payor’s data and end users.
  • all four models are ensembled to provide a combined model. This approach creates a “generalizable” model. This model is then be used across users with any insurance, and without insurance, since it has levels pertaining to multiple types of information.
  • clinical risk assessments and genetic risk assessments are combined to improve risk analysis of a particular disease (e.g., cancer), medical condition, and/or syndrome.
  • machine learning models and algorithms are utilized to determine such a risk assessment of a user (e.g., human female or human male).
  • FIG. 10 shows an example of a generic computing device 1000, which may be used with some of the techniques described in this disclosure.
  • Computing device 1000 includes a processor 1002, memory 1004, an input/output device such as a display 1006, a communication interface 1008, and a transceiver 1010, among other components.
  • the device 1000 may also be provided with a storage device, such as a micro-drive or other device, to provide additional storage.
  • a storage device such as a micro-drive or other device, to provide additional storage.
  • Each of the components 1000, 1002, 1004, 1006, 1008, and 1010 are interconnected using various buses, and several of the components may be mounted on a common motherboard or in other manners as appropriate.
  • the processor 1002 can execute instructions within the computing device 1000, including instructions stored in the memory 1004.
  • the processor 1002 may be implemented as a chipset of chips that include separate and multiple analog and digital processors.
  • the processor 1002 may provide, for example, for coordination of the other components of the device 1000, such as control of user interfaces, applications run by device 1000, and wireless communication by device 1000.
  • Processor 1002 may communicate with a user through control interface 1012 and display interface 1014 coupled to a display 1006.
  • the display 1006 may be, for example, a TFT LCD (Thin-Film-Transistor Liquid Crystal Display) or an OLED (Organic Light Emitting Diode) display, or other appropriate display technology.
  • the display interface 1014 may comprise appropriate circuitry for driving the display 1006 to present graphical and other information to a user.
  • the control interface 1012 may receive commands from a user and convert them for submission to the processor 1002.
  • an external interface 1016 may be provided in communication with processor 1002, so as to enable near area communication of device 1000 with other devices.
  • External interface 1016 may provide, for example, for wired communication in some implementations, or for wireless communication in other implementations, and multiple interfaces may also be used.
  • the memory 1004 stores information within the computing device 1000.
  • the memory 1004 can be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units.
  • Expansion memory 1018 may also be provided and connected to device 1000 through expansion interface 1020, which may include, for example, a SIMM (Single In Line Memory Module) card interface.
  • SIMM Single In Line Memory Module
  • expansion memory 1018 may provide extra storage space for device 1000, or may also store applications or other information for device 1000.
  • expansion memory 1018 may include instructions to carry out or supplement the processes described above, and may include secure information also.
  • expansion memory 1018 may be provided as a security module for device 1000, and may be programmed with instructions that permit secure use of device 1000.
  • secure applications may be provided via the SIMM cards, along with additional information, such as placing identifying information on the SIMM card in a non-hackable manner.
  • the memory may include, for example, flash memory and/or NVRAM memory, as discussed below.
  • a computer program product is tangibly embodied in an information carrier.
  • the computer program product contains instructions that, when executed, perform one or more methods, such as those described above.
  • the information carrier is a computer- or machine-readable medium, such as the memory 1004, expansion memory 1018, memory on processor 1002, or a propagated signal that may be received, for example, over transceiver 1010 or external interface 1016.
  • Device 1000 may communicate wirelessly through communication interface 1008, which may include digital signal processing circuitry where necessary.
  • Communication interface 1008 may in some cases be a cellular modem.
  • Communication interface 1008 may provide for communications under various modes or protocols, such as GSM voice calls, SMS, EMS, or MMS messaging, CDMA, TDMA, PDC, WCDMA, CDMA2000, or GPRS, among others.
  • Such communication may occur, for example, through radio-frequency transceiver 1010.
  • short-range communication may occur, such as using a Bluetooth, Wi-Fi, or other such transceiver (not shown).
  • GPS Global Positioning System
  • receiver module 1022 may provide additional navigation- and location-related wireless data to device 1000, which may be used as appropriate by applications running on device 1000.
  • Device 1000 may also communicate audibly using audio codec 1024, which may receive spoken information from a user and convert it to usable digital information. Audio codec 1024 may likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of device 1000. Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice messages, music files, etc.) and may also include sound generated by applications operating on device 1000.
  • the device 1000 includes a microphone to collect audio (e.g., speech) from a user.
  • the device 1000 may include an input to receive a connection from an external microphone.
  • the computing device 1000 may be implemented in a number of different forms, as shown in FIG. 10. For example, it may be implemented as a computer (e.g., laptop) 1026. It may also be implemented as part of a smartphone 1028, smart watch, tablet, personal digital assistant, or other similar mobile device.
  • implementations of the subject matter and the operations described in this specification can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.
  • Implementations of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on computer storage medium for execution by, or to control the operation of, data processing apparatus.
  • the program instructions can be encoded on an artificially- generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.
  • a computer storage medium can be, or be included in, a computer-readable storage device, a computer-readable storage substrate, a random or serial access memory array or device, or a combination of one or more of them. Moreover, while a computer storage medium is not a propagated signal, a computer storage medium can be a source or destination of computer program instructions encoded in an artificially generated propagated signal. The computer storage medium can also be, or be included in, one or more separate physical components or media (e.g., multiple CDs, disks, or other storage devices).
  • the term “data processing apparatus” encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, a system on a chip, or multiple ones, or combinations, of the foregoing.
  • the apparatus can include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
  • the apparatus can also include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, a cross-platform runtime environment, a virtual machine, or a combination of one or more of them.
  • the apparatus and execution environment can realize various different computing model infrastructures, such as web services, distributed computing and grid computing infrastructures.
  • a computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, object, or other unit suitable for use in a computing environment.
  • a computer program may, but need not, correspond to a file in a file system.
  • a program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language resource), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
  • a computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
  • processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer.
  • a processor will receive instructions and data from a read-only memory or a random access memory or both.
  • the essential elements of a computer are a processor for performing actions in accordance with instructions and one or more memory devices for storing instructions and data.
  • a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks.
  • mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks.
  • a computer need not have such devices.
  • a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device (e.g., a universal serial bus (USB) flash drive), to name just a few.
  • Devices suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD- ROM and DVD-ROM disks.
  • the processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
  • implementations of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and a pointing device e.g., a mouse or a trackball
  • Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
  • a computer can interact with a user by sending resources to and receiving resources from a device that
  • Implementations of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components.
  • the components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network.
  • Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), an inter-network (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks).
  • LAN local area network
  • WAN wide area network
  • inter-network e.g., the Internet
  • peer-to-peer networks e.g., ad hoc peer-to-peer networks.
  • the computing system can include clients and servers.
  • a client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
  • a server transmits data (e.g., an HTML page) to a client device (e.g., for purposes of displaying data to and receiving user input from a user interacting with the client device).
  • client device e.g., for purposes of displaying data to and receiving user input from a user interacting with the client device.
  • Data generated at the client device e.g., a result of the user interaction
  • a system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions.
  • One or more computer programs canbe configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions.
  • connections between components or systems within the figures are not intended to be limited to direct connections. Rather, data or signals between these components may be modified, re-formatted, or otherwise changed by intermediary components. Also, additional or fewer connections may be used.
  • the terms “coupled,” “connected,” or “communicatively coupled” shall be understood to include direct connections, indirect connections through one or more intermediary devices, wireless connections, and so forth.
  • a service, function, or resource is not limited to a single service, function, or resource; usage of these terms may refer to a grouping of related services, functions, or resources, which may be distributed or aggregated.
  • a reference to “A and/or B”, when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements).
  • the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements.
  • This definition also allows that elements may optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one” refers, whether related or unrelated to those elements specifically identified.
  • “at least one of A and B” can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements).

Abstract

L'invention concerne un procédé mis en œuvre par ordinateur permettant d'entraîner un modèle d'apprentissage machine (ML) destiné à évaluer le risque qu'un sujet humain développe au moins un trouble. Le procédé comprend la réception d'un ensemble de données d'entrée correspondant à une pluralité de sujets humains. Un ensemble de données modifié est créé qui comprend des facteurs de risque composés dérivés de l'ensemble de données d'entrée. L'ensemble de données d'entrée est divisé en un premier ensemble de données et en un second ensemble de données. Au moins un facteur de risque associé au développement dudit au moins un trouble est sélectionné. Le modèle ML est entraîné à l'aide du premier ensemble de données et dudit au moins un facteur de risque. Le second ensemble de données est fourni au modèle ML en vue de générer une prédiction de risque de développement dudit au moins un trouble avant la fin d'une période de prédiction cible. Au moins un paramètre du modèle ML est réglé sur la base des prédictions de risque générées.
PCT/US2023/033921 2022-09-28 2023-09-28 Système d'évaluation de risque de développement d'un cancer du sein et procédés associés WO2024072921A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202263377500P 2022-09-28 2022-09-28
US63/377,500 2022-09-28

Publications (1)

Publication Number Publication Date
WO2024072921A1 true WO2024072921A1 (fr) 2024-04-04

Family

ID=90478996

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/033921 WO2024072921A1 (fr) 2022-09-28 2023-09-28 Système d'évaluation de risque de développement d'un cancer du sein et procédés associés

Country Status (1)

Country Link
WO (1) WO2024072921A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180068083A1 (en) * 2014-12-08 2018-03-08 20/20 Gene Systems, Inc. Methods and machine learning systems for predicting the likelihood or risk of having cancer
US20180211727A1 (en) * 2017-01-24 2018-07-26 Basehealth, Inc. Automated Evidence Based Identification of Medical Conditions and Evaluation of Health and Financial Benefits Of Health Management Intervention Programs
US20200342958A1 (en) * 2019-04-23 2020-10-29 Cedars-Sinai Medical Center Methods and systems for assessing inflammatory disease with deep learning
WO2021163619A1 (fr) * 2020-02-14 2021-08-19 Icahn School Of Medicine At Mount Sinai Procédés et appareil pour le diagnostic d'un déclin progressif de la fonction rénale en utilisant un modèle d'apprentissage automatique

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180068083A1 (en) * 2014-12-08 2018-03-08 20/20 Gene Systems, Inc. Methods and machine learning systems for predicting the likelihood or risk of having cancer
US20180211727A1 (en) * 2017-01-24 2018-07-26 Basehealth, Inc. Automated Evidence Based Identification of Medical Conditions and Evaluation of Health and Financial Benefits Of Health Management Intervention Programs
US20200342958A1 (en) * 2019-04-23 2020-10-29 Cedars-Sinai Medical Center Methods and systems for assessing inflammatory disease with deep learning
WO2021163619A1 (fr) * 2020-02-14 2021-08-19 Icahn School Of Medicine At Mount Sinai Procédés et appareil pour le diagnostic d'un déclin progressif de la fonction rénale en utilisant un modèle d'apprentissage automatique

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
PRIYANKA RAJENDRA: "Prediction of diabetes using logistic regression and ensemble techniques", COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE UPDATE, vol. 1, 1 January 2021 (2021-01-01), pages 100032, XP093151969, ISSN: 2666-9900, DOI: 10.1016/j.cmpbup.2021.100032 *

Similar Documents

Publication Publication Date Title
US11600390B2 (en) Machine learning clinical decision support system for risk categorization
US20230142594A1 (en) Application of bayesian networks to patient screening and treatment
Tariq et al. Detecting developmental delay and autism through machine learning models using home videos of Bangladeshi children: Development and validation study
US10565309B2 (en) Interpreting the meaning of clinical values in electronic medical records
US20190043606A1 (en) Patient-provider healthcare recommender system
US20180181719A1 (en) Virtual healthcare personal assistant
US20150254754A1 (en) Methods and apparatuses for consumer evaluation of insurance options
US20140129256A1 (en) System and method for identifying healthcare fraud
Ma et al. An app for detecting bullying of nurses using convolutional neural networks and web-based computerized adaptive testing: development and usability study
US11610679B1 (en) Prediction and prevention of medical events using machine-learning algorithms
US20210118557A1 (en) System and method for providing model-based predictions of beneficiaries receiving out-of-network care
Jiang et al. Analysis of massive online medical consultation service data to understand physicians’ economic return: Observational data mining study
Frochen et al. Functional status and adaptation: measuring activities of daily living and device use in the National Health and aging trends study
Barak-Corren et al. Prediction of patient disposition: comparison of computer and human approaches and a proposed synthesis
Yerrapragada et al. Machine learning to predict tamoxifen nonadherence among US commercially insured patients with metastatic breast cancer
Ren et al. Issue of data imbalance on low birthweight baby outcomes prediction and associated risk factors identification: establishment of benchmarking key machine learning models with data rebalancing strategies
US20240161932A1 (en) Systems for assessing risk of developing breast cancer and related methods
US20220130505A1 (en) Method, System, and Computer Program Product for Pharmacy Substitutions
WO2024072921A1 (fr) Système d'évaluation de risque de développement d'un cancer du sein et procédés associés
WO2022212293A1 (fr) Système d'évaluation de risque de développement d'un cancer du sein et procédés associés
Wong et al. Predicting primary care use among patients in a large integrated health system: the role of patient experience measures
Liang Developing Clinical Prediction Models for Post-treatment Substance Use Relapse with Explainable Artificial Intelligence
US20240071623A1 (en) Patient health platform
US20230018521A1 (en) Systems and methods for generating targeted outputs
Matias et al. Approaches to projecting future healthcare demand