WO2020120301A1 - Hierarchical local model for social determinants of health index prediction - Google Patents

Hierarchical local model for social determinants of health index prediction Download PDF

Info

Publication number
WO2020120301A1
WO2020120301A1 PCT/EP2019/083934 EP2019083934W WO2020120301A1 WO 2020120301 A1 WO2020120301 A1 WO 2020120301A1 EP 2019083934 W EP2019083934 W EP 2019083934W WO 2020120301 A1 WO2020120301 A1 WO 2020120301A1
Authority
WO
WIPO (PCT)
Prior art keywords
geographic
area
interest
sdoh
training
Prior art date
Application number
PCT/EP2019/083934
Other languages
French (fr)
Inventor
Jin Liu
Eran Simhon
Original Assignee
Koninklijke Philips N.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips N.V. filed Critical Koninklijke Philips N.V.
Publication of WO2020120301A1 publication Critical patent/WO2020120301A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance

Definitions

  • Various exemplary embodiments disclosed herein relate generally to a hierarchical local model for social determinants of health index prediction.
  • SDoH Social determinants of health
  • the community level SDoH index may provide healthcare organizations an overview of health status and contributing social factors of the region that they are serving.
  • a ZIP Code -level SDoH index was developed using 3rd party commercial data, and this application was limited by purchasing agreements and limited geographic regions.
  • Public free survey/ statistic datasets are a good source for SDoH information, but they are limited by the health outcomes which are only available at the county or state level.
  • Various embodiments relate to a method for training a hierarchical machine learning model that produces a social determinants of health (SDoH) index, including: receiving a description of a geographic area of interest made up of a set of first geographic areas having a first geographic level; determining an area of interest similarity score for each of a plurality of geographic areas having the first geographic hierarchy outside the geographic area of interest; determining an optimal hierarchical machine learning model by minimizing a performance metric based upon the determined area of interest similarity scores for a set of SDoH features by repeating the steps of: determining a training set of data based upon the determined area of interest similarity scores; training the hierarchical machine learning model using the determined training set of data; and calculating the performance metric for the trained model based upon the test data set.
  • SDoH social determinants of health
  • the optimal machine learning module is configured to produce the SDoH index based upon the SDoH features for a geographic area having a second geographic level, wherein the second geographic level is less than the first geographic level.
  • the area of interest similarity score is an average of the similarity scores between each of the set of first geographic areas and a geographic area outside the geographic area of interest.
  • Various embodiments are described, further including the area of interest similarity score is based upon the distance between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas, the difference of non-SDoH features between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas, and the difference in health outcomes between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas.
  • determining a training set of data based upon the determined area of interest similarity scores includes selecting the M areas outside the geographic area of interesting having the highest area of interest similarity scores, wherein M is an integer.
  • Various embodiments are described, further including the performance parameter is mean square error on the test data set.
  • Various embodiments are described, further including training the hierarchical machine learning model uses Lasso regression.
  • Various embodiments are described, further including the hierarchical machine learning model calculates a sub heath score for each of the SDoH categories and the SDoH index is a weighted sum of the sub health scores for each of the SDoH categories.
  • Various embodiments are described, further including calculating the SDoH index based upon the SDoH features for a geographic area having a second geographic level, wherein the second geographic level is less than the first geographic level.
  • the optimal machine learning module is configured to produce the SDoH index based upon the SDoH features for a geographic area having a second geographic level, wherein the second geographic level is less than the first geographic level.
  • the area of interest similarity score is an average of the similarity scores between each of the set of first geographic areas and a geographic area outside the geographic area of interest.
  • the area of interest similarity score is based upon the distance between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas, the difference of non-SDoH features between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas, and the difference in health outcomes between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas.
  • instructions for determining a training set of data based upon the determined area of interest similarity scores includes instructions for selecting the M areas outside the geographic area of interesting having the highest area of interest similarity scores, wherein M is an integer.
  • the performance parameter is mean square error on the test data set.
  • the hierarchical machine learning model calculates a sub heath score for each of the SDoH categories and the SDoH index is a weighted sum of the sub health scores for each of the SDoH categories.
  • instructions for training the hierarchical machine learning model includes instructions for learning the weights used in the weighted sum.
  • FIG. 1 illustrates a block diagram of a hierarchical local tool for producing a SDoH index
  • FIG. 2 illustrates a user interface for use in the hierarchical local tool
  • FIG. 3 illustrates the training process used by the local training set optimization module
  • FIG. 4 illustrates a block diagram of the hierarchical machine learning module.
  • SDoH information such as income, education, and housing at different geographic levels.
  • Health outcome information such as life expectancy and mortality rate may also be obtained from public databases, but health outcome information is often only available at the county or state level.
  • renting versus owning a house may be a good differentiator in rural regions providing a good reflection of economic stability, while it may make less of difference in urban cities such as New York City, where even affluent people rent versus owning a home.
  • a healthcare organization would prefer a model that based on data for the area it serves.
  • the number of counties in that area may be too small to adequately develop such a model.
  • creating a reliable model would include using data from other counties with different socio-economic challenges.
  • this problem is solved by finding the optimal set of counties that should be used for developing the model.
  • the embodiments described herein include a hierarchical machine learning model with optimized local training set selection to integrate multilevel public datasets into a predictive SDoH model that produces a SDoH index.
  • the hierarchical machine learning model trains the model at the county level and predicts the SDoH index at the ZIP code level or some other geographic level defined by the users.
  • the ZIP code level is used herein as an example to make the description concise without losing generalization; while county level for training is also likewise used as an example to represent the geographic level same as the health outcome variable. Other geographic levels may be used as well.
  • FIG. 1 illustrates a block diagram of a hierarchical local tool for producing a SDoH index.
  • the hierarchical local tool 100 includes a user interface 105, a region scoring module 110, a local training set optimization module 115, and a hierarchical machine learning module 120.
  • FIG. 2 illustrates a user interface for use in the hierarchical local tool.
  • the user interface 105 may include a map 225 that shows the geographic area to be used with the hierarchical local tool 100. In this specific example, the United States is shown, but other countries, regions, continents, etc. may also be shown by this map include the whole world.
  • the specific map displayed may be set by the location where the hierarchical local tool 100 is being used. For example, if the hierarchical local tool 100 is being used in Denmark, the map may just be for Denmark, all of Scandinavia, or all of Europe.
  • the map 225 may be zoomed in and out as well as panned to get to a specific area of interest.
  • the user interface also includes a region of interest pane 230 that allows a user to either determine a region of interest by drawing and/ or selection from the map 225 or by importing a file that specifies the region of interest.
  • a drop down menu or other interface elements may be used to select the region of interest.
  • the user interface may also include a geographic level pane 235 where the user may select the geographic level for SDoH index.
  • geographic level pane 235 As shown, there are check boxes for ZIP code, Block, and County levels shown. A user may click a specific check box to select the desired geographic level for the SDoH index. Also, a drop down menu or other interface elements may be used to select the geographic level for the SDoH index.
  • the user interface 105 may display a selection information pane 240 and a local map 245.
  • the selection information pane 240 may display various information related to the region of interest such as the number of people in the region and the resulting SDoH index for the region once it is calculated.
  • the local map 245 may show the boundaries of the different geographic areas as defined by the selected geographic level, for example, by zip code, block or county.
  • the map 245 may also be color coded to show via specific colors the SDoH index values for each geographic area in the map 245.
  • the user interface 105 may also include a feature plot 250 that plots the feature data related to the SDoH index value that is calculated. This provides a user additional insight to see which features are contributing to the SDoH index value.
  • the region scoring module 110 receives input data from the user interface defining the region of interest.
  • the region scoring model 110 then takes each region outside of the region of interest and calculates a similarity score that indicates how similar each region outside the region of interest is to the region of interest. These similarity scores will then be used to define training sets to use to train the hierarchical machine learning model.
  • the region scoring module 110 produces a weighted average similarity score (5, ⁇ ) of each remaining region comparing with each subregion in the region of interest (i) based upon: its distance to each region of interest ( disti j ); region similarity (non-SDoH features such as rural/ urban, average age, gender and race); and health outcome difference.
  • the following function may be used where the weights (wl, w2 and w3) can be predetermined by the user or be optimized in local training set optimization module 115.
  • S j is the overall similarity score for region j
  • SJ is the similarity score between county i and county j
  • N is the total number of counties in the region of interest
  • disti j is the distance between county i and county j
  • dif j is the difference in non-SDoH features between county i and county j
  • dif ⁇ health outcome is the difference in health outcome between count j and county i.
  • the hierarchical local tool 100 uses the local training set optimization module 115 to determine an optimal training set for training the hierarchical machine learning model.
  • FIG. 3 illustrates the training process used by the local training set optimization module.
  • the map 305 shows the region of interest 310 and potential similar regions 312, 314, and 316 to be used for training.
  • the local training set optimization module 115 uses the similarity scores for each county outside of region of interest to find an optimal training set to use in training the hierarchical machine learning module.
  • the local training set optimization module 115 may first use an initial value M for the training set size and then select M number of counties outside the region of interest having the highest similarity score. For example, counties in regions 312, 314, and 316 may have the highest similarity scores to region of interest 310.
  • the local training set optimization module 115 performs model fitting with regulation 325 (eg., Tasso regression, but other methods may be used as well) for both model fitting and feature selection (number of features decided by cross-validation in the training set only), resulting sub health scores for each different SDoH category.
  • the hierarchical machine learning model uses a series of weights coi, w ⁇ , . . . , wk, where each weight weights a sub health score for each different SDoH category used to calculate the SDoH index (this will be explained further below).
  • MSE mean squared error
  • the local training set optimization module 115 seeks to minimize the test MSE 330.
  • a simulation based optimization 320 may be used to change training set size M and weights (wl, w2, w3) for the SDoH index calculation and then repeat training the hierarchical machine learning model.
  • the simulation based optimization 320 may include successively selecting more or fewer areas outside the area of interest with the highest similarity scores. Heuristic methods as well as other methods may be used to speed up the simulation process to find a solution with the minimum MSE.
  • FIG. 4 illustrates a block diagram of the hierarchical machine learning module.
  • the hierarchical machine learning module 120 implements the hierarchical machine learning model defined by the local training set optimization module 115.
  • county data related to SDoH features may be found in the public/private data base (eg., ACS data base) 400.
  • This data is grouped into multiple SDoH categories (one category may include multiple SDoH features) and used in the training process along with health status data 410 to train 115 hierarchical machine learning model for each SDoH category, respectively.
  • health status data 410 to train 115 hierarchical machine learning model for each SDoH category, respectively.
  • the values for the feature coefficients 425 used by the model as well as series of weights coi, w ⁇ , . . .
  • weights coi, w ⁇ , . . . , wk for each category can be determined by normalizing explained deviance (R 2 ) fitted for each category: w ; (so that
  • ⁇ _ 1 cu 1 ).
  • the features were grouped to five categories which include neighborhood 401, economic status 402, education status 403, social context 404, and health and health care status 405.
  • the hierarchical machine learning model was trained at the county level, but the hierarchical machine learning model will be used to make predictions at the zip code or even the block level.
  • SDoH data for the features may be extracted from the ACS data 415 at the desired geographic level, (i.e., zip code or block), and then this feature data is fed into the hierarchical machine learning model and the feature coefficients 425 are used to calculate a sub health score 431-435 for each of the categories.
  • weighted health scores 441-445 are then multiplied by their associated weights coi, w ⁇ , . . . , cos 427 to produce weighted health scores 441-445.
  • the weighted health scores 441-445 are then summed to produce the SDoH index 450.
  • the embodiments of the hierarchical local tool described herein solves various technological problems. Often a user will want to determine a SDoH index for a specific geographic area, for example, for all of the zip codes in a region of interest. The challenge is that data relating to health outcomes may only be available at the county, so the embodiments of the hierarchical local tool described herein presents a solution where a machine learning model for calculating the SDoH index is trained using county level data, but then may be used to make predictions at the zip code level. This is accomplished by finding an optimum training set that predicts outcomes in the region of interest using data from similar counties outside the region of interest.
  • the embodiments described herein may be implemented as software running on a processor with an associated memory and storage.
  • the processor may be any hardware device capable of executing instructions stored in memory or storage or otherwise processing data.
  • the processor may include a microprocessor, field programmable gate array (FPGA), application-specific integrated circuit (ASIC), graphics processing units (GPU), specialized neural network processors, cloud computing systems, or other similar devices.
  • FPGA field programmable gate array
  • ASIC application-specific integrated circuit
  • GPU graphics processing units
  • specialized neural network processors cloud computing systems, or other similar devices.
  • the memory may include various memories such as, for example LI, L2, or L3 cache or system memory.
  • the memory may include static random-access memory (SRAM), dynamic RAM (DRAM), flash memory, read only memory (ROM), or other similar memory devices.
  • SRAM static random-access memory
  • DRAM dynamic RAM
  • ROM read only memory
  • the storage may include one or more machine-readable storage media such as read-only memory (ROM), random-access memory (RAM), magnetic disk storage media, optical storage media, flash-memory devices, or similar storage media.
  • ROM read-only memory
  • RAM random-access memory
  • magnetic disk storage media magnetic disk storage media
  • optical storage media optical storage media
  • flash-memory devices or similar storage media.
  • the storage may store instructions for execution by the processor or data upon with the processor may operate. This software may implement the various embodiments described above.
  • embodiments may be implemented on multiprocessor computer systems, distributed computer systems, and cloud computing systems.
  • the embodiments may be implemented as software on a server, a specific computer, on a cloud computing, or other computing platform.
  • non-transitory machine-readable storage medium will be understood to exclude a transitory propagation signal but to include all forms of volatile and non volatile memory.

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Strategic Management (AREA)
  • General Physics & Mathematics (AREA)
  • Economics (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Development Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Technology Law (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Tourism & Hospitality (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Game Theory and Decision Science (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method for training a hierarchical machine learning model that produces a social determinants of health (SDoH) index, including: receiving a description of a geographic area of interest made up of a set of first geographic areas having a first geographic level; determining an area of interest similarity score for each of a plurality of geographic areas having the first geographic hierarchy outside the geographic area of interest; determining an optimal hierarchical machine learning model by minimizing a performance metric based upon the determined area of interest similarity scores for a set of SDoH features by repeating the steps of: determining a training set of data based upon the determined area of interest similarity scores; training the hierarchical machine learning model using the determined training set of data; and calculating the performance metric for the trained model based upon the test set.

Description

HIERARCHICAL LOCAL MODEL FOR SOCIAL DETERMINANTS OF HEALTH
INDEX PREDICTION
TECHNICAL FIELD
[0001] Various exemplary embodiments disclosed herein relate generally to a hierarchical local model for social determinants of health index prediction.
BACKGROUND
[0002] Social determinants of health (SDoH) have been widely recognized as an important factor for health outcomes. The community level SDoH index may provide healthcare organizations an overview of health status and contributing social factors of the region that they are serving. In the past a ZIP Code -level SDoH index was developed using 3rd party commercial data, and this application was limited by purchasing agreements and limited geographic regions. Public free survey/ statistic datasets are a good source for SDoH information, but they are limited by the health outcomes which are only available at the county or state level.
SUMMARY
[0003] A summary of various exemplary embodiments is presented below. Some simplifications and omissions may be made in the following summary, which is intended to highlight and introduce some aspects of the various exemplary embodiments, but not to limit the scope of the invention. Detailed descriptions of an exemplary embodiment adequate to allow those of ordinary skill in the art to make and use the inventive concepts will follow in later sections.
[0004] Various embodiments relate to a method for training a hierarchical machine learning model that produces a social determinants of health (SDoH) index, including: receiving a description of a geographic area of interest made up of a set of first geographic areas having a first geographic level; determining an area of interest similarity score for each of a plurality of geographic areas having the first geographic hierarchy outside the geographic area of interest; determining an optimal hierarchical machine learning model by minimizing a performance metric based upon the determined area of interest similarity scores for a set of SDoH features by repeating the steps of: determining a training set of data based upon the determined area of interest similarity scores; training the hierarchical machine learning model using the determined training set of data; and calculating the performance metric for the trained model based upon the test data set.
[0005] The method of claim 1, wherein the optimal machine learning module is configured to produce the SDoH index based upon the SDoH features for a geographic area having a second geographic level, wherein the second geographic level is less than the first geographic level.
[0006] Various embodiments are described, further including the area of interest similarity score is an average of the similarity scores between each of the set of first geographic areas and a geographic area outside the geographic area of interest.
[0007] Various embodiments are described, further including the area of interest similarity score is based upon the distance between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas, the difference of non-SDoH features between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas, and the difference in health outcomes between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas.
[0008] Various embodiments are described, further including determining a training set of data based upon the determined area of interest similarity scores includes selecting the M areas outside the geographic area of interesting having the highest area of interest similarity scores, wherein M is an integer.
[0009] Various embodiments are described, further including the value of M changes between iterations of determining a training set of data.
[0010] Various embodiments are described, further including the performance parameter is mean square error on the test data set. [0011] Various embodiments are described, further including training the hierarchical machine learning model uses Lasso regression.
[0012] Various embodiments are described, further including training the hierarchical machine learning model includes feature selection.
[0013] Various embodiments are described, further including the hierarchical machine learning model calculates a sub heath score for each of the SDoH categories and the SDoH index is a weighted sum of the sub health scores for each of the SDoH categories.
[0014] Various embodiments are described, further including training the hierarchical machine learning model includes learning the weights used in the weighted sum.
[0015] Various embodiments are described, further including calculating the SDoH index based upon the SDoH features for a geographic area having a second geographic level, wherein the second geographic level is less than the first geographic level.
[0016] Further various embodiments relate to a non-transitory machine-readable storage medium encoded with instructions for training a hierarchical machine learning model that produces a social determinants of health (SDoH) index, including: instructions for receiving a description of a geographic area of interest made up of a set of first geographic areas having a first geographic level; instructions for determining an area of interest similarity score for each of a plurality of geographic areas having the first geographic hierarchy outside the geographic area of interest; instructions for determining an optimal hierarchical machine learning model by minimizing a performance metric based upon the determined area of interest similarity scores for a set of SDoH features by repeating the instructions for: determining a training set of data based upon the determined area of interest similarity scores; training the hierarchical machine learning model using the determined training set of data; and calculating the performance metric for the trained model based upon the test data set.
[0017] Various embodiments are described, wherein the optimal machine learning module is configured to produce the SDoH index based upon the SDoH features for a geographic area having a second geographic level, wherein the second geographic level is less than the first geographic level.
[0018] Various embodiments are described, wherein the area of interest similarity score is an average of the similarity scores between each of the set of first geographic areas and a geographic area outside the geographic area of interest.
[0019] Various embodiments are described, wherein the area of interest similarity score is based upon the distance between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas, the difference of non-SDoH features between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas, and the difference in health outcomes between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas.
[0020] Various embodiments are described, wherein instructions for determining a training set of data based upon the determined area of interest similarity scores includes instructions for selecting the M areas outside the geographic area of interesting having the highest area of interest similarity scores, wherein M is an integer.
[0021] Various embodiments are described, wherein the value of M changes between iterations of determining a training set of data.
[0022] Various embodiments are described, wherein the performance parameter is mean square error on the test data set.
[0023] Various embodiments are described, wherein instructions training the hierarchical machine learning model uses Lasso regression.
[0024] Various embodiments are described, wherein instructions training the hierarchical machine learning model includes feature selection.
[0025] Various embodiments are described, wherein the hierarchical machine learning model calculates a sub heath score for each of the SDoH categories and the SDoH index is a weighted sum of the sub health scores for each of the SDoH categories.
[0026] Various embodiments are described, wherein instructions for training the hierarchical machine learning model includes instructions for learning the weights used in the weighted sum.
[0027] Various embodiments are described, further including instructions for calculating the SDoH index based upon the SDoH features for a geographic area having a second geographic level, wherein the second geographic level is less than the first geographic level.
BRIEF DESCRIPTION OF THE DRAWINGS
[0028] In order to better understand various exemplary embodiments, reference is made to the accompanying drawings, wherein:
[0029] FIG. 1 illustrates a block diagram of a hierarchical local tool for producing a SDoH index;
[0030] FIG. 2 illustrates a user interface for use in the hierarchical local tool;
[0031] FIG. 3 illustrates the training process used by the local training set optimization module; and
[0032] FIG. 4 illustrates a block diagram of the hierarchical machine learning module.
[0033] To facilitate understanding, identical reference numerals have been used to designate elements having substantially the same or similar structure and/or substantially the same or similar function.
DETAILED DESCRIPTION
[0034] The description and drawings illustrate the principles of the invention. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the principles of the invention and are included within its scope. Furthermore, all examples recited herein are principally intended expressly to be for pedagogical purposes to aid the reader in understanding the principles of the invention and the concepts contributed by the inventor(s) to furthering the art and are to be construed as being without limitation to such specifically recited examples and conditions. Additionally, the term,“or,” as used herein, refers to a non-exclusive or (i.e., and/or), unless otherwise indicated (eg.,“or else” or“or in the alternative”). Also, the various embodiments described herein are not necessarily mutually exclusive, as some embodiments can be combined with one or more other embodiments to form new embodiments.
[0035] Socioeconomic and behavior factors have been widely proven to affect health outcomes. In recent years, the study of socioeconomic and behavior factors has been a popular research area, and many hospitals are also eager to apply socioeconomic and behavior factors into their workflow. However, collecting social determinants of health (SDoH) information is not only time-consuming but also often unrealistic for healthcare organizations considering that many populations are outside of the healthcare organization. Previous work in this area developed a method to produce a ZIP code level SDoH index using commercial data to show and explain the overall health status within different ZIP codes and the contributing SDoH factors. However, this method is limited to data that have the same geographic level (eg., ZIP Code -level) and the commercial statistical data’s usage is controlled by a 3rd party.
[0036] In fact, many free public survey or statistical data include SDoH information such as income, education, and housing at different geographic levels. Health outcome information such as life expectancy and mortality rate may also be obtained from public databases, but health outcome information is often only available at the county or state level.
[0037] In order to use these public datasets, one of the challenges is that these health outcome datasets are not available at the ZIP code level. For example, public datasets such as American Community Survey (ACS) has multiple geographic levels including state, county, ZIP Code Tabulation Area (ZCTA) and block, while health outcome data are often aggregated at the county or state level. Considering the large size of many counties, it would be more informational to provide SDoH information at more granular level such as at the ZIP code -level or even at the block-level. [0038] Another challenge is how to train a SDoH model that best fits the local community needs. For example, renting versus owning a house may be a good differentiator in rural regions providing a good reflection of economic stability, while it may make less of difference in urban cities such as New York City, where even affluent people rent versus owning a home. Hence, a healthcare organization would prefer a model that based on data for the area it serves. However, the number of counties in that area may be too small to adequately develop such a model. In this case, creating a reliable model would include using data from other counties with different socio-economic challenges. In embodiments describe herein, this problem is solved by finding the optimal set of counties that should be used for developing the model.
[0039] The embodiments described herein include a hierarchical machine learning model with optimized local training set selection to integrate multilevel public datasets into a predictive SDoH model that produces a SDoH index. The hierarchical machine learning model trains the model at the county level and predicts the SDoH index at the ZIP code level or some other geographic level defined by the users. The ZIP code level is used herein as an example to make the description concise without losing generalization; while county level for training is also likewise used as an example to represent the geographic level same as the health outcome variable. Other geographic levels may be used as well.
[0040] FIG. 1 illustrates a block diagram of a hierarchical local tool for producing a SDoH index. The hierarchical local tool 100 includes a user interface 105, a region scoring module 110, a local training set optimization module 115, and a hierarchical machine learning module 120.
[0041] FIG. 2 illustrates a user interface for use in the hierarchical local tool. The user interface 105 may include a map 225 that shows the geographic area to be used with the hierarchical local tool 100. In this specific example, the United States is shown, but other countries, regions, continents, etc. may also be shown by this map include the whole world. The specific map displayed may be set by the location where the hierarchical local tool 100 is being used. For example, if the hierarchical local tool 100 is being used in Denmark, the map may just be for Denmark, all of Scandinavia, or all of Europe. The map 225 may be zoomed in and out as well as panned to get to a specific area of interest. The user interface also includes a region of interest pane 230 that allows a user to either determine a region of interest by drawing and/ or selection from the map 225 or by importing a file that specifies the region of interest. In other embodiments, a drop down menu or other interface elements may be used to select the region of interest.
[0042] The user interface may also include a geographic level pane 235 where the user may select the geographic level for SDoH index. In the geographic level pane 235 as shown, there are check boxes for ZIP code, Block, and County levels shown. A user may click a specific check box to select the desired geographic level for the SDoH index. Also, a drop down menu or other interface elements may be used to select the geographic level for the SDoH index. Once the user has selected a region of interest and a geographic level, the user interface 105 may display a selection information pane 240 and a local map 245. The selection information pane 240 may display various information related to the region of interest such as the number of people in the region and the resulting SDoH index for the region once it is calculated. The local map 245 may show the boundaries of the different geographic areas as defined by the selected geographic level, for example, by zip code, block or county. The map 245 may also be color coded to show via specific colors the SDoH index values for each geographic area in the map 245.
[0043] The user interface 105 may also include a feature plot 250 that plots the feature data related to the SDoH index value that is calculated. This provides a user additional insight to see which features are contributing to the SDoH index value.
[0044] The region scoring module 110 receives input data from the user interface defining the region of interest. The region scoring model 110 then takes each region outside of the region of interest and calculates a similarity score that indicates how similar each region outside the region of interest is to the region of interest. These similarity scores will then be used to define training sets to use to train the hierarchical machine learning model.
[0045] The region scoring module 110 produces a weighted average similarity score (5,·) of each remaining region comparing with each subregion in the region of interest (i) based upon: its distance to each region of interest ( distij ); region similarity (non-SDoH features such as rural/ urban, average age, gender and race); and health outcome difference. The following function may be used where the weights (wl, w2 and w3) can be predetermined by the user or be optimized in local training set optimization module 115.
Figure imgf000010_0001
where Sj is the overall similarity score for region j, SJ is the similarity score between county i and county j, N is the total number of counties in the region of interest, disti j is the distance between county i and county j, dif j(non— SDoH features ) is the difference in non-SDoH features between county i and county j, and dif ^health outcome ) is the difference in health outcome between count j and county i. These three features have been found to provide an indication of the similarity between two counties and other additional features can also be added to calculate the similarity score. The final value Sj is an average of the similarity between each county i in the region of interest and the county j outside the region of interest.
[0046] Next, the hierarchical local tool 100 uses the local training set optimization module 115 to determine an optimal training set for training the hierarchical machine learning model. FIG. 3 illustrates the training process used by the local training set optimization module. In FIG. 3, the map 305 shows the region of interest 310 and potential similar regions 312, 314, and 316 to be used for training. The local training set optimization module 115 uses the similarity scores for each county outside of region of interest to find an optimal training set to use in training the hierarchical machine learning module. The local training set optimization module 115 may first use an initial value M for the training set size and then select M number of counties outside the region of interest having the highest similarity score. For example, counties in regions 312, 314, and 316 may have the highest similarity scores to region of interest 310. Then the local training set optimization module 115 performs model fitting with regulation 325 (eg., Tasso regression, but other methods may be used as well) for both model fitting and feature selection (number of features decided by cross-validation in the training set only), resulting sub health scores for each different SDoH category. Further, the hierarchical machine learning model uses a series of weights coi, wå, . . . , wk, where each weight weights a sub health score for each different SDoH category used to calculate the SDoH index (this will be explained further below). Using a test set of data for the regions of interest determine the mean squared error (MSE) with current training set size and score weights. Other error measures may be used as appropriate. Because the prediction is at the county level, data from the region of interest may be used as the test set to get an unbiased evaluation. The local training set optimization module 115 seeks to minimize the test MSE 330. In order to minimize the test MSE, a simulation based optimization 320 may be used to change training set size M and weights (wl, w2, w3) for the SDoH index calculation and then repeat training the hierarchical machine learning model. The simulation based optimization 320 may include successively selecting more or fewer areas outside the area of interest with the highest similarity scores. Heuristic methods as well as other methods may be used to speed up the simulation process to find a solution with the minimum MSE. Using SDoH features from the training set identified by the local training set optimization module 115, feature selection, feature coefficients, and series of weights coi, (q2, · · · , wk are fitted to define the hierarchical machine learning model that minimizes the MSE for a region of interest.
[0047] FIG. 4 illustrates a block diagram of the hierarchical machine learning module. The hierarchical machine learning module 120 implements the hierarchical machine learning model defined by the local training set optimization module 115. Specifically, county data related to SDoH features may be found in the public/private data base (eg., ACS data base) 400. This data is grouped into multiple SDoH categories (one category may include multiple SDoH features) and used in the training process along with health status data 410 to train 115 hierarchical machine learning model for each SDoH category, respectively. During this training process the values for the feature coefficients 425 used by the model as well as series of weights coi, wå, . . . , wk for each feature were fitted and recorded as well as the specific features to be used. The weights coi, wå, . . . , wk for each category can be determined by normalizing explained deviance (R2) fitted for each category: w;
Figure imgf000012_0001
(so that
Figure imgf000012_0002
å _1 cu = 1 ). In this example, the features were grouped to five categories which include neighborhood 401, economic status 402, education status 403, social context 404, and health and health care status 405. In this example, the hierarchical machine learning model was trained at the county level, but the hierarchical machine learning model will be used to make predictions at the zip code or even the block level. In order to do that, SDoH data for the features may be extracted from the ACS data 415 at the desired geographic level, (i.e., zip code or block), and then this feature data is fed into the hierarchical machine learning model and the feature coefficients 425 are used to calculate a sub health score 431-435 for each of the categories. These sub heath scores 431-435 are then multiplied by their associated weights coi, wå, . . . , cos 427 to produce weighted health scores 441-445. The weighted health scores 441-445 are then summed to produce the SDoH index 450.
[0048] The embodiments described herein use the example of training at a county level and then making predictions at the zip code or block level. These are only meant to be examples, and the scope of the training and prediction geographic levels may differ from these examples depending up the specific application and the data available.
[0049] The embodiments of the hierarchical local tool described herein solves various technological problems. Often a user will want to determine a SDoH index for a specific geographic area, for example, for all of the zip codes in a region of interest. The challenge is that data relating to health outcomes may only be available at the county, so the embodiments of the hierarchical local tool described herein presents a solution where a machine learning model for calculating the SDoH index is trained using county level data, but then may be used to make predictions at the zip code level. This is accomplished by finding an optimum training set that predicts outcomes in the region of interest using data from similar counties outside the region of interest.
[0050] The embodiments described herein may be implemented as software running on a processor with an associated memory and storage. The processor may be any hardware device capable of executing instructions stored in memory or storage or otherwise processing data. As such, the processor may include a microprocessor, field programmable gate array (FPGA), application-specific integrated circuit (ASIC), graphics processing units (GPU), specialized neural network processors, cloud computing systems, or other similar devices.
[0051] The memory may include various memories such as, for example LI, L2, or L3 cache or system memory. As such, the memory may include static random-access memory (SRAM), dynamic RAM (DRAM), flash memory, read only memory (ROM), or other similar memory devices.
[0052] The storage may include one or more machine-readable storage media such as read-only memory (ROM), random-access memory (RAM), magnetic disk storage media, optical storage media, flash-memory devices, or similar storage media. In various embodiments, the storage may store instructions for execution by the processor or data upon with the processor may operate. This software may implement the various embodiments described above.
[0053] Further such embodiments may be implemented on multiprocessor computer systems, distributed computer systems, and cloud computing systems. For example, the embodiments may be implemented as software on a server, a specific computer, on a cloud computing, or other computing platform.
[0054] Any combination of specific software running on a processor to implement the embodiments of the invention, constitute a specific dedicated machine.
[0055] As used herein, the term “non-transitory machine-readable storage medium” will be understood to exclude a transitory propagation signal but to include all forms of volatile and non volatile memory.
[0056] Although the various exemplary embodiments have been described in detail with particular reference to certain exemplary aspects thereof, it should be understood that the invention is capable of other embodiments and its details are capable of modifications in various obvious respects. As is readily apparent to those skilled in the art, variations and modifications can be affected while remaining within the spirit and scope of the invention. Accordingly, the foregoing disclosure, description, and figures are for illustrative purposes only and do not in any way limit the invention, which is defined only by the claims.

Claims

What is claimed is:
1. A method for training a hierarchical machine learning model that produces a social determinants of health (SDoH) index, comprising:
receiving a description of a geographic area of interest made up of a set of first geographic areas having a first geographic level;
determining an area of interest similarity score for each of a plurality of geographic areas having the first geographic hierarchy outside the geographic area of interest;
determining an optimal hierarchical machine learning model by minimizing a performance metric based upon the determined area of interest similarity scores for a set of SDoH features by repeating the steps of:
determining a training set of data based upon the determined area of interest similarity scores;
training the hierarchical machine learning model using the determined training set of data; and
calculating the performance metric for the trained model based upon the test data set.
2. The method of claim 1, wherein the optimal machine learning module is configured to produce the SDoH index based upon the SDoH features for a geographic area having a second geographic level, wherein the second geographic level is less than the first geographic level.
3. The method of claim 1, wherein the area of interest similarity score is an average of the similarity scores between each of the set of first geographic areas and a geographic area outside the geographic area of interest.
4. The method of claim 1, wherein the area of interest similarity score is based upon the distance between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas, the difference of non-SDoH features between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas, and the difference in health outcomes between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas.
5. The method of claim 1, wherein determining a training set of data based upon the determined area of interest similarity scores includes selecting the M areas outside the geographic area of interesting having the highest area of interest similarity scores, wherein M is an integer.
6. The method of claim 5, wherein the value of M changes between iterations of determining a training set of data.
7. The method of claim 1, wherein the performance parameter is mean square error on the test data set.
8. The method of claim 1, wherein training the hierarchical machine learning model uses Tasso regression.
9. The method of claim 1, wherein training the hierarchical machine learning model includes feature selection.
10. The method of claim 1, wherein the hierarchical machine learning model calculates a sub heath score for each of the SDoH categories and the SDoH index is a weighted sum of the sub health scores for each of the SDoH categories.
11. The method of claim 1, wherein training the hierarchical machine learning model includes learning the weights used in the weighted sum.
12. The method of claim 1, further comprising calculating the SDoH index based upon the SDoH features for a geographic area having a second geographic level, wherein the second geographic level is less than the first geographic level.
13. A non-transitory machine-readable storage medium encoded with instructions for training a hierarchical machine learning model that produces a social determinants of health (SDoH) index, comprising:
instructions for receiving a description of a geographic area of interest made up of a set of first geographic areas having a first geographic level;
instructions for determining an area of interest similarity score for each of a plurality of geographic areas having the first geographic hierarchy outside the geographic area of interest;
instructions for determining an optimal hierarchical machine learning model by minimizing a performance metric based upon the determined area of interest similarity scores for a set of SDoH features by repeating the instructions for:
determining a training set of data based upon the determined area of interest similarity scores;
training the hierarchical machine learning model using the determined training set of data; and calculating the performance metric for the trained model based upon the test data set.
14. The non-transitory machine-readable storage medium of claim 13, wherein the optimal machine learning module is configured to produce the SDoH index based upon the SDoH features for a geographic area having a second geographic level, wherein the second geographic level is less than the first geographic level.
15. The non-transitory machine-readable storage medium of claim 13, wherein the area of interest similarity score is an average of the similarity scores between each of the set of first geographic areas and a geographic area outside the geographic area of interest.
16. The non-transitory machine-readable storage medium of claim 13, wherein the area of interest similarity score is based upon the distance between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas, the difference of non- SDoH features between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas, and the difference in health outcomes between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas.
17. The non-transitory machine-readable storage medium of claim 13, wherein instructions for determining a training set of data based upon the determined area of interest similarity scores includes instructions for selecting the M areas outside the geographic area of interesting having the highest area of interest similarity scores, wherein M is an integer.
18. The non-transitory machine-readable storage medium of claim 17, wherein the value of M changes between iterations of determining a training set of data.
19. The non-transitory machine-readable storage medium of claim 13, wherein the performance parameter is mean square error on the test data set.
20. The non-transitory machine-readable storage medium of claim 13, wherein instructions training the hierarchical machine learning model uses Tasso regression.
21. The non-transitory machine-readable storage medium of claim 13, wherein instructions training the hierarchical machine learning model includes feature selection.
22. The non-transitory machine-readable storage medium of claim 13, wherein the hierarchical machine learning model calculates a sub heath score for each of the SDoH categories and the SDoH index is a weighted sum of the sub health scores for each of the SDoH categories.
23. The non-transitory machine-readable storage medium of claim 13, wherein instructions for training the hierarchical machine learning model includes instructions for learning the weights used in the weighted sum.
24. The non-transitory machine-readable storage medium of claim 13, further comprising instructions for calculating the SDoH index based upon the SDoH features for a geographic area having a second geographic level, wherein the second geographic level is less than the first geographic level.
PCT/EP2019/083934 2018-12-10 2019-12-06 Hierarchical local model for social determinants of health index prediction WO2020120301A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201862777488P 2018-12-10 2018-12-10
US62/777,488 2018-12-10

Publications (1)

Publication Number Publication Date
WO2020120301A1 true WO2020120301A1 (en) 2020-06-18

Family

ID=68835221

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2019/083934 WO2020120301A1 (en) 2018-12-10 2019-12-06 Hierarchical local model for social determinants of health index prediction

Country Status (1)

Country Link
WO (1) WO2020120301A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160188824A1 (en) * 2013-07-31 2016-06-30 Koninklijke Philips N.V. Healthcare decision support system for tailoring patient care
US20180101617A1 (en) * 2016-10-12 2018-04-12 Salesforce.Com, Inc. Ranking Search Results using Machine Learning Based Models
WO2018120426A1 (en) * 2016-12-29 2018-07-05 平安科技(深圳)有限公司 Personal health status evaluation method, apparatus and device based on location service, and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160188824A1 (en) * 2013-07-31 2016-06-30 Koninklijke Philips N.V. Healthcare decision support system for tailoring patient care
US20180101617A1 (en) * 2016-10-12 2018-04-12 Salesforce.Com, Inc. Ranking Search Results using Machine Learning Based Models
WO2018120426A1 (en) * 2016-12-29 2018-07-05 平安科技(深圳)有限公司 Personal health status evaluation method, apparatus and device based on location service, and storage medium

Similar Documents

Publication Publication Date Title
Wang Why public health needs GIS: a methodological overview
US10496678B1 (en) Systems and methods for generating and implementing knowledge graphs for knowledge representation and analysis
Shrivastava et al. Failure prediction of Indian Banks using SMOTE, Lasso regression, bagging and boosting
CN111696112B (en) Automatic image cutting method and system, electronic equipment and storage medium
Mercer et al. A comparison of spatial smoothing methods for small area estimation with sampling weights
TW201939400A (en) Method and device for determining group of target users
US11663282B2 (en) Taxonomy-based system for discovering and annotating geofences from geo-referenced data
CN109104688A (en) Wireless network access point model is generated using aggregation technique
CN110008397A (en) A kind of recommended models training method and device
CN108475256A (en) Feature insertion is generated from homologous factors
KR20200107389A (en) Rating augmentation and item recommendation method and system based on generative adversarial networks
Parvinnezhad et al. A modified spatial entropy for urban sprawl assessment
CN115423353A (en) Power distribution network resource consumption scheduling method and device, electronic equipment and storage medium
Zhou et al. Bandwidth selection for nonparametric modal regression
Chen et al. A temporal recommendation mechanism based on signed network of user interest changes
Anderson et al. Spatial clustering of average risks and risk trends in Bayesian disease mapping
Holmes et al. Developing physician migration estimates for workforce models
CN107451249B (en) Event development trend prediction method and device
CN111768035A (en) Path recommendation information pushing method and device, computer equipment and storage medium
CN110929172A (en) Information selection method and device, electronic equipment and readable storage medium
Maithani et al. Simulation of peri-urban growth dynamics using weights of evidence approach
CN110348896A (en) Divide the method for geographic grid, commercial circle determines method and apparatus
CN112651574B (en) P median genetic algorithm-based addressing method and device and electronic equipment
WO2020120301A1 (en) Hierarchical local model for social determinants of health index prediction
CN117216376A (en) Fair perception recommendation system and recommendation method based on depth map neural network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19817253

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19817253

Country of ref document: EP

Kind code of ref document: A1