WO2020120301A1

WO2020120301A1 - Hierarchical local model for social determinants of health index prediction

Info

Publication number: WO2020120301A1
Application number: PCT/EP2019/083934
Authority: WO
Inventors: Jin Liu; Eran Simhon
Original assignee: Koninklijke Philips N.V.
Priority date: 2018-12-10
Filing date: 2019-12-06
Publication date: 2020-06-18

Abstract

A method for training a hierarchical machine learning model that produces a social determinants of health (SDoH) index, including: receiving a description of a geographic area of interest made up of a set of first geographic areas having a first geographic level; determining an area of interest similarity score for each of a plurality of geographic areas having the first geographic hierarchy outside the geographic area of interest; determining an optimal hierarchical machine learning model by minimizing a performance metric based upon the determined area of interest similarity scores for a set of SDoH features by repeating the steps of: determining a training set of data based upon the determined area of interest similarity scores; training the hierarchical machine learning model using the determined training set of data; and calculating the performance metric for the trained model based upon the test set.

Description

HIERARCHICAL LOCAL MODEL FOR SOCIAL DETERMINANTS OF HEALTH

INDEX PREDICTION

TECHNICAL FIELD

[0001] Various exemplary embodiments disclosed herein relate generally to a hierarchical local model for social determinants of health index prediction.

BACKGROUND

[0002] Social determinants of health (SDoH) have been widely recognized as an important factor for health outcomes. The community level SDoH index may provide healthcare organizations an overview of health status and contributing social factors of the region that they are serving. In the past a ZIP Code -level SDoH index was developed using 3rd party commercial data, and this application was limited by purchasing agreements and limited geographic regions. Public free survey/ statistic datasets are a good source for SDoH information, but they are limited by the health outcomes which are only available at the county or state level.

SUMMARY

[0003] A summary of various exemplary embodiments is presented below. Some simplifications and omissions may be made in the following summary, which is intended to highlight and introduce some aspects of the various exemplary embodiments, but not to limit the scope of the invention. Detailed descriptions of an exemplary embodiment adequate to allow those of ordinary skill in the art to make and use the inventive concepts will follow in later sections.

[0004] Various embodiments relate to a method for training a hierarchical machine learning model that produces a social determinants of health (SDoH) index, including: receiving a description of a geographic area of interest made up of a set of first geographic areas having a first geographic level; determining an area of interest similarity score for each of a plurality of geographic areas having the first geographic hierarchy outside the geographic area of interest; determining an optimal hierarchical machine learning model by minimizing a performance metric based upon the determined area of interest similarity scores for a set of SDoH features by repeating the steps of: determining a training set of data based upon the determined area of interest similarity scores; training the hierarchical machine learning model using the determined training set of data; and calculating the performance metric for the trained model based upon the test data set.

[0005] The method of claim 1, wherein the optimal machine learning module is configured to produce the SDoH index based upon the SDoH features for a geographic area having a second geographic level, wherein the second geographic level is less than the first geographic level.

[0006] Various embodiments are described, further including the area of interest similarity score is an average of the similarity scores between each of the set of first geographic areas and a geographic area outside the geographic area of interest.

[0007] Various embodiments are described, further including the area of interest similarity score is based upon the distance between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas, the difference of non-SDoH features between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas, and the difference in health outcomes between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas.

[0008] Various embodiments are described, further including determining a training set of data based upon the determined area of interest similarity scores includes selecting the M areas outside the geographic area of interesting having the highest area of interest similarity scores, wherein M is an integer.

[0009] Various embodiments are described, further including the value of M changes between iterations of determining a training set of data.

[0010] Various embodiments are described, further including the performance parameter is mean square error on the test data set. [0011] Various embodiments are described, further including training the hierarchical machine learning model uses Lasso regression.

[0012] Various embodiments are described, further including training the hierarchical machine learning model includes feature selection.

[0013] Various embodiments are described, further including the hierarchical machine learning model calculates a sub heath score for each of the SDoH categories and the SDoH index is a weighted sum of the sub health scores for each of the SDoH categories.

[0014] Various embodiments are described, further including training the hierarchical machine learning model includes learning the weights used in the weighted sum.

[0015] Various embodiments are described, further including calculating the SDoH index based upon the SDoH features for a geographic area having a second geographic level, wherein the second geographic level is less than the first geographic level.

[0016] Further various embodiments relate to a non-transitory machine-readable storage medium encoded with instructions for training a hierarchical machine learning model that produces a social determinants of health (SDoH) index, including: instructions for receiving a description of a geographic area of interest made up of a set of first geographic areas having a first geographic level; instructions for determining an area of interest similarity score for each of a plurality of geographic areas having the first geographic hierarchy outside the geographic area of interest; instructions for determining an optimal hierarchical machine learning model by minimizing a performance metric based upon the determined area of interest similarity scores for a set of SDoH features by repeating the instructions for: determining a training set of data based upon the determined area of interest similarity scores; training the hierarchical machine learning model using the determined training set of data; and calculating the performance metric for the trained model based upon the test data set.

[0017] Various embodiments are described, wherein the optimal machine learning module is configured to produce the SDoH index based upon the SDoH features for a geographic area having a second geographic level, wherein the second geographic level is less than the first geographic level.

[0018] Various embodiments are described, wherein the area of interest similarity score is an average of the similarity scores between each of the set of first geographic areas and a geographic area outside the geographic area of interest.

[0019] Various embodiments are described, wherein the area of interest similarity score is based upon the distance between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas, the difference of non-SDoH features between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas, and the difference in health outcomes between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas.

[0020] Various embodiments are described, wherein instructions for determining a training set of data based upon the determined area of interest similarity scores includes instructions for selecting the M areas outside the geographic area of interesting having the highest area of interest similarity scores, wherein M is an integer.

[0021] Various embodiments are described, wherein the value of M changes between iterations of determining a training set of data.

[0022] Various embodiments are described, wherein the performance parameter is mean square error on the test data set.

[0023] Various embodiments are described, wherein instructions training the hierarchical machine learning model uses Lasso regression.

[0024] Various embodiments are described, wherein instructions training the hierarchical machine learning model includes feature selection.

[0025] Various embodiments are described, wherein the hierarchical machine learning model calculates a sub heath score for each of the SDoH categories and the SDoH index is a weighted sum of the sub health scores for each of the SDoH categories.

[0026] Various embodiments are described, wherein instructions for training the hierarchical machine learning model includes instructions for learning the weights used in the weighted sum.

[0027] Various embodiments are described, further including instructions for calculating the SDoH index based upon the SDoH features for a geographic area having a second geographic level, wherein the second geographic level is less than the first geographic level.

BRIEF DESCRIPTION OF THE DRAWINGS

[0028] In order to better understand various exemplary embodiments, reference is made to the accompanying drawings, wherein:

[0029] FIG. 1 illustrates a block diagram of a hierarchical local tool for producing a SDoH index;

[0030] FIG. 2 illustrates a user interface for use in the hierarchical local tool;

[0031] FIG. 3 illustrates the training process used by the local training set optimization module; and

[0032] FIG. 4 illustrates a block diagram of the hierarchical machine learning module.

[0033] To facilitate understanding, identical reference numerals have been used to designate elements having substantially the same or similar structure and/or substantially the same or similar function.

DETAILED DESCRIPTION

[0034] The description and drawings illustrate the principles of the invention. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the principles of the invention and are included within its scope. Furthermore, all examples recited herein are principally intended expressly to be for pedagogical purposes to aid the reader in understanding the principles of the invention and the concepts contributed by the inventor(s) to furthering the art and are to be construed as being without limitation to such specifically recited examples and conditions. Additionally, the term,“or,” as used herein, refers to a non-exclusive or (i.e., and/or), unless otherwise indicated (eg.,“or else” or“or in the alternative”). Also, the various embodiments described herein are not necessarily mutually exclusive, as some embodiments can be combined with one or more other embodiments to form new embodiments.

[0035] Socioeconomic and behavior factors have been widely proven to affect health outcomes. In recent years, the study of socioeconomic and behavior factors has been a popular research area, and many hospitals are also eager to apply socioeconomic and behavior factors into their workflow. However, collecting social determinants of health (SDoH) information is not only time-consuming but also often unrealistic for healthcare organizations considering that many populations are outside of the healthcare organization. Previous work in this area developed a method to produce a ZIP code level SDoH index using commercial data to show and explain the overall health status within different ZIP codes and the contributing SDoH factors. However, this method is limited to data that have the same geographic level (eg., ZIP Code -level) and the commercial statistical data’s usage is controlled by a 3rd party.

[0036] In fact, many free public survey or statistical data include SDoH information such as income, education, and housing at different geographic levels. Health outcome information such as life expectancy and mortality rate may also be obtained from public databases, but health outcome information is often only available at the county or state level.

[0037] In order to use these public datasets, one of the challenges is that these health outcome datasets are not available at the ZIP code level. For example, public datasets such as American Community Survey (ACS) has multiple geographic levels including state, county, ZIP Code Tabulation Area (ZCTA) and block, while health outcome data are often aggregated at the county or state level. Considering the large size of many counties, it would be more informational to provide SDoH information at more granular level such as at the ZIP code -level or even at the block-level. [0038] Another challenge is how to train a SDoH model that best fits the local community needs. For example, renting versus owning a house may be a good differentiator in rural regions providing a good reflection of economic stability, while it may make less of difference in urban cities such as New York City, where even affluent people rent versus owning a home. Hence, a healthcare organization would prefer a model that based on data for the area it serves. However, the number of counties in that area may be too small to adequately develop such a model. In this case, creating a reliable model would include using data from other counties with different socio-economic challenges. In embodiments describe herein, this problem is solved by finding the optimal set of counties that should be used for developing the model.

[0039] The embodiments described herein include a hierarchical machine learning model with optimized local training set selection to integrate multilevel public datasets into a predictive SDoH model that produces a SDoH index. The hierarchical machine learning model trains the model at the county level and predicts the SDoH index at the ZIP code level or some other geographic level defined by the users. The ZIP code level is used herein as an example to make the description concise without losing generalization; while county level for training is also likewise used as an example to represent the geographic level same as the health outcome variable. Other geographic levels may be used as well.

[0040] FIG. 1 illustrates a block diagram of a hierarchical local tool for producing a SDoH index. The hierarchical local tool 100 includes a user interface 105, a region scoring module 110, a local training set optimization module 115, and a hierarchical machine learning module 120.

[0041] FIG. 2 illustrates a user interface for use in the hierarchical local tool. The user interface 105 may include a map 225 that shows the geographic area to be used with the hierarchical local tool 100. In this specific example, the United States is shown, but other countries, regions, continents, etc. may also be shown by this map include the whole world. The specific map displayed may be set by the location where the hierarchical local tool 100 is being used. For example, if the hierarchical local tool 100 is being used in Denmark, the map may just be for Denmark, all of Scandinavia, or all of Europe. The map 225 may be zoomed in and out as well as panned to get to a specific area of interest. The user interface also includes a region of interest pane 230 that allows a user to either determine a region of interest by drawing and/ or selection from the map 225 or by importing a file that specifies the region of interest. In other embodiments, a drop down menu or other interface elements may be used to select the region of interest.

[0042] The user interface may also include a geographic level pane 235 where the user may select the geographic level for SDoH index. In the geographic level pane 235 as shown, there are check boxes for ZIP code, Block, and County levels shown. A user may click a specific check box to select the desired geographic level for the SDoH index. Also, a drop down menu or other interface elements may be used to select the geographic level for the SDoH index. Once the user has selected a region of interest and a geographic level, the user interface 105 may display a selection information pane 240 and a local map 245. The selection information pane 240 may display various information related to the region of interest such as the number of people in the region and the resulting SDoH index for the region once it is calculated. The local map 245 may show the boundaries of the different geographic areas as defined by the selected geographic level, for example, by zip code, block or county. The map 245 may also be color coded to show via specific colors the SDoH index values for each geographic area in the map 245.

[0043] The user interface 105 may also include a feature plot 250 that plots the feature data related to the SDoH index value that is calculated. This provides a user additional insight to see which features are contributing to the SDoH index value.

[0044] The region scoring module 110 receives input data from the user interface defining the region of interest. The region scoring model 110 then takes each region outside of the region of interest and calculates a similarity score that indicates how similar each region outside the region of interest is to the region of interest. These similarity scores will then be used to define training sets to use to train the hierarchical machine learning model.

[0045] The region scoring module 110 produces a weighted average similarity score (5,·) of each remaining region comparing with each subregion in the region of interest (i) based upon: its distance to each region of interest ( disti_j ); region similarity (non-SDoH features such as rural/ urban, average age, gender and race); and health outcome difference. The following function may be used where the weights (wl, w2 and w3) can be predetermined by the user or be optimized in local training set optimization module 115.

where S_j is the overall similarity score for region j, SJ is the similarity score between county i and county j, N is the total number of counties in the region of interest, disti _j is the distance between county i and county j, dif _j(non— SDoH features ) is the difference in non-SDoH features between county i and county j, and dif ^health outcome ) is the difference in health outcome between count j and county i. These three features have been found to provide an indication of the similarity between two counties and other additional features can also be added to calculate the similarity score. The final value S_j is an average of the similarity between each county i in the region of interest and the county j outside the region of interest.

[0046] Next, the hierarchical local tool 100 uses the local training set optimization module 115 to determine an optimal training set for training the hierarchical machine learning model. FIG. 3 illustrates the training process used by the local training set optimization module. In FIG. 3, the map 305 shows the region of interest 310 and potential similar regions 312, 314, and 316 to be used for training. The local training set optimization module 115 uses the similarity scores for each county outside of region of interest to find an optimal training set to use in training the hierarchical machine learning module. The local training set optimization module 115 may first use an initial value M for the training set size and then select M number of counties outside the region of interest having the highest similarity score. For example, counties in regions 312, 314, and 316 may have the highest similarity scores to region of interest 310. Then the local training set optimization module 115 performs model fitting with regulation 325 (eg., Tasso regression, but other methods may be used as well) for both model fitting and feature selection (number of features decided by cross-validation in the training set only), resulting sub health scores for each different SDoH category. Further, the hierarchical machine learning model uses a series of weights coi, w_å, . . . , wk, where each weight weights a sub health score for each different SDoH category used to calculate the SDoH index (this will be explained further below). Using a test set of data for the regions of interest determine the mean squared error (MSE) with current training set size and score weights. Other error measures may be used as appropriate. Because the prediction is at the county level, data from the region of interest may be used as the test set to get an unbiased evaluation. The local training set optimization module 115 seeks to minimize the test MSE 330. In order to minimize the test MSE, a simulation based optimization 320 may be used to change training set size M and weights (wl, w2, w3) for the SDoH index calculation and then repeat training the hierarchical machine learning model. The simulation based optimization 320 may include successively selecting more or fewer areas outside the area of interest with the highest similarity scores. Heuristic methods as well as other methods may be used to speed up the simulation process to find a solution with the minimum MSE. Using SDoH features from the training set identified by the local training set optimization module 115, feature selection, feature coefficients, and series of weights coi, (q2, · · · , wk are fitted to define the hierarchical machine learning model that minimizes the MSE for a region of interest.

[0047] FIG. 4 illustrates a block diagram of the hierarchical machine learning module. The hierarchical machine learning module 120 implements the hierarchical machine learning model defined by the local training set optimization module 115. Specifically, county data related to SDoH features may be found in the public/private data base (eg., ACS data base) 400. This data is grouped into multiple SDoH categories (one category may include multiple SDoH features) and used in the training process along with health status data 410 to train 115 hierarchical machine learning model for each SDoH category, respectively. During this training process the values for the feature coefficients 425 used by the model as well as series of weights coi, w_å, . . . , wk for each feature were fitted and recorded as well as the specific features to be used. The weights coi, w_å, . . . , wk for each category can be determined by normalizing explained deviance (R²) fitted for each category: w_;

(so that

å _₁ cu = 1 ). In this example, the features were grouped to five categories which include neighborhood 401, economic status 402, education status 403, social context 404, and health and health care status 405. In this example, the hierarchical machine learning model was trained at the county level, but the hierarchical machine learning model will be used to make predictions at the zip code or even the block level. In order to do that, SDoH data for the features may be extracted from the ACS data 415 at the desired geographic level, (i.e., zip code or block), and then this feature data is fed into the hierarchical machine learning model and the feature coefficients 425 are used to calculate a sub health score 431-435 for each of the categories. These sub heath scores 431-435 are then multiplied by their associated weights coi, w_å, . . . , cos 427 to produce weighted health scores 441-445. The weighted health scores 441-445 are then summed to produce the SDoH index 450.

[0048] The embodiments described herein use the example of training at a county level and then making predictions at the zip code or block level. These are only meant to be examples, and the scope of the training and prediction geographic levels may differ from these examples depending up the specific application and the data available.

[0049] The embodiments of the hierarchical local tool described herein solves various technological problems. Often a user will want to determine a SDoH index for a specific geographic area, for example, for all of the zip codes in a region of interest. The challenge is that data relating to health outcomes may only be available at the county, so the embodiments of the hierarchical local tool described herein presents a solution where a machine learning model for calculating the SDoH index is trained using county level data, but then may be used to make predictions at the zip code level. This is accomplished by finding an optimum training set that predicts outcomes in the region of interest using data from similar counties outside the region of interest.

[0050] The embodiments described herein may be implemented as software running on a processor with an associated memory and storage. The processor may be any hardware device capable of executing instructions stored in memory or storage or otherwise processing data. As such, the processor may include a microprocessor, field programmable gate array (FPGA), application-specific integrated circuit (ASIC), graphics processing units (GPU), specialized neural network processors, cloud computing systems, or other similar devices.

[0051] The memory may include various memories such as, for example LI, L2, or L3 cache or system memory. As such, the memory may include static random-access memory (SRAM), dynamic RAM (DRAM), flash memory, read only memory (ROM), or other similar memory devices.

[0052] The storage may include one or more machine-readable storage media such as read-only memory (ROM), random-access memory (RAM), magnetic disk storage media, optical storage media, flash-memory devices, or similar storage media. In various embodiments, the storage may store instructions for execution by the processor or data upon with the processor may operate. This software may implement the various embodiments described above.

[0053] Further such embodiments may be implemented on multiprocessor computer systems, distributed computer systems, and cloud computing systems. For example, the embodiments may be implemented as software on a server, a specific computer, on a cloud computing, or other computing platform.

[0054] Any combination of specific software running on a processor to implement the embodiments of the invention, constitute a specific dedicated machine.

[0055] As used herein, the term “non-transitory machine-readable storage medium” will be understood to exclude a transitory propagation signal but to include all forms of volatile and non volatile memory.

[0056] Although the various exemplary embodiments have been described in detail with particular reference to certain exemplary aspects thereof, it should be understood that the invention is capable of other embodiments and its details are capable of modifications in various obvious respects. As is readily apparent to those skilled in the art, variations and modifications can be affected while remaining within the spirit and scope of the invention. Accordingly, the foregoing disclosure, description, and figures are for illustrative purposes only and do not in any way limit the invention, which is defined only by the claims.

Claims

What is claimed is:

1. A method for training a hierarchical machine learning model that produces a social determinants of health (SDoH) index, comprising:

receiving a description of a geographic area of interest made up of a set of first geographic areas having a first geographic level;

determining an area of interest similarity score for each of a plurality of geographic areas having the first geographic hierarchy outside the geographic area of interest;

determining an optimal hierarchical machine learning model by minimizing a performance metric based upon the determined area of interest similarity scores for a set of SDoH features by repeating the steps of:

determining a training set of data based upon the determined area of interest similarity scores;

training the hierarchical machine learning model using the determined training set of data; and

calculating the performance metric for the trained model based upon the test data set.

2. The method of claim 1, wherein the optimal machine learning module is configured to produce the SDoH index based upon the SDoH features for a geographic area having a second geographic level, wherein the second geographic level is less than the first geographic level.

3. The method of claim 1, wherein the area of interest similarity score is an average of the similarity scores between each of the set of first geographic areas and a geographic area outside the geographic area of interest.

4. The method of claim 1, wherein the area of interest similarity score is based upon the distance between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas, the difference of non-SDoH features between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas, and the difference in health outcomes between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas.

5. The method of claim 1, wherein determining a training set of data based upon the determined area of interest similarity scores includes selecting the M areas outside the geographic area of interesting having the highest area of interest similarity scores, wherein M is an integer.

6. The method of claim 5, wherein the value of M changes between iterations of determining a training set of data.

7. The method of claim 1, wherein the performance parameter is mean square error on the test data set.

8. The method of claim 1, wherein training the hierarchical machine learning model uses Tasso regression.

9. The method of claim 1, wherein training the hierarchical machine learning model includes feature selection.

10. The method of claim 1, wherein the hierarchical machine learning model calculates a sub heath score for each of the SDoH categories and the SDoH index is a weighted sum of the sub health scores for each of the SDoH categories.

11. The method of claim 1, wherein training the hierarchical machine learning model includes learning the weights used in the weighted sum.

12. The method of claim 1, further comprising calculating the SDoH index based upon the SDoH features for a geographic area having a second geographic level, wherein the second geographic level is less than the first geographic level.

13. A non-transitory machine-readable storage medium encoded with instructions for training a hierarchical machine learning model that produces a social determinants of health (SDoH) index, comprising:

instructions for receiving a description of a geographic area of interest made up of a set of first geographic areas having a first geographic level;

instructions for determining an area of interest similarity score for each of a plurality of geographic areas having the first geographic hierarchy outside the geographic area of interest;

instructions for determining an optimal hierarchical machine learning model by minimizing a performance metric based upon the determined area of interest similarity scores for a set of SDoH features by repeating the instructions for:

training the hierarchical machine learning model using the determined training set of data; and calculating the performance metric for the trained model based upon the test data set.

14. The non-transitory machine-readable storage medium of claim 13, wherein the optimal machine learning module is configured to produce the SDoH index based upon the SDoH features for a geographic area having a second geographic level, wherein the second geographic level is less than the first geographic level.

15. The non-transitory machine-readable storage medium of claim 13, wherein the area of interest similarity score is an average of the similarity scores between each of the set of first geographic areas and a geographic area outside the geographic area of interest.

16. The non-transitory machine-readable storage medium of claim 13, wherein the area of interest similarity score is based upon the distance between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas, the difference of non- SDoH features between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas, and the difference in health outcomes between the geographic area outside the geographic area of interest and the geographic areas in the set of first geographic areas.

17. The non-transitory machine-readable storage medium of claim 13, wherein instructions for determining a training set of data based upon the determined area of interest similarity scores includes instructions for selecting the M areas outside the geographic area of interesting having the highest area of interest similarity scores, wherein M is an integer.

18. The non-transitory machine-readable storage medium of claim 17, wherein the value of M changes between iterations of determining a training set of data.

19. The non-transitory machine-readable storage medium of claim 13, wherein the performance parameter is mean square error on the test data set.

20. The non-transitory machine-readable storage medium of claim 13, wherein instructions training the hierarchical machine learning model uses Tasso regression.

21. The non-transitory machine-readable storage medium of claim 13, wherein instructions training the hierarchical machine learning model includes feature selection.

22. The non-transitory machine-readable storage medium of claim 13, wherein the hierarchical machine learning model calculates a sub heath score for each of the SDoH categories and the SDoH index is a weighted sum of the sub health scores for each of the SDoH categories.

23. The non-transitory machine-readable storage medium of claim 13, wherein instructions for training the hierarchical machine learning model includes instructions for learning the weights used in the weighted sum.

24. The non-transitory machine-readable storage medium of claim 13, further comprising instructions for calculating the SDoH index based upon the SDoH features for a geographic area having a second geographic level, wherein the second geographic level is less than the first geographic level.