WO2023128780A1

WO2023128780A1 - Method for the early diagnosis of chronic diseases in a patient

Info

Publication number: WO2023128780A1
Application number: PCT/RU2021/000605
Authority: WO
Inventors: Роман Эдвардович НОВИЦКИЙ; Александр Владимирович ГУСЕВ
Original assignee: Общество с ограниченной ответственностью "К-Скай"
Priority date: 2021-12-27
Filing date: 2021-12-28
Publication date: 2023-07-06

Abstract

A method for the early diagnosis of chronic diseases in a patient, implemented by a computer containing a processor and a memory, includes the following steps of: receiving, on a computer, anonymized electronic medical record data from a medical information system; extracting medical and social features of a patient's state of health, as well as risk factor features, using natural language processing, wherein the extracted features are sent to the input of a trained suite of classifiers for assigning the features to a probability class for the presence of chronic diseases; dividing the extracted medical and social features about the patient's state of health into groups, and dividing the resulting groups into clusters according to diseases diagnosed, wherein groups are combined into clusters with common diagnoses; obtaining values for the medical and social features of the patient's state of health and for the risk factor features in each cluster; evaluating the risk of the development of complications of chronic diseases; obtaining, at the output of the suite of classifiers, the probability class for the presence of chronic diseases and an evaluation of the risk of the development of complications.

Description

METHOD FOR EARLY DIAGNOSIS OF CHRONIC DISEASES OF A PATIENT

FIELD OF TECHNOLOGY

The invention relates to the field of medicine, as well as to the field of information and communication technologies for processing medical data, in particular to a method for early diagnosis of a patient's chronic diseases, based on cluster analysis of big data.

The presented solution can be used, at least in clinical practice, by doctors and other medical professionals who are involved in the diagnosis, treatment and prevention of diseases, in predicting the onset of various medical events for a patient.

BACKGROUND OF THE INVENTION

The source of information RU 2 698 007 C1 is known from the prior art, published on August 21, 2019 and revealing an automatic medical decision support system for comorbidities. The system contains an interface block, an input data block made in the format of an electronic medical record, a decision storage block for each of the therapeutic areas, a computing block, an input data inconsistency check block, a decision issuing block, a treatment recommendations block, while the interface block is made with the possibility of obtaining data from a clinician, or from databases with electronic medical records, or from "big data" repositories, while the interface block is configured to exchange data with the input data block, with the decision block and with the treatment recommendations block, while the input data block is made in the format of an electronic medical record with the possibility of transmitting the input information to the computing unit and with the possibility of storing reference models of various diseases corresponding to different human organs and systems, while the input data block is configured to transfer data to the input data inconsistency check block , and the computing unit is configured to recalculate the weight coefficients assigned to the symptoms or signs for each disease of a particular organ or system of the human body, and scores in favor of specific diseases, while the interface unit, the input data unit and the inconsistency check unit, the decision storage unit , computing unit, decision making unit, treatment recommendations unit made with the ability to work on calls, and / or in remote areas, and / or in emergency zones without stable access to the Internet, the decision block is made with the ability to display information about all reference case histories, highlighting the signs or symptoms identified during the examination of the patient , or displaying information about those reference diseases that have signs or symptoms in common with the data obtained during the survey, and with the ability to compare each reference disease model with the entered patient examination data; analysis of the sufficiency of survey data to determine diseases; requesting additional examinations if it is impossible to determine the disease or prescribe a treatment plan; grouping diseases by organs or systems of the human body, while the treatment recommendations block is configured to receive data from the input data block and the decision issuing block and output the optimal treatment plan to the interface block.

The disadvantage of the proposed solution is that the diagnosis of diseases is based on the symptoms of the disease. The proposed solution uses only patient health information and risk factors extracted from the patient's electronic health record. Also, the difference of the proposed solution is the definition of diagnosing chronic diseases by means of an ensemble of classifiers.

SUMMARY OF THE INVENTION

The technical problem to be solved by the claimed solution is the need to develop a method for early diagnosis of a patient's chronic diseases, as well as to create a set of classifiers for determining chronic diseases. classifiers, which is characterized in an independent claim. Additional embodiments of the present invention are presented in dependent claims.

The technical result consists in increasing the accuracy of early diagnosis of a patient's chronic diseases and determining the risk of developing complications of chronic diseases by using a set of classifiers, since the data obtained will be used to check for all chronic diseases for which the classifier is trained, and to assess the risks of developing chronic diseases in patient. Additional technical the result is an increase in the performance of the server infrastructure on which the method is implemented when solving the task (i.e., due to the implementation of the described method, it becomes possible to process data with obtaining results in less time), thereby reducing the load on the central processors of computing devices / servers , by reducing the number of requests processed.

The claimed result is achieved by using a method for early diagnosis of a patient's chronic diseases, running on a computing device containing a processor and a memory that stores instructions executed by the processor and containing the following steps: depersonalized medical data of an electronic medical record is received from a medical information system to a computing device; on the computing device, extracting medical and social signs about the patient's health status, as well as signs of risk factors through Natural language processing; the extracted features are input to the trained set of classifiers to assign the features to the probability class of chronic diseases, while training the classifier set consists in training at least one classifier for at least one chronic disease and contains the following steps: the extracted medical and social features are divided about the state of health of the patient in groups; carry out the division of the received groups into clusters according to the diseases diagnosed in them, according to the data of the electronic medical record and medical and social signs about the state of health of the patient, and combine the groups into clusters with common diagnoses; get the values of medical and social signs about the patient's health status and signs of risk factors in each cluster, according to which the extracted signs will be assigned to the probability classes of the presence of chronic diseases; carry out an assessment of the risk of developing complications of chronic diseases, classified as the probability class of the presence of chronic diseases according to the ICD class; at the output of the ensemble of classifiers, they obtain a probability class for the presence of chronic diseases, according to the ICD class, and an assessment of the risk of developing complications of chronic diseases. In a particular implementation of the proposed solution, medical and social signs of the patient's health include: gender, age, social status, region of residence, physiological parameters, laboratory parameters.

In another particular embodiment of the proposed solution, the signs of the risk factor characterize the signs that adversely affect the health of the patient.

In another particular implementation of the proposed solution, the grouping into clusters is carried out using the k-means method and / or the c-means method and / or layered clustering and / or the selection of connected components and / or the minimum spanning tree method.

DESCRIPTION OF THE DRAWINGS

The implementation of the invention will be described hereinafter in accordance with the accompanying drawings, which are presented to explain the essence of the invention and in no way limit the scope of the invention. The following drawings are attached to the application:

Fig. 1 illustrates an example of diagnosing cardiovascular disease

DETAILED DESCRIPTION OF THE INVENTION

In the following detailed description of the implementation of the invention, numerous implementation details are provided to provide a clear understanding of the present invention. However, it will be apparent to one skilled in the art how the present invention can be used, both with and without these implementation details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to unnecessarily obscure the features of the present invention.

Furthermore, it will be clear from the foregoing that the invention is not limited to the present implementation. Numerous possible modifications, changes, variations and substitutions that retain the spirit and form of the present invention will be apparent to those skilled in the subject area.

In the medical information system used in a medical organization, various data about patients are accumulated in the course of work, including general and medical information: height, weight, blood pressure numbers, etc., registered diseases and visits to medical organizations, examination protocols, data from medical examinations, surgical interventions, etc.

The proposed solution is integrated with the medical information system of a medical organization through open API systems.

From the medical information system to the computing device, upon request, a package of depersonalized medical data of the patient is automatically generated and sent from the patient's electronic medical record.

Electronic medical record, (electronic patient passport) - EMR; English, electronic health record - EHR) - a database containing information about the patient: the patient's physiological parameters, anamnesis, medical histories and their treatment (methods and course of treatment, prescribed drugs, etc.), which is created in a medical institution. Including an electronic medical record of patients contains records of patients, including at least the following data: the date the record was added, codes for diagnoses, symptoms, procedures and drugs, a textual description of the medical history in natural language, biomedical images associated with the medical history, research results and patient analyses.

By means of Natural language processing methods, on a computing device, medical and social signs about the patient's health status, as well as signs of risk factors are extracted from the received package of depersonalized medical data of the patient from the patient's electronic medical record.

Under the medical signs of the patient's health are understood signs that characterize the body mass index, blood pressure, pulse pressure, etc., laboratory parameters (general and clinical blood tests, blood biochemistry, etc.).

Under the social signs of the patient's health status are understood signs that characterize gender, age, social status, region of residence.

Risk factors are signs that characterize the negative impact on the patient's health, for example, but not limited to smoking, personal and family history, physical inactivity, obesity, etc.

The extracted features are input to a trained set of classifiers to assign the features to the probability class of chronic diseases, such as cardiovascular diseases, endocrinological diseases, kidney diseases, and respiratory diseases. Each classifier from the set is trained to identify one chronic disease. Each classifier has its own set of data, which it defines to assign features to the class of probability of having chronic diseases. If the extracted features lack one parameter from the data set of the classifier, then this classifier is not used in the evaluation. For example, a classifier for the definition of chronic heart failure will not be used in the assessment if the extracted features will not have a hemoglobin level.

Training a set of classifiers consists in training at least one classifier to determine at least one chronic disease (determining the risk of having atherosclerosis, chronic lung disease and chronic heart failure, chronic kidney disease, diabetes). The stages of classifier training are shown below.

Classifiers are trained using the scikit-learn, tensorflow, catboost, xgboost, etc. libraries. Classifiers are trained on the same data set, after which the obtained accuracy metrics and error matrices are analyzed, which makes it possible to determine the optimal algorithm for each of the tasks. The set of analyzed accuracy metrics includes:

For classification algorithms:

- accuracy (percentage of correct answers);

- precision (precision);

- completeness (recall);

- f-measure (harmonic mean of recall and accuracy);

- ais-gos (area under the error curve);

is the logistic loss function.

Data for disease prediction include:

• demographic data (gender, age, region of residence);

• patient history (codes for chronic noncommunicable diseases);

• medical history of the patient (frequency of visits and hospitalizations, diagnostic manipulations, etc.);

• signs extracted from patients' EHR in their dynamic interpretation - defined as functions of the trend of variability of physiological and laboratory parameters.

The implementation of this technical solution allows for the initial stage of determining suspicions of diseases without using accurate medical information. Based on the patient's belonging to one of the selected clusters, the probability of having chronic diseases is determined. Interpretation of the output data of the classifiers occurs using a confidence threshold, above which the probability of a patient having chronic diseases is defined as high, below which it is defined as low. For example, for the representatives of the cluster “men aged 45-50 years, suffering from obesity and arterial hypertension, with a small history of visits to healthcare facilities”, the probability of having cardiovascular diseases is defined as high.

To deploy a chronic disease prediction model based on big data cluster analysis, the following minimum technical requirements must be met.

Minimum characteristics of computing elements:

Processor: at least 6 cores with a frequency of 2.0 GHz or more with support for AVX instructions; RAM: at least 24 GB;

Disk subsystem: 100 GB of free disk space.

The operating system is one of the following:

• Ubuntu 18.04 LTS;

• Ubuntu 20.04 LTS;

• Astra Linux Common Edition 2.12.29;

• CentOS 7.7.

Required software:

• docker 19.03 and above

• docker-compose 1.25.5 and higher

1. The extracted medical and social signs about the patient's health status are divided into groups using a hierarchical clustering algorithm. Groups are formed according to a limited number of characteristics: gender, age category, area of residence. Age groups are distinguished (according to the classification of the World Health Organization, as well as on the basis of published clinical studies, depending on the disease, for example, for cardiovascular diseases, a step of 5 years is used within the age group of 40+ or, for example, for the study of gynecological syndromes, a fundamental the cut-off is the age of menopause), after separating the age group on the data, a joint breakdown by sex (men and women) is performed, after the breakdown by sex, the 3rd feature is added - the territory of residence (according to the EHR data). As a result of the work of the algorithm, groups of this type are obtained: “men aged 40-50 years old living in the Far North”, or “women aged 60-65 years old living in the Southern Federal District”. 2. The obtained groups are divided into clusters according to the diseases diagnosed in them, according to the data of the electronic medical record, medical and social signs of the patient's health, and the groups are combined into clusters with common diagnoses. Combining groups into clusters with a common diagnosis occurs through algorithms: the k-means method or the c-means method, or layered clustering, or the selection of connected components, or the minimum spanning tree method. Clustering methods use relationship measures obtained as a result of comparing diagnoses with each other. Accordingly, within the selected cluster, a community will be formed, characterized by the presence of common diagnoses. For example, for the group “men 40-55 years old, residents of large cities”, a subcluster “with the presence of atherosclerosis” or “having had a heart attack” will be allocated. The subcluster code is passed to the model as a categorical value (for example, 0 and 1 for a heart attack, 1 for a history of a heart attack, 0 for no heart attack).

k-means method: - the number of clusters k is determined; - k rows are randomly selected from the initial data set, which act as the initial centers of clusters; -for each data series, the nearest cluster center is determined; - centroids (centers of gravity of clusters) are calculated; - the center of the cluster is shifted to its centroid; - the steps from determining the center of gravity to shifting it to the centroid are iteratively repeated, which ensures the growth of intercluster distances, the smaller the intercluster distance, the more likely it is to refer the patient to several clusters, and if the intercluster distance is large, then the patient is assigned to a cluster with a common diagnosis.

C-average method: a membership matrix is formed to divide objects into k clusters; - the values of the error criterion are determined; - all objects are regrouped to reduce the value of the error criterion; - the last two procedures are repeated until the changes in the matrix during the rearrangement become insignificant, which indicate the definition of a general diagnosis.

Minimum Spanning Tree Algorithm: - a minimum spanning tree is built, and then the edges with the highest weight are sequentially removed. The criterion for belonging to the general diagnosis of the cluster will be the minimum weights.

Layer-by-layer clustering: connected components of the graph are distinguished, the clustering algorithm forms a sequence of subgraphs that reflect the connections between clusters. The distance threshold is set, by changing which you can control the depth of the cluster hierarchy. By the value of this threshold (he is calculated based on the selection of connected components at a certain level of distances between objects), a tree structure is selected.

3. The values of medical and social signs about the patient's health status and signs of risk factors in each cluster are obtained, according to which the extracted signs will be assigned to the probability classes of the presence of chronic diseases. These values are the training sample. After the patient is assigned to one of the clusters, a categorical value is obtained for him, for example, 5, which will mean that the patient is assigned to the group “men 40-55 years old living in large cities.” It is this value that becomes the input parameter for risk assessment models, t .e. the input is not only data for each patient (initial data), but also the value of its cluster.

4. Each classifier from the set is trained on the basis of a multilayer (at least 3 layers) neural network of direct propagation with normalization in the input layer and / or gradient boosting, which classify the patients of each cluster to a positive or negative class (positive class according to the probability of atherosclerosis, positive class for stroke probability, negative class for chronic kidney disease, etc.).

5. As a result, a class is obtained (1 or 0, where 0 - the probability of the disease is absent, 1 - the disease will develop) of the probability of having chronic diseases, according to the ICD class, of each cluster. For example, for atherosclerosis of the brachiocephalic arteries, the risk of developing a diagnosis in the age group of 20-30 years of any gender in the absence of overweight = 0.1, which is interpreted as a low probability.

6. An assessment is made of the risk of developing complications of chronic diseases in patients assigned to the probability class of the presence of chronic diseases according to the ICD class. This assessment of the risk of developing complications of chronic diseases is based on machine learning algorithms (neural network, gradient boosting or random forest algorithm). Risk assessment refers to the numerical output of the model and its interpretation. For each algorithm, a certain threshold is calculated, above which the risk of an event (development of a complication) is considered high, below which it is considered low. It is possible to introduce multiple thresholds for interpretation, according to which the degree of risk is assessed as low, moderate, high, very high, etc. When training models, the algorithms learn from big datasets of patients who developed these complications and who did not. For numerical risk assessment complication development, regression algorithms are used that allow not only to classify patient data, such as “there is a risk of stroke” or “no risk”, but to quantify the likelihood of a complication, which the platform interprets depending on the degree of risk, for example, an assessment of the likelihood of developing in this patient, a stroke is issued in the form of "high risk", which refers the patient to the group of increased attention.

A qualitative interpretation is developed with the participation of medical experts and in the implementation looks like a color scheme with explanations (red - high risk, green - low risks or no risks). For example, the probability of having atherosclerotic plaques in the brachiocephalic arteries in obese men (body mass index >= 25) aged 45-50 years, without diagnosed chronic cardiovascular diseases, in the presence of risk factors (smoking, physical inactivity, family history, dyslipidemia) model rated as 0.6. The cutoff threshold for this model, calculated using the Youden criterion, is 0.55. Thus, a patient from the described group has a high risk of having atherosclerotic plaques.

At the output of a set of classifiers, a probability class for the presence of chronic diseases is obtained, according to the ICD class, and an assessment of the risk of developing complications of chronic diseases.

In FIG. 1 illustrates an example where, based on the medical and social signs of the patient's health (such as: age - 65 years, social status - military pensioner, territory of residence - the far north, which has - hypodynamia, high cholesterol, arthritis, arterial hypertension, there was a heart attack , antihypertensive therapy was carried out, the presence of injuries), as well as on the basis of signs of risk factors (high body mass index, smoking, heart rate indicated, waist circumference), the probability of coronary heart disease was determined more than 50%, as well as the likelihood of developing diabetes mellitus 2 type more than 45%.

The degree of risk of cardiovascular disease is linked to factors such as gender, age, region of residence and social status. The next step in segmenting patients will be to identify categories of patients with a burdened history (for example, patients with type 2 diabetes, patients with kidney disease, etc.) and to identify risk factors (family history, obesity, smoking, physical inactivity) within the age and sex groups. Using the combined clustering method, using hierarchical clustering, as well as other clustering methods described above, assign each group of patients with a group number with similar aggravated history and risk factors (for example, a 68-year-old man, retired, body mass index = 34, smoker, history of diseases of the musculoskeletal system (arthritis that leads to physical inactivity). This group number and data about the patient is fed into the trained set of classifiers.The classifier set contains classifiers for determining the following diseases: determining the risk of atherosclerosis, chronic lung disease and chronic heart failure, chronic kidney disease, the presence of atherosclerosis of the brachiocephalic arteries, diabetes. patient’s health data (high cholesterol, arthritis, arterial hypertension, there was a heart attack, antihypertensive therapy was carried out, the presence of injuries) and risk factors (family history, obesity, smoking, physical inactivity), then classifiers begin to work on these data: determining the risk of diabetes mellitus and determining the risk of coronary heart disease, the rest of the classifiers are not involved in the work, since there are no data for work for them. At the output, a probability class for the presence of chronic diseases in a patient according to the ICD class is obtained.

Next, the risk of developing complications of chronic diseases of patients is assessed, the threshold value is determined for each algorithm using Youden statistics.

If it is necessary to determine the assessment for the presence of atherosclerosis of the brachiocephalic arteries, then the following data are submitted to the input of the classifier set: general information about the patient (age, smoking); general medical information (weight, height, waist circumference, body mass index); medical examination data (blood pressure, heart rate, respiratory rate); information about the anamnesis (COVID-19, gout, diabetes mellitus, psoriasis, rheumatoid arthritis, atrial fibrillation); laboratory parameters (cholesterol, LDL, HDL, blood glucose, creatinine, ACT, ALT, blood protein, triglycerides); data of instrumental measurements (myocardial mass of the left ventricle).

The set of classifiers is based on the decision tree algorithm. To determine atherosclerosis of the brachiocephalic arteries, only the classifier for atherosclerosis of the brachiocephalic arteries is launched, the rest of the classifiers from the set are not involved in the analysis.

AND Decision tree analysis uses a visual and analytical decision support tool to calculate expected values (or expected benefits) of competing alternatives. The structure of a tree is "leaves" and "branches". On the edges ("branches") of the decision tree, the features on which the objective function depends are written, the values of the objective function are written in the "leaves", and the other nodes are the features by which the cases differ. To classify a new case, one must go down the tree to a leaf and return the corresponding value.

A decision tree in general can be described by the following formula:

(x, y) = (x1, x2, x3, .... xk, y)

The dependent variable Y is the target variable to be analysed, classified and summarized. The vector x consists of the input variables x1, x2, x3, etc., which are used to complete this task.

The decision tree consists of three types of nodes:

Decision Nodes

Probabilistic nodes

Closing knots

As a result of the work of the ensemble of decision trees, the output of the model is presented:

Probability of having plaque in the brachiocephalic arteries, as a number from 0 to 1. The interpretation of the output is carried out using a threshold value (0.55), the output value above which is considered as a high risk, less - as a low risk of the presence of plaques of the brachiocephalic arteries.

In these application materials, a preferred disclosure of the implementation of the claimed technical solution was presented, which should not be used as limiting other, private embodiments of its implementation, which do not go beyond the scope of the requested legal protection and are obvious to specialists in the relevant field of technology.

Claims

Formula

1. A computer-implemented method for early diagnosis of a patient's chronic diseases, running on a computing device containing a processor and a memory that stores instructions executed by the processor and containing the following steps: receive depersonalized medical data of an electronic medical record from a medical information system to a computing device; on the computing device, extracting medical and social signs about the patient's health status, as well as signs of risk factors through Natural language processing; carry out assignment of the extracted medical and social signs about the state of health of the patient to the group of patients; assigning a group of patients to a specific cluster according to the diseases diagnosed in the patient, according to the data of the electronic medical record and medical and social signs about the patient's health, and combining the group of patients into clusters with common diagnoses; get the values of medical and social signs about the patient's health status and signs of risk factors in each cluster, according to which the extracted signs will be assigned to the probability classes of the presence of chronic diseases; clusters with extracted features are fed to the input of a trained set of classifiers for assigning features to the probability class of chronic diseases, while training the classifier set consists in training at least one classifier for at least one chronic disease, and the classifier from the set does not participate in the classification, if the extracted features do not have input features for this classifier; assess the risk of developing complications of chronic diseases; at the output of a set of classifiers, a probability class for the presence of chronic diseases is obtained, according to the ICD class, and an assessment of the risk of developing complications of chronic diseases.

2. The method according to claim 1, characterized in that the medical and social signs of the patient's health include: gender, age, social status, region of residence, physiological parameters, laboratory parameters.

3. The method according to claim 1, characterized in that the signs of the risk factor characterize the signs that negatively affect the health of the patient.

4. The method according to claim 1, characterized in that the assignment of groups into clusters is carried out by means of the k-means method and / or the c-means method and / or layered clustering and / or the selection of connected components and / or the minimum spanning tree method.