CN116864139A

CN116864139A - Disease risk assessment method, device, computer equipment and readable storage medium

Info

Publication number: CN116864139A
Application number: CN202310801002.XA
Authority: CN
Inventors: 唐蕊
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-06-30
Filing date: 2023-06-30
Publication date: 2023-10-10

Abstract

The invention relates to the technical field of artificial intelligence and medical health, and discloses a disease risk assessment method, a device, equipment and a medium, wherein the method comprises the following steps: collecting a sample dataset; analyzing the sample data set to determine influencing factors of the target diseases; calculating a feature selection score of each influencing factor; selecting a preset number of influencing factors with the feature selection score at the front as a key factor set according to the feature selection score; setting characteristic values of key factors of the sample data; constructing a disease risk assessment model based on an extreme gradient lifting algorithm; outputting disease risk assessment results of the sample data through a disease risk assessment model; comparing the disease risk assessment results of each sample data with the actual results, and adjusting model parameters of a disease risk assessment model to complete training of the disease risk assessment model; obtaining a sample to be evaluated; and outputting a disease risk assessment result of the sample to be assessed through the disease risk assessment model.

Description

Disease risk assessment method, device, computer equipment and readable storage medium

Technical Field

The invention relates to the technical field of artificial intelligence and medical health, in particular to a disease risk assessment method, a disease risk assessment device, computer equipment and a readable storage medium.

Background

As people's health consciousness increases, more and more people in modern society begin to pay attention to the level of physical health. By paying attention to health, people can better manage the physical condition of the people, prevent diseases, improve the life quality and lay a solid foundation for longer-term happiness and happiness.

Disease risk assessment questionnaires typically determine questionnaires questions and corresponding options based on disease medical knowledge, epidemiological data, and physician clinical experience, which are filled out by the user to assess the probability of the user's future suffering from a certain disease risk, which is assessed by the patient's answers.

The topics and options of the existing disease risk assessment questionnaires often need to be designed according to disease medical knowledge, clinical experience of doctors and the like. However, since there are too many influencing factors, the questions of the questionnaire are limited, and it is difficult for the questions of the questionnaire to cover all influencing factors, so that in the actual use process, the accuracy of disease risk assessment according to the answers of the patient to the questionnaire is low. And when the risk assessment is carried out on the answers of the questionnaire, professional doctors are often required to carry out manual review, so that a great amount of diagnosis time of the professional doctors is occupied, and the risk assessment efficiency is low.

Disclosure of Invention

The invention provides a disease risk assessment method, a disease risk assessment device, computer equipment and a readable storage medium, which are used for solving the technical problems of low accuracy and low risk assessment efficiency of the current disease risk assessment according to questionnaire answers of patients.

In a first aspect, a method for disease risk assessment is provided, comprising:

collecting a sample data set comprising a plurality of sample data;

analyzing the sample data set to determine influencing factors of the target diseases;

calculating a feature selection score of each influencing factor;

selecting a preset number of influence factors with the feature selection score at the front as a key factor set according to the feature selection score, wherein the key factor set comprises a plurality of key factors;

setting characteristic values of all key factors of all sample data, and forming characteristic vectors by the characteristic values of all key factors;

constructing a disease risk assessment model based on an extreme gradient lifting algorithm;

inputting the feature vector of each sample data into a disease risk assessment model, and outputting a disease risk assessment result of each sample data through the disease risk assessment model;

comparing the disease risk assessment results of each sample data with the actual results, and adjusting model parameters of a disease risk assessment model to complete training of the disease risk assessment model;

Acquiring a sample to be evaluated, and inputting a feature vector of the sample to be evaluated into a disease risk evaluation model;

and outputting a disease risk assessment result of the sample to be assessed through the disease risk assessment model.

In a second aspect, there is provided a disease risk assessment apparatus comprising:

a first acquisition module for acquiring a sample data set including a plurality of sample data;

the analysis module is used for analyzing the sample data set and determining influencing factors of the target diseases;

the computing module is used for computing the feature selection score of each influence factor;

the selection module is used for selecting a preset number of influence factors with the front feature selection score as a key factor set according to the feature selection score, wherein the key factor set comprises a plurality of key factors;

the setting module is used for setting the characteristic values of all key factors of all sample data and forming the characteristic values of all key factors into characteristic vectors;

the construction module is used for constructing a disease risk assessment model based on an extreme gradient lifting algorithm;

the first output module is used for inputting the feature vector of each sample data into the disease risk assessment model and outputting the disease risk assessment result of each sample data through the disease risk assessment model;

The adjusting module is used for comparing the disease risk assessment result with the actual result of each sample data, adjusting the model parameters of the disease risk assessment model and completing the training of the disease risk assessment model;

the second acquisition module is used for acquiring a sample to be evaluated, and inputting the feature vector of the sample to be evaluated into the disease risk evaluation model;

and the second output module is used for outputting a disease risk assessment result of the sample to be assessed through the disease risk assessment model.

In a third aspect, a computer device is provided comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the disease risk assessment method described above when executing the computer program.

In a fourth aspect, a computer readable storage medium is provided, the computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the disease risk assessment method described above.

In the scheme realized by the disease risk assessment method, the device, the computer equipment and the readable storage medium, the characteristic selection scores of all the influence factors are calculated in the face of a plurality of influence factors influencing the target disease, the preset number of influence factors with the front scores are selected as key factors, then the characteristic values of all the key factors of the sample to be assessed are input into the disease risk assessment model as characteristic vectors, and the disease risk results of whether the sample to be assessed has disease risk or not are automatically given. The disease risk assessment is carried out according to the key factors with the front scores, so that the accuracy of the disease risk assessment is improved, and the professional doctor does not need to carry out manual review, so that the efficiency of the disease risk assessment is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic view of an application environment of a disease risk assessment method according to an embodiment of the invention;

FIG. 2 is a flow chart of a method for disease risk assessment according to an embodiment of the invention;

FIG. 3 is a schematic diagram of a disease risk assessment apparatus according to an embodiment of the invention;

FIG. 4 is a schematic diagram of a computer device according to an embodiment of the present invention;

fig. 5 is a schematic diagram of another configuration of a computer device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The disease risk assessment method provided by the embodiment of the application can be applied to an application environment as shown in fig. 1, wherein a client communicates with a server through a network. The server may be based on an extreme gradient lifting algorithm disease risk assessment model. And inputting the feature vector of the key factors related to the sample to be evaluated into a disease risk evaluation model, and outputting a disease risk evaluation result of the sample to be evaluated through the disease risk evaluation model. And then, the server side sends the disease risk assessment result to the client side, and the client side displays the disease risk assessment result. The clients may be, but are not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server may be implemented by a stand-alone server or a server cluster formed by a plurality of servers. The present application will be described in detail with reference to specific examples.

Referring to fig. 2, fig. 2 is a schematic flow chart of a disease risk assessment method according to an embodiment of the invention.

The disease risk assessment method provided by the invention can be applied to the field of medical care, and helps doctors and medical care professionals to more accurately assess the risk of patients suffering from certain diseases and take corresponding preventive or therapeutic measures. The method can also be applied to the insurance field, evaluates the health condition and risk of individuals or groups, and helps insurance companies determine insurance rates, formulate personalized insurance plans or adjust policy terms. Further, it can also be applied in epidemiological studies to stratify the risk of a population, identify potentially high risk populations, and infer factors related to the occurrence of disease.

The disease risk assessment method comprises the following steps:

s101: a sample data set is acquired, the sample data set comprising a plurality of sample data.

In particular, in the medical health field, medical record data of a large number of existing patient visits can be acquired from a medical institution, and sample data for disease risk assessment can be selected from the medical record data. In the insurance field, a large number of medical record data of multiple patient visits of the applicant can be obtained from the active information disclosure of the applicant, and sample data for disease risk assessment can be selected from the medical record data.

Wherein the sample dataset should include both samples diagnosed with the target disease and healthy samples not diagnosed with the target disease.

For example, for patients diagnosed with type two diabetes, the medical record data of these patients at a time prior to the time of diagnosis (6 months or 1 year, etc.) is selected as sample data for use in type two diabetes risk assessment.

Alternatively, the medical record data of one patient may be constructed as a plurality of pieces of sample data, for example, 3 pieces of diagnosis data of one patient diagnosed with type two diabetes before the diagnosis are respectively 6 months ago, 12 months ago, 18 months ago, and 3 pieces of sample data may be constructed respectively for risk assessment of type two diabetes according to the diagnosis data.

S102: and analyzing the sample data set to determine influencing factors of the target diseases.

The target disease may be a specific single disease or a group of diseases, for example, the evaluation of chronic diseases refers to the evaluation of cardiovascular and cerebrovascular diseases, cancers, diabetes, chronic respiratory diseases and other diseases, which are caused by many coincident factors among the influence factors of the chronic diseases.

Specifically, the data mining may be performed on sample data in the sample data set, such as currently described symptoms (e.g., debilitation, dizziness, headache, chest distress, shortness of breath, etc.), inspection for abnormal items (e.g., elevated blood pressure, elevated blood lipids, elevated blood uric acid, etc.), family disease history (e.g., hypertension, diabetes, coronary heart disease, etc.), and the like. And determining influencing factors related to the target diseases by combining the domain knowledge of the target diseases. Thereby providing accurate risk assessment, personalized medical decisions, disease prevention and intervention measures, and facilitating deep research on diseases.

S103: a feature selection score is calculated for each influencing factor.

The feature selection score can represent the contribution value of each influence factor to disease diagnosis, and the most important key factors can be selected from a plurality of influence factors according to the feature selection score.

Specifically, the feature selection score of each influencing factor may be calculated in a manner of information gain, analysis of variance, mutual information or L1 regularization.

The information gain is a commonly used feature selection method, and is used for measuring importance of a feature to a target variable. It evaluates the information gain of a feature by calculating the degree of uncertainty reduction of the feature to a target variable based on the concept of information theory. The larger the information gain, the higher the feature selection score.

Wherein analysis of variance is used to compare whether there is a significant difference in mean between the different groups. In feature selection, variance analysis may be used to calculate the variance contribution of each influencing factor to the target variable. The analysis of variance results may be used as part of the feature selection score.

Wherein mutual information is an indicator of the interdependence between two variables. In feature selection, mutual information between each feature and the target variable may be calculated as a feature selection score. The larger the mutual information, the higher the feature selection score.

Wherein, L1 regularization is a common regularization method used for thinning the feature weights. By introducing the L1 regular term into the objective function, the weight of part of the features can be changed to zero, so that the importance of each feature is indirectly measured. Features with non-zero weights may be considered as having a higher feature selection score.

In one possible implementation manner, the present invention provides a completely new method for calculating feature selection scores of respective influencing factors, and S103 specifically includes substeps S1031 to S1033:

s1031: and obtaining medical knowledge scores, similar key factor scores and comprehensive key factor scores of all the influence factors.

Wherein the medical knowledge score reflects the importance and relevance of influencing factors in the medical field for diagnosing a target disease.

Wherein the same class key factor score reflects the importance of one influencing factor relative to other similar factors.

Wherein the composite key factor score reflects the importance of one influencing factor relative to all other factors.

Specifically, the medical knowledge score of each influence factor can be determined by an expert scoring method, and the similar key factor score and the comprehensive key factor score of each influence factor are determined by correlation analysis, analysis of variance, regression model and the like.

In the invention, the medical knowledge score, the similar key factor score and the comprehensive key factor score are used as indexes for evaluating the importance of the influencing factors, the factors which have important influence on the risk evaluation of the diseases are comprehensively determined, and the risk evaluation of the target diseases is focused on the factors with the most influence, so that the accuracy and the effectiveness of the risk evaluation of the target diseases can be improved. Further, processing a large number of influencing factors may result in an increase in complexity, an increase in computational cost, and a decrease in interpretability of the model, while determining key factors from the large number of influencing factors may help simplify the model, reduce unnecessary features, and improve interpretability and operability of the model.

S1032: and obtaining the weights of the medical knowledge scores, the similar key factor scores and the comprehensive key factor scores.

Wherein the weights of the medical knowledge score, the similar key factor score and the comprehensive key factor score represent the importance of the medical knowledge score, the similar key factor score and the comprehensive key factor score to the feature selection score of one influencing factor.

For example, the weights of the medical knowledge score, the same type of key factor score, and the integrated key factor score may be set to 0.3, and 0.4, respectively, at which time the importance of the integrated key factor score is higher than the medical knowledge score and the same type of key factor score.

Specifically, the medical knowledge score, the similar key factor score and the weight of the comprehensive key factor score can be determined by means of a analytic hierarchy process, an expert scoring process and the like.

S1033: calculating the feature selection score a of each influencing factor according to the following formula:

a＝μ ₁ ·a ₁ +μ ₂ ·a ₂ +μ ₃ ·a ₃

wherein a is ₁ Represents medical knowledge score, mu ₁ Weights representing medical knowledge scores, a ₂ Represents the key factor fraction, mu ₂ Weights representing scores of key factors of the same kind, a ₃ Represents the score, mu of the comprehensive key factors ₃ The weight representing the composite key factor score.

The comprehensive medical knowledge score, the similar key factor score and the influence of the comprehensive key factor score are calculated through a weighted summation mode, so that the comprehensiveness, accuracy and practicability of the feature selection score can be improved.

Further, the feature selection score provides a quantitative way to assess the importance of each influencing factor to the target disease. According to the feature selection score, factors which have important roles in disease diagnosis, prediction or treatment can be preferentially considered, so that the accuracy of disease risk prediction is improved.

In one possible implementation, the calculation of the medical knowledge score includes:

And labeling the medical knowledge scores of the influence factors according to the medical knowledge.

Specifically, a professional medical expert or researcher can obtain a label of the medical knowledge score for each influencing factor according to own medical knowledge and professional judgment. Based on the knowledge and experience of medical professionals, an important basis is provided for subsequent feature selection and analysis.

And when the influence factors have influence on the diagnosis of the target diseases, marking the medical knowledge score of the influence factors as 1, otherwise, marking the medical knowledge score of the influence factors as 0.

For example, type II diabetes, each influencing factor characteristic is marked according to medical knowledge, for example, typical symptoms of type II diabetes are thirst, frequent urination, fatigue, etc., and the influencing factors of thirst, frequent urination, fatigue, etc. can be marked as 1, while influencing factors unrelated to the typical symptoms of type II diabetes are marked as 0.

In one possible implementation, the calculation method of the key factor score of the same class includes:

and taking the characteristic value of each influence factor as input, taking the disease risk assessment result of each sample data as output, and constructing a similar key factor score determination model based on an extreme gradient lifting algorithm.

Among them, the extreme gradient lifting algorithm (eXtreme Gradient Boosting, XGBoost) is a machine learning algorithm based on a gradient lifting decision tree (Gradient Boosting Decision Trees, GBDT). The extreme gradient boosting algorithm is an ensemble learning algorithm that solves the regression and classification problems by combining multiple weak learners (typically decision trees) into one strong learner.

The contribution value of each influencing factor to the disease risk assessment result is calculated through a model interpretable device.

The model interpreters can be SHAP (SHapley Additive exPlanations), LIME (Local Interpretable Model-Agnostic Explanations), ELI5 (explatin Like I'm 5), etc.

In particular, the model interpreter may be a SHAP (SHapley Additive exPlanations) model interpreter. SHAP is an interpretable method for interpreting machine learning models, and by calculating the contribution of each feature to model predictions, SHAP provides a global, interpretable feature importance assessment method.

Specifically, by means of the SHAP model interpreter, the contribution value of each influencing factor to the disease risk assessment result can be calculated.

And calculating the proportion of the contribution value of each influence factor relative to the contribution values of other influence factors of the same type, and carrying out normalization processing.

The purpose of the normalization process is to eliminate dimensional differences between different influencing factors and to make them comparable in the same category.

And taking the normalized proportion value of each influence factor as the similar key factor score.

The normalized ratio value of each influence factor is used as the same kind of key factor score, so that the same kind of key factor score can reflect the importance of one influence factor relative to other similar factors.

In one possible implementation, the calculation method of the comprehensive key factor score includes:

The model interpreter may be a SHAP model interpreter, through which the contribution value of each influencing factor to the disease risk assessment result may be calculated.

And calculating the proportion of the contribution value of each influence factor relative to the contribution value of all influence factors, and carrying out normalization processing.

And taking the normalized proportion value of each influence factor as the comprehensive key factor fraction.

The normalized ratio value of each influence factor is used as a comprehensive key factor score, so that the comprehensive key factor score can reflect the importance of one influence factor relative to all other factors.

In one possible implementation, the obtaining of the weights of the medical knowledge score, the homogeneous key factor score and the comprehensive key factor score specifically includes:

the medical knowledge score, the similar key factor score and the comprehensive key factor score are compared in pairs, and a nine-level scale method is combined to establish a discrimination matrix B:

wherein b _ij Representing a score of importance between the ith score relative to the jth score, b _ij The values of i=1, 2,3, j=1, 2,3 can be determined by nine-pole scaling.

Nine-level scale is a commonly used subjective evaluation method for evaluating importance among different factors or objects. It classifies the evaluation object into nine classes, each class corresponding to a scoring value, typically from 1 to 9. And comparing the medical knowledge score, the similar key factor score and the comprehensive key factor score in pairs according to a nine-level scale method to determine the importance degree between the medical knowledge score and the similar key factor score, so as to obtain a discrimination matrix, and further calculating a weight value.

Calculating eigenvectors and eigenvalues of the discrimination matrix B:

Bω＝λω→(B-λI)ω

wherein lambda represents the eigenvalue of the discrimination matrix B, I represents the identity matrix, omega represents the eigenvector of the discrimination matrix B, and the largest eigenvalue is marked as lambda _max The feature vector corresponding to the largest feature value is ω _max ，ω _max ＝(ω ₁ ,ω ₂ ,ω ₃ )。

For the characteristic vector omega _max Normalization processing:

where i=1, 2,3, normalized vectorAre>Weights respectively representing the evaluation indexes and respectively marked as mu ₁ ，μ ₂ ，μ ₃ 。

In the invention, the complex decision problem is decomposed into the hierarchical structure by comparing the medical knowledge score, the similar key factor score and the comprehensive key factor score in pairs, and the relationships among different hierarchies and factors can be systematically considered, so that the decision process is more structured, and the reliability of the weights of the medical knowledge score, the similar key factor score and the comprehensive key factor score is improved.

S104: and selecting a preset number of influence factors with the feature selection score at the front as a key factor set according to the feature selection score, wherein the key factor set comprises a plurality of key factors.

The preset number may be 10 or 20, etc. The size of the preset number can be set by a person skilled in the art according to practical situations, and the invention is not limited.

S105: and setting the characteristic values of each key factor of each sample data, and forming the characteristic values of each key factor into characteristic vectors.

Specifically, the feature value of the key factor may be set to 1 when certain sample data satisfies the key factor, and otherwise, set to 0.

For example, typical symptoms of type two diabetes mellitus are thirst, frequent urination, fatigue, etc., and the characteristic value of thirst, frequent urination, but no fatigue is present in the sample data of a certain patient, and the characteristic value of fatigue is set to 0.

S106: and constructing a disease risk assessment model based on an extreme gradient lifting algorithm.

The extreme gradient lifting algorithm (Extreme Gradient Boosting, abbreviated as XGBoost) is an integrated learning algorithm, and is used for solving the regression and classification problems. The gradient lifting algorithm is a variant of the gradient lifting algorithm and has the characteristics of high efficiency and accuracy. XGBoost combines a gradient boosting decision tree and regularization technique, combining their predictions to form a strong learner by iteratively training multiple weak learners (regression trees).

Specifically, the disease risk assessment model takes the feature vector of each sample data as input, and takes the disease risk assessment result as output.

S107: and inputting the feature vector of each sample data into a disease risk assessment model, and outputting a disease risk assessment result of each sample data through the disease risk assessment model.

Specifically, the disease risk assessment model based on the extreme gradient lifting algorithm is provided with a plurality of weak learners (usually decision trees), disease risk assessment results are determined through each weak learner, and then the disease risk assessment results of the weak learners are integrated together to finally obtain the disease risk assessment result of the disease risk assessment model.

S108: and comparing the disease risk assessment results with the actual results of the sample data, adjusting the model parameters of the disease risk assessment model, and completing the training of the disease risk assessment model.

Specifically, various evaluation indexes such as accuracy, recall, F1 score, and the like can be used to evaluate the comparison result of the disease risk evaluation result and the actual result, thereby calculating the accuracy and predictive ability of the disease risk evaluation model. When the accuracy and the prediction capability of the disease risk assessment model are poor, the parameters of the disease risk assessment model are adjusted, the disease risk assessment model is gradually improved and optimized, and the prediction capability and the accuracy of the disease risk assessment model are improved.

In one possible implementation, the disease risk assessment model is trained by constructing an objective function, the extreme gradient lifting algorithm includes a plurality of regression trees, leaf nodes represent the prediction results of the regression trees, and S108 specifically includes substeps S1081 and S1082:

s1081: constructing an objective function Obj of the disease risk assessment model according to the error L between the disease risk assessment result and the actual result of each sample data and the model complexity omega:

Obj＝L+Ω

wherein n represents the number of sample data, y _i Representing the disease risk assessment result of the ith sample data,representing the actual result of the ith sample data, T representing the number of leaf nodes, w representing the real number fraction of leaf nodes, α and β representing hyper-parameters preventing the disease risk assessment model from overfitting, αT representing the L1 regularization term,/->Representing the L2 regularization term.

Wherein the error term L measures the difference between the predicted outcome and the actual outcome of the model on the training data.

Wherein the model complexity term Ω is used to control the complexity of the model to prevent overfitting.

By combining the error term L and the model complexity term Ω, the objective function Obj aims to find an optimal solution between the balanced model prediction accuracy and the model complexity.

S1082: and (3) taking the value of the minimum objective function as a target, adjusting the model parameters of the disease risk assessment model, and completing the training of the disease risk assessment model.

Wherein the aim of adapting the model parameters is to minimize the objective function.

According to the invention, the model parameters are adjusted by minimizing the objective function, so that the predicted result of the disease risk assessment model on the training data approximates to the actual result, and meanwhile, the complexity of the disease risk assessment model is controlled, so that the generalization capability and the robustness of the disease risk assessment model are improved.

S109: and acquiring a sample to be evaluated, and inputting the feature vector of the sample to be evaluated into the disease risk evaluation model.

In the field of healthcare, a patient may be invited to fill out a questionnaire relating to important factors in order to obtain a sample to be evaluated. In the field of insurance, the applicant can be allowed to fill out questionnaires related to important factors in the same way so as to obtain samples to be evaluated.

The method for obtaining the feature vector of the sample to be evaluated may refer to the method for obtaining the feature vector of the sample data in S105, and in order to avoid repetition, the present invention is not described in detail.

S110: and outputting a disease risk assessment result of the sample to be assessed through the disease risk assessment model.

The specific way of outputting the disease risk assessment result of the sample to be assessed through the disease risk assessment model may refer to the specific way of outputting the disease risk assessment result of each sample data through the disease risk assessment model in S107, so that repetition is avoided, and the present invention is not repeated.

It can be seen that, in the above scheme, the feature selection score of each influence factor is calculated in the face of many influence factors affecting the target disease, a preset number of influence factors with the front score are selected as key factors, then feature values of each key factor of the sample to be evaluated are input into the disease risk assessment model as feature vectors, and the disease risk result of whether the sample to be evaluated has disease risk or not is automatically given. The disease risk assessment is carried out according to the key factors with the front scores, so that the accuracy of the disease risk assessment is improved, and the professional doctor does not need to carry out manual review, so that the efficiency of the disease risk assessment is improved.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.

In one possible embodiment, a disease risk assessment device is provided, where the disease risk assessment device corresponds to the disease risk assessment method in the above embodiment one by one. As shown in fig. 3, the disease risk assessment apparatus 30 includes a first acquisition module 301, an analysis module 302, a calculation module 303, a selection module 304, a setting module 305, a construction module 306, a first output module 307, an adjustment module 308, a second acquisition module 309, and a second output module 310. The functional modules are described in detail as follows:

a first acquisition module 301, configured to acquire a sample data set, where the sample data set includes a plurality of sample data;

an analysis module 302, configured to analyze the sample data set and determine influencing factors of the target disease;

a calculation module 303, configured to calculate a feature selection score of each influencing factor;

a selecting module 304, configured to select, according to the feature selection score, a preset number of influencing factors with the feature selection score being a key factor set, where the key factor set includes a plurality of key factors;

a setting module 305, configured to set feature values of each key factor of each sample data, and form feature vectors from the feature values of each key factor;

A construction module 306 for constructing a disease risk assessment model based on an extreme gradient lifting algorithm;

a first output module 307, configured to input the feature vector of each sample data to a disease risk assessment model, and output a disease risk assessment result of each sample data through the disease risk assessment model;

the adjustment module 308 is configured to compare the disease risk assessment result with the actual result of each sample data, adjust model parameters of the disease risk assessment model, and complete training of the disease risk assessment model;

a second obtaining module 309, configured to obtain a sample to be evaluated, and input a feature vector of the sample to be evaluated to a disease risk assessment model;

the second output module 310 is configured to output a disease risk assessment result of the sample to be assessed through the disease risk assessment model.

In one possible implementation, the calculation module 303 is specifically configured to:

acquiring medical knowledge scores of all influence factors, similar key factor scores and comprehensive key factor scores;

acquiring the medical knowledge score, the similar key factor score and the weight of the comprehensive key factor score;

calculating the feature selection score a of each influencing factor according to the following formula:

a＝μ ₁ ·a ₁ +μ ₂ ·a ₂ +μ ₃ ·a ₃

labeling the medical knowledge scores of all the influence factors according to the medical knowledge;

taking the characteristic value of each influence factor as input, taking the disease risk assessment result of each sample data as output, and constructing a similar key factor score determining model based on an extreme gradient lifting algorithm;

calculating the contribution value of each influence factor to the disease risk assessment result through a model interpreter;

calculating the proportion of the contribution value of each influence factor relative to the contribution values of other influence factors of the same type, and carrying out normalization processing;

calculating the proportion of the contribution value of each influence factor relative to the contribution value of all influence factors, and carrying out normalization processing;

wherein b _ij Representing a score of importance between the ith score relative to the jth score, b _ij The value of (2) can be determined by nine-pole scale, i=1, 2,3, j=1, 2,3;

calculating eigenvectors and eigenvalues of the discrimination matrix B:

Bω＝λω→(B-λI)ω

wherein lambda represents the eigenvalue of the discrimination matrix B, I represents the identity matrix, omega represents the eigenvector of the discrimination matrix B, and the largest eigenvalue is marked as lambda _max The feature vector corresponding to the largest feature value is ω _max ，ω _max ＝(ω ₁ ,ω ₂ ,ω ₃ )；

For the characteristic vector omega _max Normalization processing:

wherein the normalized vectorAre>Weights respectively representing the evaluation indexes and respectively marked as mu ₁ ，μ ₂ ，μ ₃ 。/>

In one possible implementation, the extreme gradient lifting algorithm includes a plurality of regression trees, and the leaf nodes represent the prediction results of the regression trees, and the adjustment module 308 is specifically configured to:

constructing an objective function Obj of the disease risk assessment model according to the error L between the disease risk assessment result and the actual result of each sample data and the model complexity omega:

Obj＝L+Ω

wherein n represents the number of sample data, y _i Representing the disease risk assessment result of the ith sample data,representing the actual result of the ith sample data, T representing the number of leaf nodes, w representing the real number fraction of leaf nodes, α and β representing hyper-parameters preventing the disease risk assessment model from overfitting, αT representing the L1 regularization term,/- >Representing the L2 regularization term.

And (3) taking the value of the minimum objective function as a target, adjusting the model parameters of the disease risk assessment model, and completing the training of the disease risk assessment model.

The invention provides a disease risk assessment device 30, which is used for calculating feature selection scores of all influence factors facing a plurality of influence factors affecting a target disease, selecting a preset number of influence factors with the scores being the front as key factors, then inputting feature values of all key factors of a sample to be assessed into a disease risk assessment model as feature vectors, and automatically giving a disease risk result of whether the sample to be assessed has disease risk. The disease risk assessment is carried out according to the key factors with the front scores, so that the accuracy of the disease risk assessment is improved, and the professional doctor does not need to carry out manual review, so that the efficiency of the disease risk assessment is improved.

For specific limitations of the disease risk assessment apparatus 30, reference is made to the above limitation of the intelligent question-answering method, and no further description is given here. The respective modules in the above-described disease risk assessment apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one possible embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 4. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes non-volatile and/or volatile storage media and internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is for communicating with an external client via a network connection. The computer program, when executed by a processor, performs functions or steps of a server side of a disease risk assessment method.

In one possible embodiment, a computer device is provided, which may be a client, the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is for communicating with an external server via a network connection. The computer program is executed by a processor to perform functions or steps of a disease risk assessment method client side.

In one possible embodiment, a computer device is provided comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

obtaining a sample data set comprising a plurality of sample data;

calculating the feature selection score of each influence factor;

selecting the influence factors of a preset number of which the scores are in front as key factors according to the feature selection scores;

In one possible embodiment, a computer readable storage medium is provided, having stored thereon a computer program which when executed by a processor performs the steps of:

obtaining a sample data set comprising a plurality of sample data;

calculating the feature selection score of each influence factor;

It should be noted that, the functions or steps implemented by the computer readable storage medium or the computer device may correspond to the relevant descriptions of the server side and the client side in the foregoing method embodiments, and are not described herein for avoiding repetition.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims

1. A method of disease risk assessment comprising:

obtaining a sample dataset comprising a plurality of sample data;

Calculating a feature selection score of each influence factor;

selecting a preset number of influence factors in front of the feature selection score as a key factor set according to the feature selection score, wherein the key factor set comprises a plurality of key factors;

inputting the feature vector of each sample data into the disease risk assessment model, and outputting a disease risk assessment result of each sample data through the disease risk assessment model;

comparing the disease risk assessment results of each sample data with the actual results, and adjusting model parameters of the disease risk assessment model to complete training of the disease risk assessment model;

acquiring a sample to be evaluated, and inputting a feature vector of the sample to be evaluated into the disease risk evaluation model;

2. The disease risk assessment method of claim 1, wherein said calculating a feature selection score for each of said influencing factors, in particular comprises:

Acquiring medical knowledge scores of the influence factors, similar key factor scores and comprehensive key factor scores;

acquiring weights of the medical knowledge scores, the similar key factor scores and the comprehensive key factor scores;

calculating a feature selection score a of each of the influencing factors according to the following formula:

a＝μ ₁ ·a ₁ +μ ₂ ·a ₂ +μ ₃ ·a ₃

3. The disease risk assessment method of claim 2, wherein the means for calculating the medical knowledge score comprises:

labeling the medical knowledge scores of the influence factors according to the medical knowledge;

4. The disease risk assessment method of claim 2, wherein the means for calculating the homogeneous key factor score comprises:

Taking the characteristic value of each influence factor as input, taking the disease risk assessment result of each sample data as output, and constructing a similar key factor score determination model based on an extreme gradient lifting algorithm;

calculating the contribution value of each influence factor to the disease risk assessment result through a model interpretable device;

calculating the proportion of the contribution value of each influence factor relative to the contribution value of other influence factors of the same type, and carrying out normalization processing;

5. The disease risk assessment method of claim 2, wherein the means for calculating the integrated key factor score comprises:

And taking the normalized proportion value of each influence factor as the comprehensive key factor score.

6. The disease risk assessment method of claim 2, wherein the obtaining weights for the medical knowledge score, the homogeneous key factor score, and the comprehensive key factor score, comprises:

calculating the eigenvector and eigenvalue of the discrimination matrix B:

Bω＝λω→(B-λI)ω

For the characteristic vector omega _max Normalization processing:

wherein the normalized vectorAre>Weights respectively representing the evaluation indexes are respectively marked as mu ₁ ，μ ₂ ，μ ₃ 。

7. The disease risk assessment method according to claim 1, wherein the extreme gradient lifting algorithm comprises a plurality of regression trees, leaf nodes represent predicted results of the regression trees, the disease risk assessment results of each sample data are compared with actual results, model parameters of the disease risk assessment model are adjusted, and training of the disease risk assessment model is completed, specifically comprising:

Constructing an objective function Obj of the disease risk assessment model according to errors L between disease risk assessment results and actual results of each sample data and model complexity omega:

Obj＝L+Ω

wherein n represents the number of the sample data, y _i Representing the disease risk assessment result of the ith sample data,representing the actual result of the ith sample data, T representing the number of leaf nodes, w representing the real number fraction of the leaf nodes, alpha and beta representing hyper-parameters preventing the disease risk assessment model from overfitting, alpha T representing the L1 regularization term>Represents an L2 regularization term;

and adjusting model parameters of the disease risk assessment model by taking the value of the minimum objective function as a target, and completing training of the disease risk assessment model.

8. A disease risk assessment apparatus, comprising:

a first acquisition module for acquiring a sample data set, the sample data set comprising a plurality of sample data;

the selection module is used for selecting a preset number of influence factors in front of the feature selection score as a key factor set according to the feature selection score, wherein the key factor set comprises a plurality of key factors;

the first output module is used for inputting the feature vector of each sample data into the disease risk assessment model, and outputting the disease risk assessment result of each sample data through the disease risk assessment model;

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the disease risk assessment method according to any one of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the disease risk assessment method according to any one of claims 1 to 7.