WO2020181805A1 - Diabetes prediction method and apparatus, storage medium, and computer device - Google Patents

Diabetes prediction method and apparatus, storage medium, and computer device Download PDF

Info

Publication number
WO2020181805A1
WO2020181805A1 PCT/CN2019/117217 CN2019117217W WO2020181805A1 WO 2020181805 A1 WO2020181805 A1 WO 2020181805A1 CN 2019117217 W CN2019117217 W CN 2019117217W WO 2020181805 A1 WO2020181805 A1 WO 2020181805A1
Authority
WO
WIPO (PCT)
Prior art keywords
physical examination
data
index value
user
recognition model
Prior art date
Application number
PCT/CN2019/117217
Other languages
French (fr)
Chinese (zh)
Inventor
金晓辉
阮晓雯
徐亮
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020181805A1 publication Critical patent/WO2020181805A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems

Definitions

  • This application relates to the field of computer technology, in particular to a method and device for predicting diabetes, a storage medium, and computer equipment.
  • Diabetes is a group of metabolic diseases characterized by high blood sugar, which can cause damage to large blood vessels and capillaries and endanger the heart, brain, kidneys, peripheral nerves, eyes, feet and other parts of the disease. There are also many complications. Therefore, it is absolutely necessary to strengthen the prediction of diabetes. However, with the advancement of technology, the diagnosis of disease types is no longer limited to the analysis of doctors. Using artificial intelligence to predict diabetes is more in line with today's development trend.
  • the existing diabetes prediction methods can only judge whether a patient has diabetes, but cannot judge the severity of the disease, resulting in insufficient diagnosis results, and inability to carry out supporting control treatments based on the degree of the disease, which may cause the patient’s condition Exacerbate deterioration.
  • this application provides a diabetes prediction method and device, storage medium, and computer equipment.
  • the main purpose is to solve the problem that when using the constructed 0-1 classification model to predict diabetes, only whether the user has diabetes can be judged. However, it is impossible to judge the severity of the disease, which leads to the problem of insufficient diagnosis.
  • a method for predicting diabetes includes: obtaining sample user data in original health files and electronic medical record data; creating a numerical regression prediction based on user characteristics in the sample user data Model; using the regression prediction model to determine the first physical examination index value of the fasting blood glucose of the target user and the second physical examination index value of the blood glucose of the preset duration after a meal; according to the first physical examination index value and/or the second physical examination index Value to determine the degree of illness of the target user.
  • a device for predicting diabetes comprising:
  • the obtaining unit is used to obtain sample user data in the original health file and electronic medical record data;
  • the creation unit is used to create a numerical regression prediction model based on the user characteristics in the sample user data
  • a judging unit configured to use the regression prediction model to judge the first physical examination index value of the fasting blood glucose of the target user and the second physical examination index value of the blood glucose of the preset duration after a meal;
  • the determining unit is configured to determine the disease degree of the target user according to the first physical examination index value and/or the second physical examination index value.
  • a non-volatile readable storage medium having computer readable instructions stored thereon, and when the computer readable instructions are executed by a processor, the aforementioned diabetes prediction method is implemented.
  • a computer device including a non-volatile readable storage medium, a processor, and a computer-readable storage medium that is stored on the non-volatile readable storage medium and can run on the processor. Instructions, when the processor executes the computer-readable instructions, the aforementioned diabetes prediction method is implemented.
  • the method and device, storage medium, and computer equipment for predicting diabetes are compared with the current method of predicting diabetes using the constructed 0-1 classification model.
  • a regression prediction model of postprandial blood glucose and fasting 2h blood glucose is added.
  • the regression prediction model can be used to determine the first physical examination index value of the target user’s fasting blood glucose and the second physical examination index value of the blood glucose of the preset duration after the meal. That is, the physical examination index value can be used to determine whether the target user has diabetes, and the degree of the target user's disease can be further judged.
  • FIG. 1 shows a schematic flow chart of a method for predicting diabetes according to an embodiment of the present application
  • Fig. 2 shows a schematic flow chart of another method for predicting diabetes according to an embodiment of the present application
  • Fig. 3 shows the present application
  • the embodiment provides a schematic structural diagram of a diabetes prediction device
  • FIG. 4 shows a schematic structural diagram of another diabetes prediction device provided by an embodiment of the present application.
  • sample user data from original health files and electronic medical record data.
  • the sample user data may include patient visit data, physical examination index data, medication data, and health notification data.
  • the regression prediction model can be constructed using multiple different framework models based on decision trees, that is, using integrated learning ideas to gather multiple prediction models based on decision trees to improve the accuracy of prediction results .
  • Decision tree is a relatively simple type of machine learning supervised learning classification algorithm. Decision tree is a predictive model; it represents a mapping relationship between object attributes and object values.
  • Each node in the tree represents an object, and each bifurcation path represents a possible attribute value, and each leaf node corresponds to the object represented by the path from the root node to the leaf node. value.
  • the decision tree has only a single output. If you want to have multiple outputs, you can build an independent decision tree to handle different outputs.
  • Decision tree algorithms include ID3, C4.5, and CART algorithms. The common point is that they are all greedy algorithms. The difference is that the measurement methods are different. For example, ID3 uses the amount of information acquisition as the measurement method, while C4.5 uses the maximum gain rate.
  • the regression prediction model created by the creation can well reflect the postprandial blood glucose value and fasting 2h blood glucose value of the sample users of different blood pressure, sebum thickness, insulin, BMI body mass index, diabetes genetic information, age, diagnosis results, etc. .
  • the target user is a user who needs to predict the condition of diabetes; the first physical examination index value corresponds to the data test result of the target user's fasting blood glucose; the second physical examination index value corresponds to the target user's blood glucose test result for a preset time after meal; the preset time Can be determined according to actual needs.
  • the target user’s characteristics are matched with those of the sample user to find the postprandial blood glucose value and fasting corresponding to the characteristics of the matched sample user 2h blood glucose level.
  • the fasting blood glucose of the target user can be judged according to the first physical examination index value, and whether the target user’s blood glucose is normal for the preset time after meal according to the second physical examination index value.
  • the first physical examination index value and/or When the index value of the second physical examination is abnormal it can be judged that the user has diabetes, and the patient's degree of disease can be further judged by comparing with the threshold value.
  • a numerical regression prediction model can be created based on the user characteristics in the sample user data, and the regression prediction model can be used to determine the first physical examination index value of the target user’s fasting blood glucose and the preset duration of blood glucose after a meal.
  • the second physical examination index value determines whether the target user is ill and the severity of the disease, so that the diagnosis result of the condition is more accurate, the diagnosis content is more complete, and it is convenient According to the different development levels of diabetes, timely and effective supporting treatments are carried out to curb the development of the disease.
  • the method includes:
  • sample user data from original health files and electronic medical record data. For example, a total of about 100 sample user data with complete user characteristics are obtained from the original health file and electronic medical record data, and then the sample user data is further analyzed and processed.
  • the following two prediction methods are described. One is to use the fasting blood glucose value as a physical examination index value for prediction (that is, the process shown in steps 202a to 205a), and the other is to use the blood glucose level two hours after a meal, which is a physical examination index value. Perform prediction (that is, the process shown in steps 202b to 205b).
  • the fasting blood glucose value in the user characteristics of the sample user as label information Y1, and use target feature data of the sample user except the fasting blood glucose value and the two-hour postprandial blood glucose value as the feature information X1 to create a first model training set.
  • the user characteristics are extracted from the sample user data using regular expressions, and the target characteristic data includes at least one or more of the sample user’s medical history data, hospitalization data, medical treatment data, physical examination data, and health notification data ,
  • the target characteristic data includes at least one or more of the sample user’s medical history data, hospitalization data, medical treatment data, physical examination data, and health notification data ,
  • the created first model training set contains each feature information X1 and each corresponding label information Y1. That is, the fasting blood glucose values corresponding to the sample users of different medical history records, hospitalization records, medication status, physical examination status, and health notifications.
  • the preset regression prediction algorithm is obtained by the fusion of four algorithms: Random Forest (Random Forest), Gradient Boosting Decision Tree (GBDT), Xgboost, and LightGBM.
  • the evaluation of the first recognition model uses the average absolute percentage error (MAPE). ) Index, when the MAPE index value corresponding to the first recognition model is less than the preset standard comparison threshold, it is determined that the first recognition model meets the evaluation standard.
  • the MAPE indicator is used to evaluate the error between the predicted value of the model and the true value.
  • Common regression model evaluation indicators are MAP, MSE, RMSE and MAPE, but MAP, MSE, and RMSE only consider the value of the error. MAPE also considers the ratio between the error and the true value.
  • the calculation formula is:
  • N is the total number of samples
  • X is the measured value
  • Y is the simulated value.
  • the standard comparison threshold can be set according to the actual situation. When the MAPE is less than the standard comparison threshold, it means that the first recognition model meets the evaluation standard. Prediction through the recognition model that meets the evaluation criteria can ensure the accuracy of the prediction results.
  • the first mapping relationship between the feature information X1 and the label information Y1 can be determined by the first recognition model that meets the evaluation standard.
  • the process may specifically include:
  • (1) Use random sampling to obtain the first training sample set, the second training sample set, the third training sample set, and the fourth training sample set from the first model training set. For example, n randomly selected from the first model training set The training samples are drawn in four rounds to obtain four training sets. (The four training sets are independent of each other, and the elements can be repeated);
  • the LightGBM algorithm is used to train the fourth classifier; where each training sample set contains different feature information X1 and their corresponding label information Y1, these four categories
  • the training process of the device can be trained based on the corresponding model training algorithm, and the four classifiers obtained can be used to predict user diabetes separately, that is, input the characteristic data of the user to be tested (the specific content corresponds to the characteristic information X1), and pass The classifier finds the corresponding label information Y1.
  • n training sets Train n decision tree models separately (it can be constructed by ID3 algorithm, C4.5 algorithm, CART algorithm and other existing algorithms); 3For a single decision tree model, assuming that the number of training sample features is n, then each split is based on Information gain/information gain ratio/Gini index selects the best feature for splitting; 4Each tree keeps splitting like this until all training examples of the node belong to the same category, and there is no need to cut the decision tree during the splitting process. Branch; 5 The generated multiple decision trees are formed into a random forest. For regression problems, the final prediction result is determined by the average of the predicted values of multiple trees, that is, as the prediction result of the first classifier.
  • c is the set constant.
  • I is the set of training samples of all leaf node regions Rtj.
  • the second classifier is obtained by training.
  • k represents the number of trees
  • F represents each tree structure constructed
  • xi represents the i-th sample
  • the sum of the score values of xi on each tree is the predicted value of xi, Is the predicted value.
  • yi is the actual value of the sample corresponding to xi.
  • I j represents: all samples included in the j-th leaf
  • wj represents the weight of the j-th leaf
  • ⁇ T corresponds to the number of leaves.
  • A) Use the existing LightGBM algorithm to fit the data in the fourth sample training set, and use the test set selected from the fourth sample training set to obtain the model after each fitting Test to obtain the corresponding coefficient of determination and mean square error value; B) When the coefficient of determination is greater than a certain threshold and the average error value is less than a certain threshold, determine that the fitted model meets the standard, and determine the model that meets the standard as the first Four classifiers.
  • the specific integration method is through the process of voting, that is, the majority principle is adopted, and the minority obeys the majority.
  • the prediction results obtained by three of the four classifiers correspond to If the fasting blood glucose value meets the criteria for diabetes, then it can be determined that the user to be tested is suffering from diabetes; if there is only one prediction result obtained by the classifier corresponding to the fasting blood glucose value that meets the criteria for diabetes, the other three classifiers correspond to fasting If the blood glucose level does not meet the standard, it can be determined that the user to be tested does not have diabetes.
  • the first model training set can be re-divided to obtain a new first training Sample set, second training sample set, third training sample set, fourth training sample set, and then use the new first training sample set to continue training the first classifier, and use the new second training sample set to continue training the second Classifier, and use the new third training sample set to continue training the third classifier, and use the new fourth training sample set to continue training the fourth classifier, and then the four classifiers obtained by the new training are fused and processed, Determine whether the MAPE index value of the new first recognition model is less than the preset standard comparison threshold. If it is still greater than the preset standard comparison threshold, repeat the process of repeatedly dividing the model training set and updating the training classifier until the latest obtained The MAPE index value of the first recognition model is greater than the preset standard comparison threshold, that is, it meets the evaluation standard.
  • step 204a Input the characteristic information of the target user into the first recognition model to perform similarity matching with the characteristic information X1.
  • the characteristic information of the target user corresponds to the target characteristic data of the target user except the fasting blood glucose value and the blood glucose value two hours after a meal.
  • step 204a may specifically include: subjecting the feature information of the target user to data cleaning, feature extraction, missing value filling, and outlier processing to obtain feature information of the structured data; combining the feature information of the structured data with The feature information X1 performs similarity matching.
  • the characteristic information of the target user sometimes contains useless data, and/or there are missing values, and/or there are outliers, that is, unstructured data that is not suitable for direct prediction using the first recognition model. Therefore, it is possible to clean the characteristic information of the target user first, and remove useless data (such as removing the user’s current residence location, household registration location, etc.), and only keep medical history data, hospitalization data, medical treatment data, medical examination data, and health notification data Etc.); then perform feature extraction on the retained data (such as extracting medical history data, hospitalization data, medical treatment data, physical examination data, health notification data, etc.); if there are missing values in the extracted feature data, a value of 0 can be used Filling (such as the height and weight of the user's physical examination data, it can be filled with a value of 0, so that subsequent matching with the feature information X1 in the model will ensure comparability and avoid unmatched errors during feature matching); if the extracted features
  • the abnormal value in the data can be corrected with reference to the actual situation (for example,
  • the preset threshold can be preset according to actual needs. For example, the larger the preset threshold is set, the higher the matching accuracy of the corresponding feature is. If the similarity is 100%, the feature is completely matched. For example, after the first recognition model inputs the medical history data, hospitalization data, medical treatment data, physical examination data, and health notification data of the target user, it is equivalent to inputting these data into the four classifiers in step 203a and classifying them.
  • the corresponding feature information of each device is matched by similarity, and the feature information that is most similar and greater than a certain threshold is found, and then the corresponding physical examination index values are obtained through the four classifiers, namely the fasting blood glucose value of the target user.
  • the target user can be determined to have diabetes, and the average of these three fasting blood glucose values can be calculated as the first physical examination index calculated by the first recognition model If two of the four fasting blood glucose values do not meet the criteria for diabetes, and the other two fasting blood glucose values meet the criteria for diabetes, then the average of the four fasting blood glucose values is calculated as the first A first physical examination index value calculated by a recognition model, and based on this average value, it is determined whether the target user has diabetes.
  • Step 202b which is parallel to step 202a, uses the user characteristics of the sample user's blood glucose level two hours after the meal as the label information Y2, and sets the target feature data of the sample user except the fasting blood glucose value and the two hours postprandial blood glucose level as the feature information X2, Create a second model training set.
  • step 202b is similar to step 202a.
  • the target characteristic data of the sample user at least includes one or more of the sample user's medical history data, hospitalization data, medical treatment data, physical examination data, and health notification data.
  • the created second model training set includes each feature information X2 and each corresponding label information Y2. That is, the two-hour postprandial blood glucose values of the sample users corresponding to different medical history records, hospitalization records, medication status, physical examination status, and health notifications.
  • a second recognition model for judging a second physical examination index value based on a preset regression prediction algorithm through the second model training set.
  • the evaluation of the second recognition model also uses the MAPE index.
  • the MAPE index value corresponding to the second recognition model is less than the predetermined standard comparison threshold, it is determined that the second recognition model meets the evaluation standard, which can be determined by the second recognition model that meets the evaluation standard.
  • the specific process of step 203b may include: (1) Using random sampling to obtain the fifth training sample set, the sixth training sample set, the seventh training sample set, and the second model training set separately Eight training sample sets; (2) Based on the fifth training sample set, use the random forest algorithm to train the fifth classifier; based on the sixth training sample set use the GBDT algorithm to train the sixth classifier; use the seventh training sample set Xgboost algorithm, train to obtain the seventh classifier; based on the eighth training sample set and use the LightGBM algorithm to train to obtain the eighth classifier; (3) Combine the fifth classifier, sixth classifier, seventh classifier, and eighth classifier The bagging method is used for fusion processing to obtain the second recognition model.
  • the specific fusion processing method is also a voting process, that is, the majority principle is adopted, and the minority obeys the majority.
  • the prediction results obtained by three of the four classifiers correspond to The blood glucose level two hours after a meal meets the criteria for diabetes, then it can be determined that the user to be tested has diabetes; if there is only one prediction result obtained by the classifier, the blood glucose level two hours after a meal meets the criteria for diabetes, and the other three The blood glucose level two hours after a meal corresponding to each of the classifiers does not meet the standard, then it can be determined that the user to be tested does not have diabetes.
  • the first model training set can be re-divided to obtain a new fifth training Sample set, sixth training sample set, seventh training sample set, eighth training sample set, and then use the new fifth training sample set to continue training the fifth classifier, and use the new sixth training sample set to continue training the sixth Classifier, and use the new seventh training sample set to continue training the seventh classifier, and use the new eighth training sample set to continue training the eighth classifier, and then the four classifiers obtained by the new training are fused and processed, Determine whether the MAPE index value of the new second recognition model is less than the preset standard comparison threshold. If it is still greater than the preset standard comparison threshold, repeat the process of repeatedly dividing the model training set and updating the training classifier until the latest obtained The MAPE index value of the second recognition model is greater than the preset standard comparison threshold, that is, it meets the evaluation standard.
  • step 204b Input the feature information of the target user into the second recognition model to perform similarity matching with the feature information X2.
  • the characteristic information of the target user corresponds to the target characteristic data of the target user except the fasting blood glucose value and the blood glucose value two hours after a meal.
  • step 204b may specifically include: subjecting the feature information of the target user to data cleaning, feature extraction, missing value filling, and outlier processing to obtain feature information of the structured data; combining the feature information of the structured data with The feature information X2 performs similarity matching.
  • step 204a Similar to the optional method in step 204a, through a series of processing such as data cleaning, feature extraction, missing value filling, and outlier processing in this optional method, it can be ensured that when matching with the feature information X2 in the second recognition model Comparable structured data avoids unmatched errors during feature matching, removes outliers, and improves the accuracy of feature matching.
  • the predetermined threshold can be preset according to actual needs. For example, the larger the predetermined threshold is set, the higher the matching accuracy of the corresponding feature is. If the similarity is 100%, the feature is completely matched. For example, after the second recognition model inputs the medical history data, hospitalization data, medical treatment data, physical examination data, and health notification data of the target user, it is equivalent to inputting these data into the four classifiers in step 203b and classifying them.
  • the corresponding feature information of each device is matched by similarity, and the feature information that is most similar and greater than a certain threshold is found, and then the corresponding physical examination index values are obtained through these four classifiers, that is, the blood glucose value of the target user two hours after a meal . If three of the four two-hour postprandial blood glucose levels meet the criteria for diabetes, then the target user can be determined to have diabetes, and the average of the three two-hour postprandial blood glucose levels can be calculated As the second physical examination index value calculated by the second recognition model; if two of the four two-hour postprandial blood glucose values do not meet the criteria for diabetes, the other two two-hour postprandial blood glucose values If the value meets the criteria for suffering from diabetes, then the average value of the blood glucose values of the four two hours after a meal is calculated as the second physical examination index value calculated by the second recognition model, and the target user is determined whether or not the target user has diabetes based on this average value.
  • step 206 may specifically include: if the first physical examination index value corresponding to the target user is greater than or equal to a first preset threshold, and/or the second physical examination index value is greater than or equal to a second preset threshold, determining the target The user suffers from diabetes; then, the disease degree of the target user is judged by the first numerical interval where the first physical examination index value is located and/or the second numerical interval where the second physical examination index value is located.
  • the first preset threshold value is determined according to the setting standard of fasting blood glucose for judging diabetes, such as 7.0mmol/L
  • the second preset threshold value is determined according to the setting standard of blood glucose two hours after a meal for judging diabetes, such as 11.1mmol/L.
  • the first preset threshold of the target user is 8.0 mmol/L and the second preset threshold is 7.6 mmol/L
  • the first physical examination index value is greater than the first preset threshold
  • the first preset threshold value of the target user is 5.7mmol/L and the second preset threshold value is 11.9mmol/L
  • the second physical examination index value is greater than the second preset threshold value
  • the first preset threshold for the target user is 8.3mmol/L
  • the second preset threshold is 11.7mmol/L. Because the first physical examination index value is greater than the first preset threshold, the second physical examination index value is greater than the second preset threshold, so It can be determined that the target user has diabetes.
  • the degree of disease may specifically include: dividing a plurality of numerical intervals greater than the first preset threshold and increasing according to a predetermined numerical law; creating a third mapping relationship between the plurality of numerical intervals and the degree of diabetes; determining the first physical examination index The value corresponds to a first numerical interval in a plurality of numerical intervals; according to the third mapping relationship and the first numerical interval, the degree of first diabetes prevalence of the target user is determined.
  • setting multiple numerical intervals greater than the first preset threshold of 7.0 mmol/L and the corresponding diabetes degree in the third mapping relationship are: mild diabetes: 7.0-8.4 mmol/L, moderate diabetes: 8.4 ⁇ 10.1mmol/L, severe diabetes: greater than 10.11mmol/L. If it is determined that the first physical examination index value is 9.6mmol/L, it can be judged that the first numerical interval in which the first physical examination index value lies is: 8.4 ⁇ 11.1mmol/L. According to the third mapping relationship and the first numerical interval, It is determined that the prevalence of diabetes of the target user is moderate diabetes.
  • setting multiple numerical intervals greater than the second preset threshold 11.1mmol/L and the corresponding diabetes degree in the fourth mapping relationship are respectively: moderate diabetes: 11.1-16.7mmol/L, severe diabetes: greater than 16.7 mmol/L (when it is greater than 16.7 mmol/L, ketoacidosis is prone to occur). If it is determined that the second physical examination index value is 12.6mmol/L, it can be judged that the second numerical interval of the second physical examination index value is 11.1 ⁇ 16.7mmol/L. According to the fourth mapping relationship and the second numerical interval, It is determined that the degree of diabetes of the target user is already moderate diabetes.
  • the corresponding first recognition model is obtained respectively The first weight of and the second weight corresponding to the second recognition model; when the first weight is greater than the second weight, the first diabetes is determined as the target user’s disease; when the second weight is greater than the first weight , Determine the degree of the second diabetes as the degree of the target user.
  • the weights corresponding to the two prediction methods can be set according to the accuracy or acceptance rate of user feedback. Specifically, the weight values corresponding to different accuracy rates or adoption rates can be counted, and then the corresponding weights of the prediction methods can be found through the mapping relationship obtained by statistics. For this embodiment, according to the accuracy or acceptance rate of user feedback, it can accurately reflect which prediction method has higher prediction accuracy, and then the prediction result obtained by the prediction method with higher prediction accuracy is selected as the final judgment result, which is more accurate. In addition, the corresponding weights of the two prediction methods can also be preset artificially according to the actual situation.
  • the weight that can be configured for the first physical examination index value prediction method is 70%, which is the second physical examination index value prediction
  • the weight of the mode configuration is 30%.
  • the two recognition models in this embodiment can be continuously trained as a new sample training set to achieve the effect of higher prediction accuracy.
  • the mapping relationship between feature information and label information can be determined by training the model training set, matching the structured data of the target user with the regression prediction model, and then determining the fasting blood glucose through the mapping relationship
  • the first integrated examination index value and the second physical examination index value two hours after a meal can be compared with the values of the first preset threshold and the second preset threshold to determine whether the user has diabetes, starting from the diabetes diagnosis index, Not only can it predict whether the user is ill, but also the first numerical interval where the first physical examination index value is located, and/or the second numerical interval where the second physical examination index value is located, to determine the degree of the target user’s illness and make the diagnosis The result is more complete.
  • an embodiment of the present application provides a diabetes prediction device.
  • the device includes: an acquisition unit 31, a creation unit 32, and a judgment unit 33 , Determination unit 34.
  • the obtaining unit 31 can be used to obtain sample user data in the original health file and electronic medical record data; the creating unit 32 can be used to create a numerical regression prediction model based on user characteristics in the sample user data; the judging unit 33 can be used to use regression
  • the prediction model judges the first physical examination index value of the fasting blood glucose of the target user and the second physical examination index value of the blood glucose of the preset duration after a meal; the determining unit 34 may be used to determine the target according to the first physical examination index value and/or the second physical examination index value The user’s prevalence.
  • the creating unit 32 may specifically include: a creating module 321, a training module 322, and a determining module 323.
  • the creation module 321 can be specifically configured to use the fasting blood glucose value in the user characteristics as the label information Y1, and use the target feature data of the sample user except the fasting blood glucose value and the blood glucose value two hours after a meal as the feature information X1, Create a first model training set, and the target feature data includes at least one or more of the sample user’s medical history data, hospitalization data, medical treatment data, physical examination data, and health notification data; training module 322, specifically It can be used to train a first recognition model for judging the value of the first physical examination index based on a preset regression prediction algorithm through the first model training set, wherein the preset regression prediction algorithm consists of random forest and gradient boosting decision tree
  • the four algorithms of GBDT, Xgboost, and LightGBM are
  • the evaluation of the first recognition model adopts the average absolute percentage error MAPE index.
  • the determining module 323 can be specifically used to determine the first mapping relationship between the feature information X1 and the label information Y1 through the first recognition model that meets the evaluation standard; creating module 321 Specifically, it can also be used to use the user's characteristic Chinese blood glucose level two hours after a meal as label information Y2, and the target characteristic data of the sample user as characteristic information X2 to create a second model training set; training module 322, specifically It can also be used to train a second recognition model for judging the value of the second physical examination index through the second model training set based on the preset regression prediction algorithm, wherein the evaluation of the second recognition model adopts the MAPE index, When the MAPE index value corresponding to the second recognition model is less than the predetermined standard comparison threshold, it is determined that the second recognition model meets the evaluation standard; the determining module
  • the judgment unit 33 may specifically include: a matching module 331 and a determination module 332.
  • the matching module 331 may be specifically configured to input the characteristic information of the target user into the first recognition model to perform similarity matching with the characteristic information X1, and the characteristic information of the target user corresponds to the target user except the The target feature data other than the fasting blood glucose level and the blood glucose level two hours after a meal; the determining module 332 can be specifically used to use the feature information X1 and the first feature information with the highest similarity and the similarity greater than the preset threshold.
  • the mapping relationship determines the first physical examination index value corresponding to the target user; the matching module 331 may be specifically used to input the characteristic information of the target user into the second recognition model to perform similarity with the characteristic information X2 Matching; the determination module 332 can be specifically configured to use feature information X2 with a similarity greater than a predetermined threshold and the highest similarity and the second mapping relationship to determine the second physical examination index value corresponding to the target user.
  • the determining unit 34 may specifically include: a determining module 341, Judging module 342.
  • the determining module 341 can be used to determine that the target user has diabetes if the first physical examination index value corresponding to the target user is greater than or equal to the first preset threshold, and/or the second physical examination index value is greater than or equal to the second preset threshold; 342. It can be used to judge the disease degree of the target user through the first numerical interval where the first physical examination index value is located and/or the second numerical interval where the second physical examination index value is located.
  • the determination module 342 is specifically used to divide multiple numerical ranges that are greater than the first preset threshold and increase according to a predetermined numerical law; The third mapping relationship between the numerical interval and the degree of diabetes; determining that the first physical examination index value corresponds to the first numerical interval in the multiple numerical intervals; judging the diabetes of the target user according to the third mapping relationship and the first numerical interval Degree of illness.
  • the judging module 342 is specifically further configured to: if the prevalence degree of the first diabetes and the prevalence degree of the second diabetes are different, according to the user's prediction of passing the first recognition model and the second recognition model The accuracy or adoption rate of the way feedback, the first weight corresponding to the first recognition model and the second weight corresponding to the second recognition model are respectively obtained; when the first weight is greater than the second weight, the The first degree of diabetes prevalence is determined as the prevalence of the target user; when the second weight is greater than the first weight, the second prevalence degree of diabetes is determined as the prevalence of the target user Disease severity.
  • the matching module 331 can be specifically used to process the feature information of the target user through data cleaning, feature extraction, missing value filling, and outlier processing to obtain feature information of structured data;
  • the feature information is matched with the feature information X1 for similarity;
  • the matching module 331 can be specifically used to process the feature information of the target user through data cleaning, feature extraction, missing value filling, and outlier processing to obtain feature information of structured data ;
  • the training module 322 can be specifically used to obtain a first training sample set, a second training sample set, a third training sample set, and a fourth training sample from the first model training set by random sampling.
  • Set based on the first training sample set using the random forest algorithm to train a first classifier; based on the second training sample set using the GBDT algorithm to train a second classifier; based on the third training sample set using
  • the Xgboost algorithm is used to train the third classifier; the LightGBM algorithm is used to train the fourth classifier based on the fourth training sample set; the first classifier, the second classifier, and the third classifier are
  • the fourth classifier uses the bagging method to perform fusion processing to obtain the first recognition model; the training module 322 can be specifically used to obtain a fifth training sample set from the second model training set by random sampling 6.
  • the sixth training sample set, the seventh training sample set, and the eighth training sample set based on the fifth training sample set, a random forest algorithm is used to train a fifth classifier; based on the sixth training sample set, the GBDT algorithm is used , Training to obtain a sixth classifier; based on the seventh training sample set using the Xgboost algorithm, training to obtain a seventh classifier; based on the eighth training sample set using the LightGBM algorithm, training to obtain an eighth classifier;
  • the fifth classifier, the sixth classifier, the seventh classifier, and the eighth classifier perform fusion processing using a bagging method to obtain the second recognition model.
  • the software product can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.), including several
  • the instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods in each implementation scenario of the present application.
  • an embodiment of the present application also provides a computer device, which may be a personal computer, Servers, network devices, etc.
  • the physical device includes a storage medium and a processor; the storage medium is used to store computer-readable instructions; the processor is used to execute computer-readable instructions to implement the above-mentioned diabetes shown in FIGS. 1 and 2 Forecasting method.
  • the computer device may also include a user interface, a network interface, a camera, a radio frequency (RF) circuit, a sensor, an audio circuit, a WI-FI module, and so on.
  • RF radio frequency
  • the user interface may include a display screen (Display), an input unit such as a keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, and the like.
  • the network interface can optionally include a standard wired interface, a wireless interface (such as a Bluetooth interface, a WI-FI interface), etc.
  • the non-volatile readable storage medium may also include an operating system and a network communication module.
  • the operating system is a program that manages the hardware and software resources of the physical equipment for the prediction of diabetes, and supports the operation of information processing programs and other software and/or programs.
  • the network communication module is used to implement communication between various components in the non-volatile readable storage medium and communication with other hardware and software in the physical device.
  • this application can further determine the severity of the disease on the basis of detecting that the target user is suffering from diabetes, so that the diagnosis result can be more perfect, and thus can be tracked in time Understand the development of the target user's condition and carry out the corresponding supporting treatment.

Abstract

The present application discloses a diabetes prediction method and apparatus, a storage medium and a computer device, relating to the field of computer technology, being able to effectively solve the problem in the prior art that it can only be determined whether a user suffers from diabetes, but the severity of the disease thereof cannot be determined. Said method comprises: acquiring sample user data from original health profile and electronic medical record data; creating, according to user features in the sample user data, a regression prediction model of a numerical type; determining, using the regression prediction model, a first physical examination index value of fasting blood glucose of a target user and a second physical examination index value of blood glucose within a pre-set time period after a meal; and determining, according to the first physical examination index value and/or the second physical examination index value, the disease severity of the target user. The present application is applicable to the prediction of diabetes, and to the determination of the severity of diabetes.

Description

糖尿病的预测方法及装置、存储介质、计算机设备Diabetes prediction method and device, storage medium, and computer equipment
本申请要求与2019年3月12日提交中国专利局、申请号为2019101850792、申请名称为“糖尿病的预测方法及装置、存储介质、计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims priority with the Chinese patent application filed on March 12, 2019 with the Chinese Patent Office, the application number is 2019101850792, and the application name is "diabetes prediction method and device, storage medium, computer equipment", the entire content of which is incorporated by reference Incorporate in the application.
技术领域Technical field
本申请涉及计算机技术领域,尤其涉及到一种糖尿病的预测方法及装置、存储介质、计算机设备。This application relates to the field of computer technology, in particular to a method and device for predicting diabetes, a storage medium, and computer equipment.
背景技术Background technique
糖尿病是一组以高血糖为特征的代谢性疾病,发病时会导致大血管、微血管受损并危及心、脑、肾、周围神经、眼睛、足等多个部位,还会伴有多种并发症,故加强糖尿病的预测工作是完全必要的。然而随着科技的进步,病种的诊断已经不局限于医生的分析,利用人工智能来预测糖尿病,更能符合如今的发展潮流。Diabetes is a group of metabolic diseases characterized by high blood sugar, which can cause damage to large blood vessels and capillaries and endanger the heart, brain, kidneys, peripheral nerves, eyes, feet and other parts of the disease. There are also many complications. Therefore, it is absolutely necessary to strengthen the prediction of diabetes. However, with the advancement of technology, the diagnosis of disease types is no longer limited to the analysis of doctors. Using artificial intelligence to predict diabetes is more in line with today's development trend.
发明人发现目前业内对于糖尿病预测的常见方法是通过收集糖尿病医案,将糖尿病患者数据与健康人群数据进行对比,构建0-1分类模型,通过患者的各类特征维度数据,判断用户是否患糖尿病。然而现有的糖尿病预测的方法只能判断患者是否患糖尿病,却无法判断其患病的严重程度,导致诊断结果不够完善,无法根据患病程度进行配套的控制治疗,进而可能会造成患者病情的加剧恶化。The inventor found that the current common method for diabetes prediction in the industry is to collect diabetes medical records, compare the data of diabetic patients with the data of healthy people, build a 0-1 classification model, and judge whether the user is suffering from diabetes through various characteristic dimension data of the patient . However, the existing diabetes prediction methods can only judge whether a patient has diabetes, but cannot judge the severity of the disease, resulting in insufficient diagnosis results, and inability to carry out supporting control treatments based on the degree of the disease, which may cause the patient’s condition Exacerbate deterioration.
发明内容Summary of the invention
有鉴于此,本申请提供了一种糖尿病的预测方法及装置、存储介质、计算机设备,主要目的在于解决当利用构建的0-1分类模型进行糖尿病的预测时,只能判断用户是否患糖尿病,却无法判断其患病的严重程度,进而导致诊断结果不够完善的问题。In view of this, this application provides a diabetes prediction method and device, storage medium, and computer equipment. The main purpose is to solve the problem that when using the constructed 0-1 classification model to predict diabetes, only whether the user has diabetes can be judged. However, it is impossible to judge the severity of the disease, which leads to the problem of insufficient diagnosis.
根据本申请的一个方面,提供了一种糖尿病的预测方法,该方法包括:获取原始健康档案和电子病历数据中的样本用户数据;依据所述样本用户数据中的用户特征创建数值型的回归预测模型;利用所述回归预测模型判断目标用户空腹血糖的第一体检指标值和餐后预设时长血糖的第二体检指标值;根据所述第一体检指标值和/或所述第二体检指标值,确定所述目标用户的患病程度。According to one aspect of the present application, a method for predicting diabetes is provided. The method includes: obtaining sample user data in original health files and electronic medical record data; creating a numerical regression prediction based on user characteristics in the sample user data Model; using the regression prediction model to determine the first physical examination index value of the fasting blood glucose of the target user and the second physical examination index value of the blood glucose of the preset duration after a meal; according to the first physical examination index value and/or the second physical examination index Value to determine the degree of illness of the target user.
根据本申请的另一个方面,提供了一种糖尿病的预测装置,该装置包括:According to another aspect of the present application, a device for predicting diabetes is provided, the device comprising:
获取单元,用于获取原始健康档案和电子病历数据中的样本用户数据;The obtaining unit is used to obtain sample user data in the original health file and electronic medical record data;
创建单元,用于依据所述样本用户数据中的用户特征创建数值型的回归预测模型;The creation unit is used to create a numerical regression prediction model based on the user characteristics in the sample user data;
判断单元,用于利用所述回归预测模型判断目标用户空腹血糖的第一体检指标值和餐后预设时长血糖的第二体检指标值;A judging unit, configured to use the regression prediction model to judge the first physical examination index value of the fasting blood glucose of the target user and the second physical examination index value of the blood glucose of the preset duration after a meal;
确定单元,用于根据所述第一体检指标值和/或所述第二体检指标值,确定所述目标用户的患病程度。The determining unit is configured to determine the disease degree of the target user according to the first physical examination index value and/or the second physical examination index value.
根据本申请的又一个方面,提供了一种非易失性可读存储介质,其上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现上述糖尿病的预测方法。根据本申请的再一个方面,提供了一种计算机设备,包括非易失性可读存储介质、处理器及存储在非易失性可读存储介质上并可在处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现上述糖尿病的预测方法。According to another aspect of the present application, there is provided a non-volatile readable storage medium having computer readable instructions stored thereon, and when the computer readable instructions are executed by a processor, the aforementioned diabetes prediction method is implemented. According to another aspect of the present application, there is provided a computer device, including a non-volatile readable storage medium, a processor, and a computer-readable storage medium that is stored on the non-volatile readable storage medium and can run on the processor. Instructions, when the processor executes the computer-readable instructions, the aforementioned diabetes prediction method is implemented.
借由上述技术方案,本申请提供的一种糖尿病的预测方法及装置、存储介质、计算机设备,与目前利用构建的0-1分类模型预测糖尿病的方法相比,本申请在现有的糖尿病预测模型的基础上,增加了餐后血糖和空腹2h血糖的回归预测模型,可利用回归预测模型判断出目标用户空腹血糖的第一体检指标值和餐后预设时长血糖的第二体检指标值,即可利用体检指标值确定出目标用户是否患糖尿病,并且还能进一步判断出目标用户的患病程度。With the above technical solutions, the method and device, storage medium, and computer equipment for predicting diabetes provided by this application are compared with the current method of predicting diabetes using the constructed 0-1 classification model. On the basis of the model, a regression prediction model of postprandial blood glucose and fasting 2h blood glucose is added. The regression prediction model can be used to determine the first physical examination index value of the target user’s fasting blood glucose and the second physical examination index value of the blood glucose of the preset duration after the meal. That is, the physical examination index value can be used to determine whether the target user has diabetes, and the degree of the target user's disease can be further judged.
上述说明仅是本申请技术方案的概述,为了能够更清楚了解本申请的技术手段,而可依照说明书的内容予以实施,并且为了本申请的上述和其他目的、特征和优点能够更明显易懂,以下特举本申请的具体实施方式。The above description is only an overview of the technical solution of this application. In order to understand the technical means of this application more clearly, it can be implemented in accordance with the content of the specification, and for the above and other purposes, features and advantages of this application to be more obvious and understandable, Specific implementations of this application are cited below.
附图说明Description of the drawings
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本地申请的不当限定。在附图中:The drawings described here are used to provide a further understanding of the application and constitute a part of the application. The exemplary embodiments and descriptions of the application are used to explain the application, and do not constitute an improper limitation of the local application. In the attached picture:
图1示出了本申请实施例提供的一种糖尿病的预测方法的流程示意图;图2示出了本申请实施例提供的另一种糖尿病的预测方法的流程示意图;图3示出了本申请实施例提供的一种糖尿病的预测装置的结构示意图;图4示出了本申请实施例提供的另一种糖尿病的预测装置的结构示意图。Fig. 1 shows a schematic flow chart of a method for predicting diabetes according to an embodiment of the present application; Fig. 2 shows a schematic flow chart of another method for predicting diabetes according to an embodiment of the present application; Fig. 3 shows the present application The embodiment provides a schematic structural diagram of a diabetes prediction device; FIG. 4 shows a schematic structural diagram of another diabetes prediction device provided by an embodiment of the present application.
具体实施方式detailed description
下文中将参考实施例并结合附图来详细说明本申请。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互结合。针对目前利用构建的0-1分类模型来对糖尿病进行预测时,无法根据用户数据判断出糖尿病患病严重程度的问题,本实施例提供了一种糖尿病的预测方法,如图1所示,该方法包括:Hereinafter, the present application will be described in detail with reference to embodiments and in conjunction with the drawings. It should be noted that the embodiments in this application and the features in the embodiments can be combined with each other if there is no conflict. Aiming at the problem that when the currently constructed 0-1 classification model is used to predict diabetes, the severity of diabetes cannot be judged based on user data. This embodiment provides a method for predicting diabetes, as shown in FIG. Methods include:
101、获取原始健康档案和电子病历数据中的样本用户数据。其中,样本用户数据可包括患者就诊数据、体检指标数据、用药数据和健康告知数据等。101. Obtain sample user data from original health files and electronic medical record data. Among them, the sample user data may include patient visit data, physical examination index data, medication data, and health notification data.
102、依据样本用户数据中的用户特征创建数值型的回归预测模型。其中,用户特征可包含餐后血糖和空腹2h血糖、血压、皮脂厚度、胰岛素、BMI身体质量指数、糖尿病遗传信息、年龄、诊断结果等多类特征维度数据。在具体的实施方式中,回归预测模型可采用多种基于决策树的不同框架模型进行融合构建,即采用集成学习思想将多个基于决策树的预测模型聚集在一起,以提高预测结果的准确率。决策树是属于机器学习监督学习分类算法中比较简单的一种,决策树是预测模型;它代表的是对象属性与对象值之间的一种映射关系。树中每个节点表示某个对象,而每个分叉路径则代表的某个可能的属性值,而每个叶结点则对应从根节点到该叶节点所经历的路径所表示的对象的值。决策树仅有单一输出,若欲有复数输出,可以建立独立的决策树以处理不同输出。决策树算法有ID3,C4.5,CART算法,共同点为都是贪心算法,区别为度量方式不同,就比如ID3使用了信息获取量作为度量方式,而C4.5使用最大增益率。通过创建得到的回归预测模型可以很好的反应出不同血压、皮脂厚度、胰岛素、BMI身体质量指数、糖尿病遗传信息、年龄、诊断结果等的样本用户分别对应的餐后血糖值和空腹2h血糖值。102. Create a numerical regression prediction model based on user characteristics in the sample user data. Among them, the user characteristics can include multiple types of feature dimension data such as postprandial blood glucose and fasting 2h blood glucose, blood pressure, sebum thickness, insulin, BMI body mass index, diabetes genetic information, age, and diagnosis results. In specific implementations, the regression prediction model can be constructed using multiple different framework models based on decision trees, that is, using integrated learning ideas to gather multiple prediction models based on decision trees to improve the accuracy of prediction results . Decision tree is a relatively simple type of machine learning supervised learning classification algorithm. Decision tree is a predictive model; it represents a mapping relationship between object attributes and object values. Each node in the tree represents an object, and each bifurcation path represents a possible attribute value, and each leaf node corresponds to the object represented by the path from the root node to the leaf node. value. The decision tree has only a single output. If you want to have multiple outputs, you can build an independent decision tree to handle different outputs. Decision tree algorithms include ID3, C4.5, and CART algorithms. The common point is that they are all greedy algorithms. The difference is that the measurement methods are different. For example, ID3 uses the amount of information acquisition as the measurement method, while C4.5 uses the maximum gain rate. The regression prediction model created by the creation can well reflect the postprandial blood glucose value and fasting 2h blood glucose value of the sample users of different blood pressure, sebum thickness, insulin, BMI body mass index, diabetes genetic information, age, diagnosis results, etc. .
103、利用回归预测模型判断目标用户空腹血糖的第一体检指标值和餐后预设时长血糖的第二体检指标值。其中,目标用户为需要进行糖尿病病情预测的用户;第一体检指标值对应目标用户空腹血糖的数据检验结果;第二体检指标值对应目标用户餐后预设时长血糖的数据检验结果;预设时长可根据实际需求确定。对于本实施例,基于不同特征的样本用户反应出的餐后血糖值和空腹2h血糖值,将目标用户的特征与样本用户的特征进行匹配,找到匹配样本用户特征对应的餐后血糖值和空腹2h血糖值。103. Use the regression prediction model to determine the first physical examination index value of the fasting blood glucose of the target user and the second physical examination index value of the blood glucose of the preset duration after a meal. Among them, the target user is a user who needs to predict the condition of diabetes; the first physical examination index value corresponds to the data test result of the target user's fasting blood glucose; the second physical examination index value corresponds to the target user's blood glucose test result for a preset time after meal; the preset time Can be determined according to actual needs. For this embodiment, based on the postprandial blood glucose values and fasting 2h blood glucose values reflected by sample users with different characteristics, the target user’s characteristics are matched with those of the sample user to find the postprandial blood glucose value and fasting corresponding to the characteristics of the matched sample user 2h blood glucose level.
104、根据第一体检指标值和/或第二体检指标值,确定目标用户的患病程度。104. Determine the disease degree of the target user according to the first physical examination index value and/or the second physical examination index value.
在具体的应用场景中,可根据第一体检指标值判断目标用户空腹血糖是否正常,根据第二体检指标值判断目标用户餐后预设时长血糖是否正常,当第一体检指标值和/或第二体检指标值显示异常时,即可判断出用户患有糖尿病,并且还可以通过与临界值进行比较,进一步判断患者的患病程度。通过本实施例中糖尿病的预测方法,可以根据样本用户数据中的用户特征创建数值型的回归预测模型,利用回归预测模型判断目标用户空腹血糖的第一体检指标值和餐后预设时长血糖的第二体检指标值,并根据第一体检指标值和/或第二体检指标值,确定出目标用户是否患病以及患病的严重程度,使病情诊断结果更为精准,诊断内容更加完善,便于根据糖尿病的不同发展程度进行及时有效的配套治疗,进而遏制病情发展。In a specific application scenario, the fasting blood glucose of the target user can be judged according to the first physical examination index value, and whether the target user’s blood glucose is normal for the preset time after meal according to the second physical examination index value. When the first physical examination index value and/or When the index value of the second physical examination is abnormal, it can be judged that the user has diabetes, and the patient's degree of disease can be further judged by comparing with the threshold value. Through the diabetes prediction method in this embodiment, a numerical regression prediction model can be created based on the user characteristics in the sample user data, and the regression prediction model can be used to determine the first physical examination index value of the target user’s fasting blood glucose and the preset duration of blood glucose after a meal. The second physical examination index value, and according to the first physical examination index value and/or the second physical examination index value, determine whether the target user is ill and the severity of the disease, so that the diagnosis result of the condition is more accurate, the diagnosis content is more complete, and it is convenient According to the different development levels of diabetes, timely and effective supporting treatments are carried out to curb the development of the disease.
进一步的,作为上述实施例具体实施方式的细化和扩展,为了完整说明本申请实施例中的具体实施过程,提供了另一种糖尿病的预测方法,如图2所示,该方法包括:Further, as a refinement and extension of the specific implementation of the foregoing embodiment, in order to fully explain the specific implementation process in the embodiment of the present application, another method for predicting diabetes is provided. As shown in FIG. 2, the method includes:
201、获取原始健康档案和电子病历数据中的样本用户数据。例如,在原始健康档案和电子病历数据中共获取100项左右用户特征完整的样本用户数据,然后对样本用户数据做进一步分析处理。201. Obtain sample user data from original health files and electronic medical record data. For example, a total of about 100 sample user data with complete user characteristics are obtained from the original health file and electronic medical record data, and then the sample user data is further analyzed and processed.
下面分两种预测方式进行说明,一种是利用空腹血糖值这一体检指标值进行预测(即步骤202a至205a所示过程),另一种是利用餐后两小时血糖值这一体检指标值进行预测(即步骤202b至205b所示过程)。The following two prediction methods are described. One is to use the fasting blood glucose value as a physical examination index value for prediction (that is, the process shown in steps 202a to 205a), and the other is to use the blood glucose level two hours after a meal, which is a physical examination index value. Perform prediction (that is, the process shown in steps 202b to 205b).
202a、将样本用户的用户特征中空腹血糖值作为标签信息Y1,并将样本用户除空腹血糖值和餐后两小时血糖值以外的目标特征数据作为特征信息X1,创建第一模型训练集。其中,用户特征是利用正则表达式从样本用户数据中提取的,目标特征数据至少包括样本用户的患病史数据、住院数据、就诊用药数据、体检数据、健康告知数据中的一项或多项,例如可包括用户的患病史记录、住院记录、用药情况、体检情况、健康告知等相关信息。创建得到的第一模型训练集中包含各个特征信息X1,以及各自对应的标签信息Y1。即不同患病史记录、住院记录、用药情况、体检情况、健康告知等的样本用户分别对应的空腹血糖值。202a. Use the fasting blood glucose value in the user characteristics of the sample user as label information Y1, and use target feature data of the sample user except the fasting blood glucose value and the two-hour postprandial blood glucose value as the feature information X1 to create a first model training set. Among them, the user characteristics are extracted from the sample user data using regular expressions, and the target characteristic data includes at least one or more of the sample user’s medical history data, hospitalization data, medical treatment data, physical examination data, and health notification data , For example, may include the user's medical history, hospitalization, medication, physical examination, health notification and other related information. The created first model training set contains each feature information X1 and each corresponding label information Y1. That is, the fasting blood glucose values corresponding to the sample users of different medical history records, hospitalization records, medication status, physical examination status, and health notifications.
203a、通过第一模型训练集基于预设回归预测算法训练用于判断第一体检指标值的第一识别模型。203a. Train a first recognition model for judging the value of the first physical examination index based on a preset regression prediction algorithm through the first model training set.
其中,预设回归预测算法由随机森林(Random Forest)、梯度提升决策树(Gradient Boosting Decision Tree,GBDT)、Xgboost、LightGBM四种算法融合得到,第一识别模型的评估采用平均绝对百分比误差(MAPE)指标,当第一识别模型对应的MAPE指标值小于预置标准比较阈值时,确定第一识别模型符合评估标准。MAPE指标用于评估模型预测值和真实值之间的误差。常见的回归模型评估指标有MAP、 MSE、RMSE和MAPE,但MAP、MSE和RMSE只考虑误差的值,MAPE还考虑了误差与真实值之间的比例,其计算公式为:Among them, the preset regression prediction algorithm is obtained by the fusion of four algorithms: Random Forest (Random Forest), Gradient Boosting Decision Tree (GBDT), Xgboost, and LightGBM. The evaluation of the first recognition model uses the average absolute percentage error (MAPE). ) Index, when the MAPE index value corresponding to the first recognition model is less than the preset standard comparison threshold, it is determined that the first recognition model meets the evaluation standard. The MAPE indicator is used to evaluate the error between the predicted value of the model and the true value. Common regression model evaluation indicators are MAP, MSE, RMSE and MAPE, but MAP, MSE, and RMSE only consider the value of the error. MAPE also considers the ratio between the error and the true value. The calculation formula is:
Figure PCTCN2019117217-appb-000001
Figure PCTCN2019117217-appb-000001
在上面公式中,N为样本总数,X为实测值,Y为模拟值。MAPE值越小,说明模型预测值和真实值之间的误差越小,在具体实施方式中,可根据实际情况设定标准比较阈值,当MAPE小于标准比较阈值时,说明第一识别模型符合评估标准。通过符合评估标准的识别模型进行预测,可保证预测结果的准确性。通过符合评估标准的第一识别模型可确定特征信息X1和标签信息Y1之间的第一映射关系。为了说明利用上述四种算法融合得到的预设回归预测算法训练得到第一识别模型的过程,作为一种可选方式,该过程具体可包括:In the above formula, N is the total number of samples, X is the measured value, and Y is the simulated value. The smaller the MAPE value, the smaller the error between the model predicted value and the true value. In specific implementations, the standard comparison threshold can be set according to the actual situation. When the MAPE is less than the standard comparison threshold, it means that the first recognition model meets the evaluation standard. Prediction through the recognition model that meets the evaluation criteria can ensure the accuracy of the prediction results. The first mapping relationship between the feature information X1 and the label information Y1 can be determined by the first recognition model that meets the evaluation standard. In order to illustrate the process of using the preset regression prediction algorithm obtained by fusion of the above four algorithms to obtain the first recognition model, as an optional method, the process may specifically include:
(1)采用随机采样方式从第一模型训练集中分别获取第一训练样本集、第二训练样本集、第三训练样本集、第四训练样本集,例如从第一模型训练集中随机抽取n个训练样本,共进行四轮抽取,得到四个训练集。(四个训练集之间相互独立,元素可以有重复);(1) Use random sampling to obtain the first training sample set, the second training sample set, the third training sample set, and the fourth training sample set from the first model training set. For example, n randomly selected from the first model training set The training samples are drawn in four rounds to obtain four training sets. (The four training sets are independent of each other, and the elements can be repeated);
(2)基于第一训练样本集利用随机森林算法,训练得到第一分类器;基于第二训练样本集利用GBDT算法,训练得到第二分类器;基于第三训练样本集利用Xgboost算法,训练得到第三分类器;基于第四训练样本集利用LightGBM算法,训练得到第四分类器;其中,每个训练样本集中都包含有不同的特征信息X1,以及各自对应的标签信息Y1,这四种分类器的训练过程可基于各自对应的模型训练算法训练得到,且得到的这四种分类器都可单独进行用户糖尿病的预测,即输入待测用户的特征数据(具体内容对应特征信息X1),通过分类器找到相对应的标签信息Y1。(2) Using the random forest algorithm based on the first training sample set to train the first classifier; based on the second training sample set using the GBDT algorithm to train the second classifier; based on the third training sample set using the Xgboost algorithm to train The third classifier: Based on the fourth training sample set, the LightGBM algorithm is used to train the fourth classifier; where each training sample set contains different feature information X1 and their corresponding label information Y1, these four categories The training process of the device can be trained based on the corresponding model training algorithm, and the four classifiers obtained can be used to predict user diabetes separately, that is, input the characteristic data of the user to be tested (the specific content corresponds to the characteristic information X1), and pass The classifier finds the corresponding label information Y1.
对于第一分类器的具体训练过程:①从第一训练样本集中使用Bootstraping方法随机有放回采样选出m个样本,共进行n次采样,生成n个训练集;②对于n个训练集,分别训练n个决策树模型(可通过ID3算法、C4.5算法、CART算法等现有算法构建);③对于单个决策树模型,假设训练样本特征的个数为n,那么每次分裂时根据信息增益/信息增益比/基尼指数选择最好的特征进行分裂;④每棵树都一直这样分裂下去,直到该节点的所有训练样例都属于同一类,在决策树的分裂过程中不需要剪枝;⑤将生成的多棵决策树组成随机森林。对于回归问题,由多棵树预测值的均值决定最终预测结果,即作为第一分类器的预测结果。For the specific training process of the first classifier: ①From the first training sample set, use the Bootstraping method to randomly select m samples with replacement sampling, and perform a total of n samples to generate n training sets; ②For n training sets, Train n decision tree models separately (it can be constructed by ID3 algorithm, C4.5 algorithm, CART algorithm and other existing algorithms); ③For a single decision tree model, assuming that the number of training sample features is n, then each split is based on Information gain/information gain ratio/Gini index selects the best feature for splitting; ④Each tree keeps splitting like this until all training examples of the node belong to the same category, and there is no need to cut the decision tree during the splitting process. Branch; ⑤ The generated multiple decision trees are formed into a random forest. For regression problems, the final prediction result is determined by the average of the predicted values of multiple trees, that is, as the prediction result of the first classifier.
对于第二分类器的具体训练过程:输入第二训练样本T={(x1,y1),(x2,y2),...(xm,ym)}T={(x1,y1),(x2,y2),...(xm,ym)},最大迭代次数T,损失函数L。输出是强学习器f(x):For the specific training process of the second classifier: input the second training sample T={(x1,y1),(x2,y2),...(xm,ym)}T={(x1,y1),(x2 ,y2),...(xm,ym)}, the maximum number of iterations T, loss function L. The output is the strong learner f(x):
A)初始化弱学习器A) Initialize the weak learner
Figure PCTCN2019117217-appb-000002
其中,c为设定常数。
Figure PCTCN2019117217-appb-000002
Among them, c is the set constant.
B)对迭代轮数t=1,2,...T有:B) For the number of iterations t=1, 2,...T:
a)对样本i=1,2,...m,计算负梯度r ti a) For samples i = 1, 2, ... m, calculate the negative gradient r ti
Figure PCTCN2019117217-appb-000003
Figure PCTCN2019117217-appb-000003
b)利用(xi,rti)(i=1,2,..m),拟合一颗CART回归树,得到第t颗回归树,其对应的叶子节点区域为Rtj,j=1,2,...,J。其中J为回归树t的叶子节点的个数。b) Use (xi, rti) (i=1, 2, ..m) to fit a CART regression tree to obtain the t-th regression tree, and the corresponding leaf node area is Rtj, j=1, 2, ...,J. Where J is the number of leaf nodes of the regression tree t.
c)对叶子区域j=1,2,..J,计算最佳拟合值c tj c) For the leaf area j = 1, 2, ... J, calculate the best fit value c tj
Figure PCTCN2019117217-appb-000004
Figure PCTCN2019117217-appb-000004
d)更新强学习器d) Update strong learner
Figure PCTCN2019117217-appb-000005
其中,I为所有叶子节点区域Rtj的训练样本的集合。
Figure PCTCN2019117217-appb-000005
Among them, I is the set of training samples of all leaf node regions Rtj.
C)得到强学习器f(x)的表达式C) Get the expression of the strong learner f(x)
Figure PCTCN2019117217-appb-000006
Figure PCTCN2019117217-appb-000006
基于上述强学习器f(x),训练得到第二分类器。Based on the above-mentioned strong learner f(x), the second classifier is obtained by training.
对于第三分类器的具体训练过程:For the specific training process of the third classifier:
A)建立初始模型,具体如下式:A) Establish an initial model, specifically as follows:
Figure PCTCN2019117217-appb-000007
其中,k表示树的个数,F表示构建的每个树结构,xi表示第i个样本,xi在每个树上的得分值的和就是xi的预测值,
Figure PCTCN2019117217-appb-000008
为预测值。
Figure PCTCN2019117217-appb-000007
Among them, k represents the number of trees, F represents each tree structure constructed, xi represents the i-th sample, and the sum of the score values of xi on each tree is the predicted value of xi,
Figure PCTCN2019117217-appb-000008
Is the predicted value.
该初始模型的目标函数为The objective function of this initial model is
Figure PCTCN2019117217-appb-000009
yi为xi对应的样本实际值。
Figure PCTCN2019117217-appb-000009
yi is the actual value of the sample corresponding to xi.
B)随着树的增长,通过t轮的公式递推,得到最终目标函数为B) As the tree grows, through the t-round formula recursion, the final objective function is obtained as
Figure PCTCN2019117217-appb-000010
其中,
Figure PCTCN2019117217-appb-000011
Figure PCTCN2019117217-appb-000012
I j表示:第j个叶子中包括的所有样本,wj表示第j个叶子的权重,γT对应叶子的个数。
Figure PCTCN2019117217-appb-000010
among them,
Figure PCTCN2019117217-appb-000011
Figure PCTCN2019117217-appb-000012
I j represents: all samples included in the j-th leaf, wj represents the weight of the j-th leaf, and γT corresponds to the number of leaves.
C)利用上述初始模型代入第三训练样本集数据进行拟合训练,并利用上述最终目标函数衡量模型拟合训练数据的好坏程度(即利用目标函数计算损失函数,损失函数越小说明模型能够较好的拟合训练数据),使得模型的偏差和方差得到标准要求,即最终训练得到第三分类器。C) Use the above initial model to substitute the third training sample set data for fitting training, and use the above final objective function to measure how well the model fits the training data (that is, use the objective function to calculate the loss function. The smaller the loss function, the smaller the model can Fit the training data well), so that the deviation and variance of the model meet the standard requirements, that is, the third classifier is finally trained.
对于第四分类器的具体训练过程:A)使用现有的LightGBM算法拟合第四样本训练集中的数据,并对每次拟合后得到模型利用从第四样本训练集中选出的测试集进行测试,得到相应的决定系数以及均方误差值;B)在该决定系数大于一定阈值和均分误差值小于一定阈值时,确定拟合后的模型符合标准,并将符合标准的模型确定为第四分类器。For the specific training process of the fourth classifier: A) Use the existing LightGBM algorithm to fit the data in the fourth sample training set, and use the test set selected from the fourth sample training set to obtain the model after each fitting Test to obtain the corresponding coefficient of determination and mean square error value; B) When the coefficient of determination is greater than a certain threshold and the average error value is less than a certain threshold, determine that the fitted model meets the standard, and determine the model that meets the standard as the first Four classifiers.
(3)最后将第一分类器、第二分类器、第三分类器、第四分类器利用套袋法(bagging)进行融合处理,得到第一识别模型。具体融合处理方式是通过投票表决的过程,即采用大多数原则,少数服从多数。例如,对于这四种分类器,在输入待测用户的患病史数据、住院数据、就诊用药数据、体检数据、健康告知数据后,如果四个分类器中有三个分类器得到的预测结果对应的空腹血糖值符合患有糖尿病的标准,那么可确定待测用户患有糖尿病;如果只有一个分类器得到的预测结果对应的空腹血糖值符合患有糖尿病的标准,另外三个分类器对应的空腹血糖值不符合该标准,那么可确定待测用户没有患糖尿病。(3) Finally, the first classifier, the second classifier, the third classifier, and the fourth classifier are fused by bagging to obtain the first recognition model. The specific integration method is through the process of voting, that is, the majority principle is adopted, and the minority obeys the majority. For example, for these four classifiers, after inputting the medical history data, hospitalization data, medical treatment data, physical examination data, and health notification data of the user to be tested, if the prediction results obtained by three of the four classifiers correspond to If the fasting blood glucose value meets the criteria for diabetes, then it can be determined that the user to be tested is suffering from diabetes; if there is only one prediction result obtained by the classifier corresponding to the fasting blood glucose value that meets the criteria for diabetes, the other three classifiers correspond to fasting If the blood glucose level does not meet the standard, it can be determined that the user to be tested does not have diabetes.
需要说明的是,如果第一识别模型的MAPE指标值大于预置标准比较阈值,即训练得到的第一识别模型不符合评估标准,那么可重新划分第一模型训练集,得到新的第一训练样本集、第二训练样本集、第三训练样本集、第四训练样本集,然后利用新的第一训练样本集继续训练第一分类器,且利用新的第二训练样本集继续训练第二分类器,且利用新的第三训练样本集继续训练第三分类器,且利用新的第四训练样本集继续训练第四分类器,然后再通过新训练得到的这四个分类器融合处理,判定新的第一识别模型的MAPE指标值是否小于预置标准比较阈值,如果仍大于预置标准比较阈值,则重复执行上述重复划分模型训练集、以及更新训练分类器的过程,直至最新得到的第一识别模型的MAPE指标值大于预置标准比较阈值,即符合评估标准。It should be noted that if the MAPE index value of the first recognition model is greater than the preset standard comparison threshold, that is, the first recognition model obtained by training does not meet the evaluation standard, then the first model training set can be re-divided to obtain a new first training Sample set, second training sample set, third training sample set, fourth training sample set, and then use the new first training sample set to continue training the first classifier, and use the new second training sample set to continue training the second Classifier, and use the new third training sample set to continue training the third classifier, and use the new fourth training sample set to continue training the fourth classifier, and then the four classifiers obtained by the new training are fused and processed, Determine whether the MAPE index value of the new first recognition model is less than the preset standard comparison threshold. If it is still greater than the preset standard comparison threshold, repeat the process of repeatedly dividing the model training set and updating the training classifier until the latest obtained The MAPE index value of the first recognition model is greater than the preset standard comparison threshold, that is, it meets the evaluation standard.
204a、将目标用户的特征信息输入到第一识别模型中与特征信息X1进行相似度匹配。其中,目标用户的特征信息对应目标用户除空腹血糖值和餐后两小时血糖值以外的目标特征数据。作为一种可选方式,步骤204a具体可包括:将目标用户的特征信息经过数据清洗、特征提取、缺失值填充、异常值处理,得到结构化数据的特征信息;将结构化数据的特征信息与特征信息X1进行相似度匹配。204a. Input the characteristic information of the target user into the first recognition model to perform similarity matching with the characteristic information X1. Among them, the characteristic information of the target user corresponds to the target characteristic data of the target user except the fasting blood glucose value and the blood glucose value two hours after a meal. As an optional method, step 204a may specifically include: subjecting the feature information of the target user to data cleaning, feature extraction, missing value filling, and outlier processing to obtain feature information of the structured data; combining the feature information of the structured data with The feature information X1 performs similarity matching.
由于目标用户的特征信息有时是包含无用数据、和/或存在缺失值、和/或存在异常值的,即不适合利用第一识别模型直接进行预测的非结构化数据。因此,首先可对目标用户的特征信息进行数据清洗,清除无用数据(如去除用户现住处所在地、户口所在地等数据,只保留患病史数据、住院数据、就诊用药数据、体检数据、健康告知数据等);再对保留的数据进行特征提取(如提取患病史数据、住院数据、 就诊用药数据、体检数据、健康告知数据等);如果提取的特征数据中存在缺失值时可利用0值进行填充(如用户体检数据中身高和体重一项空缺,可利用0值填充,这样后续与模型中特征信息X1匹配时保证具有可比性,避免特征匹配时产生无法匹配的错误);如果提取的特征数据中存在异常值可参考实际情况进行修正处理(如住院时长一项为99999天,明显存在异常,可进一步通过住院开始时间和结束时间计算正确的住院时长,然后进行修改处理)。通过本可选方式中的数据清洗、特征提取、缺失值填充、异常值处理等一系列处理,可保证得到与第一识别模型中特征信息X1匹配时具有可比性的结构化数据,避免特征匹配时产生无法匹配的错误,去除异常值,提高特征匹配的精确度。Since the characteristic information of the target user sometimes contains useless data, and/or there are missing values, and/or there are outliers, that is, unstructured data that is not suitable for direct prediction using the first recognition model. Therefore, it is possible to clean the characteristic information of the target user first, and remove useless data (such as removing the user’s current residence location, household registration location, etc.), and only keep medical history data, hospitalization data, medical treatment data, medical examination data, and health notification data Etc.); then perform feature extraction on the retained data (such as extracting medical history data, hospitalization data, medical treatment data, physical examination data, health notification data, etc.); if there are missing values in the extracted feature data, a value of 0 can be used Filling (such as the height and weight of the user's physical examination data, it can be filled with a value of 0, so that subsequent matching with the feature information X1 in the model will ensure comparability and avoid unmatched errors during feature matching); if the extracted features The abnormal value in the data can be corrected with reference to the actual situation (for example, the length of hospitalization is 99999 days, which is obviously abnormal, and the correct length of hospitalization can be further calculated by the start time and end time of the hospitalization, and then be modified). Through a series of processing such as data cleaning, feature extraction, missing value filling, and outlier processing in this optional method, it can be guaranteed to obtain structured data that is comparable to the feature information X1 in the first recognition model and avoid feature matching When generating unmatched errors, remove outliers and improve the accuracy of feature matching.
205a、利用相似度大于预设阈值、且相似度最高的特征信息X1和第一映射关系,确定目标用户对应的第一体检指标值。其中,预设阈值可根据实际需求预先设置。例如,预设阈值设置得越大,相应的特征匹配精度越高,如果相似度为100%,则说明特征完全匹配。例如,在第一识别模型输入目标用户的患病史数据、住院数据、就诊用药数据、体检数据、健康告知数据后,相当于将这些数据分别输入到步骤203a的四个分类器中并与分类器各自对应的特征信息进行相似度匹配,分别找到最相似且大于一定阈值的特征信息,进而通过这四个分类器分别求出各自对应的体检指标值,即目标用户的空腹血糖值,如果这四个空腹血糖值中有三个空腹血糖值符合患有糖尿病的标准,那么可确定目标用户患有糖尿病,并计算这三个空腹血糖值的平均值作为第一识别模型计算得到的第一体检指标值;如果这四个空腹血糖值中有两个空腹血糖值不符合患有糖尿病的标准,另外两个空腹血糖值符合患有糖尿病的标准,那么计算这四个空腹血糖值的平均值作为第一识别模型计算得到的第一体检指标值,并依据这个平均值判定目标用户是否患有糖尿病。205a. Determine the first physical examination index value corresponding to the target user by using the feature information X1 with the similarity degree greater than the preset threshold and the highest similarity degree and the first mapping relationship. Among them, the preset threshold can be preset according to actual needs. For example, the larger the preset threshold is set, the higher the matching accuracy of the corresponding feature is. If the similarity is 100%, the feature is completely matched. For example, after the first recognition model inputs the medical history data, hospitalization data, medical treatment data, physical examination data, and health notification data of the target user, it is equivalent to inputting these data into the four classifiers in step 203a and classifying them. The corresponding feature information of each device is matched by similarity, and the feature information that is most similar and greater than a certain threshold is found, and then the corresponding physical examination index values are obtained through the four classifiers, namely the fasting blood glucose value of the target user. Three of the four fasting blood glucose values meet the criteria for diabetes, then the target user can be determined to have diabetes, and the average of these three fasting blood glucose values can be calculated as the first physical examination index calculated by the first recognition model If two of the four fasting blood glucose values do not meet the criteria for diabetes, and the other two fasting blood glucose values meet the criteria for diabetes, then the average of the four fasting blood glucose values is calculated as the first A first physical examination index value calculated by a recognition model, and based on this average value, it is determined whether the target user has diabetes.
与步骤202a并列的步骤202b、将样本用户的用户特征中餐后两小时血糖值作为标签信息Y2,并将样本用户除空腹血糖值和餐后两小时血糖值以外的目标特征数据作为特征信息X2,创建第二模型训练集。需要说明的是,步骤202b与步骤202a类似,样本用户的目标特征数据至少包括样本用户的患病史数据、住院数据、就诊用药数据、体检数据、健康告知数据中的一项或多项。并且创建得到的第二模型训练集中包含各个特征信息X2,以及各自对应的标签信息Y2。即不同患病史记录、住院记录、用药情况、体检情况、健康告知等的样本用户分别对应的餐后两小时血糖值。 Step 202b, which is parallel to step 202a, uses the user characteristics of the sample user's blood glucose level two hours after the meal as the label information Y2, and sets the target feature data of the sample user except the fasting blood glucose value and the two hours postprandial blood glucose level as the feature information X2, Create a second model training set. It should be noted that step 202b is similar to step 202a. The target characteristic data of the sample user at least includes one or more of the sample user's medical history data, hospitalization data, medical treatment data, physical examination data, and health notification data. And the created second model training set includes each feature information X2 and each corresponding label information Y2. That is, the two-hour postprandial blood glucose values of the sample users corresponding to different medical history records, hospitalization records, medication status, physical examination status, and health notifications.
203b、通过第二模型训练集基于预设回归预测算法训练用于判断第二体检指标值的第二识别模型。其中,第二识别模型的评估同样采用MAPE指标,当第二识别模型对应的MAPE指标值小于预定标准比较阈值时,确定第二识别模型符合评估标准,通过符合评估标准的第二识别模型可确定特征信息X2和标签信息Y2之间的第二映射关系。作为一种可选方式,该203b步骤的具体过程可包括:(1)采用随机采样方式从第二模型训练集中分别获取第五训练样本集、第六训练样本集、第七训练样本集、第八训练样本集;(2)基于第五训练样本集利用随机森林算法,训练得到第五分类器;基于第六训练样本集利用GBDT算法,训练得到第六分类器;基于第七训练样本集利用Xgboost算法,训练得到第七分类器;基于第八训练样本集利用LightGBM算法,训练得到第八分类器;(3)将第五分类器、第六分类器、第七分类器、第八分类器利用套袋法进行融合处理,得到第二识别模型。203b. Train a second recognition model for judging a second physical examination index value based on a preset regression prediction algorithm through the second model training set. The evaluation of the second recognition model also uses the MAPE index. When the MAPE index value corresponding to the second recognition model is less than the predetermined standard comparison threshold, it is determined that the second recognition model meets the evaluation standard, which can be determined by the second recognition model that meets the evaluation standard. The second mapping relationship between the feature information X2 and the label information Y2. As an alternative, the specific process of step 203b may include: (1) Using random sampling to obtain the fifth training sample set, the sixth training sample set, the seventh training sample set, and the second model training set separately Eight training sample sets; (2) Based on the fifth training sample set, use the random forest algorithm to train the fifth classifier; based on the sixth training sample set use the GBDT algorithm to train the sixth classifier; use the seventh training sample set Xgboost algorithm, train to obtain the seventh classifier; based on the eighth training sample set and use the LightGBM algorithm to train to obtain the eighth classifier; (3) Combine the fifth classifier, sixth classifier, seventh classifier, and eighth classifier The bagging method is used for fusion processing to obtain the second recognition model.
与步骤203a中的可选方式类似,具体融合处理方式也是通过投票表决的过程,即采用大多数原则,少数服从多数。例如,对于这四种分类器,在输入待测用户的患病史数据、住院数据、就诊用药数据、体检数据、健康告知数据后,如果四个分类器中有三个分类器得到的预测结果对应的餐后两小时血糖值符合患有糖尿病的标准,那么可确定待测用户患有糖尿病;如果只有一个分类器得到的预测结果对应的餐后两小时血糖值符合患有糖尿病的标准,另外三个分类器对应的餐后两小时血糖值不符合该标准,那么可确定待测用户没有患糖尿病。需要说明的是,如果第二识别模型的MAPE指标值大于预置标准比较阈值,即训练得到的第二识别模型不符合评估标准,那么可重新划分第一模型训练集,得到新的第五训练样本集、第六训练样本集、第七训练样本集、第八训练样本集,然后利用新的第五训练样本集继续训练第五分类器,且利用新的第六训练样本集继续训练第六分类器,且利用新的第七训练样本集继续训练第七分类器,且利用新的第八训练样本集继续训练第八分类器,然后再通过新训练得到的这四个分类器融合处理,判定新的第二识别模型的MAPE指标值是否小于预置标准比较阈值,如果仍大于预置标准比较阈值,则重复执行上述重复划分模型训练集、以及更新训练分类器的过程,直至最新得到的第二识别模型的MAPE指标值大于预置标准比较阈值,即符合评估标准。Similar to the optional method in step 203a, the specific fusion processing method is also a voting process, that is, the majority principle is adopted, and the minority obeys the majority. For example, for these four classifiers, after inputting the medical history data, hospitalization data, medical treatment data, physical examination data, and health notification data of the user to be tested, if the prediction results obtained by three of the four classifiers correspond to The blood glucose level two hours after a meal meets the criteria for diabetes, then it can be determined that the user to be tested has diabetes; if there is only one prediction result obtained by the classifier, the blood glucose level two hours after a meal meets the criteria for diabetes, and the other three The blood glucose level two hours after a meal corresponding to each of the classifiers does not meet the standard, then it can be determined that the user to be tested does not have diabetes. It should be noted that if the MAPE index value of the second recognition model is greater than the preset standard comparison threshold, that is, the second recognition model obtained by training does not meet the evaluation standard, then the first model training set can be re-divided to obtain a new fifth training Sample set, sixth training sample set, seventh training sample set, eighth training sample set, and then use the new fifth training sample set to continue training the fifth classifier, and use the new sixth training sample set to continue training the sixth Classifier, and use the new seventh training sample set to continue training the seventh classifier, and use the new eighth training sample set to continue training the eighth classifier, and then the four classifiers obtained by the new training are fused and processed, Determine whether the MAPE index value of the new second recognition model is less than the preset standard comparison threshold. If it is still greater than the preset standard comparison threshold, repeat the process of repeatedly dividing the model training set and updating the training classifier until the latest obtained The MAPE index value of the second recognition model is greater than the preset standard comparison threshold, that is, it meets the evaluation standard.
204b、将目标用户的特征信息输入到第二识别模型中与特征信息X2进行相似度匹配。在本步骤中,目标用户的特征信息对应目标用户除空腹血糖值和餐后两小时血糖值以外的目标特征数据。作为一种可选方式,步骤204b具体可包括:将目标用户的特征信息经过数据清洗、特征提取、缺失值填充、异常值处理,得到结构化数据的特征信息;将结构化数据的特征信息与特征信息X2进行相似度匹配。204b. Input the feature information of the target user into the second recognition model to perform similarity matching with the feature information X2. In this step, the characteristic information of the target user corresponds to the target characteristic data of the target user except the fasting blood glucose value and the blood glucose value two hours after a meal. As an optional method, step 204b may specifically include: subjecting the feature information of the target user to data cleaning, feature extraction, missing value filling, and outlier processing to obtain feature information of the structured data; combining the feature information of the structured data with The feature information X2 performs similarity matching.
与步骤204a中的可选方式类似,通过本可选方式中的数据清洗、特征提取、缺失值填充、异常值处理等一系列处理,可保证得到与第二识别模型中特征信息X2匹配时具有可比性的结构化数据,避免特 征匹配时产生无法匹配的错误,去除异常值,提高特征匹配的精确度。Similar to the optional method in step 204a, through a series of processing such as data cleaning, feature extraction, missing value filling, and outlier processing in this optional method, it can be ensured that when matching with the feature information X2 in the second recognition model Comparable structured data avoids unmatched errors during feature matching, removes outliers, and improves the accuracy of feature matching.
205b、利用相似度大于预定阈值、且相似度最高的特征信息X2和第二映射关系,确定目标用户对应的第二体检指标值。其中,预定阈值可根据实际需求预先设置。例如,预定阈值设置得越大,相应的特征匹配精度越高,如果相似度为100%,则说明特征完全匹配。例如,在第二识别模型输入目标用户的患病史数据、住院数据、就诊用药数据、体检数据、健康告知数据后,相当于将这些数据分别输入到步骤203b的四个分类器中并与分类器各自对应的特征信息进行相似度匹配,分别找到最相似且大于一定阈值的特征信息,进而通过这四个分类器分别求出各自对应的体检指标值,即目标用户的餐后两小时血糖值,如果这四个餐后两小时血糖值中有三个餐后两小时血糖值符合患有糖尿病的标准,那么可确定目标用户患有糖尿病,并计算这三个餐后两小时血糖值的平均值作为第二识别模型计算得到的第二体检指标值;如果这四个餐后两小时血糖值中有两个餐后两小时血糖值不符合患有糖尿病的标准,另外两个餐后两小时血糖值符合患有糖尿病的标准,那么计算这四个餐后两小时血糖值的平均值作为第二识别模型计算得到的第二体检指标值,并依据这个平均值判定目标用户是否患有糖尿病。205b. Use the feature information X2 with the similarity degree greater than the predetermined threshold and the highest similarity degree and the second mapping relationship to determine the second physical examination index value corresponding to the target user. Among them, the predetermined threshold can be preset according to actual needs. For example, the larger the predetermined threshold is set, the higher the matching accuracy of the corresponding feature is. If the similarity is 100%, the feature is completely matched. For example, after the second recognition model inputs the medical history data, hospitalization data, medical treatment data, physical examination data, and health notification data of the target user, it is equivalent to inputting these data into the four classifiers in step 203b and classifying them. The corresponding feature information of each device is matched by similarity, and the feature information that is most similar and greater than a certain threshold is found, and then the corresponding physical examination index values are obtained through these four classifiers, that is, the blood glucose value of the target user two hours after a meal , If three of the four two-hour postprandial blood glucose levels meet the criteria for diabetes, then the target user can be determined to have diabetes, and the average of the three two-hour postprandial blood glucose levels can be calculated As the second physical examination index value calculated by the second recognition model; if two of the four two-hour postprandial blood glucose values do not meet the criteria for diabetes, the other two two-hour postprandial blood glucose values If the value meets the criteria for suffering from diabetes, then the average value of the blood glucose values of the four two hours after a meal is calculated as the second physical examination index value calculated by the second recognition model, and the target user is determined whether or not the target user has diabetes based on this average value.
206、根据第一体检指标值和/或第二体检指标值,确定目标用户的患病程度。206. Determine the disease degree of the target user according to the first physical examination index value and/or the second physical examination index value.
作为一种可选方式,步骤206具体可包括:若目标用户对应的第一体检指标值大于等于第一预设阈值,和/或第二体检指标值大于等于第二预设阈值,则确定目标用户患有糖尿病;然后通过第一体检指标值所处的第一数值区间、和/或第二体检指标值所处的第二数值区间,判断目标用户的患病程度。其中,第一预设阈值为根据空腹血糖判断糖尿病的设定标准来确定的,如7.0mmol/L;第二预设阈值为根据餐后两小时血糖判断糖尿病的设定标准来确定的,如11.1mmol/L。例如,若确定目标用户第一预设阈值为8.0mmol/L,第二预设阈值为7.6mmol/L,因第一体检指标值大于第一预设阈值,故可确定目标用户患有糖尿病;若确定目标用户第一预设阈值为5.7mmol/L,第二预设阈值为11.9mmol/L,因第二体检指标值大于第二预设阈值,故可确定目标用户患有糖尿病;若确定目标用户第一预设阈值为8.3mmol/L,第二预设阈值为11.7mmol/L,因第一体检指标值大于第一预设阈值,第二体检指标值大于第二预设阈值,故可确定目标用户患有糖尿病。As an optional manner, step 206 may specifically include: if the first physical examination index value corresponding to the target user is greater than or equal to a first preset threshold, and/or the second physical examination index value is greater than or equal to a second preset threshold, determining the target The user suffers from diabetes; then, the disease degree of the target user is judged by the first numerical interval where the first physical examination index value is located and/or the second numerical interval where the second physical examination index value is located. Among them, the first preset threshold value is determined according to the setting standard of fasting blood glucose for judging diabetes, such as 7.0mmol/L; the second preset threshold value is determined according to the setting standard of blood glucose two hours after a meal for judging diabetes, such as 11.1mmol/L. For example, if it is determined that the first preset threshold of the target user is 8.0 mmol/L and the second preset threshold is 7.6 mmol/L, because the first physical examination index value is greater than the first preset threshold, it can be determined that the target user has diabetes; If it is determined that the first preset threshold value of the target user is 5.7mmol/L and the second preset threshold value is 11.9mmol/L, because the second physical examination index value is greater than the second preset threshold value, it can be determined that the target user has diabetes; The first preset threshold for the target user is 8.3mmol/L, and the second preset threshold is 11.7mmol/L. Because the first physical examination index value is greater than the first preset threshold, the second physical examination index value is greater than the second preset threshold, so It can be determined that the target user has diabetes.
而对于糖尿病的患病程度,下面分三种情况进行讨论:(1)只用第一体检指标值判断,即通过第一体检指标值所处的第一数值区间,判断所述目标用户的患病程度,具体可包括:划分大于第一预设阈值,且按照预定数值规律递增的多个数值区间;创建多个数值区间与糖尿病患病程度之间的第三映射关系;确定第一体检指标值对应处于多个数值区间中的第一数值区间;根据第三映射关系以及第一数值区间,判断目标用户的第一糖尿病患病程度。例如,设定大于第一预设阈值7.0mmol/L的多个数值区间以及第三映射关系中对应的糖尿病患病程度分别为:轻度糖尿病:7.0~8.4mmol/L,中度糖尿病:8.4~10.1mmol/L,重度糖尿病:大于10.11mmol/L。若确定第一体检指标值为9.6mmol/L,则可判断出第一体检指标值处于的第一数值区间为:8.4~11.1mmol/L,则根据第三映射关系以及第一数值区间,可判断出目标用户的糖尿病的患病程度为中度糖尿病。(2)只用第二体检指标值判断,即通过第二体检指标值所处的第二数值区间,判断目标用户的患病程度,具体包括:划分大于第二预设阈值,且按照预定数值规律递增的多个数值区间;创建多个数值区间与糖尿病患病程度之间的第四映射关系;确定第二体检指标值对应处于多个数值区间中的第二数值区间;根据第四映射关系以及第二数值区间,判断目标用户的第二糖尿病患病程度。例如,设定大于第二预设阈值11.1mmol/L的多个数值区间以及第四映射关系中对应的糖尿病患病程度分别为:中度糖尿病:11.1~16.7mmol/L,重度糖尿病:大于16.7mmol/L(当大于16.7mmol/L时容易出现酮症酸中毒的现象)。若确定第二体检指标值为12.6mmol/L,则可判断出第二体检指标值处于的第二数值区间为:11.1~16.7mmol/L,则根据第四映射关系以及第二数值区间,可判断出目标用户的糖尿病的患病程度已经为中度糖尿病。(3)结合第一体检指标值和第二体检指标值进行综合判定(这种判定方式由于考虑多方面因素,因此预测精度相对较高),即通过第一体检指标值所处的第一数值区间、和第二体检指标值所处的第二数值区间,判断目标用户的患病程度,具体包括:若第一糖尿病患病程度和第二糖尿病患病程度相同,则按照二者相同的糖尿病患病程度确定最终患病程度。若第一糖尿病患病程度和第二糖尿病患病程度不同,则依据用户对通过第一识别模型和第二识别模型这两种预测方式反馈的准确率或采纳率,分别获取第一识别模型对应的第一权重和第二识别模型对应的第二权重;在第一权重大于第二权重时,将第一糖尿病患病程度确定为目标用户的患病程度;在第二权重大于第一权重时,将第二糖尿病患病程度确定为目标用户的患病程度。As for the prevalence of diabetes, the following three cases are discussed: (1) Only the first physical examination index value is used to judge, that is, the first numerical interval where the first physical examination index value is located is used to judge the target user’s suffering The degree of disease may specifically include: dividing a plurality of numerical intervals greater than the first preset threshold and increasing according to a predetermined numerical law; creating a third mapping relationship between the plurality of numerical intervals and the degree of diabetes; determining the first physical examination index The value corresponds to a first numerical interval in a plurality of numerical intervals; according to the third mapping relationship and the first numerical interval, the degree of first diabetes prevalence of the target user is determined. For example, setting multiple numerical intervals greater than the first preset threshold of 7.0 mmol/L and the corresponding diabetes degree in the third mapping relationship are: mild diabetes: 7.0-8.4 mmol/L, moderate diabetes: 8.4 ~10.1mmol/L, severe diabetes: greater than 10.11mmol/L. If it is determined that the first physical examination index value is 9.6mmol/L, it can be judged that the first numerical interval in which the first physical examination index value lies is: 8.4~11.1mmol/L. According to the third mapping relationship and the first numerical interval, It is determined that the prevalence of diabetes of the target user is moderate diabetes. (2) Use only the second physical examination index value to judge, that is, judge the target user’s prevalence through the second numerical interval in which the second physical examination index value is located, specifically including: dividing greater than the second preset threshold, and according to the predetermined value Multiple numerical intervals that increase regularly; create a fourth mapping relationship between multiple numerical intervals and the degree of diabetes; determine that the second physical examination index value corresponds to a second numerical interval in the multiple numerical intervals; according to the fourth mapping relationship And the second numerical interval to determine the degree of the second diabetes of the target user. For example, setting multiple numerical intervals greater than the second preset threshold 11.1mmol/L and the corresponding diabetes degree in the fourth mapping relationship are respectively: moderate diabetes: 11.1-16.7mmol/L, severe diabetes: greater than 16.7 mmol/L (when it is greater than 16.7 mmol/L, ketoacidosis is prone to occur). If it is determined that the second physical examination index value is 12.6mmol/L, it can be judged that the second numerical interval of the second physical examination index value is 11.1~16.7mmol/L. According to the fourth mapping relationship and the second numerical interval, It is determined that the degree of diabetes of the target user is already moderate diabetes. (3) Combine the first physical examination index value and the second physical examination index value to make a comprehensive judgment (this judgment method takes into account many factors, so the prediction accuracy is relatively high), that is, the first value through the first physical examination index value The interval and the second numerical interval where the second physical examination index value is located are used to determine the disease degree of the target user, including: if the disease degree of the first diabetes and the disease degree of the second diabetes are the same, according to the same diabetes The degree of illness determines the final degree of illness. If the degree of prevalence of the first diabetes and the prevalence of the second diabetes are different, according to the accuracy or acceptance rate of the user’s feedback on the two prediction methods of the first recognition model and the second recognition model, the corresponding first recognition model is obtained respectively The first weight of and the second weight corresponding to the second recognition model; when the first weight is greater than the second weight, the first diabetes is determined as the target user’s disease; when the second weight is greater than the first weight , Determine the degree of the second diabetes as the degree of the target user.
在本实施例中,两种预测方式各自对应的权重可依据用户反馈的准确率或采纳率进行设定。具体可统计不同的准确率或采纳率分别对应的权重值,然后通过统计得到的映射关系,找到预测方式相应的权重。对于本实施例,依据用户反馈的准确率或采纳率,可准确反映出哪种预测方式预测精度更高,进而选用预测精度更高的预测方式得到的预测结果作为最终判定结果,更加准确。除此之外,还可人为根据实际情况预先设定两种预测方式各自对应的权重。例如,根据用户反馈的结果,发现利用第一体检指标 值预测糖尿病患病程度的准确率较高,则可为第一体检指标值预测方式配置的权重为70%,为第二体检指标值预测方式配置的权重为30%,当两种预测产生的结果不同时可将第一体检指标值预测的结果反馈给目标用户,作为最终诊断结果。假设第一体检指标值预测的为中度糖尿病,第二体检指标值预测的为重度糖尿病,则根据配置的权重高低,最终确定目标用户的糖尿病患病程度为中度糖尿病。后续在获得目标用户的实际空腹血糖值和餐后两小时血糖值后,还可作为新的样本训练集对本实施例中的两个识别模型继续进行训练,以达到预测精度更高的效果。通过上述糖尿病的预测方法,可以通过对模型训练集进行训练,确定特征信息和标签信息之间的映射关系,将目标用户的结构化数据与回归预测模型进行匹配,进而通过映射关系确定空腹血糖的第一体检指标值和餐后两小时的第二体检指标值,通过与第一预设阈值和第二预设阈值的数值比较,即可判断出用户是否患有糖尿病,从糖尿病确诊指标出发,不但能预测用户是否患病,还能通过第一体检指标值所处的第一数值区间、和/或第二体检指标值所处的第二数值区间,判断目标用户的患病程度,使诊断结果更加完善。In this embodiment, the weights corresponding to the two prediction methods can be set according to the accuracy or acceptance rate of user feedback. Specifically, the weight values corresponding to different accuracy rates or adoption rates can be counted, and then the corresponding weights of the prediction methods can be found through the mapping relationship obtained by statistics. For this embodiment, according to the accuracy or acceptance rate of user feedback, it can accurately reflect which prediction method has higher prediction accuracy, and then the prediction result obtained by the prediction method with higher prediction accuracy is selected as the final judgment result, which is more accurate. In addition, the corresponding weights of the two prediction methods can also be preset artificially according to the actual situation. For example, according to the results of user feedback, it is found that using the first physical examination index value to predict the degree of diabetes is more accurate, and the weight that can be configured for the first physical examination index value prediction method is 70%, which is the second physical examination index value prediction The weight of the mode configuration is 30%. When the two predictions produce different results, the result of the first physical examination index value prediction can be fed back to the target user as the final diagnosis result. Assuming that the first physical examination index value predicts moderate diabetes, and the second physical examination index predicts severe diabetes, then according to the configured weight, it is finally determined that the diabetes prevalence of the target user is moderate diabetes. After obtaining the actual fasting blood glucose value of the target user and the two-hour postprandial blood glucose value of the target user, the two recognition models in this embodiment can be continuously trained as a new sample training set to achieve the effect of higher prediction accuracy. Through the above diabetes prediction method, the mapping relationship between feature information and label information can be determined by training the model training set, matching the structured data of the target user with the regression prediction model, and then determining the fasting blood glucose through the mapping relationship The first integrated examination index value and the second physical examination index value two hours after a meal can be compared with the values of the first preset threshold and the second preset threshold to determine whether the user has diabetes, starting from the diabetes diagnosis index, Not only can it predict whether the user is ill, but also the first numerical interval where the first physical examination index value is located, and/or the second numerical interval where the second physical examination index value is located, to determine the degree of the target user’s illness and make the diagnosis The result is more complete.
进一步的,作为图1和图2所示方法的具体体现,本申请实施例提供了一种糖尿病的预测装置,如图3所示,该装置包括:获取单元31、创建单元32、判断单元33、确定单元34。获取单元31,可用于获取原始健康档案和电子病历数据中的样本用户数据;创建单元32,可用于依据样本用户数据中的用户特征创建数值型的回归预测模型;判断单元33,可用于利用回归预测模型判断目标用户空腹血糖的第一体检指标值和餐后预设时长血糖的第二体检指标值;确定单元34,可用于根据第一体检指标值和/或第二体检指标值,确定目标用户的患病程度。Further, as a specific embodiment of the method shown in FIG. 1 and FIG. 2, an embodiment of the present application provides a diabetes prediction device. As shown in FIG. 3, the device includes: an acquisition unit 31, a creation unit 32, and a judgment unit 33 , Determination unit 34. The obtaining unit 31 can be used to obtain sample user data in the original health file and electronic medical record data; the creating unit 32 can be used to create a numerical regression prediction model based on user characteristics in the sample user data; the judging unit 33 can be used to use regression The prediction model judges the first physical examination index value of the fasting blood glucose of the target user and the second physical examination index value of the blood glucose of the preset duration after a meal; the determining unit 34 may be used to determine the target according to the first physical examination index value and/or the second physical examination index value The user’s prevalence.
在具体实施应用场景中,为了依据样本用户数据中的用户特征创建数值型的回归预测模型,如图4所示,创建单元32,具体可包括:创建模块321、训练模块322、确定模块323。创建模块321,具体可用于将所述用户特征中空腹血糖值作为标签信息Y1,并将样本用户除所述空腹血糖值和所述餐后两小时血糖值以外的目标特征数据作为特征信息X1,创建第一模型训练集,所述目标特征数据至少包括所述样本用户的患病史数据、住院数据、就诊用药数据、体检数据、健康告知数据中的一项或多项;训练模块322,具体可用于通过所述第一模型训练集基于预设回归预测算法训练用于判断所述第一体检指标值的第一识别模型,其中,所述预设回归预测算法由随机森林、梯度提升决策树GBDT、Xgboost、LightGBM四种算法融合得到,所述第一识别模型的评估采用平均绝对百分比误差MAPE指标,当所述第一识别模型对应的MAPE指标值小于预置标准比较阈值时,确定所述第一识别模型符合评估标准;确定模块323,具体可用于通过符合评估标准的所述第一识别模型可确定所述特征信息X1和所述标签信息Y1之间的第一映射关系;创建模块321,具体还可用于将所述用户特征中餐后两小时血糖值作为标签信息Y2,并将所述样本用户的所述目标特征数据作为特征信息X2,创建第二模型训练集;训练模块322,具体还可用于通过所述第二模型训练集基于所述预设回归预测算法训练用于判断所述第二体检指标值的第二识别模型,其中,所述第二识别模型的评估采用MAPE指标,当所述第二识别模型对应的MAPE指标值小于预定标准比较阈值时,确定所述第二识别模型符合评估标准;确定模块323,具体还可用于通过符合评估标准的所述第二识别模型可确定所述特征信息X2和所述标签信息Y2之间的第二映射关系。In a specific implementation application scenario, in order to create a numerical regression prediction model based on user characteristics in the sample user data, as shown in FIG. 4, the creating unit 32 may specifically include: a creating module 321, a training module 322, and a determining module 323. The creation module 321 can be specifically configured to use the fasting blood glucose value in the user characteristics as the label information Y1, and use the target feature data of the sample user except the fasting blood glucose value and the blood glucose value two hours after a meal as the feature information X1, Create a first model training set, and the target feature data includes at least one or more of the sample user’s medical history data, hospitalization data, medical treatment data, physical examination data, and health notification data; training module 322, specifically It can be used to train a first recognition model for judging the value of the first physical examination index based on a preset regression prediction algorithm through the first model training set, wherein the preset regression prediction algorithm consists of random forest and gradient boosting decision tree The four algorithms of GBDT, Xgboost, and LightGBM are fused. The evaluation of the first recognition model adopts the average absolute percentage error MAPE index. When the MAPE index value corresponding to the first recognition model is less than the preset standard comparison threshold, it is determined that the The first recognition model meets the evaluation standard; the determining module 323 can be specifically used to determine the first mapping relationship between the feature information X1 and the label information Y1 through the first recognition model that meets the evaluation standard; creating module 321 Specifically, it can also be used to use the user's characteristic Chinese blood glucose level two hours after a meal as label information Y2, and the target characteristic data of the sample user as characteristic information X2 to create a second model training set; training module 322, specifically It can also be used to train a second recognition model for judging the value of the second physical examination index through the second model training set based on the preset regression prediction algorithm, wherein the evaluation of the second recognition model adopts the MAPE index, When the MAPE index value corresponding to the second recognition model is less than the predetermined standard comparison threshold, it is determined that the second recognition model meets the evaluation standard; the determining module 323 can be specifically used to pass the second recognition model that meets the evaluation standard. Determine a second mapping relationship between the feature information X2 and the label information Y2.
相应的,为了判断出目标用户空腹血糖的第一体检指标值和餐后预设时长血糖的第二体检指标值,如图4所示,判断单元33,具体可包括:匹配模块331、确定模块332。匹配模块331,具体可用于将所述目标用户的特征信息输入到所述第一识别模型中与所述特征信息X1进行相似度匹配,所述目标用户的特征信息对应所述目标用户除所述空腹血糖值和所述餐后两小时血糖值以外的所述目标特征数据;确定模块332,具体可用于利用相似度大于预设阈值、且相似度最高的所述特征信息X1和所述第一映射关系,确定所述目标用户对应的第一体检指标值;匹配模块331,具体还可用于将所述目标用户的特征信息输入到所述第二识别模型中与所述特征信息X2进行相似度匹配;确定模块332,具体还可用于利用相似度大于预定阈值、且相似度最高的特征信息X2和所述第二映射关系,确定所述目标用户对应的第二体检指标值。在具体的应用场景中,为了根据第一体检指标值和/或第二体检指标值,确定出目标用户的患病程度,如图4所示,确定单元34,具体可包括:确定模块341、判断模块342。Correspondingly, in order to determine the first physical examination index value of the fasting blood glucose of the target user and the second physical examination index value of the blood glucose of the preset duration after the meal, as shown in FIG. 4, the judgment unit 33 may specifically include: a matching module 331 and a determination module 332. The matching module 331 may be specifically configured to input the characteristic information of the target user into the first recognition model to perform similarity matching with the characteristic information X1, and the characteristic information of the target user corresponds to the target user except the The target feature data other than the fasting blood glucose level and the blood glucose level two hours after a meal; the determining module 332 can be specifically used to use the feature information X1 and the first feature information with the highest similarity and the similarity greater than the preset threshold. The mapping relationship determines the first physical examination index value corresponding to the target user; the matching module 331 may be specifically used to input the characteristic information of the target user into the second recognition model to perform similarity with the characteristic information X2 Matching; the determination module 332 can be specifically configured to use feature information X2 with a similarity greater than a predetermined threshold and the highest similarity and the second mapping relationship to determine the second physical examination index value corresponding to the target user. In a specific application scenario, in order to determine the disease degree of the target user according to the first physical examination index value and/or the second physical examination index value, as shown in FIG. 4, the determining unit 34 may specifically include: a determining module 341, Judging module 342.
确定模块341,可用于若目标用户对应的第一体检指标值大于等于第一预设阈值,和/或第二体检指标值大于等于第二预设阈值,则确定目标用户患有糖尿病;判断模块342,可用于通过第一体检指标值所处的第一数值区间、和/或第二体检指标值所处的第二数值区间,判断目标用户的患病程度。在具体的应用场景中,为了准确的判断出目标用户的患病程度,判断模块342,具体还用于划分大于第一预设阈值,且按照预定数值规律递增的多个数值区间;创建多个数值区间与糖尿病患病程度之间的第三映射关系;确定第一体检指标值对应处于多个数值区间中的第一数值区间;根据第三映射关系以及第一数值区间,判断目标用户的糖尿病患病程度。划分大于第二预设阈值,且按照预定数值规律递增的多个数值区间;创建多个数值区间与糖尿病患病程度之间的第四映射关系;确定第二体检指标值对应处于多个数值区间中的第二数值区间;根据第四映射关系以及第二数值区间,判断目标用户的糖尿病患病程度;The determining module 341 can be used to determine that the target user has diabetes if the first physical examination index value corresponding to the target user is greater than or equal to the first preset threshold, and/or the second physical examination index value is greater than or equal to the second preset threshold; 342. It can be used to judge the disease degree of the target user through the first numerical interval where the first physical examination index value is located and/or the second numerical interval where the second physical examination index value is located. In a specific application scenario, in order to accurately determine the degree of disease of the target user, the determination module 342 is specifically used to divide multiple numerical ranges that are greater than the first preset threshold and increase according to a predetermined numerical law; The third mapping relationship between the numerical interval and the degree of diabetes; determining that the first physical examination index value corresponds to the first numerical interval in the multiple numerical intervals; judging the diabetes of the target user according to the third mapping relationship and the first numerical interval Degree of illness. Divide multiple numerical intervals that are greater than the second preset threshold and increase according to a predetermined numerical law; create a fourth mapping relationship between multiple numerical intervals and the degree of diabetes; determine that the second physical examination index value corresponds to multiple numerical intervals The second numerical interval in, according to the fourth mapping relationship and the second numerical interval, determine the diabetes prevalence of the target user;
判断模块342,具体还用于若所述第一糖尿病患病程度和所述第二糖尿病患病程度不同,则依据用户对通过所述第一识别模型和所述第二识别模型这两种预测方式反馈的准确率或采纳率,分别获取所述第一识别模型对应的第一权重和所述第二识别模型对应的第二权重;在所述第一权重大于所述第二权重时,将所述第一糖尿病患病程度确定为所述目标用户的患病程度;在所述第二权重大于所述第一权重时,将所述第二糖尿病患病程度确定为所述目标用户的患病程度。在具体的应用场景中,匹配模块331,具体可用于将所述目标用户的特征信息经过数据清洗、特征提取、缺失值填充、异常值处理,得到结构化数据的特征信息;将结构化数据的特征信息与所述特征信息X1进行相似度匹配;匹配模块331,具体可用于将所述目标用户的特征信息经过数据清洗、特征提取、缺失值填充、异常值处理,得到结构化数据的特征信息;将结构化数据的特征信息与所述特征信息X2进行相似度匹配。The judging module 342 is specifically further configured to: if the prevalence degree of the first diabetes and the prevalence degree of the second diabetes are different, according to the user's prediction of passing the first recognition model and the second recognition model The accuracy or adoption rate of the way feedback, the first weight corresponding to the first recognition model and the second weight corresponding to the second recognition model are respectively obtained; when the first weight is greater than the second weight, the The first degree of diabetes prevalence is determined as the prevalence of the target user; when the second weight is greater than the first weight, the second prevalence degree of diabetes is determined as the prevalence of the target user Disease severity. In a specific application scenario, the matching module 331 can be specifically used to process the feature information of the target user through data cleaning, feature extraction, missing value filling, and outlier processing to obtain feature information of structured data; The feature information is matched with the feature information X1 for similarity; the matching module 331 can be specifically used to process the feature information of the target user through data cleaning, feature extraction, missing value filling, and outlier processing to obtain feature information of structured data ; Perform similarity matching between the feature information of the structured data and the feature information X2.
在具体的应用场景中,训练模块322,具体可用于采用随机采样方式从所述第一模型训练集中分别获取第一训练样本集、第二训练样本集、第三训练样本集、第四训练样本集;基于所述第一训练样本集利用随机森林算法,训练得到第一分类器;基于所述第二训练样本集利用GBDT算法,训练得到第二分类器;基于所述第三训练样本集利用Xgboost算法,训练得到第三分类器;基于所述第四训练样本集利用LightGBM算法,训练得到第四分类器;将所述第一分类器、所述第二分类器、所述第三分类器、所述第四分类器利用套袋法进行融合处理,得到所述第一识别模型;训练模块322,具体还可用于采用随机采样方式从所述第二模型训练集中分别获取第五训练样本集、第六训练样本集、第七训练样本集、第八训练样本集;基于所述第五训练样本集利用随机森林算法,训练得到第五分类器;基于所述第六训练样本集利用GBDT算法,训练得到第六分类器;基于所述第七训练样本集利用Xgboost算法,训练得到第七分类器;基于所述第八训练样本集利用LightGBM算法,训练得到第八分类器;将所述第五分类器、所述第六分类器、所述第七分类器、所述第八分类器利用套袋法进行融合处理,得到所述第二识别模型。In a specific application scenario, the training module 322 can be specifically used to obtain a first training sample set, a second training sample set, a third training sample set, and a fourth training sample from the first model training set by random sampling. Set; based on the first training sample set using the random forest algorithm to train a first classifier; based on the second training sample set using the GBDT algorithm to train a second classifier; based on the third training sample set using The Xgboost algorithm is used to train the third classifier; the LightGBM algorithm is used to train the fourth classifier based on the fourth training sample set; the first classifier, the second classifier, and the third classifier are The fourth classifier uses the bagging method to perform fusion processing to obtain the first recognition model; the training module 322 can be specifically used to obtain a fifth training sample set from the second model training set by random sampling 6. The sixth training sample set, the seventh training sample set, and the eighth training sample set; based on the fifth training sample set, a random forest algorithm is used to train a fifth classifier; based on the sixth training sample set, the GBDT algorithm is used , Training to obtain a sixth classifier; based on the seventh training sample set using the Xgboost algorithm, training to obtain a seventh classifier; based on the eighth training sample set using the LightGBM algorithm, training to obtain an eighth classifier; The fifth classifier, the sixth classifier, the seventh classifier, and the eighth classifier perform fusion processing using a bagging method to obtain the second recognition model.
需要说明的是,本实施例提供的一种糖尿病的预测装置所涉及各功能单元的其它相应描述,可以参考图1至图2中的对应描述,在此不再赘述。基于上述如图1和图2所示方法,相应的,本申请实施例还提供了一种存储介质,其上存储有计算机可读指令,该计算机可读指令被处理器执行时实现上述如图1和图2所示的糖尿病的预测方法。基于这样的理解,本申请的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施场景的方法。It should be noted that, for other corresponding descriptions of the functional units involved in the diabetes prediction device provided in this embodiment, reference may be made to the corresponding descriptions in FIGS. 1 to 2, and details are not repeated here. Based on the methods shown in Figures 1 and 2, correspondingly, embodiments of the present application also provide a storage medium on which computer-readable instructions are stored. When the computer-readable instructions are executed by a processor, the foregoing 1 and Figure 2 shows the prediction method of diabetes. Based on this understanding, the technical solution of this application can be embodied in the form of a software product. The software product can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.), including several The instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods in each implementation scenario of the present application.
基于上述如图1、图2所示的方法,以及图3、图4所示的虚拟装置实施例,为了实现上述目的,本申请实施例还提供了一种计算机设备,具体可以为个人计算机、服务器、网络设备等,该实体设备包括存储介质和处理器;存储介质,用于存储计算机可读指令;处理器,用于执行计算机可读指令以实现上述如图1和图2所示的糖尿病的预测方法。可选地,该计算机设备还可以包括用户接口、网络接口、摄像头、射频(Radio Frequency,RF)电路,传感器、音频电路、WI-FI模块等等。用户接口可以包括显示屏(Display)、输入单元比如键盘(Keyboard)等,可选用户接口还可以包括USB接口、读卡器接口等。网络接口可选的可以包括标准的有线接口、无线接口(如蓝牙接口、WI-FI接口)等。Based on the methods shown in Figures 1 and 2 and the virtual device embodiments shown in Figures 3 and 4, in order to achieve the above objectives, an embodiment of the present application also provides a computer device, which may be a personal computer, Servers, network devices, etc., the physical device includes a storage medium and a processor; the storage medium is used to store computer-readable instructions; the processor is used to execute computer-readable instructions to implement the above-mentioned diabetes shown in FIGS. 1 and 2 Forecasting method. Optionally, the computer device may also include a user interface, a network interface, a camera, a radio frequency (RF) circuit, a sensor, an audio circuit, a WI-FI module, and so on. The user interface may include a display screen (Display), an input unit such as a keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, and the like. The network interface can optionally include a standard wired interface, a wireless interface (such as a Bluetooth interface, a WI-FI interface), etc.
本领域技术人员可以理解,本实施例提供的计算机设备结构并不构成对该实体设备的限定,可以包括更多或更少的部件,或者组合某些部件,或者不同的部件布置。非易失性可读存储介质中还可以包括操作系统、网络通信模块。操作系统是管理糖尿病的预测的实体设备硬件和软件资源的程序,支持信息处理程序以及其它软件和/或程序的运行。网络通信模块用于实现非易失性可读存储介质内部各组件之间的通信,以及与该实体设备中其它硬件和软件之间通信。通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到本申请可以借助软件加必要的通用硬件平台的方式来实现,也可以通过硬件实现。通过应用本申请的技术方案,与目前现有技术相比,本申请可以在检测出目标用户患有糖尿病的基础上,进一步判断患病的严重程度,可以使诊断结果更加完善,进而可及时跟踪了解目标用户的病情发展情况,并进行相应的配套治疗。Those skilled in the art can understand that the computer device structure provided in this embodiment does not constitute a limitation on the physical device, and may include more or fewer components, or combine certain components, or arrange different components. The non-volatile readable storage medium may also include an operating system and a network communication module. The operating system is a program that manages the hardware and software resources of the physical equipment for the prediction of diabetes, and supports the operation of information processing programs and other software and/or programs. The network communication module is used to implement communication between various components in the non-volatile readable storage medium and communication with other hardware and software in the physical device. Through the description of the foregoing implementation manners, those skilled in the art can clearly understand that this application can be implemented by means of software plus a necessary general hardware platform, or by hardware. By applying the technical solution of this application, compared with the current prior art, this application can further determine the severity of the disease on the basis of detecting that the target user is suffering from diabetes, so that the diagnosis result can be more perfect, and thus can be tracked in time Understand the development of the target user's condition and carry out the corresponding supporting treatment.
本领域技术人员可以理解附图只是一个优选实施场景的示意图,附图中的模块或流程并不一定是实施本申请所必须的。本领域技术人员可以理解实施场景中的装置中的模块可以按照实施场景描述进行分布于实施场景的装置中,也可以进行相应变化位于不同于本实施场景的一个或多个装置中。上述实施场景的模块可以合并为一个模块,也可以进一步拆分成多个子模块。上述本申请序号仅仅为了描述,不代表实施场景的优劣。以上公开的仅为本申请的几个具体实施场景,但是,本申请并非局限于此,任何本领域的技术人员能思之的变化都应落入本申请的保护范围。Those skilled in the art can understand that the accompanying drawings are only schematic diagrams of preferred implementation scenarios, and the modules or processes in the accompanying drawings are not necessarily necessary for implementing this application. Those skilled in the art can understand that the modules in the device in the implementation scenario can be distributed in the device in the implementation scenario according to the description of the implementation scenario, or can be changed to be located in one or more devices different from the implementation scenario. The modules of the above implementation scenarios can be combined into one module or further divided into multiple sub-modules. The above serial number of this application is only for description, and does not represent the merits of implementation scenarios. The above disclosures are only a few specific implementation scenarios of the application, but the application is not limited to these, and any changes that can be thought of by those skilled in the art should fall into the protection scope of the application.

Claims (20)

  1. 一种糖尿病的预测方法,包括:A method for predicting diabetes, including:
    获取原始健康档案和电子病历数据中的样本用户数据;Obtain sample user data from original health files and electronic medical record data;
    依据所述样本用户数据中的用户特征创建数值型的回归预测模型;Creating a numerical regression prediction model based on user characteristics in the sample user data;
    利用所述回归预测模型判断目标用户空腹血糖的第一体检指标值和餐后预设时长血糖的第二体检指标值;Using the regression prediction model to determine the first physical examination index value of the fasting blood glucose of the target user and the second physical examination index value of the blood glucose of the preset duration after a meal;
    根据所述第一体检指标值和/或所述第二体检指标值,确定所述目标用户的患病程度。According to the first physical examination index value and/or the second physical examination index value, the disease degree of the target user is determined.
  2. 根据权利要求1所述的方法,所述用户特征是利用正则表达式从所述样本用户数据中提取的,所述预设时长为两小时;所述依据所述样本用户数据中的用户特征创建数值型的回归预测模型,具体包括:将所述用户特征中空腹血糖值作为标签信息Y1,并将样本用户除所述空腹血糖值和所述餐后两小时血糖值以外的目标特征数据作为特征信息X1,创建第一模型训练集,所述目标特征数据至少包括所述样本用户的患病史数据、住院数据、就诊用药数据、体检数据、健康告知数据中的一项或多项;通过所述第一模型训练集基于预设回归预测算法训练用于判断所述第一体检指标值的第一识别模型,其中,所述预设回归预测算法由随机森林、梯度提升决策树GBDT、Xgboost、LightGBM四种算法融合得到,所述第一识别模型的评估采用平均绝对百分比误差MAPE指标,当所述第一识别模型对应的MAPE指标值小于预置标准比较阈值时,确定所述第一识别模型符合评估标准,通过符合评估标准的所述第一识别模型可确定所述特征信息X1和所述标签信息Y1之间的第一映射关系;将所述用户特征中餐后两小时血糖值作为标签信息Y2,并将所述样本用户的所述目标特征数据作为特征信息X2,创建第二模型训练集;通过所述第二模型训练集基于所述预设回归预测算法训练用于判断所述第二体检指标值的第二识别模型,其中,所述第二识别模型的评估采用MAPE指标,当所述第二识别模型对应的MAPE指标值小于预定标准比较阈值时,确定所述第二识别模型符合评估标准,通过符合评估标准的所述第二识别模型可确定所述特征信息X2和所述标签信息Y2之间的第二映射关系。The method according to claim 1, wherein the user characteristics are extracted from the sample user data by using regular expressions, and the preset duration is two hours; and the user characteristics are created based on the user characteristics in the sample user data. The numerical regression prediction model specifically includes: using the fasting blood glucose value in the user characteristics as the label information Y1, and using the target feature data of the sample user except the fasting blood glucose value and the blood glucose value two hours after a meal as the feature Information X1, create a first model training set, the target feature data includes at least one or more of the sample user’s medical history data, hospitalization data, medical treatment data, physical examination data, and health notification data; The first model training set trains a first recognition model for judging the value of the first physical examination index based on a preset regression prediction algorithm, wherein the preset regression prediction algorithm consists of random forest, gradient boosting decision tree GBDT, Xgboost, The four algorithms of LightGBM are fused. The evaluation of the first recognition model adopts the average absolute percentage error MAPE index. When the MAPE index value corresponding to the first recognition model is less than the preset standard comparison threshold, the first recognition model is determined In accordance with the evaluation standard, the first mapping relationship between the characteristic information X1 and the label information Y1 can be determined by the first recognition model that meets the evaluation standard; the user’s characteristic blood glucose level two hours after a meal is used as label information Y2, using the target feature data of the sample user as feature information X2 to create a second model training set; through the second model training set based on the preset regression prediction algorithm training for determining the second The second identification model of the physical examination index value, wherein the evaluation of the second identification model adopts the MAPE index, and when the MAPE index value corresponding to the second identification model is less than a predetermined standard comparison threshold, it is determined that the second identification model conforms to Evaluation criteria, the second mapping relationship between the feature information X2 and the label information Y2 can be determined through the second recognition model that meets the evaluation criteria.
  3. 根据权利要求2所述的方法,所述利用所述回归预测模型判断目标用户空腹血糖的第一体检指标值和餐后预设时长血糖的第二体检指标值,具体包括:将所述目标用户的特征信息输入到所述第一识别模型中与所述特征信息X1进行相似度匹配,所述目标用户的特征信息对应除所述空腹血糖值和所述餐后两小时血糖值以外的所述目标特征数据;利用相似度大于预设阈值、且相似度最高的所述特征信息X1和所述第一映射关系,确定所述目标用户对应的第一体检指标值;将所述目标用户的特征信息输入到所述第二识别模型中与所述特征信息X2进行相似度匹配;利用相似度大于预定阈值、且相似度最高的特征信息X2和所述第二映射关系,确定所述目标用户对应的第二体检指标值。The method according to claim 2, said using the regression prediction model to determine the first physical examination index value of the fasting blood glucose of the target user and the second physical examination index value of the blood glucose for a preset time after a meal, specifically comprising: comparing the target user The feature information of is input into the first recognition model to perform similarity matching with the feature information X1, and the feature information of the target user corresponds to the fasting blood glucose value and the two-hour postprandial blood glucose value. Target feature data; using the feature information X1 with the similarity greater than a preset threshold and the highest similarity and the first mapping relationship to determine the first physical examination index value corresponding to the target user; and the feature of the target user The information is input into the second recognition model to perform similarity matching with the feature information X2; the feature information X2 with the similarity greater than a predetermined threshold and the highest similarity and the second mapping relationship are used to determine that the target user corresponds The second physical examination index value.
  4. 根据权利要求3所述的方法,所述根据所述第一体检指标值和/或所述第二体检指标值,确定所述目标用户的患病程度,具体包括:若所述目标用户对应的所述第一体检指标值大于等于第一预设阈值,和/或所述第二体检指标值大于等于第二预设阈值,则确定所述目标用户患有糖尿病;通过所述第一体检指标值所处的第一数值区间、和/或所述第二体检指标值所处的第二数值区间,判断所述目标用户的患病程度。The method according to claim 3, wherein the determining the disease degree of the target user according to the first physical examination index value and/or the second physical examination index value specifically includes: if the target user corresponds to If the first physical examination index value is greater than or equal to a first preset threshold, and/or the second physical examination index value is greater than or equal to a second preset threshold, it is determined that the target user has diabetes; and the first physical examination index is passed The first numerical interval where the value is located, and/or the second numerical interval where the second physical examination index value is located, determine the disease degree of the target user.
  5. 根据权利要求4所述的方法,通过所述第一体检指标值所处的第一数值区间,判断所述目标用户的患病程度,具体包括:划分大于所述第一预设阈值,且按照预定数值规律递增的多个数值区间;创建所述多个数值区间与糖尿病患病程度之间的第三映射关系;确定所述第一体检指标值对应处于所述多个数值区间中的所述第一数值区间;根据所述第三映射关系以及所述第一数值区间,判断所述目标用户的第一糖尿病患病程度;The method according to claim 4, judging the disease degree of the target user according to the first numerical interval in which the first physical examination index value is located, specifically comprising: dividing the disease greater than the first preset threshold value, and according to A plurality of numerical intervals with a predetermined numerical regular increase; creating a third mapping relationship between the plurality of numerical intervals and the prevalence of diabetes; determining that the first physical examination index value corresponds to the plurality of numerical intervals A first numerical interval; judging the first diabetes degree of the target user according to the third mapping relationship and the first numerical interval;
    通过所述第二体检指标值所处的第二数值区间,判断所述目标用户的患病程度,具体包括:划分大于所述第二预设阈值,且按照预定数值规律递增的多个数值区间;创建所述多个数值区间与糖尿病患病程度之间的第四映射关系;确定所述第二体检指标值对应处于所述多个数值区间中的所述第二数值区间;根据所述第四映射关系以及所述第二数值区间,判断所述目标用户的第二糖尿病患病程度;According to the second numerical interval in which the second physical examination index value is located, judging the disease degree of the target user specifically includes: dividing a plurality of numerical intervals greater than the second preset threshold value and increasing according to a predetermined numerical law Create a fourth mapping relationship between the multiple numerical intervals and the degree of diabetes; determine that the second physical examination index value corresponds to the second numerical interval in the multiple numerical intervals; according to the first Four mapping relationships and the second numerical interval to determine the second diabetes prevalence of the target user;
    通过所述第一体检指标值所处的第一数值区间、和所述第二体检指标值所处的第二数值区间,判断所述目标用户的患病程度,具体包括:若所述第一糖尿病患病程度和所述第二糖尿病患病程度不同,则依据用户对通过所述第一识别模型和所述第二识别模型这两种预测方式反馈的准确率或采纳率,分别获取所述第一识别模型对应的第一权重和所述第二识别模型对应的第二权重;在所述第一权重大于所述第二权重时,将所述第一糖尿病患病程度确定为所述目标用户的患病程度;在所述第二权重大于所述第一权重时,将所述第二糖尿病患病程度确定为所述目标用户的患病程度。According to the first numerical interval in which the first physical examination index value is located and the second numerical interval in which the second physical examination index value is located, determining the disease degree of the target user specifically includes: The prevalence of diabetes is different from the prevalence of the second diabetes. According to the accuracy or acceptance rate of the user’s feedback on the two prediction methods of the first recognition model and the second recognition model, the The first weight corresponding to the first recognition model and the second weight corresponding to the second recognition model; when the first weight is greater than the second weight, the first degree of diabetes is determined as the target The disease degree of the user; when the second weight is greater than the first weight, the second diabetes degree is determined as the disease degree of the target user.
  6. 根据权利要求3所述的方法,所述将所述目标用户的特征信息输入到所述第一识别模型中与所述特征信息X1进行相似度匹配,具体包括:将所述目标用户的特征信息经过数据清洗、特征提取、缺失值填充、异常值处理,得到结构化数据的特征信息;将结构化数据的特征信息与所述特征信息X1进 行相似度匹配;The method according to claim 3, wherein the inputting the characteristic information of the target user into the first recognition model to perform similarity matching with the characteristic information X1 specifically includes: matching the characteristic information of the target user After data cleaning, feature extraction, missing value filling, and abnormal value processing, the feature information of the structured data is obtained; the feature information of the structured data is matched with the feature information X1 for similarity;
    所述将所述目标用户的特征信息输入到所述第二识别模型中与所述特征信息X2进行相似度匹配,具体包括:将所述目标用户的特征信息经过数据清洗、特征提取、缺失值填充、异常值处理,得到结构化数据的特征信息;将结构化数据的特征信息与所述特征信息X2进行相似度匹配。The inputting the feature information of the target user into the second recognition model to perform similarity matching with the feature information X2 specifically includes: subjecting the feature information of the target user to data cleaning, feature extraction, and missing values Filling and outlier processing to obtain the feature information of the structured data; the feature information of the structured data is matched with the feature information X2 for similarity.
  7. 根据权利要求2所述的方法,所述通过所述第一模型训练集基于预设回归预测算法训练用于判断所述第一体检指标值的第一识别模型,具体包括:采用随机采样方式从所述第一模型训练集中分别获取第一训练样本集、第二训练样本集、第三训练样本集、第四训练样本集;基于所述第一训练样本集利用随机森林算法,训练得到第一分类器;基于所述第二训练样本集利用GBDT算法,训练得到第二分类器;基于所述第三训练样本集利用Xgboost算法,训练得到第三分类器;基于所述第四训练样本集利用LightGBM算法,训练得到第四分类器;将所述第一分类器、所述第二分类器、所述第三分类器、所述第四分类器利用套袋法进行融合处理,得到所述第一识别模型;The method according to claim 2, wherein the training of the first recognition model for judging the value of the first physical examination index through the first model training set based on a preset regression prediction algorithm specifically includes: adopting a random sampling method from In the first model training set, a first training sample set, a second training sample set, a third training sample set, and a fourth training sample set are respectively obtained; based on the first training sample set, a random forest algorithm is used to train to obtain the first Classifier; based on the second training sample set using the GBDT algorithm to train a second classifier; based on the third training sample set using the Xgboost algorithm to train a third classifier; based on the fourth training sample set using The LightGBM algorithm is trained to obtain a fourth classifier; the first classifier, the second classifier, the third classifier, and the fourth classifier are fused using the bagging method to obtain the first classifier A recognition model;
    所述通过所述第二模型训练集基于所述预设回归预测算法训练用于判断所述第二体检指标值的第二识别模型,具体包括:采用随机采样方式从所述第二模型训练集中分别获取第五训练样本集、第六训练样本集、第七训练样本集、第八训练样本集;基于所述第五训练样本集利用随机森林算法,训练得到第五分类器;基于所述第六训练样本集利用GBDT算法,训练得到第六分类器;基于所述第七训练样本集利用Xgboost算法,训练得到第七分类器;基于所述第八训练样本集利用LightGBM算法,训练得到第八分类器;将所述第五分类器、所述第六分类器、所述第七分类器、所述第八分类器利用套袋法进行融合处理,得到所述第二识别模型。The training of the second recognition model for judging the value of the second physical examination index through the second model training set based on the preset regression prediction algorithm specifically includes: adopting a random sampling method from the second model training set Obtain the fifth training sample set, the sixth training sample set, the seventh training sample set, and the eighth training sample set respectively; based on the fifth training sample set, the random forest algorithm is used to train the fifth classifier; based on the first Sixth training sample set uses the GBDT algorithm to train the sixth classifier; based on the seventh training sample set uses the Xgboost algorithm to train the seventh classifier; based on the eighth training sample set uses the LightGBM algorithm to train the eighth Classifier; the fifth classifier, the sixth classifier, the seventh classifier, and the eighth classifier are fused using a bagging method to obtain the second recognition model.
  8. 一种糖尿病的预测装置,包括:A diabetes prediction device, including:
    获取单元,用于获取原始健康档案和电子病历数据中的样本用户数据;The obtaining unit is used to obtain sample user data in the original health file and electronic medical record data;
    创建单元,用于依据所述样本用户数据中的用户特征创建数值型的回归预测模型;The creation unit is used to create a numerical regression prediction model based on the user characteristics in the sample user data;
    判断单元,用于利用所述回归预测模型判断目标用户空腹血糖的第一体检指标值和餐后预设时长血糖的第二体检指标值;A judging unit, configured to use the regression prediction model to judge the first physical examination index value of the fasting blood glucose of the target user and the second physical examination index value of the blood glucose of the preset duration after a meal;
    确定单元,用于根据所述第一体检指标值和/或所述第二体检指标值,确定所述目标用户的患病程度。The determining unit is configured to determine the disease degree of the target user according to the first physical examination index value and/or the second physical examination index value.
  9. 根据权利要求8所述的装置,所述创建单元,具体包括:创建模块、训练模块、确定模块;The apparatus according to claim 8, wherein the creation unit specifically includes: a creation module, a training module, and a determination module;
    所述创建模块,具体用于将所述用户特征中空腹血糖值作为标签信息Y1,并将样本用户除所述空腹血糖值和所述餐后两小时血糖值以外的目标特征数据作为特征信息X1,创建第一模型训练集,所述目标特征数据至少包括所述样本用户的患病史数据、住院数据、就诊用药数据、体检数据、健康告知数据中的一项或多项;The creation module is specifically configured to use the fasting blood glucose value in the user characteristics as label information Y1, and use target feature data of the sample user except the fasting blood glucose value and the blood glucose value two hours after a meal as the feature information X1 Create a first model training set, the target feature data includes at least one or more of medical history data, hospitalization data, medical treatment data, physical examination data, and health notification data of the sample user;
    所述训练模块,具体用于通过所述第一模型训练集基于预设回归预测算法训练用于判断所述第一体检指标值的第一识别模型,其中,所述预设回归预测算法由随机森林、梯度提升决策树GBDT、Xgboost、LightGBM四种算法融合得到,所述第一识别模型的评估采用平均绝对百分比误差MAPE指标,当所述第一识别模型对应的MAPE指标值小于预置标准比较阈值时,确定所述第一识别模型符合评估标准;The training module is specifically configured to train a first recognition model for judging the value of the first physical examination index based on a preset regression prediction algorithm through the first model training set, wherein the preset regression prediction algorithm is determined by random The four algorithms of forest, gradient boosting decision tree GBDT, Xgboost, and LightGBM are fused. The evaluation of the first recognition model adopts the average absolute percentage error MAPE index. When the MAPE index value corresponding to the first recognition model is less than the preset standard comparison Threshold, determining that the first recognition model meets the evaluation standard;
    所述确定模块,具体用于通过符合评估标准的所述第一识别模型可确定所述特征信息X1和所述标签信息Y1之间的第一映射关系;The determining module is specifically configured to determine the first mapping relationship between the feature information X1 and the label information Y1 through the first recognition model that meets the evaluation standard;
    所述创建模块,具体还用于将所述用户特征中餐后两小时血糖值作为标签信息Y2,并将所述样本用户的所述目标特征数据作为特征信息X2,创建第二模型训练集;The creation module is specifically further configured to use the user's characteristic Chinese blood glucose level two hours after a meal as label information Y2, and use the target characteristic data of the sample user as characteristic information X2 to create a second model training set;
    所述训练模块,具体还用于通过所述第二模型训练集基于所述预设回归预测算法训练用于判断所述第二体检指标值的第二识别模型,其中,所述第二识别模型的评估采用MAPE指标,当所述第二识别模型对应的MAPE指标值小于预定标准比较阈值时,确定所述第二识别模型符合评估标准;The training module is specifically further configured to train a second recognition model for judging the second physical examination index value based on the preset regression prediction algorithm through the second model training set, wherein the second recognition model The MAPE index is used for the evaluation, and when the MAPE index value corresponding to the second recognition model is less than a predetermined standard comparison threshold, it is determined that the second recognition model meets the evaluation standard;
    所述确定模块,具体还用于通过符合评估标准的所述第二识别模型可确定所述特征信息X2和所述标签信息Y2之间的第二映射关系。The determining module is specifically further configured to determine the second mapping relationship between the feature information X2 and the label information Y2 through the second recognition model that meets the evaluation standard.
  10. 根据权利要求9所述的装置,所述判断单元,具体包括:匹配模块、确定模块;The device according to claim 9, wherein the judgment unit specifically includes: a matching module and a determining module;
    所述匹配模块,具体用于将所述目标用户的特征信息输入到所述第一识别模型中与所述特征信息X1进行相似度匹配,所述目标用户的特征信息对应所述目标用户除所述空腹血糖值和所述餐后两小时血糖值以外的所述目标特征数据;The matching module is specifically configured to input the characteristic information of the target user into the first recognition model to perform similarity matching with the characteristic information X1, and the characteristic information of the target user corresponds to the target user The target characteristic data other than the fasting blood glucose value and the blood glucose value two hours after the meal;
    所述确定模块,具体用于利用相似度大于预设阈值、且相似度最高的所述特征信息X1和所述第一映射关系,确定所述目标用户对应的第一体检指标值;The determining module is specifically configured to determine the first physical examination index value corresponding to the target user by using the characteristic information X1 with the similarity greater than a preset threshold and the highest similarity and the first mapping relationship;
    所述匹配模块,具体还用于将所述目标用户的特征信息输入到所述第二识别模型中与所述特征信息X2进行相似度匹配;The matching module is specifically further configured to input feature information of the target user into the second recognition model to perform similarity matching with the feature information X2;
    所述确定模块,具体还用于利用相似度大于预定阈值、且相似度最高的特征信息X2和所述第二映 射关系,确定所述目标用户对应的第二体检指标值。The determining module is specifically further configured to determine the second physical examination index value corresponding to the target user by using the feature information X2 with the similarity greater than a predetermined threshold and the highest similarity and the second mapping relationship.
  11. 根据权利要求10所述的装置,所述确定单元,具体包括:确定模块、判断模块;The device according to claim 10, the determining unit specifically includes: a determining module and a judging module;
    所述确定模块,用于若目标用户对应的第一体检指标值大于等于第一预设阈值,和/或第二体检指标值大于等于第二预设阈值,则确定目标用户患有糖尿病;The determining module is configured to determine that the target user has diabetes if the first physical examination index value corresponding to the target user is greater than or equal to a first preset threshold, and/or the second physical examination index value is greater than or equal to a second preset threshold;
    所述判断模块,用于通过第一体检指标值所处的第一数值区间、和/或第二体检指标值所处的第二数值区间,判断目标用户的患病程度。The judgment module is configured to judge the disease degree of the target user through the first numerical interval where the first physical examination index value is located and/or the second numerical interval where the second physical examination index value is located.
  12. 根据权利要求11所述的装置,所述判断模块,具体还用于划分大于第一预设阈值,且按照预定数值规律递增的多个数值区间;创建多个数值区间与糖尿病患病程度之间的第三映射关系;确定第一体检指标值对应处于多个数值区间中的第一数值区间;根据第三映射关系以及第一数值区间,判断目标用户的糖尿病患病程度。划分大于第二预设阈值,且按照预定数值规律递增的多个数值区间;创建多个数值区间与糖尿病患病程度之间的第四映射关系;确定第二体检指标值对应处于多个数值区间中的第二数值区间;根据第四映射关系以及第二数值区间,判断目标用户的糖尿病患病程度;The device according to claim 11, wherein the judgment module is specifically further configured to divide a plurality of numerical intervals that are greater than a first preset threshold and increase according to a predetermined numerical law; create a relationship between the plurality of numerical intervals and the degree of diabetes The third mapping relationship for determining the first physical examination index value corresponds to the first numerical interval in the multiple numerical intervals; according to the third mapping relationship and the first numerical interval, the degree of diabetes prevalence of the target user is determined. Divide multiple numerical intervals that are greater than the second preset threshold and increase according to a predetermined numerical law; create a fourth mapping relationship between multiple numerical intervals and the degree of diabetes; determine that the second physical examination index value corresponds to multiple numerical intervals The second numerical interval in, according to the fourth mapping relationship and the second numerical interval, determine the diabetes prevalence of the target user;
    所述判断模块,具体还用于若所述第一糖尿病患病程度和所述第二糖尿病患病程度不同,则依据用户对通过所述第一识别模型和所述第二识别模型这两种预测方式反馈的准确率或采纳率,分别获取所述第一识别模型对应的第一权重和所述第二识别模型对应的第二权重;在所述第一权重大于所述第二权重时,将所述第一糖尿病患病程度确定为所述目标用户的患病程度;在所述第二权重大于所述第一权重时,将所述第二糖尿病患病程度确定为所述目标用户的患病程度。The judgment module is further specifically configured to pass the first recognition model and the second recognition model according to the user’s opinion if the first diabetes degree is different from the second diabetes degree. For the accuracy or adoption rate of the prediction mode feedback, the first weight corresponding to the first recognition model and the second weight corresponding to the second recognition model are respectively obtained; when the first weight is greater than the second weight, The first diabetes degree is determined as the target user's disease degree; when the second weight is greater than the first weight, the second diabetes degree is determined as the target user's disease degree Degree of illness.
  13. 根据权利要求10所述的装置,所述匹配模块,具体用于将所述目标用户的特征信息经过数据清洗、特征提取、缺失值填充、异常值处理,得到结构化数据的特征信息;将结构化数据的特征信息与所述特征信息X1进行相似度匹配;The device according to claim 10, the matching module is specifically configured to process the feature information of the target user through data cleaning, feature extraction, missing value filling, and outlier processing to obtain feature information of structured data; Similarity matching between the feature information of the transformed data and the feature information X1;
    所述匹配模块,具体用于将所述目标用户的特征信息经过数据清洗、特征提取、缺失值填充、异常值处理,得到结构化数据的特征信息;将结构化数据的特征信息与所述特征信息X2进行相似度匹配。The matching module is specifically configured to perform data cleaning, feature extraction, missing value filling, and abnormal value processing on the feature information of the target user to obtain feature information of the structured data; and compare the feature information of the structured data with the feature Information X2 performs similarity matching.
  14. 根据权利要求9所述的装置,所述训练模块,具体用于采用随机采样方式从所述第一模型训练集中分别获取第一训练样本集、第二训练样本集、第三训练样本集、第四训练样本集;基于所述第一训练样本集利用随机森林算法,训练得到第一分类器;基于所述第二训练样本集利用GBDT算法,训练得到第二分类器;基于所述第三训练样本集利用Xgboost算法,训练得到第三分类器;基于所述第四训练样本集利用LightGBM算法,训练得到第四分类器;将所述第一分类器、所述第二分类器、所述第三分类器、所述第四分类器利用套袋法进行融合处理,得到所述第一识别模型;The device according to claim 9, wherein the training module is specifically configured to obtain a first training sample set, a second training sample set, a third training sample set, and a first training sample set from the first model training set by random sampling. Four training sample set; based on the first training sample set using the random forest algorithm to train a first classifier; based on the second training sample set using the GBDT algorithm to train a second classifier; based on the third training The sample set uses the Xgboost algorithm to train the third classifier; based on the fourth training sample set uses the LightGBM algorithm to train the fourth classifier; the first classifier, the second classifier, and the first The three classifiers and the fourth classifier perform fusion processing using the bagging method to obtain the first recognition model;
    所述训练模块,具体还用于采用随机采样方式从所述第二模型训练集中分别获取第五训练样本集、第六训练样本集、第七训练样本集、第八训练样本集;基于所述第五训练样本集利用随机森林算法,训练得到第五分类器;基于所述第六训练样本集利用GBDT算法,训练得到第六分类器;基于所述第七训练样本集利用Xgboost算法,训练得到第七分类器;基于所述第八训练样本集利用LightGBM算法,训练得到第八分类器;将所述第五分类器、所述第六分类器、所述第七分类器、所述第八分类器利用套袋法进行融合处理,得到所述第二识别模型。The training module is specifically further configured to obtain a fifth training sample set, a sixth training sample set, a seventh training sample set, and an eighth training sample set from the second model training set by random sampling; based on the The fifth training sample set uses the random forest algorithm to train a fifth classifier; based on the sixth training sample set the GBDT algorithm is used to train the sixth classifier; based on the seventh training sample set the Xgboost algorithm is used to train Seventh classifier; Based on the eighth training sample set, use the LightGBM algorithm to train an eighth classifier; combine the fifth classifier, the sixth classifier, the seventh classifier, and the eighth The classifier uses the bagging method to perform fusion processing to obtain the second recognition model.
  15. 一种非易失性可读存储介质,其上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现糖尿病的预测方法,包括:获取原始健康档案和电子病历数据中的样本用户数据;依据所述样本用户数据中的用户特征创建数值型的回归预测模型;利用所述回归预测模型判断目标用户空腹血糖的第一体检指标值和餐后预设时长血糖的第二体检指标值;根据所述第一体检指标值和/或所述第二体检指标值,确定所述目标用户的患病程度。A non-volatile readable storage medium having computer readable instructions stored thereon. The method for predicting diabetes when the computer readable instructions are executed by a processor includes: obtaining samples in original health files and electronic medical record data User data; create a numerical regression prediction model based on the user characteristics in the sample user data; use the regression prediction model to determine the first physical examination index value of the target user’s fasting blood glucose and the second physical examination index of the blood glucose for a preset duration after a meal Value; according to the first physical examination index value and/or the second physical examination index value, determine the disease degree of the target user.
  16. 根据权利要求15所述的非易失性可读存储介质,所述用户特征是利用正则表达式从所述样本用户数据中提取的,所述预设时长为两小时;所述计算机可读指令被处理器执行时实现所述依据所述样本用户数据中的用户特征创建数值型的回归预测模型,具体包括:将所述用户特征中空腹血糖值作为标签信息Y1,并将样本用户除所述空腹血糖值和所述餐后两小时血糖值以外的目标特征数据作为特征信息X1,创建第一模型训练集,所述目标特征数据至少包括所述样本用户的患病史数据、住院数据、就诊用药数据、体检数据、健康告知数据中的一项或多项;通过所述第一模型训练集基于预设回归预测算法训练用于判断所述第一体检指标值的第一识别模型,其中,所述预设回归预测算法由随机森林、梯度提升决策树GBDT、Xgboost、LightGBM四种算法融合得到,所述第一识别模型的评估采用平均绝对百分比误差MAPE指标,当所述第一识别模型对应的MAPE指标值小于预置标准比较阈值时,确定所述第一识别模型符合评估标准,通过符合评估标准的所述第一识别模型可确定所述特征信息X1和所述标签信息Y1之间的第一映射关系;将所述用户特征中餐后两小时血糖值作为标签信息Y2,并将所述样本用户的所述目标特征数据作为特征信息X2,创建第二模型训练集;通过所述第二模型训练集基于所述预设回归预测算法训练用于判断所述第二体检指标值的第二识别模型,其中,所述第二识别模型的评估采用MAPE指标,当所述第二识别模型对应的MAPE指标值小于预定标准比较阈值时,确定所述 第二识别模型符合评估标准,通过符合评估标准的所述第二识别模型可确定所述特征信息X2和所述标签信息Y2之间的第二映射关系。The non-volatile readable storage medium according to claim 15, wherein the user characteristics are extracted from the sample user data using regular expressions, and the preset duration is two hours; the computer-readable instructions When executed by the processor, the creation of a numerical regression prediction model based on the user characteristics in the sample user data specifically includes: taking the fasting blood glucose value in the user characteristics as the label information Y1, and dividing the sample user by the The fasting blood glucose value and the target feature data other than the blood glucose value two hours after the meal are used as feature information X1 to create a first model training set. The target feature data includes at least the medical history data, hospitalization data, and doctor visits of the sample user One or more of medication data, physical examination data, and health notification data; the first recognition model for judging the value of the first physical examination index is trained based on a preset regression prediction algorithm through the first model training set, wherein, The preset regression prediction algorithm is obtained by fusion of four algorithms: random forest, gradient boosting decision tree GBDT, Xgboost, and LightGBM. The evaluation of the first recognition model uses the average absolute percentage error MAPE index. When the first recognition model corresponds to When the MAPE index value is less than the preset standard comparison threshold, it is determined that the first recognition model meets the evaluation standard, and the first recognition model that meets the evaluation standard can determine the difference between the feature information X1 and the label information Y1 The first mapping relationship; using the user's characteristic Chinese blood glucose level two hours after the meal as the label information Y2, and the target characteristic data of the sample user as the characteristic information X2, creating a second model training set; through the second The model training set trains a second recognition model for judging the value of the second physical examination index based on the preset regression prediction algorithm, wherein the evaluation of the second recognition model uses the MAPE index, and when the second recognition model corresponds to When the MAPE index value is less than the predetermined standard comparison threshold, it is determined that the second recognition model meets the evaluation standard, and the second recognition model that meets the evaluation standard can determine the first between the feature information X2 and the label information Y2 Two mapping relationship.
  17. 根据权利要求16所述的非易失性可读存储介质,所述计算机可读指令被处理器执行时实现所述利用所述回归预测模型判断目标用户空腹血糖的第一体检指标值和餐后预设时长血糖的第二体检指标值,具体包括:将所述目标用户的特征信息输入到所述第一识别模型中与所述特征信息X1进行相似度匹配,所述目标用户的特征信息对应除所述空腹血糖值和所述餐后两小时血糖值以外的所述目标特征数据;利用相似度大于预设阈值、且相似度最高的所述特征信息X1和所述第一映射关系,确定所述目标用户对应的第一体检指标值;将所述目标用户的特征信息输入到所述第二识别模型中与所述特征信息X2进行相似度匹配;利用相似度大于预定阈值、且相似度最高的特征信息X2和所述第二映射关系,确定所述目标用户对应的第二体检指标值。The non-volatile readable storage medium according to claim 16, when the computer-readable instructions are executed by the processor, the first physical examination index value and the postprandial index value of the fasting blood glucose of the target user are determined by the regression prediction model. The second physical examination index value of blood glucose for a preset duration specifically includes: inputting the characteristic information of the target user into the first recognition model to perform similarity matching with the characteristic information X1, and the characteristic information of the target user corresponds The target feature data other than the fasting blood glucose value and the two-hour postprandial blood glucose value; using the feature information X1 with the similarity greater than the preset threshold and the highest similarity and the first mapping relationship to determine The first physical examination index value corresponding to the target user; the characteristic information of the target user is input into the second recognition model to perform similarity matching with the characteristic information X2; the similarity is greater than a predetermined threshold and the similarity is used The highest characteristic information X2 and the second mapping relationship determine the second physical examination index value corresponding to the target user.
  18. 一种计算机设备,包括非易失性可读存储介质、处理器及存储在非易失性可读存储介质上并可在处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现糖尿病的预测方法,包括:获取原始健康档案和电子病历数据中的样本用户数据;依据所述样本用户数据中的用户特征创建数值型的回归预测模型;利用所述回归预测模型判断目标用户空腹血糖的第一体检指标值和餐后预设时长血糖的第二体检指标值;根据所述第一体检指标值和/或所述第二体检指标值,确定所述目标用户的患病程度。A computer device includes a non-volatile readable storage medium, a processor, and computer readable instructions stored on the non-volatile readable storage medium and running on the processor, and the processor executes the computer A method for predicting diabetes when instructions are readable, including: obtaining sample user data in original health files and electronic medical record data; creating a numerical regression prediction model based on user characteristics in the sample user data; using the regression prediction model Determine the first physical examination index value of the fasting blood glucose of the target user and the second physical examination index value of the blood glucose of the preset duration after a meal; according to the first physical examination index value and/or the second physical examination index value, determine the target user’s Degree of illness.
  19. 根据权利要求18所述的计算机设备,所述用户特征是利用正则表达式从所述样本用户数据中提取的,所述预设时长为两小时;所述计算机可读指令被处理器执行时实现所述依据所述样本用户数据中的用户特征创建数值型的回归预测模型,具体包括:将所述用户特征中空腹血糖值作为标签信息Y1,并将样本用户除所述空腹血糖值和所述餐后两小时血糖值以外的目标特征数据作为特征信息X1,创建第一模型训练集,所述目标特征数据至少包括所述样本用户的患病史数据、住院数据、就诊用药数据、体检数据、健康告知数据中的一项或多项;通过所述第一模型训练集基于预设回归预测算法训练用于判断所述第一体检指标值的第一识别模型,其中,所述预设回归预测算法由随机森林、梯度提升决策树GBDT、Xgboost、LightGBM四种算法融合得到,所述第一识别模型的评估采用平均绝对百分比误差MAPE指标,当所述第一识别模型对应的MAPE指标值小于预置标准比较阈值时,确定所述第一识别模型符合评估标准,通过符合评估标准的所述第一识别模型可确定所述特征信息X1和所述标签信息Y1之间的第一映射关系;将所述用户特征中餐后两小时血糖值作为标签信息Y2,并将所述样本用户的所述目标特征数据作为特征信息X2,创建第二模型训练集;通过所述第二模型训练集基于所述预设回归预测算法训练用于判断所述第二体检指标值的第二识别模型,其中,所述第二识别模型的评估采用MAPE指标,当所述第二识别模型对应的MAPE指标值小于预定标准比较阈值时,确定所述第二识别模型符合评估标准,通过符合评估标准的所述第二识别模型可确定所述特征信息X2和所述标签信息Y2之间的第二映射关系。The computer device according to claim 18, wherein the user characteristics are extracted from the sample user data using regular expressions, and the preset duration is two hours; the computer-readable instructions are implemented when the processor is executed The creation of a numerical regression prediction model based on the user characteristics in the sample user data specifically includes: using the fasting blood glucose value in the user characteristics as label information Y1, and dividing the sample user by the fasting blood glucose value and the The target feature data other than the blood glucose level two hours after a meal is used as feature information X1 to create a first model training set. The target feature data includes at least the medical history data, hospitalization data, medical treatment data, physical examination data, One or more items of health notification data; a first recognition model for judging the value of the first physical examination index is trained based on a preset regression prediction algorithm through the first model training set, wherein the preset regression prediction The algorithm is obtained by fusion of four algorithms: random forest, gradient boosting decision tree GBDT, Xgboost, and LightGBM. The evaluation of the first recognition model adopts the average absolute percentage error MAPE index. When the MAPE index value corresponding to the first recognition model is less than the expected value When setting the standard comparison threshold, it is determined that the first recognition model meets the evaluation standard, and the first mapping relationship between the feature information X1 and the label information Y1 can be determined through the first recognition model that meets the evaluation standard; The user’s characteristic blood glucose level two hours after the meal is used as label information Y2, and the target characteristic data of the sample user is used as characteristic information X2 to create a second model training set; the second model training set is based on the A preset regression prediction algorithm trains a second recognition model for judging the value of the second physical examination index, wherein the evaluation of the second recognition model adopts the MAPE index, and when the MAPE index value corresponding to the second recognition model is less than a predetermined value When the standard compares the threshold, it is determined that the second recognition model meets the evaluation standard, and the second mapping relationship between the feature information X2 and the label information Y2 can be determined by the second recognition model that meets the evaluation standard.
  20. 根据权利要求19所述的计算机设备,所述计算机可读指令被处理器执行时实现所述利用所述回归预测模型判断目标用户空腹血糖的第一体检指标值和餐后预设时长血糖的第二体检指标值,具体包括:将所述目标用户的特征信息输入到所述第一识别模型中与所述特征信息X1进行相似度匹配,所述目标用户的特征信息对应除所述空腹血糖值和所述餐后两小时血糖值以外的所述目标特征数据;利用相似度大于预设阈值、且相似度最高的所述特征信息X1和所述第一映射关系,确定所述目标用户对应的第一体检指标值;将所述目标用户的特征信息输入到所述第二识别模型中与所述特征信息X2进行相似度匹配;利用相似度大于预定阈值、且相似度最高的特征信息X2和所述第二映射关系,确定所述目标用户对应的第二体检指标值。The computer device according to claim 19, when the computer-readable instructions are executed by the processor, the first physical examination index value of the fasting blood glucose of the target user and the second blood glucose of the preset duration after a meal are determined by the regression prediction model. The second physical examination index value specifically includes: inputting the characteristic information of the target user into the first recognition model to perform similarity matching with the characteristic information X1, and the characteristic information of the target user corresponds to the fasting blood glucose value And the target feature data other than the blood glucose level two hours after the meal; using the feature information X1 with the similarity greater than the preset threshold and the highest similarity and the first mapping relationship to determine the corresponding The first integrated inspection index value; input the feature information of the target user into the second recognition model to perform similarity matching with the feature information X2; use the feature information X2 with the similarity greater than a predetermined threshold and the highest similarity and The second mapping relationship determines the second physical examination index value corresponding to the target user.
PCT/CN2019/117217 2019-03-12 2019-11-11 Diabetes prediction method and apparatus, storage medium, and computer device WO2020181805A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910185079.2 2019-03-12
CN201910185079.2A CN110197720A (en) 2019-03-12 2019-03-12 Prediction technique and device, storage medium, the computer equipment of diabetes

Publications (1)

Publication Number Publication Date
WO2020181805A1 true WO2020181805A1 (en) 2020-09-17

Family

ID=67751751

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117217 WO2020181805A1 (en) 2019-03-12 2019-11-11 Diabetes prediction method and apparatus, storage medium, and computer device

Country Status (2)

Country Link
CN (1) CN110197720A (en)
WO (1) WO2020181805A1 (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112164454A (en) * 2020-10-10 2021-01-01 联仁健康医疗大数据科技股份有限公司 Diagnosis prediction method and device and electronic equipment
CN113035357A (en) * 2021-04-06 2021-06-25 昆明医科大学第一附属医院 Diabetic kidney disease risk assessment system
CN113057586A (en) * 2021-03-17 2021-07-02 上海电气集团股份有限公司 Disease early warning method, device, equipment and medium
CN113113134A (en) * 2021-04-07 2021-07-13 闵东 Clinical etiology prejudgment device and system
CN113488166A (en) * 2021-07-28 2021-10-08 联仁健康医疗大数据科技股份有限公司 Diabetes data analysis model training and data management method, device and equipment
CN113808744A (en) * 2021-09-22 2021-12-17 河北工程大学 Diabetes risk prediction method, device, equipment and storage medium
CN113921144A (en) * 2021-09-23 2022-01-11 清华大学 Disease prediction set processing method and device, electronic equipment and storage medium
CN116189896A (en) * 2023-04-24 2023-05-30 北京快舒尔医疗技术有限公司 Cloud-based diabetes health data early warning method and system
CN117112729A (en) * 2023-08-21 2023-11-24 北京科文思数据管理有限公司 Medical resource docking method and system based on artificial intelligence
CN117494688A (en) * 2023-12-29 2024-02-02 深圳智能思创科技有限公司 Form information extraction method, device, equipment and storage medium

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110197720A (en) * 2019-03-12 2019-09-03 平安科技(深圳)有限公司 Prediction technique and device, storage medium, the computer equipment of diabetes
CN111429289B (en) * 2020-03-23 2023-03-24 平安医疗健康管理股份有限公司 Single disease identification method and device, computer equipment and storage medium
CN111599470B (en) * 2020-04-23 2022-07-29 中国科学院上海技术物理研究所 Method for improving near-infrared noninvasive blood glucose detection precision
CN111710420B (en) * 2020-05-15 2024-03-19 深圳先进技术研究院 Complication onset risk prediction method, system, terminal and storage medium based on electronic medical record big data
CN111696662A (en) * 2020-05-26 2020-09-22 平安科技(深圳)有限公司 Disease prediction method, device and storage medium
CN111739646A (en) * 2020-06-22 2020-10-02 平安医疗健康管理股份有限公司 Data verification method and device, computer equipment and readable storage medium
CN111657873B (en) * 2020-07-07 2021-06-22 四川长虹电器股份有限公司 Physical constitution prediction method based on visible light and near infrared spectrum technology
CN111797284A (en) * 2020-07-08 2020-10-20 北京康健德科技有限公司 Graph database construction method and device, electronic equipment and storage medium
CN112382394A (en) * 2020-11-05 2021-02-19 苏州麦迪斯顿医疗科技股份有限公司 Event processing method and device, electronic equipment and storage medium
CN113658704A (en) * 2021-09-17 2021-11-16 平安国际智慧城市科技股份有限公司 Diabetes risk prediction device, apparatus and storage medium
CN113796852B (en) * 2021-09-30 2023-09-08 太原理工大学 Diabetes foot prediction method based on gradient lifting decision tree model algorithm
CN114242247A (en) * 2021-12-30 2022-03-25 吉林大学第一医院 Non-obese MAFLD prediction system, device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106682412A (en) * 2016-12-22 2017-05-17 浙江大学 Diabetes prediction method based on medical examination data
US20180158552A1 (en) * 2016-12-01 2018-06-07 University Of Southern California Interpretable deep learning framework for mining and predictive modeling of health care data
CN109308545A (en) * 2018-08-21 2019-02-05 中国平安人寿保险股份有限公司 The method, apparatus, computer equipment and storage medium of diabetes probability are suffered from prediction
CN109378072A (en) * 2018-10-13 2019-02-22 中山大学 A kind of abnormal fasting blood sugar method for early warning based on integrated study Fusion Model
CN110197720A (en) * 2019-03-12 2019-09-03 平安科技(深圳)有限公司 Prediction technique and device, storage medium, the computer equipment of diabetes

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN2850518Y (en) * 2005-10-24 2006-12-27 北京软测科技有限公司 Portable diabetes condition monitoring apparatus
US20150347707A1 (en) * 2014-05-30 2015-12-03 Anthony Michael Albisser Computer-Implemented System And Method For Improving Glucose Management Through Cloud-Based Modeling Of Circadian Profiles

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180158552A1 (en) * 2016-12-01 2018-06-07 University Of Southern California Interpretable deep learning framework for mining and predictive modeling of health care data
CN106682412A (en) * 2016-12-22 2017-05-17 浙江大学 Diabetes prediction method based on medical examination data
CN109308545A (en) * 2018-08-21 2019-02-05 中国平安人寿保险股份有限公司 The method, apparatus, computer equipment and storage medium of diabetes probability are suffered from prediction
CN109378072A (en) * 2018-10-13 2019-02-22 中山大学 A kind of abnormal fasting blood sugar method for early warning based on integrated study Fusion Model
CN110197720A (en) * 2019-03-12 2019-09-03 平安科技(深圳)有限公司 Prediction technique and device, storage medium, the computer equipment of diabetes

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
BAI, HONGLIANG ET AL.: "Diabetes Prediction Models Based on Health Check-Up Population in a Hospital", PRACTICAL PREVENTIVE MEDICINE, vol. 25, no. 1, 31 January 2018 (2018-01-31), DOI: 20200109115144A *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112164454A (en) * 2020-10-10 2021-01-01 联仁健康医疗大数据科技股份有限公司 Diagnosis prediction method and device and electronic equipment
CN113057586B (en) * 2021-03-17 2024-03-12 上海电气集团股份有限公司 Disease early warning method, device, equipment and medium
CN113057586A (en) * 2021-03-17 2021-07-02 上海电气集团股份有限公司 Disease early warning method, device, equipment and medium
CN113035357A (en) * 2021-04-06 2021-06-25 昆明医科大学第一附属医院 Diabetic kidney disease risk assessment system
CN113113134A (en) * 2021-04-07 2021-07-13 闵东 Clinical etiology prejudgment device and system
CN113488166A (en) * 2021-07-28 2021-10-08 联仁健康医疗大数据科技股份有限公司 Diabetes data analysis model training and data management method, device and equipment
CN113808744A (en) * 2021-09-22 2021-12-17 河北工程大学 Diabetes risk prediction method, device, equipment and storage medium
CN113921144A (en) * 2021-09-23 2022-01-11 清华大学 Disease prediction set processing method and device, electronic equipment and storage medium
CN116189896B (en) * 2023-04-24 2023-08-08 北京快舒尔医疗技术有限公司 Cloud-based diabetes health data early warning method and system
CN116189896A (en) * 2023-04-24 2023-05-30 北京快舒尔医疗技术有限公司 Cloud-based diabetes health data early warning method and system
CN117112729A (en) * 2023-08-21 2023-11-24 北京科文思数据管理有限公司 Medical resource docking method and system based on artificial intelligence
CN117494688A (en) * 2023-12-29 2024-02-02 深圳智能思创科技有限公司 Form information extraction method, device, equipment and storage medium
CN117494688B (en) * 2023-12-29 2024-03-29 深圳智能思创科技有限公司 Form information extraction method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN110197720A (en) 2019-09-03

Similar Documents

Publication Publication Date Title
WO2020181805A1 (en) Diabetes prediction method and apparatus, storage medium, and computer device
US9861308B2 (en) Method and system for monitoring stress conditions
US10912508B2 (en) Method and system for assessing mental state
US20140344208A1 (en) Context-aware prediction in medical systems
CN111613289A (en) Individualized drug dose prediction method, individualized drug dose prediction device, electronic equipment and storage medium
WO2021217867A1 (en) Xgboost-based data classification method and apparatus, computer device, and storage medium
WO2020181806A1 (en) Method and apparatus for predicting future blood glucose level and computer device
WO2021151295A1 (en) Method, apparatus, computer device, and medium for determining patient treatment plan
CN108648827A (en) Cardiovascular and cerebrovascular disease Risk Forecast Method and device
WO2021151327A1 (en) Triage data processing method and apparatus, and device and medium
CN114783580B (en) Medical data quality evaluation method and system
Xiang et al. Integrated Architectures for Predicting Hospital Readmissions Using Machine Learning
CN112542242A (en) Data transformation/symptom scoring
WO2020206172A1 (en) Confidence evaluation to measure trust in behavioral health survey results
CN112967803A (en) Early mortality prediction method and system for emergency patients based on integrated model
CN115938593A (en) Medical record information processing method, device and equipment and computer readable storage medium
TWI790479B (en) Physiological status evaluation method and physiological status evaluation device
CN114694779A (en) Method and system for improving nursing satisfaction degree of ICU patient
CN114068036A (en) Infection propagation prediction method and system based on Internet of things perception
KR102167161B1 (en) Systme and method for recommanding symptoms of diseases
TWM605545U (en) Risk assessment apparatus for chronic disease
US20220270716A1 (en) Confidence evaluation to measure trust in behavioral health survey results
JP2021507392A (en) Learning and applying contextual similarities between entities
WO2024091292A1 (en) Cancer progression assessment method and system thereof
KR20230116458A (en) Method and apparatus for processing health examination data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19919456

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19919456

Country of ref document: EP

Kind code of ref document: A1