WO2018149300A1 - 疾病概率的检测方法、装置、设备及计算机可读存储介质 - Google Patents

疾病概率的检测方法、装置、设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2018149300A1
WO2018149300A1 PCT/CN2018/074808 CN2018074808W WO2018149300A1 WO 2018149300 A1 WO2018149300 A1 WO 2018149300A1 CN 2018074808 W CN2018074808 W CN 2018074808W WO 2018149300 A1 WO2018149300 A1 WO 2018149300A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
disease probability
feature
decision tree
value
Prior art date
Application number
PCT/CN2018/074808
Other languages
English (en)
French (fr)
Inventor
李菲菲
徐亮
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Priority to US16/305,884 priority Critical patent/US20200126662A1/en
Priority to SG11201810380VA priority patent/SG11201810380VA/en
Priority to JP2018559946A priority patent/JP2019521418A/ja
Publication of WO2018149300A1 publication Critical patent/WO2018149300A1/zh

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H70/00ICT specially adapted for the handling or processing of medical references
    • G16H70/60ICT specially adapted for the handling or processing of medical references relating to pathologies

Definitions

  • the present application relates to the field of disease information processing, and in particular, to a method, device, device and computer readable storage medium for detecting disease probability.
  • the main purpose of the present application is to provide a method, a device, a device and a computer readable storage medium for detecting disease probability, which are intended to solve the detection of disease probability in the prior art, which not only has a long detection time but also has a high cost.
  • Technical problem is to provide a method, a device, a device and a computer readable storage medium for detecting disease probability, which are intended to solve the detection of disease probability in the prior art, which not only has a long detection time but also has a high cost.
  • the present application provides a method for detecting a disease probability, and the method for detecting the disease probability includes:
  • the regression decision tree is tested according to the test set to calculate a user's disease probability.
  • the present application further provides a device for detecting a disease probability, and the device for detecting a disease probability includes:
  • a processing module configured to collect various data associated with the user, and perform feature processing on each collected data
  • a constructing module configured to construct a multi-dimensional data set according to each data processed by the feature
  • a dividing module configured to randomly sample the multi-dimensional data set to divide a test set and a training set
  • a calculation module configured to test the regression decision tree according to the test set to calculate a disease probability of the user.
  • the present application further provides a detection apparatus for disease probability, the detection apparatus of the disease probability comprising a processor, and a memory storing a detection program of a disease probability; the processor is configured to execute the disease Probabilistic detection procedure to achieve the steps of the detection method of the disease probability described above.
  • the present application further provides a computer readable storage medium storing a detection program of a disease probability, the detection program of the disease probability being executed by a processor to implement the above The steps of the method for detecting the probability of disease.
  • the method and device for detecting disease probability proposed by the present application first collects various data associated with a user, and then performs feature processing on each collected data, and then constructs a multi-dimensional data set according to each data processed by the feature, and constructs the multi-dimensional data set according to the feature-processed data.
  • the data set is randomly sampled to divide the test set and the training set, build a model based on the training set, obtain a regression decision tree, and finally test the regression decision tree according to the test set to calculate the disease probability of the user.
  • the program builds a model through the collected data, and finally calculates the disease probability of the user according to the model built, without detecting the disease probability by means of physical examination, the detection efficiency of the disease probability is high, and the cost of disease probability detection is also low. .
  • FIG. 1 is a schematic flow chart of a first embodiment of a method for detecting disease probability according to the present application
  • FIG. 2 is a schematic diagram of a refinement process of step S10 in FIG. 1;
  • step S20 in FIG. 1 is a schematic diagram showing the refinement process of step S20 in FIG. 1;
  • step S50 in FIG. 1 is a schematic diagram of a refinement process of step S50 in FIG. 1;
  • FIG. 5 is a schematic diagram of functional modules of a first embodiment of a device for detecting disease probability according to the present application
  • FIG. 6 is a schematic diagram of a refinement function module of the processing module 10 of FIG. 5;
  • FIG. 7 is a schematic diagram of a refinement function module of the construction module 20 of FIG. 5;
  • FIG. 8 is a schematic diagram of a refinement function module of the calculation module 50 of FIG. 5;
  • FIG. 9 is a schematic structural diagram of a device in a hardware operating environment according to an embodiment of the present application.
  • the solution of the embodiment of the present application is mainly: collecting various data associated with the user, then performing feature processing on each collected data, and constructing a multi-dimensional data set according to each data processed by the feature, and constructing the multi-dimensional data set according to the processed data. Random sampling is performed to divide the test set and the training set, build a model based on the training set, obtain a regression decision tree, and finally test the regression decision tree according to the test set to calculate a disease probability of the user.
  • the existing disease probability is solved, and it is necessary to perform detection by means of physical examination and laboratory test, and it is impossible to quickly detect the probability of disease, and the cost of disease probability detection is high.
  • the present application provides a method for detecting disease probability.
  • FIG. 1 is a schematic flowchart diagram of a first embodiment of a method for detecting disease probability according to the present application.
  • the method for detecting the probability of disease includes:
  • Step S10 collecting various data associated with the user, and performing feature processing on each collected data
  • the detection method of the disease probability is preferably applied to the insurance system.
  • the user may report the health information related to the medical examination or some behavior information of the medical examination to the insurance before the insurance is insured.
  • a comprehensive analysis is performed by the insurance system to detect the probability of the user's disease, and then it is determined whether or not to insure. Therefore, collecting the data associated with the user in the database is actually collecting the data associated with the user in the database corresponding to the insurance system.
  • the data includes the behavior information and the health information, and the behavior information and the health information are represented. Information in different dimensions.
  • the step S10 includes:
  • Step S11 performing feature analysis on each collected data to determine a feature type of each data
  • Step S12 when the data is missing value data, performing mean interpolation processing or multiple interpolation processing on the missing value data;
  • Step S13 When the data is abnormal value data, the abnormal value data is filtered to filter out the data whose abnormal value is less than the preset threshold, and the filtered data is processed as the missing value data.
  • the feature types of the data include feature types such as outliers and missing values.
  • the missing value data is subjected to mean interpolation processing or multiple interpolation processing, and specifically, which interpolation processing method is adopted, and is limited according to actual conditions.
  • the mean interpolation process includes two modes: 1) one is interpolation processing using an average value; 2) the other is interpolation processing using a mode.
  • the attribute of the data is first divided into a fixed distance type and a non-distance type. If the missing value is a fixed distance type, the missing value is interpolated with the average value of the attribute existence value; if the missing value is non-distance Type, according to the principle of the plural in statistics, use the mode of the attribute (that is, the highest frequency of occurrence) to fill in the missing value.
  • the multiple interpolation processing (Multiple Imputation, MI), considers that the value to be interpolated is random and its value is derived from the observed value. In practice, it is common to estimate the value to be interpolated, and then add different noise to form multiple sets of optional interpolation values.
  • the multiple interpolation processing method is divided into three steps: 1 generating a set of possible interpolation values for each null value, which reflect the uncertainty of the non-response model; each value can be used to interpolate the data set Missing values, resulting in several complete data sets. 2 Each imputed data set is statistically analyzed using statistical methods for the complete data set. 3 The results from the respective imputed data sets are selected according to the scoring function to generate the final imputed values.
  • the current one has group data, including three variables Y1, Y2, Y3, whose joint distribution is a normal distribution, and the data is processed into three groups, the group A maintains the original data, and the group B only lacks the Y3, group C. Y1 and Y2 are missing.
  • group data including three variables Y1, Y2, Y3, whose joint distribution is a normal distribution, and the data is processed into three groups, the group A maintains the original data, and the group B only lacks the Y3, group C. Y1 and Y2 are missing.
  • no processing will be performed for group A
  • a set of estimated values of Y3 will be generated for group B (return for Y3 with respect to Y1, Y2), and a pair of pairs for generating group Y1 and Y2 for group C.
  • Estimated value for Y1, Y2 regression on Y3
  • the group A will not be processed.
  • the complete samples will be randomly selected to form m groups (m is the optional m group interpolation value), and the number of cases in each group can be It is ok to estimate the parameters effectively.
  • the likelihood method, the specific implementation algorithm in the computer is the expectation maximization method (EM).
  • EM expectation maximization method
  • the mean value interpolation processing or the multiple interpolation processing can be realized for the missing value data.
  • the outlier data is filtered to filter out the data whose outlier is less than the preset threshold.
  • the preset threshold is limited according to the specific situation. After filtering out the data whose outlier value is less than the preset threshold, the filtered data can be processed as the missing value data, and the processing method of the missing value data has been described in detail above, and details are not described herein.
  • the interpolation processing of the data is equivalent to filling the content of the data having the missing value, and the content is filled because the data collected from the database may have some information. Filling in the complete, if the calculation of the disease probability is not enough, it may not be accurate. Therefore, in this embodiment, the data with missing values can be filled, which can improve the saturation of the data, and can ensure the accuracy of the subsequent disease probability calculation. .
  • the screening of outliers is to eliminate data with more serious abnormalities to prevent the impact of disease probability detection results.
  • Step S20 constructing a multi-dimensional data set according to each data processed by the feature
  • the step S20 includes:
  • Step S21 determining feature saturation corresponding to each data after the feature processing
  • Step S22 Filter each data according to the feature saturation to filter out each data whose feature saturation reaches a preset saturation degree
  • Step S23 constructing a multi-dimensional data set according to the selected data.
  • each data is filtered according to the feature saturation to select the feature saturation to reach the preset saturation.
  • Each data ultimately constructs a multi-dimensional data set based on the filtered data. It is equivalent to cleaning the collected data to screen out the data that meets the requirements, so as to ensure that the disease probability of subsequent calculations is more accurate.
  • Step S30 performing random sampling on the multi-dimensional data set to divide a test set and a training set
  • the multi-dimensional data set is randomly sampled to divide the multi-dimensional data into a test set and a training set.
  • the number of the test set and the training set is not limited, and is set according to a specific situation, but the number of training sets is required to be higher than the number of test sets, for example, the training set is divided into 70%, The test set is divided into 30%.
  • Step S40 constructing a model based on the training set, and obtaining a regression decision tree
  • a regression decision tree is obtained.
  • the manner of constructing the model according to the training set is consistent with the existing data set to construct the model, and no further description is made here.
  • Step S50 testing the regression decision tree according to the test set to calculate a disease probability of the user.
  • the regression decision tree is obtained, the regression decision tree is tested according to the test set to calculate the disease probability of the user.
  • the step S50 includes:
  • Step S51 inputting data of the test set into the regression decision tree to obtain respective numbers of corresponding numbers according to the number of trees in the regression decision tree;
  • Step S52 weighting and averaging the respective values and the weight values of the trees in the regression decision tree to obtain a total value of the regression decision tree;
  • step S53 the total value is taken as the disease probability of the user.
  • the regression decision tree is tested according to the test set to calculate the disease probability of the user, essentially inputting the data of the test set into the regression decision tree, and then according to the regression decision
  • the number of trees in the tree is obtained by a corresponding number of values.
  • the number of trees in the current regression decision tree is 3000-5000
  • the number of values obtained is also the number of trees at 3000-5000 due to the regression decision tree.
  • the weight values of the trees in the tree are preset, and after obtaining the corresponding number of values according to the number of trees in the regression decision tree, weighting the respective values and the weight values of the trees in the regression decision tree The total value of the regression decision tree can be obtained.
  • the regression decision tree has four trees with weights of 0.3, 0.15, 0.2, and 0.35, respectively, and the respective values obtained according to the number of trees in the regression decision tree are A, B, C, and D, respectively.
  • the resulting total value Q 0.3 * A + 0.15 * B + 0.2 * C + 0.35 * D.
  • the total value is the probability of the user's disease.
  • the user is unknown to the user whose disease condition is unknown, and the prediction result of the model is output by the regression decision tree model to obtain the probability of the user's disease.
  • the method for detecting disease probability first collects each data associated with the user, and then performs feature processing on each collected data, and then constructs a multi-dimensional data set according to each data processed by the feature, and constructs the multi-dimensional data according to each data processed by the feature.
  • the set performs random sampling to divide the test set and the training set, build a model based on the training set, obtain a regression decision tree, and finally test the regression decision tree according to the test set to calculate the disease probability of the user.
  • the program builds a model through the collected data, and finally calculates the disease probability of the user according to the model built, without detecting the disease probability by means of physical examination, the detection efficiency of the disease probability is high, and the cost of disease probability detection is also low. .
  • the above-mentioned storage medium may be a read only memory, a magnetic disk or an optical disk or the like.
  • the application further provides a device for detecting the probability of disease.
  • FIG. 5 is a schematic diagram of functional modules of a first embodiment of a disease probability detecting apparatus 100 of the present application.
  • the functional block diagram shown in FIG. 5 is merely an exemplary diagram of a preferred embodiment, and the function of the detecting apparatus 100 for disease probability shown in FIG. 5 by those skilled in the art.
  • the module can be easily supplemented by a new function module; the name of each function module is a custom name, and each program function block of the detecting device 100 for assisting in understanding the probability of the disease is not used to limit the technical solution of the present application.
  • the disease probability detecting apparatus 100 includes:
  • the processing module 10 is configured to collect various data associated with the user, and perform feature processing on each collected data.
  • the constructing module 20 is configured to construct a multi-dimensional data set according to each data processed by the feature;
  • a dividing module 30 configured to randomly sample the multi-dimensional data set to divide a test set and a training set
  • the building module 40 is configured to build a model based on the training set to obtain a regression decision tree
  • the calculating module 50 is configured to test the regression decision tree according to the test set to calculate a disease probability of the user.
  • the detection device for the disease probability is preferably applied to the insurance system. It can be understood that the user will report the health information related to the medical examination or some behavior information of the medical examination to the insurance before the insurance is insured. In the system, a comprehensive analysis is performed by the insurance system to detect the probability of the user's disease, and then it is determined whether or not to insure. Therefore, the processing module 10 collects the data associated with the user in the database, and actually collects the data associated with the user in the database corresponding to the insurance system. In this embodiment, the data includes behavior information and health information, behavior information, and Health information represents information in different dimensions.
  • the processing module 10 After collecting the individual data associated with the user, the processing module 10 performs feature processing on each of the collected data. Specifically, referring to FIG. 6, the processing module 10 includes:
  • the feature analyzing unit 11 is configured to perform feature analysis on each collected data to determine a feature type of each data
  • the interpolation processing unit 12 is configured to perform mean interpolation processing or multiple interpolation processing on the missing value data when the data is missing value data;
  • the filtering processing unit 13 is configured to filter the abnormal value data when the data is the abnormal value data, to filter the data whose abnormal value is less than the preset threshold, and process the filtered data as the missing value data.
  • the feature analyzing unit 11 performs feature analysis on each of the collected data to determine the feature type of each data.
  • the feature types of the data include abnormal values and missing values.
  • Feature type After determining the feature type of each data, if the data is found to be missing value data, the interpolation processing unit 12 performs mean interpolation processing or multiple interpolation processing on the missing value data, and specifically adopts which interpolation processing method, according to the actual situation. Limited.
  • the mean interpolation process includes two modes: 1) one is interpolation processing using an average value; 2) the other is interpolation processing using a mode.
  • the attribute of the data is first divided into a fixed distance type and a non-distance type. If the missing value is a fixed distance type, the missing value is interpolated with the average value of the attribute existence value; if the missing value is non-distance Type, according to the principle of the plural in statistics, use the mode of the attribute (that is, the highest frequency of occurrence) to fill in the missing value.
  • the multiple interpolation processing (Multiple Imputation, MI), considers that the value to be interpolated is random and its value is derived from the observed value. In practice, it is common to estimate the value to be interpolated, and then add different noise to form multiple sets of optional interpolation values.
  • the multiple interpolation processing method is divided into three steps: 1 generating a set of possible interpolation values for each null value, which reflect the uncertainty of the non-response model; each value can be used to interpolate the data set Missing values, resulting in several complete data sets. 2 Each imputed data set is statistically analyzed using statistical methods for the complete data set. 3 The results from the respective imputed data sets are selected according to the scoring function to generate the final imputed values.
  • the current one has group data, including three variables Y1, Y2, Y3, whose joint distribution is a normal distribution, and the data is processed into three groups, the group A maintains the original data, and the group B only lacks the Y3, group C. Y1 and Y2 are missing.
  • group data including three variables Y1, Y2, Y3, whose joint distribution is a normal distribution, and the data is processed into three groups, the group A maintains the original data, and the group B only lacks the Y3, group C. Y1 and Y2 are missing.
  • no processing will be performed for group A
  • a set of estimated values of Y3 will be generated for group B (return for Y3 with respect to Y1, Y2), and a pair of pairs for generating group Y1 and Y2 for group C.
  • Estimated value for Y1, Y2 regression on Y3
  • the group A will not be processed.
  • the complete samples will be randomly selected to form m groups (m is the optional m group interpolation value), and the number of cases in each group can be It is ok to estimate the parameters effectively.
  • the likelihood method, the specific implementation algorithm in the computer is the expectation maximization method (EM).
  • EM expectation maximization method
  • the mean value interpolation processing or the multiple interpolation processing can be realized for the missing value data.
  • the filtering processing unit 13 filters the outlier data to filter out the data whose outlier is less than the preset threshold.
  • the preset threshold is defined according to the specific situation. After filtering out the data whose outlier value is less than the preset threshold, the filtered data can be processed as the missing value data, and the processing method of the missing value data has been described in detail above, and details are not described herein.
  • the interpolation processing of the data is equivalent to filling the content of the data having the missing value, and the content is filled because the data collected from the database may have some information. Filling in the complete, if the calculation of the disease probability is not enough, it may not be accurate. Therefore, in this embodiment, the data with missing values can be filled, which can improve the saturation of the data, and can ensure the accuracy of the subsequent disease probability calculation. .
  • the screening of outliers is to eliminate data with more serious abnormalities to prevent the impact of disease probability detection results.
  • the constructing module 20 constructs a multi-dimensional data set according to each data processed by the feature. It can be understood that the above content has disclosed that the data with missing values is filled, but the data after the filling may not meet the requirements of saturation. If the data is used for subsequent calculation, the accuracy of the disease probability may still be lowered. . Therefore, in the present embodiment, in order to improve the accuracy of the disease probability calculation, referring to FIG. 7, the construction module 20 includes:
  • a determining unit 21 configured to determine a feature saturation corresponding to each data after the feature processing
  • the filtering unit 22 is configured to filter each data according to the feature saturation to filter each data whose feature saturation reaches a preset saturation degree;
  • the constructing unit 23 is configured to construct a multi-dimensional data set according to the selected data.
  • the determining unit 21 first determines the feature saturation corresponding to each data after the feature processing, and then the screening unit 22 filters each data according to the feature saturation.
  • the respective data whose feature saturation reaches the preset saturation is filtered, and the final construction unit 23 constructs the multi-dimensional data set based on the selected data. It is equivalent to cleaning the collected data to screen out the data that meets the requirements, so as to ensure that the disease probability of subsequent calculations is more accurate.
  • the dividing module 30 randomly samples the multi-dimensional data set to divide the multi-dimensional data into a test set and a training set.
  • the number of the test set and the training set is not limited, and is set according to a specific situation, but the number of training sets is required to be higher than the number of test sets, for example, the training set is divided into 70%, The test set is divided into 30%.
  • the building module 40 then builds a model based on the training set to obtain a regression decision tree.
  • the method of constructing the model according to the training set is consistent with the existing data set to construct the model, and is not described herein.
  • the calculation module 50 tests the regression decision tree according to the test set to calculate the disease probability of the user.
  • the computing module 50 includes:
  • the input unit 51 is configured to input data of the test set into the regression decision tree to obtain a corresponding number of values according to the number of trees in the regression decision tree;
  • the calculating unit 52 is configured to weight average the respective values and the weight values of the trees in the regression decision tree to obtain a total value of the regression decision tree;
  • the processing unit 53 is configured to use the total value as the disease probability of the user.
  • the calculation module 50 tests the regression decision tree according to the test set to calculate the disease probability of the user, and substantially the input unit 51 inputs the data of the test set to the regression decision tree. Then, according to the number of trees in the regression decision tree, the corresponding number of values are obtained. For example, the number of trees in the current regression decision tree is 3000-5000, and the number of obtained values is also the number of trees at 3000- 5000. Since the weight values of the trees in the regression decision tree are preset, after obtaining corresponding numbers of values according to the number of trees in the regression decision tree, the calculating unit 52 compares each value with the regression. The weight values of the weights of the trees in the decision tree are weighted and averaged to obtain the total value of the regression decision tree.
  • the user is unknown to the user whose disease condition is unknown, and the prediction result of the model is output by the regression decision tree model to obtain the probability of the user's disease.
  • the apparatus for detecting disease probability first collects each data associated with the user, and then performs feature processing on each collected data, and then constructs a multi-dimensional data set according to each data processed by the feature, and constructs the multi-dimensional data according to each data processed by the feature.
  • the set performs random sampling to divide the test set and the training set, build a model based on the training set, obtain a regression decision tree, and finally test the regression decision tree according to the test set to calculate the disease probability of the user.
  • the program builds a model through the collected data, and finally calculates the disease probability of the user according to the model built, without detecting the disease probability by means of physical examination, the detection efficiency of the disease probability is high, and the cost of disease probability detection is also low. .
  • the foregoing processing module 10, the construction module 20, the partitioning module 30, the building module 40, the computing module 50, and the like may be embedded in the hardware device or independent of the disease probability detecting device, or may be Stored in software in the memory of the detection device of the disease probability, so that the processor calls to perform the operations corresponding to the above respective modules.
  • the processor can be a central processing unit (CPU), a microprocessor, a microcontroller, or the like.
  • FIG. 9 is a schematic structural diagram of a device in a hardware operating environment according to an embodiment of the present application.
  • the detecting device for the disease probability in the embodiment of the present application may be a PC, or may be a terminal device such as a smart phone, a tablet computer, or a portable computer.
  • the detection device of the disease probability may include a processor 1001, such as a CPU, a network interface 1002, a user interface 1003, and a memory 1004. Connection communication between these components can be achieved via a communication bus.
  • the network interface 1002 may optionally include a standard wired interface (for connecting to a wired network), a wireless interface (such as a WI-FI interface, a Bluetooth interface, an infrared interface, etc. for connecting to a wireless network).
  • the user interface 1003 may include a display, an input unit such as a keyboard, and the optional user interface 1003 may also include a standard wired interface (eg, for connecting a wired keyboard, a wired mouse, etc.), a wireless interface (eg, for Connect a wireless keyboard, wireless mouse).
  • the memory 1004 may be a high speed RAM memory or a stable memory (non-volatile) Memory), such as disk storage.
  • the memory 1004 can also optionally be a storage device independent of the aforementioned processor 1001.
  • the detection device of the disease probability may further include a camera, RF (Radio) Frequency, RF) circuits, sensors, audio circuits, WiFi modules, and more.
  • RF Radio
  • RF Radio
  • the structure of the detecting device of the disease probability shown in FIG. 9 does not constitute a limitation of the detecting device for the probability of disease, and may include more or less components than those illustrated, or may combine some components. Or different parts arrangement.
  • a memory 1004 as a computer storage medium may include an operating system, a network communication module, a user interface module, and a detection program for disease probability.
  • the operating system is a program for detecting device hardware and software resources for managing and controlling disease probability, and supports network communication module, user interface module, disease probability detection program, and other programs or software operations; network communication module is used for management and control Network interface 1002; a user interface module for managing and controlling user interface 1003.
  • the processor 1001 can be used to execute a detection program of the disease probability stored in the memory 1004 to implement the respective steps of the detection method of the disease probability as described above.
  • the present application provides a computer readable storage medium storing a detection program of disease probability, the detection program of the disease probability being executed by a processor to implement detection of disease probability as described above The various steps of the method.
  • the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better.
  • Implementation Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk,
  • the optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

一种疾病概率的检测方法、装置、设备和计算机可读存储介质。其中,所述方法包括:采集用户关联的各个数据,并对采集的各个数据进行特征处理(S10);根据特征处理后的各个数据构造多维度数据集(S20);对所述多维度数据集进行随机抽样,以划分出测试集和训练集(S30);基于所述训练集搭建模型,得到回归决策树(S40);根据所述测试集对所述回归决策树进行测试,以计算用户的疾病概率(S50)。通过采集的数据搭建模型,最终根据搭建的所述模型计算用户的疾病概率,对疾病概率的检测效率较高,而且疾病概率检测的成本也较低。

Description

疾病概率的检测方法、装置、设备及计算机可读存储介质
本申请要求于2017年02月20日提交中国专利局、申请号为201710095020.5、发明名称为“疾病概率的检测方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及疾病信息处理领域,尤其涉及一种疾病概率的检测方法、装置、设备及计算机可读存储介质。
背景技术
传统的疾病概率检测,如癌症患病率的检测,是基于生物学、基因组学、以及体检化验结果等复杂方式实现的。这种方法需要精准的数据源,并且在获取到数据源之后,还需要花费较长的时间进行分析处理,以得到检测结果,再加上数据源获得途径较复杂,对疾病检所花费的成本也较高。因此,现有的疾病概率检测方式,既无法快速对疾病概率进行检测,而且疾病概率检测的成本也较高。
发明内容
本申请的主要目的在于提供一种疾病概率的检测方法、装置、设备及计算机可读存储介质,旨在解决现有技术对疾病概率的检测,不仅检测的时间较长,而且花费成本也较高的技术问题。
为实现上述目的,本申请提供一种疾病概率的检测方法,所述疾病概率的检测方法包括:
采集用户关联的各个数据,并对采集的各个数据进行特征处理;
根据特征处理后的各个数据构造多维度数据集;
对所述多维度数据集进行随机抽样,以划分出测试集和训练集;
基于所述训练集搭建模型,得到回归决策树;
根据所述测试集对所述回归决策树进行测试,以计算用户的疾病概率。
此外,为实现上述目的,本申请还提供一种疾病概率的检测装置,所述疾病概率的检测装置包括:
处理模块,用于采集用户关联的各个数据,并对采集的各个数据进行特征处理;
构造模块,用于根据特征处理后的各个数据构造多维度数据集;
划分模块,用于对所述多维度数据集进行随机抽样,以划分出测试集和训练集;
搭建模块,用于基于所述训练集搭建模型,得到回归决策树;
计算模块,用于根据所述测试集对所述回归决策树进行测试,以计算用户的疾病概率。
此外,为实现上述目的,本申请还提供一种疾病概率的检测设备,所述疾病概率的检测设备包括处理器、以及存储有疾病概率的检测程序的存储器;所述处理器用于执行所述疾病概率的检测程序,以实现上文所述的疾病概率的检测方法的步骤。
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质存储有疾病概率的检测程序,所述疾病概率的检测程序被处理器执行,以实现上文所述的疾病概率的检测方法的步骤。
本申请提出的疾病概率的检测方法和装置,先采集用户关联的各个数据,然后对采集的各个数据进行特征处理,再根据特征处理后的各个数据构造多维度数据集,并对所述多维度数据集进行随机抽样,以划分出测试集和训练集,基于所述训练集搭建模型,得到回归决策树,最终根据所述测试集对所述回归决策树进行测试,以计算用户的疾病概率。本方案通过采集的数据搭建模型,最终根据搭建的所述模型计算用户的疾病概率,无需通过体检化验的方式检测疾病概率,对疾病概率的检测效率较高,而且疾病概率检测的成本也较低。
附图说明
图1为本申请疾病概率的检测方法第一实施例的流程示意图;
图2为图1中步骤S10的细化流程示意图;
图3为图1中步骤S20的细化流程示意图;
图4为图1中步骤S50的细化流程示意图;
图5为本申请疾病概率的检测装置第一实施例的功能模块示意图;
图6为图5中处理模块10的细化功能模块示意图;
图7为图5中构造模块20的细化功能模块示意图;
图8为图5中计算模块50的细化功能模块示意图;
图9是本申请实施例方案涉及的硬件运行环境的设备结构示意图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请实施例的解决方案主要是:先采集用户关联的各个数据,然后对采集的各个数据进行特征处理,再根据特征处理后的各个数据构造多维度数据集,并对所述多维度数据集进行随机抽样,以划分出测试集和训练集,基于所述训练集搭建模型,得到回归决策树,最终根据所述测试集对所述回归决策树进行测试,以计算用户的疾病概率。解决了现有的疾病概率,需要通过体检、化验的方式进行检测,无法快速对疾病概率进行检测,而且疾病概率检测的成本较高的问题。
应当理解,传统意义上的疾病检测,数据源获得途径较复杂,无法实现快速针对每个普通用户的疾病概率检测,并且该传统方法在保险行业实现也较为困难。
基于现有技术存在的问题,本申请提供一种疾病概率的检测方法。
参照图1,图1为本申请疾病概率的检测方法第一实施例的流程示意图。
在本实施例中,所述疾病概率的检测方法包括:
采集用户关联的各个数据,并对采集的各个数据进行特征处理;根据特征处理后的各个数据构造多维度数据集;对所述多维度数据集进行随机抽样,以划分出测试集和训练集;基于所述训练集搭建模型,得到回归决策树;根据所述测试集对所述回归决策树进行测试,以计算用户的疾病概率。
以下是本实施例中逐步实现疾病概率检测的具体步骤:
步骤S10,采集用户关联的各个数据,并对采集的各个数据进行特征处理;
在本实施例中,所述疾病概率的检测方法优选应用于保险系统中,可以理解,用户在进行投保之前,会将体检的相关的健康信息,或者是自身的一些行为信息等数据上报至保险系统中,由保险系统进行综合分析,以检测出用户的疾病概率,后续再确定是否要进行投保。因此,在数据库中采集用户关联的各个数据,实际上就是在保险系统对应的数据库中采集用户关联的各个数据,本实施例中,所述数据包括行为信息和健康信息,行为信息和健康信息表示不同维度的信息。
在采集到用户关联的各个数据之后,对采集的各个数据进行特征处理。具体地,参照图2,所述步骤S10包括:
步骤S11,对采集的各个数据进行特征分析,以确定各个数据的特征类型;
步骤S12,在数据为缺失值数据时,对缺失值数据进行均值插补处理或多重插补处理;
步骤S13,在数据为异常值数据时,对异常值数据进行筛选,以筛选出异常值小于预设阈值的数据,并将筛选出的数据作为缺失值数据进行处理。
即,在采集到用户关联的各个数据之后,先对采集的各个数据进行特征分析,以确定各个数据的特征类型,本实施例中,数据的特征类型包括异常值和缺失值等特征类型。在确定各个数据的特征类型之后,若发现数据为缺失值数据,则对缺失值数据进行均值插补处理或多重插补处理,具体采用哪种插补处理方式,根据实际情况进行限定。
在本实施例中,所述均值插补处理包括两种方式:1)一种是采用平均值进行插补处理;2)另一种是采用众数进行插补处理。具体地:先将数据的属性分为定距型和非定距型,如果缺失值是定距型的,就以该属性存在值的平均值来插补缺失的值;如果缺失值是非定距型的,就根据统计学中的众数原理,用该属性的众数(即出现频率最高的值)来补齐缺失的值。
所述多重插补处理(Multiple Imputation,MI),认为待插补的值是随机的,它的值来自于已观测到的值。具体实践上通常是估计出待插补的值,然后再加上不同的噪声,形成多组可选插补值。多重插补处理方法分为三个步骤:①为每个空值产生一套可能的插补值,这些值反映了无响应模型的不确定性;每个值都可以被用来插补数据集中的缺失值,产生若干个完整数据集合。②每个插补数据集合都用针对完整数据集的统计方法进行统计分析。③对来自各个插补数据集的结果,根据评分函数进行选择,产生最终的插补值。
例如,当前一有组数据,包括三个变量Y1,Y2,Y3,它们的联合分布为正态分布,将这组数据处理成三组,A组保持原始数据,B组仅缺失Y3,C组缺失Y1和Y2。在进行多值插补时,对A组将不进行任何处理,对B组产生Y3的一组估计值(作Y3关于Y1,Y2的回归),对C组作产生Y1和Y2的一组成对估计值(作Y1,Y2关于Y3的回归)。当用多值插补时,对A组将不进行处理,对B、C组将完整的样本随机抽取形成为m组(m为可选择的m组插补值),每组个案数只要能够有效估计参数就可以了。对存在缺失值的属性的分布作出估计,然后基于这m组观测值,对于这m组样本分别产生关于参数的m组估计值,给出相应的预测即,这时采用的估计方法为极大似然法,在计算机中具体的实现算法为期望最大化法(EM)。对B组估计出一组Y3的值,对C将利用 Y1,Y,Y3它们的联合分布为正态分布这一前提,估计出一组(Y1,Y2)。
通过上述方式,即可实现对缺失值数据进行均值插补处理或多重插补处理。
当然,若发现数据为异常值数据时,则对异常值数据进行筛选,以筛选出异常值小于预设阈值的数据,其中,所述预设阈值根据具体情况进行限定。在筛选出异常值小于预设阈值的数据之后,即可将筛选出的数据作为缺失值数据进行处理,所述缺失值数据的处理方式上文在已经详述,此处不再进行赘述。
应当理解的是,本实施例中,对数据进行插补处理,相当于是对有缺失值的数据进行内容的填补,之所以要进行内容的填补,是因为从数据库中采集的数据可以有些信息没有填补完全,若是后续进行疾病概率的计算,可能不够准确,因此本实施例中,对有缺失值的数据进行填补,可以提高数据的饱和度,可以保证后续进行疾病概率计算时,准确性更高。而对异常值进行筛选,是将异常情况较为严重的数据进行剔除,以防止对疾病概率检测结果造成影响。
步骤S20,根据特征处理后的各个数据构造多维度数据集;
在对采集的各个数据进行特征处理之后,再根据特征处理后的各个数据构造多维度数据集。可以理解,上述内容已经公开了对有缺失值的数据进行填补,但是填补后的数据,可能饱和度还是没有达到要求,若是采用这些数据进行后续的计算,可能仍然会导致疾病概率的准确性降低。因此,在本实施例中,为了提高疾病概率计算的准确性,参照图3,所述步骤S20包括:
步骤S21,确定特征处理后的各个数据对应的特征饱和度;
步骤S22,根据特征饱和度对各个数据进行筛选,以筛选出特征饱和度达到预设饱和度的各个数据;
步骤S23,根据筛选出的各个数据构造多维度数据集。
即,对采集的各个数据进行特征处理之后,先确定特征处理后的各个数据对应的特征饱和度,然后再根据特征饱和度对各个数据进行筛选,以筛选出特征饱和度达到预设饱和度的各个数据,最终根据筛选出的各个数据构造多维度数据集。相当于是对采集的数据进行清洗,以筛选出符合要求的数据,以便保证后续计算的疾病概率较为准确。
步骤S30,对所述多维度数据集进行随机抽样,以划分出测试集和训练集;
即,在构造多维度数据集之后,对所述多维度数据集进行随机抽样,以将所述多维度数据划分为测试集和训练集。本实施例中,所述测试集和训练集划分的数量不做限定,根据具体情况进行设置,但是要保证训练集的数量高于测试集的数量,例如,将训练集划分为70%,将测试集划分为30%。
步骤S40,基于所述训练集搭建模型,得到回归决策树;
再基于训练集搭建模型,得到回归决策树,本实施例中,根据训练集搭建模型的方式与现有的数据集搭建模型的方式一致,此处不做赘述。
步骤S50,根据所述测试集对所述回归决策树进行测试,以计算用户的疾病概率。
在得到回归决策树之后,根据所述测试集对所述回归决策树进行测试,以计算用户的疾病概率。其中,参照图4,所述步骤S50包括:
步骤S51,将所述测试集的数据输入到所述回归决策树中,以根据所述回归决策树中树的数量得到对应数量的各个数值;
步骤S52,将各个数值与所述回归决策树中各个树的权重值进行加权平均,得到所述回归决策树的总值;
步骤S53,将所述总值作为用户的疾病概率。
也就是说,根据所述测试集对所述回归决策树进行测试,以计算用户的疾病概率,实质上是将所述测试集的数据输入到所述回归决策树中,再根据所述回归决策树中树的数量得到对应数量的各个数值,例如,当前所述回归决策树中树的数量在3000-5000,那么得到的数值的数量也是树的数目在3000-5000,由于所述回归决策树中各个树的权重值是预先设定的,那么在根据所述回归决策树中树的数量得到对应数量的各个数值之后,将各个数值与所述回归决策树中各个树的权重值进行加权平均,即可得到所述回归决策树的总值。例如,所述回归决策树中有4个树,权重分别为0.3、0.15、0.2、0.35,而根据所述回归决策树中树的数量得到的各个数值分别为A、B、C、D,那么最终得到的总值Q=0.3*A+0.15*B+0.2*C+0.35*D。该总值就是用户的疾病概率。
本实施例,相当于是对患病情况未知的用户,通过回归决策树模型,输出模型的预测结果,以得到用户的患病概率。
本实施例提出的疾病概率的检测方法,先采集用户关联的各个数据,然后对采集的各个数据进行特征处理,再根据特征处理后的各个数据构造多维度数据集,并对所述多维度数据集进行随机抽样,以划分出测试集和训练集,基于所述训练集搭建模型,得到回归决策树,最终根据所述测试集对所述回归决策树进行测试,以计算用户的疾病概率。本方案通过采集的数据搭建模型,最终根据搭建的所述模型计算用户的疾病概率,无需通过体检化验的方式检测疾病概率,对疾病概率的检测效率较高,而且疾病概率检测的成本也较低。
需要说明的是,本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
本申请进一步提供一种疾病概率的检测装置。
参照图5,图5为本申请疾病概率的检测装置100第一实施例的功能模块示意图。
需要强调的是,对本领域的技术人员来说,图5所示功能模块图仅仅是一个较佳实施例的示例图,本领域的技术人员围绕图5所示的疾病概率的检测装置100的功能模块,可轻易进行新的功能模块的补充;各功能模块的名称是自定义名称,仅用于辅助理解该疾病概率的检测装置100的各个程序功能块,不用于限定本申请的技术方案,本申请技术方案的核心是,各自定义名称的功能模块所要达成的功能。
在本实施例中,所述疾病概率的检测装置100包括:
处理模块10,用于采集用户关联的各个数据,并对采集的各个数据进行特征处理;
构造模块20,用于根据特征处理后的各个数据构造多维度数据集;
划分模块30,用于对所述多维度数据集进行随机抽样,以划分出测试集和训练集;
搭建模块40,用于基于所述训练集搭建模型,得到回归决策树;
计算模块50,用于根据所述测试集对所述回归决策树进行测试,以计算用户的疾病概率。
在本实施例中,所述疾病概率的检测装置优选应用于保险系统中,可以理解,用户在进行投保之前,会将体检的相关的健康信息,或者是自身的一些行为信息等数据上报至保险系统中,由保险系统进行综合分析,以检测出用户的疾病概率,后续再确定是否要进行投保。因此,处理模块10在数据库中采集用户关联的各个数据,实际上就是在保险系统对应的数据库中采集用户关联的各个数据,本实施例中,所述数据包括行为信息和健康信息,行为信息和健康信息表示不同维度的信息。
在采集到用户关联的各个数据之后,处理模块10对采集的各个数据进行特征处理。具体地,参照图6,所述处理模块10包括:
特征分析单元11,用于对采集的各个数据进行特征分析,以确定各个数据的特征类型;
插补处理单元12,用于在数据为缺失值数据时,对缺失值数据进行均值插补处理或多重插补处理;
筛选处理单元13,用于在数据为异常值数据时,对异常值数据进行筛选,以筛选出异常值小于预设阈值的数据,并将筛选出的数据作为缺失值数据进行处理。
即,在采集到用户关联的各个数据之后,特征分析单元11先对采集的各个数据进行特征分析,以确定各个数据的特征类型,本实施例中,数据的特征类型包括异常值和缺失值等特征类型。在确定各个数据的特征类型之后,若发现数据为缺失值数据,则插补处理单元12对缺失值数据进行均值插补处理或多重插补处理,具体采用哪种插补处理方式,根据实际情况进行限定。
在本实施例中,所述均值插补处理包括两种方式:1)一种是采用平均值进行插补处理;2)另一种是采用众数进行插补处理。具体地:先将数据的属性分为定距型和非定距型,如果缺失值是定距型的,就以该属性存在值的平均值来插补缺失的值;如果缺失值是非定距型的,就根据统计学中的众数原理,用该属性的众数(即出现频率最高的值)来补齐缺失的值。
所述多重插补处理(Multiple Imputation,MI),认为待插补的值是随机的,它的值来自于已观测到的值。具体实践上通常是估计出待插补的值,然后再加上不同的噪声,形成多组可选插补值。多重插补处理方法分为三个步骤:①为每个空值产生一套可能的插补值,这些值反映了无响应模型的不确定性;每个值都可以被用来插补数据集中的缺失值,产生若干个完整数据集合。②每个插补数据集合都用针对完整数据集的统计方法进行统计分析。③对来自各个插补数据集的结果,根据评分函数进行选择,产生最终的插补值。
例如,当前一有组数据,包括三个变量Y1,Y2,Y3,它们的联合分布为正态分布,将这组数据处理成三组,A组保持原始数据,B组仅缺失Y3,C组缺失Y1和Y2。在进行多值插补时,对A组将不进行任何处理,对B组产生Y3的一组估计值(作Y3关于Y1,Y2的回归),对C组作产生Y1和Y2的一组成对估计值(作Y1,Y2关于Y3的回归)。当用多值插补时,对A组将不进行处理,对B、C组将完整的样本随机抽取形成为m组(m为可选择的m组插补值),每组个案数只要能够有效估计参数就可以了。对存在缺失值的属性的分布作出估计,然后基于这m组观测值,对于这m组样本分别产生关于参数的m组估计值,给出相应的预测即,这时采用的估计方法为极大似然法,在计算机中具体的实现算法为期望最大化法(EM)。对B组估计出一组Y3的值,对C将利用 Y1,Y,Y3它们的联合分布为正态分布这一前提,估计出一组(Y1,Y2)。
通过上述方式,即可实现对缺失值数据进行均值插补处理或多重插补处理。
当然,若发现数据为异常值数据时,则筛选处理单元13对异常值数据进行筛选,以筛选出异常值小于预设阈值的数据,其中,所述预设阈值根据具体情况进行限定。在筛选出异常值小于预设阈值的数据之后,即可将筛选出的数据作为缺失值数据进行处理,所述缺失值数据的处理方式上文在已经详述,此处不再进行赘述。
应当理解的是,本实施例中,对数据进行插补处理,相当于是对有缺失值的数据进行内容的填补,之所以要进行内容的填补,是因为从数据库中采集的数据可以有些信息没有填补完全,若是后续进行疾病概率的计算,可能不够准确,因此本实施例中,对有缺失值的数据进行填补,可以提高数据的饱和度,可以保证后续进行疾病概率计算时,准确性更高。而对异常值进行筛选,是将异常情况较为严重的数据进行剔除,以防止对疾病概率检测结果造成影响。
在所述处理模块10对采集的各个数据进行特征处理之后,构造模块20再根据特征处理后的各个数据构造多维度数据集。可以理解,上述内容已经公开了对有缺失值的数据进行填补,但是填补后的数据,可能饱和度还是没有达到要求,若是采用这些数据进行后续的计算,可能仍然会导致疾病概率的准确性降低。因此,在本实施例中,为了提高疾病概率计算的准确性,参照图7,所述构造模块20包括:
确定单元21,用于确定特征处理后的各个数据对应的特征饱和度;
筛选单元22,用于根据特征饱和度对各个数据进行筛选,以筛选出特征饱和度达到预设饱和度的各个数据;
构造单元23,用于根据筛选出的各个数据构造多维度数据集。
即,所述处理模块10对采集的各个数据进行特征处理之后,确定单元21先确定特征处理后的各个数据对应的特征饱和度,然后筛选单元22再根据特征饱和度对各个数据进行筛选,以筛选出特征饱和度达到预设饱和度的各个数据,最终构造单元23根据筛选出的各个数据构造多维度数据集。相当于是对采集的数据进行清洗,以筛选出符合要求的数据,以便保证后续计算的疾病概率较为准确。
在本实施例中,在所述构造模块20构造多维度数据集之后,划分模块30对所述多维度数据集进行随机抽样,以将所述多维度数据划分为测试集和训练集。本实施例中,所述测试集和训练集划分的数量不做限定,根据具体情况进行设置,但是要保证训练集的数量高于测试集的数量,例如,将训练集划分为70%,将测试集划分为30%。
搭建模块40再基于训练集搭建模型,得到回归决策树,本实施例中,根据训练集搭建模型的方式与现有的数据集搭建模型的方式一致,此处不做赘述。
在得到回归决策树之后,计算模块50根据所述测试集对所述回归决策树进行测试,以计算用户的疾病概率。其中,参照图8,所述计算模块50包括:
输入单元51,用于将所述测试集的数据输入到所述回归决策树中,以根据所述回归决策树中树的数量得到对应数量的各个数值;
计算单元52,用于将各个数值与所述回归决策树中各个树的权重值进行加权平均,得到所述回归决策树的总值;
处理单元53,用于将所述总值作为用户的疾病概率。
也就是说,所述计算模块50根据所述测试集对所述回归决策树进行测试,以计算用户的疾病概率,实质上是输入单元51将所述测试集的数据输入到所述回归决策树中,再根据所述回归决策树中树的数量得到对应数量的各个数值,例如,当前所述回归决策树中树的数量在3000-5000,那么得到的数值的数量也是树的数目在3000-5000,由于所述回归决策树中各个树的权重值是预先设定的,那么在根据所述回归决策树中树的数量得到对应数量的各个数值之后,计算单元52将各个数值与所述回归决策树中各个树的权重值进行加权平均,即可得到所述回归决策树的总值。例如,所述回归决策树中有4个树,权重分别为0.3、0.15、0.2、0.35,而根据所述回归决策树中树的数量得到的各个数值分别为A、 B、C、D,那么最终得到的总值Q=0.3*A+0.15*B+0.2*C+0.35*D,该总值就是用户的疾病概率。
本实施例,相当于是对患病情况未知的用户,通过回归决策树模型,输出模型的预测结果,以得到用户的患病概率。
本实施例提出的疾病概率的检测装置,先采集用户关联的各个数据,然后对采集的各个数据进行特征处理,再根据特征处理后的各个数据构造多维度数据集,并对所述多维度数据集进行随机抽样,以划分出测试集和训练集,基于所述训练集搭建模型,得到回归决策树,最终根据所述测试集对所述回归决策树进行测试,以计算用户的疾病概率。本方案通过采集的数据搭建模型,最终根据搭建的所述模型计算用户的疾病概率,无需通过体检化验的方式检测疾病概率,对疾病概率的检测效率较高,而且疾病概率检测的成本也较低。
需要说明的是,在硬件实现上,以上处理模块10、构造模块20、划分模块30、搭建模块40及计算模块50等可以以硬件形式内嵌于或独立于疾病概率的检测装置中,也可以以软件形式存储于疾病概率的检测装置的存储器中,以便于处理器调用执行以上各个模块对应的操作。该处理器可以为中央处理单元(CPU)、微处理器、单片机等。
参照图9,图9是本申请实施例方案涉及的硬件运行环境的设备结构示意图。
本申请实施例疾病概率的检测设备可以是PC,也可以是智能手机、平板电脑、便携计算机等终端设备。
如图9所示,该疾病概率的检测设备可以包括:处理器1001,例如CPU,网络接口1002,用户接口1003,存储器1004。这些组件之间的连接通信可以通过通信总线实现。网络接口1002可选的可以包括标准的有线接口(用于连接有线网络)、无线接口(如WI-FI接口、蓝牙接口、红外线接口等,用于连接无线网络)。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard),可选用户接口1003还可以包括标准的有线接口(例如用于连接有线键盘、有线鼠标等)、无线接口(例如用于连接无线键盘、无线鼠标)。存储器1004可以是高速RAM存储器,也可以是稳定的存储器(non-volatile memory),例如磁盘存储器。存储器1004可选的还可以是独立于前述处理器1001的存储装置。
可选地,该疾病概率的检测设备还可以包括摄像头、RF(Radio Frequency,射频)电路,传感器、音频电路、WiFi模块等等。
本领域技术人员可以理解,图9中示出的疾病概率的检测设备结构并不构成对疾病概率的检测设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
如图9所示,作为一种计算机存储介质的存储器1004中可以包括操作系统、网络通信模块、用户接口模块以及疾病概率的检测程序。其中,操作系统是管理和控制疾病概率的检测设备硬件与软件资源的程序,支持网络通信模块、用户接口模块、疾病概率的检测程序以及其他程序或软件的运行;网络通信模块用于管理和控制网络接口1002;用户接口模块用于管理和控制用户接口1003。
在图9所示的疾病概率的检测设备中,而处理器1001可以用于执行存储器1004中存储的疾病概率的检测程序,以实现如上文所述的疾病概率的检测方法的各个步骤。
本申请提供了一种计算机可读存储介质,所述计算机可读存储介质存储有疾病概率的检测程序,所述疾病概率的检测程序被处理器执行,以实现如上文所述的疾病概率的检测方法的各个步骤。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其它变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其它要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其它相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种疾病概率的检测方法,其特征在于,所述疾病概率的检测方法包括:
    采集用户关联的各个数据,并对采集的各个数据进行特征处理;
    根据特征处理后的各个数据构造多维度数据集;
    对所述多维度数据集进行随机抽样,以划分出测试集和训练集;
    基于所述训练集搭建模型,得到回归决策树;
    根据所述测试集对所述回归决策树进行测试,以计算用户的疾病概率。
  2. 如权利要求1所述的疾病概率的检测方法,其特征在于,所述对采集的各个数据进行特征处理的步骤包括:
    对采集的各个数据进行特征分析,以确定各个数据的特征类型;
    在数据为缺失值数据时,对缺失值数据进行均值插补处理或多重插补处理;
    在数据为异常值数据时,对异常值数据进行筛选,以筛选出异常值小于预设阈值的数据,并将筛选出的数据作为缺失值数据进行处理。
  3. 如权利要求2所述的疾病概率的检测方法,其特征在于,所述均值插补处理的方式包括:采用平均值进行插补处理,或采用众数进行插补处理。
  4. 如权利要求1所述的疾病概率的检测方法,其特征在于,所述根据特征处理后的各个数据构造多维度数据集的步骤包括:
    确定特征处理后的各个数据对应的特征饱和度;
    根据特征饱和度对各个数据进行筛选,以筛选出特征饱和度达到预设饱和度的各个数据;
    根据筛选出的各个数据构造多维度数据集。
  5. 如权利要求1述的疾病概率的检测方法,其特征在于,所述根据所述测试集对所述回归决策树进行测试,以计算用户的疾病概率的步骤包括:
    将所述测试集的数据输入到所述回归决策树中,以根据所述回归决策树中树的数量得到对应数量的各个数值;
    将各个数值与所述回归决策树中各个树的权重值进行加权平均,得到所述回归决策树的总值;
    将所述总值作为用户的疾病概率。
  6. 一种疾病概率的检测装置,其特征在于,所述疾病概率的检测装置包括:
    处理模块,用于采集用户关联的各个数据,并对采集的各个数据进行特征处理;
    构造模块,用于根据特征处理后的各个数据构造多维度数据集;
    划分模块,用于对所述多维度数据集进行随机抽样,以划分出测试集和训练集;
    搭建模块,用于基于所述训练集搭建模型,得到回归决策树;
    计算模块,用于根据所述测试集对所述回归决策树进行测试,以计算用户的疾病概率。
  7. 如权利要求6所述的疾病概率的检测装置,其特征在于,所述处理模块包括:
    特征分析单元,用于对采集的各个数据进行特征分析,以确定各个数据的特征类型;
    插补处理单元,用于在数据为缺失值数据时,对缺失值数据进行均值插补处理或多重插补处理;
    筛选处理单元,用于在数据为异常值数据时,对异常值数据进行筛选,以筛选出异常值小于预设阈值的数据,并将筛选出的数据作为缺失值数据进行处理。
  8. 如权利要求7所述的疾病概率的检测装置,其特征在于,所述均值插补处理的方式包括:采用平均值进行插补处理,或采用众数进行插补处理。
  9. 如权利要求6所述的疾病概率的检测装置,其特征在于,所述构造模块包括:
    确定单元,用于确定特征处理后的各个数据对应的特征饱和度;
    筛选单元,用于根据特征饱和度对各个数据进行筛选,以筛选出特征饱和度达到预设饱和度的各个数据;
    构造单元,用于根据筛选出的各个数据构造多维度数据集。
  10. 如权利要求6所述的疾病概率的检测装置,其特征在于,所述计算模块包括:
    输入单元,用于将所述测试集的数据输入到所述回归决策树中,以根据所述回归决策树中树的数量得到对应数量的各个数值;
    计算单元,用于将各个数值与所述回归决策树中各个树的权重值进行加权平均,得到所述回归决策树的总值;
    处理单元,用于将所述总值作为用户的疾病概率。
  11. 一种疾病概率的检测设备,其特征在于,所述疾病概率的检测设备包括处理器、以及存储有疾病概率的检测程序的存储器;所述处理器用于执行所述疾病概率的检测程序,以实现以下步骤:
    采集用户关联的各个数据,并对采集的各个数据进行特征处理;
    根据特征处理后的各个数据构造多维度数据集;
    对所述多维度数据集进行随机抽样,以划分出测试集和训练集;
    基于所述训练集搭建模型,得到回归决策树;
    根据所述测试集对所述回归决策树进行测试,以计算用户的疾病概率。
  12. 如权利要求11所述的疾病概率的检测设备,其特征在于,所述处理器还用于执行所述疾病概率的检测程序,以实现对采集的各个数据进行特征处理的步骤:
    对采集的各个数据进行特征分析,以确定各个数据的特征类型;
    在数据为缺失值数据时,对缺失值数据进行均值插补处理或多重插补处理;
    在数据为异常值数据时,对异常值数据进行筛选,以筛选出异常值小于预设阈值的数据,并将筛选出的数据作为缺失值数据进行处理。
  13. 如权利要求12所述的疾病概率的检测设备,其特征在于,所述均值插补处理的方式包括:采用平均值进行插补处理,或采用众数进行插补处理。
  14. 如权利要求11所述的疾病概率的检测设备,其特征在于,所述处理器还用于执行所述疾病概率的检测程序,以实现根据特征处理后的各个数据构造多维度数据集的步骤:
    确定特征处理后的各个数据对应的特征饱和度;
    根据特征饱和度对各个数据进行筛选,以筛选出特征饱和度达到预设饱和度的各个数据;
    根据筛选出的各个数据构造多维度数据集。
  15. 如权利要求11所述的疾病概率的检测设备,其特征在于,所述处理器还用于执行所述疾病概率的检测程序,以实现根据所述测试集对所述回归决策树进行测试,以计算用户的疾病概率的步骤:
    将所述测试集的数据输入到所述回归决策树中,以根据所述回归决策树中树的数量得到对应数量的各个数值;
    将各个数值与所述回归决策树中各个树的权重值进行加权平均,得到所述回归决策树的总值;
    将所述总值作为用户的疾病概率。
  16. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有疾病概率的检测程序,所述疾病概率的检测程序被处理器执行,以实现以下步骤:
    采集用户关联的各个数据,并对采集的各个数据进行特征处理;
    根据特征处理后的各个数据构造多维度数据集;
    对所述多维度数据集进行随机抽样,以划分出测试集和训练集;
    基于所述训练集搭建模型,得到回归决策树;
    根据所述测试集对所述回归决策树进行测试,以计算用户的疾病概率。
  17. 如权利要求16所述的计算机可读存储介质,其特征在于,所述疾病概率的检测程序被处理器执行,还实现对采集的各个数据进行特征处理的步骤:
    对采集的各个数据进行特征分析,以确定各个数据的特征类型;
    在数据为缺失值数据时,对缺失值数据进行均值插补处理或多重插补处理;
    在数据为异常值数据时,对异常值数据进行筛选,以筛选出异常值小于预设阈值的数据,并将筛选出的数据作为缺失值数据进行处理。
  18. 如权利要求17所述的计算机可读存储介质,其特征在于,所述均值插补处理的方式包括:采用平均值进行插补处理,或采用众数进行插补处理。
  19. 如权利要求16所述的计算机可读存储介质,其特征在于,所述疾病概率的检测程序被处理器执行,还实现根据特征处理后的各个数据构造多维度数据集的步骤:
    确定特征处理后的各个数据对应的特征饱和度;
    根据特征饱和度对各个数据进行筛选,以筛选出特征饱和度达到预设饱和度的各个数据;
    根据筛选出的各个数据构造多维度数据集。
  20. 如权利要求16所述的计算机可读存储介质,其特征在于,所述疾病概率的检测程序被处理器执行,还实现根据所述测试集对所述回归决策树进行测试,以计算用户的疾病概率的步骤:
    将所述测试集的数据输入到所述回归决策树中,以根据所述回归决策树中树的数量得到对应数量的各个数值;
    将各个数值与所述回归决策树中各个树的权重值进行加权平均,得到所述回归决策树的总值;将所述总值作为用户的疾病概率。
PCT/CN2018/074808 2017-02-20 2018-01-31 疾病概率的检测方法、装置、设备及计算机可读存储介质 WO2018149300A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US16/305,884 US20200126662A1 (en) 2017-02-20 2018-01-31 Method, device, and apparatus for detecting disease probability, and computer-readable storage medium
SG11201810380VA SG11201810380VA (en) 2017-02-20 2018-01-31 Method, device, and apparatus for detecting disease probability, and computer-readable storage medium
JP2018559946A JP2019521418A (ja) 2017-02-20 2018-01-31 疾患確率の検出方法、装置、設備およびコンピュータ読み取り可能な記憶媒体

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710095020.5 2017-02-20
CN201710095020.5A CN107622801A (zh) 2017-02-20 2017-02-20 疾病概率的检测方法和装置

Publications (1)

Publication Number Publication Date
WO2018149300A1 true WO2018149300A1 (zh) 2018-08-23

Family

ID=61087260

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/074808 WO2018149300A1 (zh) 2017-02-20 2018-01-31 疾病概率的检测方法、装置、设备及计算机可读存储介质

Country Status (5)

Country Link
US (1) US20200126662A1 (zh)
JP (1) JP2019521418A (zh)
CN (1) CN107622801A (zh)
SG (1) SG11201810380VA (zh)
WO (1) WO2018149300A1 (zh)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107622801A (zh) * 2017-02-20 2018-01-23 平安科技(深圳)有限公司 疾病概率的检测方法和装置
US11875904B2 (en) * 2017-04-27 2024-01-16 Koninklijke Philips N.V. Identification of epidemiology transmission hot spots in a medical facility
CN109035034A (zh) * 2018-06-12 2018-12-18 昆明理工大学 一种基于支付数据的健康保险精算系统与方法
CN109147939A (zh) * 2018-09-21 2019-01-04 宜昌市疾病预防控制中心 一种用于疾病控制的抽样装置及抽样方法
CN110827949B (zh) * 2019-10-31 2023-06-23 望海康信(北京)科技股份公司 确定疾病编码的方法、装置、电子设备及可读存储介质
CN111564223B (zh) * 2020-07-20 2021-01-12 医渡云(北京)技术有限公司 传染病生存概率的预测方法、预测模型的训练方法及装置
CN112435757A (zh) * 2020-10-27 2021-03-02 深圳市利来山科技有限公司 一种急性肝炎的预测装置及系统
US11378299B2 (en) * 2020-11-04 2022-07-05 Mann+Hummel Gmbh Metadata driven method and system for airborne viral infection risk and air quality analysis from networked air quality sensors
CN112750530A (zh) * 2021-01-05 2021-05-04 上海梅斯医药科技有限公司 一种模型的训练方法、终端设备和存储介质
CN115602328B (zh) * 2022-11-16 2023-05-26 深圳技术大学 急性白血病的预警方法及装置
CN116304932B (zh) * 2023-05-19 2023-09-05 湖南工商大学 一种样本生成方法、装置、终端设备及介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150611A (zh) * 2013-03-08 2013-06-12 北京理工大学 Ii型糖尿病发病概率分层预测方法
CN105603101A (zh) * 2016-03-03 2016-05-25 博奥颐和健康科学技术(北京)有限公司 检测8个miRNA表达量的系统在制备诊断或辅助诊断肝细胞癌产品中的应用
CN105956382A (zh) * 2016-04-26 2016-09-21 北京工商大学 基于改进型cart决策树与模糊朴素贝叶斯组合模型的中医体质优化分类方法
CN107622801A (zh) * 2017-02-20 2018-01-23 平安科技(深圳)有限公司 疾病概率的检测方法和装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPWO2008007630A1 (ja) * 2006-07-14 2009-12-10 日本電気株式会社 蛋白質探索方法及び装置
US8604777B2 (en) * 2011-07-13 2013-12-10 Allegro Microsystems, Llc Current sensor with calibration for a current divider configuration
CN102340673B (zh) * 2011-10-25 2014-07-02 杭州藏愚科技有限公司 一种针对交通场景的摄像机白平衡方法
CN102446302B (zh) * 2011-12-31 2014-07-02 浙江大学 一种水质预测系统的数据预处理方法
EP3107061B1 (en) * 2014-02-12 2020-12-02 Shimura, Akiyoshi Disease detection system and disease detection method
TWI578262B (zh) * 2015-08-07 2017-04-11 緯創資通股份有限公司 風險評估系統及資料處理方法
CN106127380A (zh) * 2016-06-22 2016-11-16 北京拓明科技有限公司 一种大数据风险分析方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150611A (zh) * 2013-03-08 2013-06-12 北京理工大学 Ii型糖尿病发病概率分层预测方法
CN105603101A (zh) * 2016-03-03 2016-05-25 博奥颐和健康科学技术(北京)有限公司 检测8个miRNA表达量的系统在制备诊断或辅助诊断肝细胞癌产品中的应用
CN105956382A (zh) * 2016-04-26 2016-09-21 北京工商大学 基于改进型cart决策树与模糊朴素贝叶斯组合模型的中医体质优化分类方法
CN107622801A (zh) * 2017-02-20 2018-01-23 平安科技(深圳)有限公司 疾病概率的检测方法和装置

Also Published As

Publication number Publication date
SG11201810380VA (en) 2018-12-28
CN107622801A (zh) 2018-01-23
US20200126662A1 (en) 2020-04-23
JP2019521418A (ja) 2019-07-25

Similar Documents

Publication Publication Date Title
WO2018149300A1 (zh) 疾病概率的检测方法、装置、设备及计算机可读存储介质
WO2018107610A1 (zh) 业务数据处理方法、系统、设备及计算机可读存储介质
WO2018205544A1 (zh) 软件项目管理方法、装置、终端及计算机存储介质
WO2018149299A1 (zh) 社保欺诈行为的识别方法、装置、设备及计算机存储介质
WO2015131803A1 (en) Application recommending method and system
WO2019090986A1 (zh) 一种保险理赔案件的理算方法和装置
WO2013139239A1 (en) Method for recommending users in social network and the system thereof
WO2019037396A1 (zh) 账户清结算方法、装置、设备及存储介质
WO2018233301A1 (zh) 产品推荐方法、装置、设备以及计算机可读存储介质
WO2019196213A1 (zh) 接口测试方法、装置、设备及计算机可读存储介质
WO2018205376A1 (zh) 一种关联信息查询方法、终端、服务器管理系统及计算机可读存储介质
WO2019061611A1 (zh) 营销活动的管理方法、装置、设备及计算机存储介质
WO2019061614A1 (zh) 贷款产品匹配方法、装置及计算机可读存储介质
WO2015144089A1 (en) Application recommending method and apparatus
WO2014026526A1 (zh) 自然人信息设置方法及电子设备
WO2016201745A1 (zh) 基于用户终端的就诊提示方法、用户终端和网络医院平台
WO2019104876A1 (zh) 保险产品的推送方法、系统、终端、客户终端及存储介质
WO2015196960A1 (en) Method and system for checking security of url for mobile terminal
WO2019085116A1 (zh) 电磁炉测温方法、测温装置及可读储存介质
WO2018166314A1 (zh) 额度审批方法、装置、设备以及计算机可读存储介质
WO2019062199A1 (zh) 用电计费方式的推荐方法、装置及存储介质
WO2018023926A1 (zh) 电视与移动终端的互动方法及系统
WO2015139594A1 (en) Security verification method, apparatus, and system
WO2019041851A1 (zh) 家电售后咨询方法、电子设备和计算机可读存储介质
WO2018201699A1 (zh) 客户评价方法、装置、设备以及计算机可读存储介质

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2018559946

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18754243

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18754243

Country of ref document: EP

Kind code of ref document: A1