WO2017041651A1 - 一种用户数据分类的方法和设备 - Google Patents

一种用户数据分类的方法和设备 Download PDF

Info

Publication number
WO2017041651A1
WO2017041651A1 PCT/CN2016/097495 CN2016097495W WO2017041651A1 WO 2017041651 A1 WO2017041651 A1 WO 2017041651A1 CN 2016097495 W CN2016097495 W CN 2016097495W WO 2017041651 A1 WO2017041651 A1 WO 2017041651A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
information
samples
feature information
user data
Prior art date
Application number
PCT/CN2016/097495
Other languages
English (en)
French (fr)
Inventor
白松
李禹�
武凯
潘静
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2017041651A1 publication Critical patent/WO2017041651A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16ZINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS, NOT OTHERWISE PROVIDED FOR
    • G16Z99/00Subject matter not provided for in other main groups of this subclass

Definitions

  • the present application relates to the field of computers, and in particular, to a technique for classifying user data.
  • Big Data has attracted more and more attention.
  • the rapid development of the Internet and mobile has also made the concept of big data much more than a large amount of data (terabyte data) and technology that processes large amounts of data, or the so-called “four V” (Volume, Variety, Value, and Velocity) simple concepts.
  • Big data is also increasing in applications in various fields, and it is increasingly being used for personal applications. It collects and organizes personal information in all aspects of life and provides users with personal services.
  • the purpose of the present application is to provide a method and device for classifying user data to solve the problem of effectively classifying user data.
  • a method for classifying user data including:
  • the to-be-determined class samples are subdivided into the same class samples or different class samples, wherein the prediction classification model is a comprehensive description feature based on the same class samples and the different class samples and user data therein Information obtained through training;
  • the user data is classified based on the same class of samples and different class of samples.
  • the user data is medical record data of the user
  • the distinguishing feature information includes identity feature information of the user
  • the comprehensive description feature information includes medical record feature information of the user.
  • the distinguishing feature information includes uniquely identifying the distinguishing feature information and the non-unique identifying the distinguishing feature information.
  • the distinguishing feature information of the two user data in the sample and the comparison information of the distinguishing feature information, and dividing the sample into the same class sample, the different class sample, or the to-be-determined class sample includes:
  • the comparison is divided into the same class sample or different class samples based on the comparison information of the unique identification difference feature information
  • the sample is divided into different types of samples or samples to be determined based on the comparison information that does not uniquely identify the difference feature information;
  • the sample is divided into the class to be determined
  • the predictive classification model is obtained by using a machine learning method to perform training based on the same type of samples and the different types of samples and the comprehensive description feature information of user data therein.
  • the machine learning method includes a logistic regression method
  • the process of obtaining the predicted classification model includes:
  • logistic regression model is a relationship model between the difference information of the plurality of comprehensive description feature information of the two user data in the sample and the classification information of the sample;
  • the logistic regression model is trained based on the difference information and the classification information of the corresponding sample to obtain weight information of the difference information of each comprehensive description feature information in the logistic regression model.
  • the process of obtaining the predicted classification model further includes:
  • the logistic regression model is tested using a plurality of the difference information that has not been trained and the classification information of the corresponding sample.
  • the subdividing the sample to be determined into the same class sample or different class samples by using the prediction classification model includes:
  • the sample to be determined is subdivided into the same class sample or a different class sample based on the classification information of the sample.
  • the machine learning method includes a random forest method.
  • a classification device for user data including:
  • a comparing device configured to divide the sample into the same type of sample, a different type of sample, or a sample to be determined based on the difference feature information of the two user data in the sample and the comparison information of the difference feature information;
  • a training device configured to subdivide the to-be-determined class sample into a same class sample or a different class sample by using a prediction classification model, wherein the prediction classification model is based on the same class sample and the different class samples and users thereof
  • the comprehensive description of the data is obtained by training the characteristic information
  • a classifying device configured to classify the user data based on the same class of samples and different class of samples.
  • the user data is medical record data of the user
  • the distinguishing feature information includes identity feature information of the user
  • the comprehensive description feature information includes medical record feature information of the user.
  • the distinguishing feature information includes uniquely identifying the distinguishing feature information and the non-unique identifying the distinguishing feature information.
  • comparing device is used to:
  • the comparison is divided into the same class sample or different class samples based on the comparison information of the unique identification difference feature information
  • the sample is divided into different types of samples or samples to be determined based on the comparison information that does not uniquely identify the difference feature information;
  • the sample is divided into the class to be determined.
  • the predictive classification model is obtained by using a machine learning method to perform training based on the same type of samples and the different types of samples and the comprehensive description feature information of user data therein.
  • the machine learning method includes a logistic regression method
  • the process of obtaining the predicted classification model includes:
  • logistic regression model is a relationship model between the difference information of the plurality of comprehensive description feature information of the two user data in the sample and the classification information of the sample;
  • the logistic regression model is trained based on the difference information and the classification information of the corresponding sample to obtain weight information of the difference information of each comprehensive description feature information in the logistic regression model.
  • the process of obtaining the predicted classification model further includes:
  • the logistic regression model is tested using a plurality of the difference information that has not been trained and the classification information of the corresponding sample.
  • the training device comprises:
  • An obtaining unit configured to acquire difference information of a plurality of comprehensive description feature information of two user data in the sample to be determined
  • An input unit configured to input the difference information into the logistic regression model to obtain classification information of the sample
  • a sample unit configured to subdivide the sample to be determined into the same class sample or a different class sample based on the classification information of the sample.
  • the machine learning method includes a random forest method.
  • the present application generates several samples based on the user data, and generates a plurality of samples based on the user data, each of the samples including two user data having the same identification feature information; based on two users in the sample
  • the difference characteristic information of the data and the comparison information of the difference feature information the sample is divided into the same class sample, the different class sample or the to-be-determined class sample; and then, the predicted classification model is subdivided into the to-be-determined class sample a same class sample or a different class of samples, wherein the predictive classification model is obtained by training based on the same class of samples and the different types of samples and comprehensive description feature information of user data therein; based on the same class of samples and Different types of samples classify the user data. Therefore, the user data is more accurately identified according to the feature information of the user data, and the user data is better classified to the record of the corresponding user, and the comprehensive description feature information of the user is opened for the user to serve.
  • the present application is applicable to the medical field, for example, to association recognition of a user's medical examination record, etc.
  • the user data is medical record data of the user, such as a medical examination record of the user.
  • the present application generates a number of samples based on the user medical record data, and generates a number of samples based on the user medical record data, each of the samples including two user medical record data having the same name; based on the two user medical record data in the sample And the difference information information and the comparison information of the difference feature information, the sample is divided into the same class sample, the different class sample or the to-be-determined class sample; and then, the to-be-determined class sample is subdivided into the same by using the prediction classification model a class sample or a different class of samples, wherein the predictive classification model is obtained by training based on the same class of samples and the different types of samples and medical comprehensive description feature information of user data therein; based on the same class of samples and Different types of samples are used to classify the user medical record data.
  • the predictive classification model is to use a machine learning method to train based on the comprehensive description feature information of the same class of samples and the different types of samples and user medical record data therein
  • the training obtained can identify all the physical examination records, and the accuracy is high.
  • the better development and utilization of the massive user data accumulated by the medical examination institutions has great value for the users, medical institutions and society.
  • FIG. 1 is a schematic structural diagram of a classifying device for user data according to an aspect of the present application
  • FIG. 2 shows a flow diagram of a specific scenario in accordance with a preferred embodiment of an aspect of the present application
  • Figure 3 shows a schematic block diagram of a training device 13 in accordance with a preferred embodiment of an aspect of the present application
  • FIG. 4 is a flow chart showing a method for classifying user data according to still another aspect of the present application.
  • Figure 5 is a flow chart showing the method of step S13 in accordance with a preferred embodiment of yet another aspect of the present application.
  • the device 1 shows a schematic structural diagram of a sorting device for user data according to an aspect of the present application, the device 1 comprising an obtaining device 11, a comparing device 12, a training device 13, and a sorting device 14.
  • the obtaining means 11 acquires a plurality of user data and generates a plurality of samples based on the user data, each of the samples comprising two user data having the same identification feature information; the comparing means 12, based on two users in the sample The difference characteristic information of the data and the comparison information of the difference feature information, the sample is divided into the same class sample, the different class sample or the to-be-determined class sample; the training device 13 uses the prediction classification model to re-determine the sample to be determined Dividing into the same class sample or different class samples, wherein the prediction classification model is obtained by training based on the same class sample and the different class samples and the comprehensive description feature information of the user data therein; the classification device 14 is based on The same type of samples and different types of samples are used to classify the user data.
  • the device 1 includes, but is not limited to, a user equipment, or a device formed by integrating a user equipment and a network device through a network.
  • the user equipment includes, but is not limited to, any mobile electronic product that can interact with a user through a touchpad, such as a smart phone, a PDA, etc., and the mobile electronic product can adopt any operating system, such as an android operating system. iOS operating system, etc.
  • the network device includes an electronic device capable of automatically performing numerical calculation and information processing according to an instruction set or stored in advance, and the hardware includes but is not limited to a microprocessor, an application specific integrated circuit (ASIC), and a programmable gate. Array (FPGA), digital processor (DSP), embedded Prepare.
  • ASIC application specific integrated circuit
  • DSP digital processor
  • the network includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a VPN network, a wireless ad hoc network (Ad Hoc network), and the like.
  • the outgoing user equipment 1 may also be a script program running on the user equipment, or a user equipment and a network device, a touch terminal, or a device formed by integrating a network device and a touch terminal through a network.
  • the above-mentioned device 1 is only an example, and other existing or future devices 1 may be applicable to the present application, and are also included in the protection scope of the present application, and are hereby incorporated by reference. Included here.
  • the above devices are continuously working.
  • “continuous” refers to the above-mentioned respective devices respectively in real time or according to a set or real-time adjusted working mode requirements, for example, the obtaining device 11 And continuously acquiring a plurality of user data, and generating a plurality of samples based on the user data; the comparing device 12 continues to divide the sample based on the difference feature information of the two user data in the sample and the comparison information of the distinguishing feature information.
  • the training device 13 continues to use the predictive classification model to subdivide the samples to be determined into the same class or different classes;
  • the classifying device 14 continues to be based on the same class of samples.
  • the different types of samples the user data is classified until the device 1 completes the unlocking work or stops working.
  • the identification feature information is feature information capable of dividing user data into a plurality of different subsets, such as: a name of a person, a brand name of a merchant, etc.; the distinguishing feature information is capable of being used to determine whether they belong to the same
  • the characteristic information of the user such as the ID number of the person, the mobile phone number of the person, the organization code of the merchant, etc.
  • the comprehensive description feature information is that the user data cannot be directly associated with the identification, but the comprehensive description of the feature information is described by the comprehensive description.
  • human physiological information including height, weight, blood pressure, etc.
  • the business field of the merchant or publicity.
  • the distinguishing feature information includes uniquely identifying the distinguishing feature information and the non-unique identifying the distinguishing feature information.
  • the unique identification difference feature information is: can directly determine that two user data in the sample are the same class or different classes, and further determine that the sample is the same class sample or different class sample feature information, such as a person's ID number, The organization code of the merchant, etc.;
  • the non-unique identification difference feature information is: when the corresponding non-unique identification difference feature information of the two user data in the sample is different, the two user data can be directly determined to be different classes, and then determined
  • the sample is a different type of sample, but when the corresponding non-unique identification difference feature information is the same, it is not possible to directly determine that the two user data are the same class or different classes, thereby determining that the sample is the same type of sample feature information, such as the gender of the person. , ethnic, international, business areas of business.
  • the comparing device 12 is configured to: when the two user data in the sample have unique identification distinguishing feature information, divide the into the same class of samples or different classes based on the comparison information of the uniquely identifying the distinguishing feature information. a sample; when at least one of the two user data in the sample does not have unique identification distinguishing feature information, based on the The comparison information that uniquely identifies the difference feature information is divided into different types of samples or samples to be determined; when at least one of the two user data in the sample does not have distinguishing feature information, the sample is divided into Determine the class sample.
  • the predictive classification model is obtained by using a machine learning method to perform training based on the same type of samples and the different types of samples and the comprehensive description feature information of user data therein.
  • the machine learning method includes a logistic regression method
  • the process of obtaining the predictive classification model includes: creating a logistic regression model, the logistic regression model being a plurality of comprehensive description feature information about two user data in the sample a relationship model of the difference information and the classification information of the sample; acquiring difference information of the plurality of comprehensive description feature information of the same class sample and the two user data in the different class sample and classification information of the corresponding sample; The difference information and the classification information of the corresponding sample are trained on the logistic regression model to obtain weight information of the difference information of each comprehensive description feature information in the logistic regression model.
  • the process of obtaining the predictive classification model further comprises: testing the logistic regression model by using a plurality of the difference information that has not been trained and the classification information of the corresponding sample.
  • the training device 13 includes: the acquiring unit 131 acquires difference information of the plurality of comprehensive description feature information of the two user data in the sample to be determined; and the input unit 132 inputs the difference information into the logistic regression model. Obtaining classification information of the sample; the sample unit 133 subdivides the to-be-determined class sample into the same class sample or different class samples based on the classification information of the sample.
  • the machine learning method includes a random forest method.
  • the device 1 of the present application is used for classifying user data.
  • the identification feature information is used to divide the user data into a plurality of different subsets, and the samples are grouped in two subsets, and each sample is compared and compared.
  • the difference feature information is used to compare the samples, and the same type of samples, different types of samples, and samples to be determined that are not directly determined by the distinguishing feature information are used, and the same type of samples and different types of samples are used.
  • the comprehensive description of the user data in the feature data training training model obtained, that is, the "portrait" of the user data, and then use the training model to further determine the sample, and then subsample the user data according to the same class sample and different class samples.
  • the classification enables the user data to be more accurately identified and classified according to the feature information of the user data, thereby laying a foundation for providing personalized services for the user.
  • the device 1 of the present application is preferably applied to the medical field
  • the user data is preferably user medical record data, such as: previous physical examination record data, previous hospital examination record data, etc.
  • the user medical record data usually includes the user's name.
  • Information and medical comprehensive description feature information related to the user's physiological condition and physical condition may also include different user distinguishing feature information, such as ID number, gender, date of birth, blood type, ethnicity Wait.
  • the medical institution uses the name + ID card (or mobile phone number) to identify the user's medical examination data for many years. Due to the common duplicate name, the name of the same user may actually correspond to different users, and the user's ID number and mobile phone number. The distinguishing feature information is also often missing.
  • the medical institution can only diagnose according to the user's current physical examination results, and the follow-up service will not be provided after the physical examination, which causes the medical institution to fail to open the medical examination record of each user for many years. To each user's continuous changes in various indicators of the body for many years, and thus unable to provide users with better personalized services.
  • the device 1 of the present application can classify the user data having the same user's name by the user distinguishing feature information and the medical comprehensive description feature information that may be present in the user medical record data, thereby obtaining each user in the medical institution.
  • the classification of the user's medical examination records for association identification is taken as an example, and the device 1 described in the present application is specifically applied to the medical field.
  • the present application can also be applied to the classification of other user data in the medical field, such as a user's medical diagnosis record, a major illness record, a health follow-up record, and the like.
  • the distinguishing feature information is preferably the user's identity feature information, such as ethnicity, gender, age, etc.
  • the comprehensive description feature information is preferably the user's medical record feature information, such as physiological data in the medical record, past medical history. Record and so on.
  • the obtaining means 11 acquires a plurality of user medical record data, and generates a plurality of samples based on the user medical record data, each of the samples including two user medical record data having the same user name information.
  • the user medical record data includes the physical examination record data, and all the medical examination records are divided into a plurality of subsets according to the name of the medical examination user, and each subset includes one or more medical examination records, and each subset may be more A collection of personal medical records. If there is only one medical record in a subset, that is, the person has not been renamed with another person and has only been checked once, then the medical record unique belongs to the person; if there is not less than two medical records in a subset, each subset is Any two medical records are taken as one sample, that is, several samples are taken.
  • the comparing device 12 divides the sample into the same class sample, different class sample or sample to be determined based on the identity feature information of the two user medical record data and the comparison information of the identity feature information in the sample. .
  • the identity characteristic information of the user may be information about the identity of the user, for example, but not limited to name, gender, identity card, mobile phone number, marital status, nationality, employment status, total service age, etc.
  • the feature selects identity feature information, including the ID number, social security card number, gender, date of birth, blood type And the nation, of course, the identity characteristic information of the feature selection is not limited to the contents listed above.
  • the samples are classified according to a set method, for example, when two user medical record data in the sample When both have the unique identification identity information ID number, compare the two ID numbers. If they are the same, the samples of the two user medical record data are divided into the same type of samples. If they are different, the two will be compared.
  • the sample of the user medical record data is divided into different types of samples; when at least one of the two user medical record data in the sample is missing the identification number, the two users are compared with the non-unique identification identity information, and the gender is born. Date, blood type and ethnicity, if one of them is different, the sample is divided into different types of samples.
  • the sample is divided into samples to be determined;
  • identity information that is, at least one user's body
  • the training device 13 further divides the to-be-determined class sample into the same class sample or different class samples by using a prediction classification model, where the prediction classification model is based on the same class sample and the different class samples and The recording characteristic information of the user medical record data is obtained by training.
  • the recorded feature information includes medical record feature information, such as user's medical record information, including but not limited to height, weight, pulse, blood glucose, systolic blood pressure, diastolic blood pressure, hemoglobin, alanine aminotransferase, physical examination Interval days, etc.
  • medical record feature information such as user's medical record information, including but not limited to height, weight, pulse, blood glucose, systolic blood pressure, diastolic blood pressure, hemoglobin, alanine aminotransferase, physical examination Interval days, etc.
  • the classification device 14 classifies the user data based on the same class of samples and different types of samples.
  • a series of two medical records are obtained.
  • a person's relationship pair can be summarized by ODPS or Hadoop, and each user data (ie, physical examination record) is classified into one user, and then a series of medical examination records corresponding to each user are obtained.
  • the ODPS Open Data Processing Service
  • the Hadoop is a software framework for distributed processing of large amounts of data. Users can develop and run applications that process massive amounts of data on Hadoop. Those skilled in the art should understand that it is not limited to the classification method of user data by using ODPS or Hadoop, and the classification method that can be used in the future and the user data can be included in the scope of the present application.
  • the identity feature information includes uniquely identifying identity feature information and non-uniquely identifying identity feature information.
  • the unique identification of the identity card information refers to information that is unique on behalf of each user identity, and can determine whether the medical record data of the two users belong to the same person, and has a positive and negative effect, such as an identity card number, a social security card number, Once there is no change, the two user medical record data corresponding to the same ID number, it means that the two user medical record data belong to the same person, otherwise, the two user medical record data does not belong to the same person; not uniquely identified
  • the identity feature information refers to the information that can reflect the identity of the user, including the information that the user does not change, but has non-uniqueness.
  • the comparing means 12 is configured to: when the two user medical record data in the sample have unique identification identity information, divide the said into the same based on the comparison information of the unique identification identity information a class sample or a different class of samples; when at least one of the two user medical record data in the sample does not have unique identification identity information, classifying the sample into different classes based on the comparison information that does not uniquely identify the identity feature information The sample or the sample to be determined; when at least one of the two user medical record data in the sample does not have the identity feature information, the sample is divided into the sample to be determined.
  • the user's medical examination data is selected and classified into a level A, a level B, and a level C, wherein the level A is a unique identification identity information, including: an ID number, a social security card number; B is non-unique identification of identity information, including: gender, date of birth, blood type and ethnicity; level C is record characteristic information, including: height, weight, pulse, blood sugar, systolic blood pressure, diastolic blood pressure, hemoglobin, alanine aminotransferase and physical examination The number of days separated.
  • Figure 2 shows the flow chart of a specific scenario, because most users have ID numbers and social security. The card number information is missing.
  • the first step is to determine whether the level A information in the two medical records exists. If both medical records contain ID number or social security card number, you need to further determine whether the ID number or social security card number is the same. If the ID number or social security card number is the same, the two medical records belong to the same person. If the social security cards are different, it is determined that the two medical records belong to different people. If at least one of the two medical records has missing the ID number and social security card number, the decision must be continued through level B. If the two medical records contain information of level B, if only one of the gender, date of birth, blood type and ethnic group is different, it can be directly determined that the two records belong to different people.
  • the two medical examination records do not contain the information of the level B or the information in the level B is the same after comparison, it is not possible to determine whether the two medical examination records belong to the same person, and further determination by the level C is required.
  • level A and level B if the two medical examination records belong to the same person, the two medical examination records are classified into the same type of samples, which can be recorded as positive samples; if the two medical examination records belong to different persons, then These two medical records are divided into different types of samples, which can be recorded as negative samples; if they cannot be determined by comparison of level A and level B, the samples of the two medical records are divided into samples to be determined; When at least one of the records does not contain the information in level A and level B, that is, the information in level A and level B is missing, the sample in which the sample is located is divided into samples to be determined.
  • the predictive classification model is obtained by using a machine learning method to perform training based on the same type of samples and the different types of samples and medical record characteristic information of user data therein.
  • the predictive classification determines whether the two user medical record data in the sample to be determined belongs to the same individual, and according to the above specific scenario, creates a plurality of medical record information about the medical record data of two users in the positive and negative samples by using the machine learning method.
  • the difference information and the logistic regression model of the classification information of the positive and negative samples, the difference information of the plurality of medical record information of the two user medical record data in the positive and negative samples and the classification information of the corresponding sample are input into the created model; then, The logistic regression model is trained based on the difference information and the classification information of the corresponding sample, that is, the positive and negative samples are trained to obtain the weight information of the difference information of each medical record information in the logistic regression model.
  • the machine learning method comprises a logistic regression method, the process of obtaining the predictive classification model comprising: creating a logistic regression model, the medical regression model being a number of medical records relating to two user medical record data in the sample a relationship model of the difference information of the feature information and the classification information of the sample; obtaining difference information of the medical record feature information of the two types of medical record data of the same class sample and the different class of samples and classification of the corresponding sample Information; the logistic regression model based on the difference information and classification information of the corresponding sample Training is performed to obtain weight information of difference information of each medical record characteristic information in the logistic regression model.
  • the same type of sample is recorded as a positive sample
  • different types of samples are recorded as a negative sample
  • the difference between the characteristic information of the grade C in the two physical examination records in the positive sample and the negative sample is calculated, and the following form set is obtained: ⁇ height difference, Poor body weight, poor pulse rate, poor blood sugar, systolic pressure difference, diastolic blood pressure difference, hemoglobin difference, alanine aminotransferase difference, physical examination interval days ⁇
  • the obtained collection is divided into training set and test set according to the ratio of 8:2.
  • the ratio of positive and negative samples in the training set and test set is 1:1
  • the model form of logistic regression using logistic regression is as follows:
  • C 0 , C 1 , C 2 ... C 9 represent weight coefficients
  • Y represents classification results
  • Y when Y is greater than or equal to 0.5, it means that the two medical records belong to the same person, and when Y is less than 0.5, this means The two medical records belong to different people.
  • the created logistic regression model is trained by the positive and negative samples in the training set to obtain the values of the weight coefficients C 0 , C 1 , C 2 ... C 9 , and the magnitude of the weight coefficient indicates the influence of the corresponding feature information on the classification result. .
  • the obtaining of the predictive classification model further comprises: testing the logistic regression model with a plurality of the difference information that has not been trained and the classification information of the corresponding sample.
  • the weighted coefficient is obtained by training the created logistic regression model with the positive and negative samples in the training set, the positive and negative samples in the test set are tested to calculate the accuracy and AUC (area under the curve) value of the model. .
  • FIG. 3 shows a block diagram of a training device 13 in accordance with a preferred embodiment of an aspect of the present application.
  • the apparatus includes an acquisition unit 131, an input unit 132, and a sample unit 133.
  • the obtaining unit 131 acquires difference information of the medical record feature information of the two user medical record data in the sample to be determined; the input unit 132 inputs the difference information into the logistic regression model to obtain the sample.
  • the classification information; the sample unit 133 further subdivides the to-be-determined class sample into the same class sample or different class samples based on the classification information of the sample.
  • the obtaining unit 131 acquires difference information of the plurality of medical record feature information of the two user medical record data in the sample to be determined.
  • the difference information refers to the corresponding height difference, body weight difference, pulse rate, blood glucose difference, systolic pressure difference, diastolic pressure difference, hemoglobin difference, alanine aminotransferase difference and physical examination interval of the two user medical record data in the sample to be determined.
  • the information of the number of days, the difference information is calculated.
  • the input unit 132 inputs the difference information into the logistic regression model to obtain classification information of the sample. Then, according to the weighting coefficient of the logistic regression model, the calculated difference information is input into the obtained logistic regression model, and the classification information of the sample is obtained, and the classification result value Y is calculated to obtain the information of Y.
  • the sample unit 133 subdivides the to-be-determined class sample into the same class sample or different class samples based on the classification information of the sample. Determining the sample to be determined according to the calculated Y value, when Y is greater than or equal to 0.5, indicating that the sample to be determined is the same type of sample, that is, two medical records in the sample belong to the same person; when Y is less than 0.5, Indicates that the sample to be determined is a different type of sample, that is, the two medical records in the sample do not belong to the same person.
  • the machine learning method comprises a random forest method.
  • the training obtains the predicted classification model, and uses N to represent the number of the same type of samples and different types of samples trained, and M represents the variable. number. m is known and is used to determine how many variables are used when making decisions on a node, where m is less than M; from N training cases, it is sampled N times in a resampling manner to form a group
  • the training set ie, bootstrap sampling
  • this tree uses this tree to determine the class of the class to be predicted, that is, to determine whether each sample in the sample to be determined belongs to the same person, and to evaluate its error. For each node, randomly select m variables based on this point, and calculate the best segmentation method according to m variables; each tree will grow completely without pruning.
  • step S11 is a flow chart showing a method for classifying user data according to still another aspect of the present application, the method including step S11, step S12, step S13, and step S14.
  • step S11 a plurality of user data are acquired, and a plurality of samples are generated based on the user data, each of the samples including two user data having the same identification feature information; in step S12, based on two of the samples The difference characteristic information of the user data and the comparison information of the difference feature information, the sample is divided into the same class sample, the different class sample or the to-be-determined class sample; in step S13, the prediction classification model is used to determine the to-be-determined The class sample is further divided into the same class sample or a different class sample, wherein the prediction classification model is obtained by training based on the comprehensive description feature information of the same class sample and the different class samples and user data therein; In S14, the user data is classified based on the same class of samples and different types of samples.
  • the identification feature information is feature information capable of dividing user data into a plurality of different subsets, such as: a name of a person, a brand name of a merchant, etc.; the distinguishing feature information is capable of being used to determine whether they belong to the same Characteristic information of the user, such as a person's ID number, a person's mobile phone number, a merchant's organization code, etc.;
  • the feature information is that the user data cannot be directly associated with the user information, but the comprehensive description of the feature information can also establish a “portrait” of the user corresponding to the user data, and indirectly determine whether the feature information of the same user belongs to, for example, a person.
  • Physiological information including height, weight, blood pressure, etc.
  • the distinguishing feature information includes uniquely identifying the distinguishing feature information and the non-unique identifying the distinguishing feature information.
  • the unique identification difference feature information is: can directly determine that two user data in the sample are the same class or different classes, and further determine that the sample is the same class sample or different class sample feature information, such as a person's ID number, The organization code of the merchant, etc.;
  • the non-unique identification difference feature information is: when the corresponding non-unique identification difference feature information of the two user data in the sample is different, the two user data can be directly determined to be different classes, and then determined
  • the sample is a different type of sample, but when the corresponding non-unique identification difference feature information is the same, it is not possible to directly determine that the two user data are the same class or different classes, thereby determining that the sample is the same type of sample feature information, such as the gender of the person. , ethnic, international, business areas of business.
  • step S12 when both user data in the sample have unique identification difference feature information, the comparison is performed into the same class sample or different class sample based on the comparison information of the unique identification difference feature information.
  • the sample is divided into different types of samples or samples to be determined based on the comparison information that does not uniquely identify the distinguishing feature information;
  • the sample is divided into the class to be determined.
  • the predictive classification model is obtained by using a machine learning method to perform training based on the same type of samples and the different types of samples and the comprehensive description feature information of user data therein.
  • the machine learning method includes a logistic regression method
  • the process of obtaining the predictive classification model includes: creating a logistic regression model, the logistic regression model being a plurality of comprehensive description feature information about two user data in the sample a relationship model of the difference information and the classification information of the sample; acquiring difference information of the plurality of comprehensive description feature information of the same class sample and the two user data in the different class sample and classification information of the corresponding sample; The difference information and the classification information of the corresponding sample are trained on the logistic regression model to obtain weight information of the difference information of each comprehensive description feature information in the logistic regression model.
  • the process of obtaining the predictive classification model further comprises: testing the logistic regression model by using a plurality of the difference information that has not been trained and the classification information of the corresponding sample.
  • step S13 includes: step S131, acquiring difference information of a plurality of comprehensive description feature information of two user data in the sample to be determined; and step S132, inputting the difference information into the logistic regression model to obtain the Classification information of the sample; step S133, subdividing the sample to be determined into the same class sample or different class samples based on the classification information of the sample.
  • the machine learning method includes a random forest method.
  • the method described in the present application is used for classifying user data.
  • the identification feature information is used to divide the user data into a plurality of different subsets, and the samples are composed of two groups in the subset, and each sample is compared, and the comparison process is performed.
  • the difference feature information is used to compare the samples, and the same type of samples, different types of samples, and samples to be determined that are not directly determined by the distinguishing feature information are used, and the same type of samples and different types of samples are used.
  • the comprehensive description of the user data, the training model obtained by the training of the feature information, that is, the "portrait" of the user data, and then the training model is used to further determine the samples, and then the user data is sub-set according to the same class sample and different class samples. Classification, so that the user data can be identified and classified more accurately according to the feature information of the user data, thereby laying a foundation for providing personalized service for the user.
  • the medical institution uses the name + ID card (or mobile phone number) to identify the user's medical examination data for many years. Due to the common duplicate name, the name of the same user may actually correspond to different users, and the user's ID number and mobile phone number. The distinguishing feature information is also often missing.
  • the medical institution can only diagnose according to the user's current physical examination results, and the follow-up service will not be provided after the physical examination, which causes the medical institution to fail to open the medical examination record of each user for many years. To each user's continuous changes in various indicators of the body for many years, and thus unable to provide users with better personalized services.
  • the user distinguishing feature information and the medical comprehensive description feature information that may be present in the user medical record data may be used to classify user data having the same user's name by using the method described in the present application, thereby obtaining corresponding information for each user in the medical institution.
  • a series of medical services, including medical records, to open up medical records for many years, improve the accuracy of user data association identification, and at the same time, better develop and utilize the massive user data accumulated by medical institutions, for users, medical institutions and Society has great value.
  • the classification of the user's medical examination record is taken as an example, and the method described in the present application is specifically applied to the medical field.
  • the present application can also be applied to the classification of other user data in the medical field, such as a user's medical diagnosis record, a major illness record, a health follow-up record, and the like.
  • the distinguishing feature information is preferably the user's identity feature information, such as ethnicity, gender, age, etc.
  • the comprehensive description feature information is preferably the user's medical record feature information, such as physiological data in the medical record, past medical history. Record and so on.
  • step S11 a plurality of user medical record data are acquired, and a plurality of samples are generated based on the user medical record data, each of the samples including two user medical record data having the same user name information.
  • the user medical record data includes the physical examination record data, and all the medical examination records are divided into a plurality of subsets according to the name of the medical examination user, and each subset includes one or more medical examination records, each subset It may be a collection of medical records for multiple people. If there is only one medical record in a subset, that is, the person has not been renamed with another person and has only been checked once, then the medical record unique belongs to the person; if there is not less than two medical records in a subset, each subset is Any two medical records are taken as one sample, that is, several samples are taken.
  • step S12 based on the difference feature information of the two user medical record data in the sample and the comparison information of the difference feature information, the sample is divided into the same class sample, different class sample or to be determined class. sample.
  • the identity characteristic information of the user may be information about the identity of the user, for example, but not limited to name, gender, identity card, mobile phone number, marital status, nationality, employment status, total service age, etc.
  • the middle feature selects the difference feature information, and the information includes the ID number, the social security card number, the gender, the date of birth, the blood type, and the nationality.
  • the distinguishing feature information of the feature selection is not limited to the contents listed above.
  • the samples are classified according to a set method, for example, when two user medical record data in the sample When both have the unique identification difference feature information ID number, compare the two ID numbers. If they are the same, the samples of the two user medical record data are divided into the same type of samples. If they are different, the two will be compared.
  • the sample of the user medical record data is divided into different types of samples; when at least one of the two user medical record data in the sample is missing the identification number, the two users are compared to the non-unique identification difference feature information, and the gender is born. Date, blood type and ethnicity, if one of them is different, the sample is divided into different types of samples.
  • the sample is divided into samples to be determined;
  • distinguishing feature information that is, at least one user's body
  • step S13 the to-be-determined class sample is subdivided into the same class sample or different class samples by using a prediction classification model, wherein the prediction classification model is based on the same class sample and the different class samples. And the comprehensive description feature information of the user medical record data and the training information obtained by the training.
  • the comprehensive description feature information includes medical comprehensive description feature information, for example, the user's physical examination record information, including height, weight, pulse, blood glucose, systolic blood pressure, diastolic blood pressure, hemoglobin, alanine aminotransferase, and physical examination interval days.
  • medical comprehensive description feature information for example, the user's physical examination record information, including height, weight, pulse, blood glucose, systolic blood pressure, diastolic blood pressure, hemoglobin, alanine aminotransferase, and physical examination interval days.
  • step S14 the user data is classified based on the same class of samples and different types of samples.
  • each sample is the same type of sample or a different sample
  • a series of relationship pairs in which two medical examination records belong to the same person are obtained, and the results can be summarized by ODPS or Hadoop, and each user data (ie, physical examination record) is collected. Classified into a user, and then get a series of medical records corresponding to each user.
  • the ODPS Open Data Processing Service
  • the Hadoop is a software framework for distributed processing of large amounts of data. Users can develop and run applications that process massive amounts of data on Hadoop. Those skilled in the art should understand that it is not limited to the classification method of user data by using ODPS or Hadoop, and the classification method that can be used in the future and the user data can be included in the scope of the present application.
  • the distinguishing feature information includes uniquely identifying the distinguishing feature information and the non-unique identifying the distinguishing feature information.
  • the unique identification of the identity card information refers to information that is unique on behalf of each user identity, and can determine whether the medical record data of the two users belong to the same person, and has a positive and negative effect, such as an identity card number, a social security card number, Once there is no change, the two user medical record data corresponding to the same ID number, it means that the two user medical record data belong to the same person, otherwise, the two user medical record data does not belong to the same person; not uniquely identified
  • the distinguishing feature information refers to the information that distinguishes the user's distinctive features, including the information that the user does not change, but has non-uniqueness, and can only determine that the medical record data of the two users does not belong to the same person, that is, only has a negative effect, such as gender, birth Date, blood type and ethnicity, the gender of the two user medical record data is different, then the medical record data of the two users
  • step S12 when both user medical record data in the sample have unique identification difference feature information, the comparison is divided into the same class sample or based on the comparison information of the unique identification difference feature information. Different types of samples; when at least one of the two user medical record data in the sample does not have unique identification distinguishing feature information, the sample is divided into different types of samples or to be based on the comparison information that does not uniquely identify the distinguished feature information. Determining a class sample; when at least one of the two user medical record data in the sample does not have distinguishing feature information, dividing the sample into a class to be determined.
  • the user's medical examination data is selected and classified into a level A, a level B, and a level C, wherein the level A is a unique identification distinguishing feature information, including: an ID number, a social security card number; B is not uniquely identifying the distinguishing feature information, including: gender, date of birth, blood type and ethnicity; level C is a comprehensive description of characteristic information, including: height, weight, pulse, blood sugar, systolic blood pressure, diastolic blood pressure, hemoglobin, alanine aminotransferase and The number of days between medical examinations.
  • FIG. 2 the flow chart of the specific scenario is shown.
  • the first step is to determine whether the level A information in the two medical records exists. If both medical records contain ID number or social security card number, you need to further determine whether the ID number or social security card number is the same. If the ID number or social security card number is the same, the two medical records belong to the same person. If the social security cards are different, it is determined that the two medical records belong to different people. If at least one of the two medical records has missing the ID number and social security card number, the decision must be continued through level B. If the two medical records contain information of level B, if only one of the gender, date of birth, blood type and ethnic group is different, it can be directly determined that the two records belong to different people.
  • the two medical examination records do not contain the information of the level B or the information in the level B is the same after comparison, it is not possible to determine whether the two medical examination records belong to the same person, and further determination by the level C is required.
  • level A and level B if the two medical examination records belong to the same person, the two medical examination records are classified into the same type of samples, which can be recorded as positive samples; if the two medical examination records belong to different persons, then These two medical records are divided into different types of samples, which can be recorded as negative samples; if they cannot be determined by comparison of level A and level B, the samples of the two medical records are divided into samples to be determined; When at least one of the records does not contain the information in level A and level B, that is, the information in level A and level B is missing, the sample in which the sample is located is divided into samples to be determined.
  • the predictive classification model is obtained by using a machine learning method to perform training based on the same type of samples and the different types of samples and the medical comprehensive description feature information of the user data therein.
  • the prediction classification determines whether the two user medical record data in the sample to be determined belong to the same person, and the above specific scene is used to create the medical record data of the two users in the positive and negative samples by using the machine learning method.
  • a logistic regression model of the difference information of the dry examination record information and the classification information of the positive and negative samples, and the difference information of the plurality of medical examination record information of the two user medical record data in the positive and negative samples and the classification information of the corresponding sample are input into the created model. And then training the logistic regression model based on the difference information and the classification information of the corresponding sample, that is, training the positive and negative samples to obtain weight information of the difference information of each medical record information in the logistic regression model.
  • the machine learning method comprises a logistic regression method, the process of obtaining the predictive classification model comprising: creating a logistic regression model, the logistic regression model being a plurality of medical synthesis regarding medical record data of two users in the sample Determining a relationship model between the difference information of the feature information and the classification information of the sample; acquiring difference information and corresponding samples of the medical comprehensive description feature information of the same type of sample and the two user medical record data in the different class of samples Classification information; training the logistic regression model based on the difference information and the classification information of the corresponding sample to obtain weight information of the difference information of each medical comprehensive description feature information in the logistic regression model.
  • the same type of sample is recorded as a positive sample
  • different types of samples are recorded as a negative sample
  • the difference between the characteristic information of the grade C in the two physical examination records in the positive sample and the negative sample is calculated, and the following form set is obtained: ⁇ height difference, Poor body weight, poor pulse rate, poor blood sugar, systolic pressure difference, diastolic blood pressure difference, hemoglobin difference, alanine aminotransferase difference, physical examination interval days ⁇
  • the obtained collection is divided into training set and test set according to the ratio of 8:2.
  • the ratio of positive and negative samples in the training set and test set is 1:1
  • the model form of logistic regression using logistic regression is as follows:
  • C 0 , C 1 , C 2 ... C 9 represent weight coefficients
  • Y represents classification results
  • Y when Y is greater than or equal to 0.5, it means that the two medical records belong to the same person, and when Y is less than 0.5, this means The two medical records belong to different people.
  • the created logistic regression model is trained by the positive and negative samples in the training set to obtain the values of the weight coefficients C 0 , C 1 , C 2 ... C 9 , and the magnitude of the weight coefficient indicates the influence of the corresponding feature information on the classification result. .
  • the obtaining of the predictive classification model further comprises: testing the logistic regression model with a plurality of the difference information that has not been trained and the classification information of the corresponding sample.
  • the weighted coefficient is obtained by training the created logistic regression model with the positive and negative samples in the training set, the positive and negative samples in the test set are tested to calculate the accuracy and AUC (area under the curve) value of the model. .
  • Step S13 includes step S131, step S132, and step S133.
  • step S131 the difference information of the plurality of medical comprehensive description feature information of the two user medical record data in the sample to be determined is acquired; in step S132, the difference information is input into the logistic regression model to obtain The classification information of the sample; in step S133, the sample to be determined is subdivided into the same class sample or a different class sample based on the classification information of the sample.
  • the difference information of the plurality of medical comprehensive description feature information of the two user medical record data in the sample to be determined is acquired.
  • the difference information refers to the corresponding height difference, body weight difference, pulse rate, blood glucose difference, systolic pressure difference, diastolic pressure difference, hemoglobin difference, alanine aminotransferase difference and physical examination interval of the two user medical record data in the sample to be determined.
  • the information of the number of days, the difference information is calculated.
  • step S132 the difference information is input to the logistic regression model to obtain classification information of the sample. Then, according to the weighting coefficient of the logistic regression model, the calculated difference information is input into the obtained logistic regression model, and the classification information of the sample is obtained, and the classification result value Y is calculated to obtain the information of Y.
  • step S133 the to-be-determined class samples are subdivided into the same class samples or different class samples based on the classification information of the samples. Determining the sample to be determined according to the calculated Y value, when Y is greater than or equal to 0.5, indicating that the sample to be determined is the same type of sample, that is, two medical records in the sample belong to the same person; when Y is less than 0.5, Indicates that the sample to be determined is a different type of sample, that is, the two medical records in the sample do not belong to the same person.
  • the machine learning method comprises a random forest method.
  • the training is performed to obtain a predictive classification model, and N is used to represent the number of the same type of samples and different types of samples trained, and M represents a variable. Number of.
  • m is known and is used to determine how many variables are used when making decisions on a node, where m is less than M; from N training cases, it is sampled N times in a resampling manner to form a group
  • the training set ie, bootstrap sampling
  • this tree uses this tree to determine the class of the class to be predicted, that is, to determine whether each sample in the sample to be determined belongs to the same person, and to evaluate its error.
  • For each node randomly select m variables based on this point, and calculate the best segmentation method according to m variables; each tree will grow completely without pruning.
  • the method and device for classifying user data by acquiring a plurality of user data, and generating a plurality of samples based on the user data, each of the samples including two user data having the same identification feature information;
  • the difference feature information of the two user data in the sample and the comparison information of the difference feature information the sample is divided into the same class sample, different class sample or sample to be determined; and then, the prediction classification model is used Determining the determined class sample is further divided into the same class sample or a different class sample, wherein the predicted classification model is obtained by training based on the comprehensive description feature information of the same class sample and the different class samples and user data therein
  • classifying the user data based on the same class of samples and different class of samples. Therefore, the user data is more accurately identified according to the feature information of the user data, and the user data is better classified to the record of the corresponding user, and the comprehensive description feature information of the user is opened for the user to serve.
  • the present application is applicable to the medical field, for example, to association recognition of a user's medical examination record, etc.
  • the user data is medical record data of the user, such as a medical examination record of the user.
  • the present application generates a plurality of samples by acquiring medical record data of a plurality of users, and based on the medical record data of the user, each of the samples includes two user medical record data having the same name; based on the two users in the sample Recording the difference characteristic information of the data and the comparison information of the difference feature information, and dividing the sample into the same class sample, the different class sample or the to-be-determined class sample; and then, using the prediction classification model, subdividing the to-be-determined class sample a sample of the same class or a different class, wherein the predictive classification model is obtained by training based on the medically comprehensive description feature information of the same class of samples and the different types of samples and user data therein; based on the same class
  • the sample and different types of samples are used to classify the user medical record data
  • the predictive classification model is obtained by using a machine learning method, based on the same type of samples and the different types of samples and the comprehensive description feature information of the user medical record data therein, and can identify all the physical examination records. And the accuracy is high. At the same time, better development and utilization of the massive medical record data accumulated by the medical examination institutions will have great value to the users, medical institutions and society.
  • the present application can be implemented in software and/or a combination of software and hardware, for example, using an application specific integrated circuit (ASIC), a general purpose computer, or any other similar hardware device.
  • the software program of the present application can be executed by a processor to implement the steps or functions described above.
  • the software programs (including related data structures) of the present application can be stored in a computer readable recording medium such as a RAM memory, a magnetic or optical drive or a floppy disk and the like.
  • some of the steps or functions of the present application may be implemented in hardware, for example, as a circuit that cooperates with a processor to perform various steps or functions.
  • a portion of the present application can be applied as a computer program product, such as computer program instructions, which, when executed by a computer, can invoke or provide a method and/or technical solution in accordance with the present application.
  • the program instructions for invoking the method of the present application may be stored in a fixed or removable recording medium, and/or transmitted by a data stream in a broadcast or other signal bearing medium, and/or stored in a Program instruction operation
  • the computer device is in the working memory.
  • an embodiment in accordance with the present application includes a device including a memory for storing computer program instructions and a processor for executing program instructions, wherein when the computer program instructions are executed by the processor, triggering
  • the apparatus operates based on the aforementioned methods and/or technical solutions in accordance with various embodiments of the present application.

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

提供一种用户数据的分类方法,通过获取若干用户数据,并基于所述用户数据生成若干样本,每一所述样本包括具有相同标识特征信息的两个用户数据(S11);基于所述样本中两个用户数据的区别特征信息及所述区别特征信息的比较信息,将所述样本分为相同类样本、不同类样本或待确定类样本(S12);利用预测分类模型将所述待确定类样本再分为相同类样本或不同类样本,其中,所述预测分类模型为基于所述相同类样本和所述不同类样本及其中用户数据的综合描述特征信息进行训练所获得的(S13);基于所述相同类样本和不同类样本,对所述用户数据进行分类(S14)。从而更准确地根据用户数据的特征信息对用户数据进行关联识别,更好的对用户数据分类至相应用户的记录,打通用户多次的综合描述特征信息,以供为用户服务。

Description

一种用户数据分类的方法和设备
本申请要求2015年09月09日递交的申请号为201510571182.2发明名称为“一种用户数据分类的方法和设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机领域,尤其涉及一种用户数据分类的技术。
背景技术
随着云时代的来临,大数据(Big data)也吸引了越来越多的关注。互联网和移动的快速发展也使大数据的概念远不止大量的数据(TB级别数据)和处理大量数据的技术,或者所谓的"4个V"(Volume、Variety、Value和Velocity)的简单概念,大数据也在各个领域不断增加应用,也越来越面向个人应用,将用户在生活中方方面面个人信息收集整理,为用户提供切身服务。
因此,不仅在电信行业、互联网行业等易产生大量用户数据的行业,教育,采矿,电力等行业,尤其是医疗领域等传统行业,如何在获取用户的众多数据后,根据用户数据的特征信息对用户数据进行关联识别,依据用户数据所对应的用户,对用户数据进行有效分类,以供为用户服务成为广泛关注的问题。
发明内容
本申请的目的是提供一种用户数据分类的方法与设备,以解决对用户数据进行有效分类的问题。
根据本申请的一个方面,提供了一种用户数据的分类方法,包括:
获取若干用户数据,并基于所述用户数据生成若干样本,每一所述样本包括具有相同标识特征信息的两个用户数据;
基于所述样本中两个用户数据的区别特征信息及所述区别特征信息的比较信息,将所述样本分为相同类样本、不同类样本或待确定类样本;
利用预测分类模型将所述待确定类样本再分为相同类样本或不同类样本,其中,所述预测分类模型为基于所述相同类样本和所述不同类样本及其中用户数据的综合描述特征信息进行训练所获得的;
基于所述相同类样本和不同类样本,对所述用户数据进行分类。
进一步地,所述用户数据为用户的医疗记录数据,所述区别特征信息包括所述用户的身份特征信息,所述综合描述特征信息包括用户的医疗记录特征信息。
进一步地,所述区别特征信息包括唯一识别区别特征信息和不唯一识别区别特征信息。
其中,所述基于所述样本中两个用户数据的区别特征信息及所述区别特征信息的比较信息,将所述样本分为相同类样本、不同类样本或待确定类样本包括:
当所述样本中两个用户数据均具有唯一识别区别特征信息时,基于所述唯一识别区别特征信息的比较信息,将所述分为相同类样本或不同类样本;
当所述样本中两个用户数据至少一个不具有唯一识别区别特征信息时,基于所述不唯一识别区别特征信息的比较信息,将所述样本分为不同类样本或待确定类样本;
当所述样本中两个用户数据至少一个不具有区别特征信息时,则将所述样本分为待确定类样本;
进一步地,所述预测分类模型为利用机器学习法,基于所述相同类样本和所述不同类样本及其中用户数据的综合描述特征信息进行训练所获得的。
进一步地,所述机器学习法包括逻辑回归法,获得所述预测分类模型的过程包括:
创建逻辑回归模型,所述逻辑回归模型为关于所述样本中两个用户数据的若干综合描述特征信息的差别信息与所述样本的分类信息的关系模型;
获取所述相同类样本和所述不同类样本中的两个用户数据的若干综合描述特征信息的差别信息及相应样本的分类信息;
基于所述差别信息和相应所述样本的分类信息对所述逻辑回归模型进行训练,以获得逻辑回归模型中各综合描述特征信息的差别信息的权重信息。
进一步地,获得所述预测分类模型的过程还包括:
利用未进行过训练的若干所述差别信息和相应所述样本的分类信息对所述逻辑回归模型进行测试。
其中,所述利用预测分类模型将所述待确定类样本再分为相同类样本或不同类样本包括:
获取所述待确定类样本中两个用户数据的若干综合描述特征信息的差别信息;
将所述差别信息输入所述逻辑回归模型,获得所述样本的分类信息;
基于所述样本的分类信息将所述待确定类样本再分为相同类样本或不同类样本。
进一步地,所述机器学习法包括随机森林法。
根据本申请的另一方面,还提供了一种用户数据的分类设备,包括:
获取装置,用于获取若干用户数据,并基于所述用户数据生成若干样本,每一所述样本包括具有相同标识特征信息的两个用户数据;
比较装置,用于基于所述样本中两个用户数据的区别特征信息及所述区别特征信息的比较信息,将所述样本分为相同类样本、不同类样本或待确定类样本;
训练装置,用于利用预测分类模型将所述待确定类样本再分为相同类样本或不同类样本,其中,所述预测分类模型为基于所述相同类样本和所述不同类样本及其中用户数据的综合描述特征信息进行训练所获得的;
分类装置,用于基于所述相同类样本和不同类样本,对所述用户数据进行分类。
进一步地,所述用户数据为用户的医疗记录数据,所述区别特征信息包括所述用户的身份特征信息,所述综合描述特征信息包括用户的医疗记录特征信息。
进一步地,所述区别特征信息包括唯一识别区别特征信息和不唯一识别区别特征信息。
其中,所述比较装置用于:
当所述样本中两个用户数据均具有唯一识别区别特征信息时,基于所述唯一识别区别特征信息的比较信息,将所述分为相同类样本或不同类样本;
当所述样本中两个用户数据至少一个不具有唯一识别区别特征信息时,基于所述不唯一识别区别特征信息的比较信息,将所述样本分为不同类样本或待确定类样本;
当所述样本中两个用户数据至少一个不具有区别特征信息时,则将所述样本分为待确定类样本。
进一步地,所述预测分类模型为利用机器学习法,基于所述相同类样本和所述不同类样本及其中用户数据的综合描述特征信息进行训练所获得的。
进一步地,所述机器学习法包括逻辑回归法,获得所述预测分类模型的过程包括:
创建逻辑回归模型,所述逻辑回归模型为关于所述样本中两个用户数据的若干综合描述特征信息的差别信息与所述样本的分类信息的关系模型;
获取所述相同类样本和所述不同类样本中的两个用户数据的若干综合描述特征信息的差别信息及相应样本的分类信息;
基于所述差别信息和相应所述样本的分类信息对所述逻辑回归模型进行训练,以获得逻辑回归模型中各综合描述特征信息的差别信息的权重信息。
进一步地,获得所述预测分类模型的过程还包括:
利用未进行过训练的若干所述差别信息和相应所述样本的分类信息对所述逻辑回归模型进行测试。
其中,所述训练装置包括:
获取单元,用于获取所述待确定类样本中两个用户数据的若干综合描述特征信息的差别信息;
输入单元,用于将所述差别信息输入所述逻辑回归模型,获得所述样本的分类信息;
样本单元,用于基于所述样本的分类信息将所述待确定类样本再分为相同类样本或不同类样本。
进一步地,所述机器学习法包括随机森林法。
与现有技术相比,本申请通过获取若干用户数据,并基于所述用户数据生成若干样本,每一所述样本包括具有相同标识特征信息的两个用户数据;基于所述样本中两个用户数据的区别特征信息及所述区别特征信息的比较信息,将所述样本分为相同类样本、不同类样本或待确定类样本;接着,利用预测分类模型将所述待确定类样本再分为相同类样本或不同类样本,其中,所述预测分类模型为基于所述相同类样本和所述不同类样本及其中用户数据的综合描述特征信息进行训练所获得的;基于所述相同类样本和不同类样本,对所述用户数据进行分类。从而更准确地根据用户数据的特征信息对用户数据进行关联识别,更好的对用户数据分类至相应用户的记录,打通用户多次的综合描述特征信息,以供为用户服务。
进一步地,本申请可应用于医疗领域,例如应用于对用户的体检记录的关联识别等,所述用户数据为用户的医疗记录数据,例如用户的体检记录等。本申请通过获取若干用户医疗记录数据,并基于所述用户医疗记录数据生成若干样本,每一所述样本包括具有相同姓名的两个用户医疗记录数据;基于所述样本中两个用户医疗记录数据的区别特征信息及所述区别特征信息的比较信息,将所述样本分为相同类样本、不同类样本或待确定类样本;接着,利用预测分类模型将所述待确定类样本再分为相同类样本或不同类样本,其中,所述预测分类模型为基于所述相同类样本和所述不同类样本及其中用户数据的医疗综合描述特征信息进行训练所获得的;基于所述相同类样本和不同类样本,对所述用户医疗记录数据进行分类。从而得到医疗机构中每个用户对应的一系列体检记录,打通用户连续多年的体检记录。进一步地,所述预测分类模型为利用机器学习法,基于所述相同类样本和所述不同类样本及其中用户医疗记录数据的综合描述特征信息进行训 练所获得的,能够识别到全部的体检记录,且准确率高,同时,更好地开发和利用体检机构积累的海量用户数据,对用户个人、医疗机构和社会产生巨大的价值。
附图说明
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本申请的其它特征、目的和优点将会变得更明显:
图1示出根据本申请一个方面的一种用户数据的分类设备的结构示意图;
图2示出根据本申请一个方面的一个优选实施例的具体场景的流程示意图;
图3示出根据本申请一个方面的一个优选实施例的训练装置13的结构示意图;
图4示出根据本申请又一个方面的一种用户数据的分类方法流程示意图;
图5示出根据本申请又一个方面的一个优选实施例的步骤S13的方法流程示意图。
附图中相同或相似的附图标记代表相同或相似的部件。
具体实施方式
下面结合附图对本申请作进一步详细描述。
图1示出根据本申请一个方面的一种用户数据的分类设备的结构示意图,该设备1包括获取装置11、比较装置12、训练装置13和分类装置14。
其中,获取装置11,获取若干用户数据,并基于所述用户数据生成若干样本,每一所述样本包括具有相同标识特征信息的两个用户数据;比较装置12,基于所述样本中两个用户数据的区别特征信息及所述区别特征信息的比较信息,将所述样本分为相同类样本、不同类样本或待确定类样本;训练装置13,利用预测分类模型将所述待确定类样本再分为相同类样本或不同类样本,其中,所述预测分类模型为基于所述相同类样本和所述不同类样本及其中用户数据的综合描述特征信息进行训练所获得的;分类装置14,基于所述相同类样本和不同类样本,对所述用户数据进行分类。
在此,所述设备1包括但不限于用户设备、或用户设备与网络设备通过网络相集成所构成的设备。所述用户设备其包括但不限于任何一种可与用户通过触摸板进行人机交互的移动电子产品,例如智能手机、PDA等,所述移动电子产品可以采用任意操作系统,如android操作系统、iOS操作系统等。其中,所述网络设备包括一种能够按照事先设定或存储的指令,自动进行数值计算和信息处理的电子设备,其硬件包括但不限于微处理器、专用集成电路(ASIC)、可编程门阵列(FPGA)、数字处理器(DSP)、嵌入式设 备等。所述网络包括但不限于互联网、广域网、城域网、局域网、VPN网络、无线自组织网络(Ad Hoc网络)等。优选地,转出用户设备1还可以是运行于所述用户设备、或用户设备与网络设备、触摸终端或网络设备与触摸终端通过网络相集成所构成的设备上的脚本程序。当然,本领域技术人员应能理解上述设备1仅为举例,其他现有的或今后可能出现的设备1如可适用于本申请,也应包含在本申请保护范围以内,并在此以引用方式包含于此。
上述各装置之间是持续不断工作的,在此,本领域技术人员应理解“持续”是指上述各装置分别实时地或者按照设定的或实时调整的工作模式要求,例如所述获取装置11持续获取若干用户数据,并基于所述用户数据生成若干样本;所述比较装置12持续基于所述样本中两个用户数据的区别特征信息及所述区别特征信息的比较信息,将所述样本分为相同类样本、不同类样本或待确定类样本;训练装置13持续利用预测分类模型将所述待确定类样本再分为相同类样本或不同类样本;分类装置14持续基于所述相同类样本和不同类样本,对所述用户数据进行分类,直至所述设备1完成解锁工作或停止工作。
在此,所述标识特征信息为能够将用户数据分为多个不同子集的特征信息,例如:人的姓名、商家的品牌名等;所述区别特征信息为能够用于判定是否属于相同的用户的特征信息,例如人的身份证号码、人的手机号、商家的组织机构代码等;所述综合描述特征信息为无法直接关联识别用户数据,但通过所述综合描述特征信息的综合描述,也能够建立用户数据对应的用户的“画像”,间接地判定是否属于相同的用户的特征信息,例如人的生理信息(包括身高、体重、血压等)、商家的经营领域或宣传等。
具体地,所述区别特征信息包括唯一识别区别特征信息和不唯一识别区别特征信息。在此,所述唯一识别区别特征信息为:能够直接判定样本中两个用户数据是相同类或不同类,进而确定样本为相同类样本或不同类样本的特征信息,例如人的身份证号码、商家的组织机构代码等;所述不唯一识别区别特征信息为:当样本中两个用户数据的对应的不唯一识别区别特征信息不同时,则能够直接确定两个用户数据是不同类,进而确定样本为不同类样本,但当对应的不唯一识别区别特征信息相同时,则不能够直接确定两个用户数据是相同类或不同类,进而确定样本为相同类样本的特征信息,例如人的性别、民族、国际、商家的经营领域等。
具体地,比较装置12用于:当所述样本中两个用户数据均具有唯一识别区别特征信息时,基于所述唯一识别区别特征信息的比较信息,将所述分为相同类样本或不同类样本;当所述样本中两个用户数据至少一个不具有唯一识别区别特征信息时,基于所述不 唯一识别区别特征信息的比较信息,将所述样本分为不同类样本或待确定类样本;当所述样本中两个用户数据至少一个不具有区别特征信息时,则将所述样本分为待确定类样本。
具体地,所述预测分类模型为利用机器学习法,基于所述相同类样本和所述不同类样本及其中用户数据的综合描述特征信息进行训练所获得的。
具体地,所述机器学习法包括逻辑回归法,获得所述预测分类模型的过程包括:创建逻辑回归模型,所述逻辑回归模型为关于所述样本中两个用户数据的若干综合描述特征信息的差别信息与所述样本的分类信息的关系模型;获取所述相同类样本和所述不同类样本中的两个用户数据的若干综合描述特征信息的差别信息及相应样本的分类信息;基于所述差别信息和相应所述样本的分类信息对所述逻辑回归模型进行训练,以获得逻辑回归模型中各综合描述特征信息的差别信息的权重信息。
具体地,获得所述预测分类模型的过程还包括:利用未进行过训练的若干所述差别信息和相应所述样本的分类信息对所述逻辑回归模型进行测试。
具体地,所述训练装置13包括:获取单元131获取所述待确定类样本中两个用户数据的若干综合描述特征信息的差别信息;输入单元132将所述差别信息输入所述逻辑回归模型,获得所述样本的分类信息;样本单元133基于所述样本的分类信息将所述待确定类样本再分为相同类样本或不同类样本。
具体地,所述机器学习法包括随机森林法。
本申请所述设备1用于用户数据的分类,首先通过所述标识特征信息为能够将用户数据分为多个不同子集,并在子集中两两组成样本,对每一样本进行比较,比较过程中利用所述区别特征信息对样本进行比较,获取相同类样本、不同类样本及由于所述区别特征信息缺失或由区别特征信息不能直接确定的待确定样本,利用相同类样本和不同类样本中的用户数据的综合描述特征信息训练获得的训练模型,即用户数据的“画像”,再利用训练模型对待确定样本进行进一步比较,再根据相同类样本和不同类样本对用户数据进行子集下的分类,从而能够更准确地根据用户数据的特征信息对用户数据进行关联识别和分类,进而为用户提供个性化服务打下基础。
优选的,本申请所述设备1优选地应用于医疗领域,所述用户数据优选为用户医疗记录数据,例如:历次体检记录数据、历次医院检查记录数据等,用户医疗记录数据通常包括用户的姓名信息和与用户的生理状况、身体状况相关的医疗综合描述特征信息,还可能包括不同的用户区别特征信息,例如身份证号码、性别、出生日期、血型、民族 等。
目前,医疗机构采用姓名+身份证(或者手机号)来识别用户连续多年的体检数据,由于常见的重名情况,同一用户的姓名可能实际对应不同的用户,而用户的身份证号码、手机号等区别特征信息也常常出现缺失,医疗机构只能根据用户的当前体检结果来诊断,而且本次体检过后不会再提供后续服务,导致医疗机构未能打通每个用户多年的体检记录,不能观察到每个用户连续多年身体各项指标的变化情况,进而无法为用户提供更好地个性化服务。
因此,可利用本申请所述设备1对用户医疗记录数据中可能具有的用户区别特征信息和医疗综合描述特征信息对具有相同的用户的姓名的用户数据进行分类,从而得到医疗机构中每个用户对应的一系列医疗服务,包括体检记录,打通用户连续多年的医疗记录,提高用户数据关联识别的准确率,同时,更好地开发和利用医疗机构积累的海量用户数据,对用户个人、医疗机构和社会产生巨大的价值。
我们将在实施例中以对用户的体检记录进行关联识别的分类为例,对本申请所述设备1应用于医疗领域进行具体说明。本领域技术人员应能理解,除体检记录外,本申请还可以应用于医疗领域其他用户数据的分类,例如用户的医疗诊断记录、大病记录、健康随访记录等。其中,所述区别特征信息优选地为用户的身份特征信息,例如民族、性别、年龄等,所述综合描述特征信息优选地为用户的医疗记录特征信息,例如体检记录中的生理数据、过往病史记录等。
具体地,获取装置11,获取若干用户医疗记录数据,并基于所述用户医疗记录数据生成若干样本,每一所述样本包括具有相同用户姓名信息的两个用户医疗记录数据。
在此,在体检机构中,用户医疗记录数据包括体检记录数据,按照体检用户的姓名将所有的体检记录划分为多个子集,每个子集包含一条或者多条体检记录,每个子集中可能是多个人的体检记录集合。如果一个子集中只有一条体检记录,即此人未与他人重名且只体检过一次,则此体检记录唯一的属于此人;如果一个子集中不少于两条体检记录时,将每个子集中的任意两条体检记录作为一个样本,即获取若干样本。
具体地,比较装置12,基于所述样本中两个用户医疗记录数据的身份特征信息及所述身份特征信息的比较信息,将所述样本分为相同类样本、不同类样本或待确定类样本。
在此,所述用户的身份特征信息可以是关于用户身份的信息,例如包括但不限于姓名、性别、身份证、手机号、婚姻状况、国籍、在职情况、总工龄等,从用户的身份信息中特征选取身份特征信息,该信息包括身份证号、社保卡号、性别、出生日期、血型 和民族,当然,特征选取的身份特征信息不限于上述所列举的内容。
根据所在同一个样本中的两个用户医疗记录数据的身份特征信息及所述身份特征信息的比较信息,将样本按照设定的方法进行分类,例如,当所述样本中两个用户医疗记录数据均具有唯一识别身份特征信息身份证号时,比较两个身份证号,若相同,则将比较的两个用户医疗记录数据所在的样本分为相同类样本,若不同,则将比较的两个用户医疗记录数据所在的样本分为不同类样本;当所述样本中两个用户医疗记录数据至少一个缺失身份证号时,进行比较两个用户的不唯一识别身份特征信息,进行比较性别、出生日期、血型和民族,若其中有一项不同则将所述样本分为不同类样本,若比较的不唯一识别身份特征信息都相同,则将所述样本分为待确定类样本;当所述样本中两个用户医疗记录数据至少一个不具有身份特征信息时,即至少有一个用户的身份证号、出生日期、性别、血型和民族等这些代表用户身份的信息缺失时,则将所述样本分为待确定类样本。
本领域技术人员应能理解,上述样本的分类方法仅为举例,其他现有的或今后可能出现的样本的分类方法如可适用于本发明,也应包含在本发明保护范围以内,并在此以引用方式包含于此。
具体地,训练装置13,利用预测分类模型将所述待确定类样本再分为相同类样本或不同类样本,其中,所述预测分类模型为基于所述相同类样本和所述不同类样本及其中用户医疗记录数据的记录特征信息进行训练所获得的。
在一具体场景中,所述记录特征信息包括医疗记录特征信息,例如,用户的体检记录信息,包括但不限于身高、体重、脉搏、血糖、收缩压、舒张压、血红蛋白、谷丙转氨酶、体检间隔天数等。计算每个相同类样本和不同类样本中两条体检记录中记录特征信息的相差值,采用机器学习中的逻辑回归或随机森林方法建立用户关联和识别模型;对所建立的模型利用相同类样本和不同类样本进行训练,以获得模型中各记录特征信息的差别信息的权重信息。采用训练好的模型对待确定类样本进行预测分类,即判定待确定类样本中每个样本是否属于同一个人。
本领域技术人员应能理解,上述预测分类仅为举例,其他现有的或今后可能出现的预测分类如可适用于本发明,也应包含在本发明保护范围以内,并在此以引用方式包含于此。
具体地,分类装置14,基于所述相同类样本和不同类样本,对所述用户数据进行分类。在此,确定每个样本为相同类样本或不同样本后,得到一系列两条体检记录属于同 一个人的关系对,可以采用ODPS或Hadoop对结果进行汇总,将每一个用户数据(即体检记录)分类到一个用户中,进而得到每个用户对应的一系列体检记录。
在此,所述ODPS(Open Data Processing Service)是指开放数据处理服务,海量数据处理和分析的服务平台,提供针对TB/PB级数据、实时性要求不高的分布式处理能力,应用于数据分析、海量数据统计、机器学习、数据挖掘等领域。所述Hadoop是一个能够对大量数据进行分布式处理的软件框架,用户可以在Hadoop上开发和运行处理海量数据的应用程序。本领域技术人员应能理解,并不限于采用ODPS或Hadoop对用户数据的分类方法,今后可能出现的能够使用与用户数据的分类方法的,也可以包括在本申请的思想范围之内。
优选地,所述身份特征信息包括唯一识别身份特征信息和不唯一识别身份特征信息。在此,唯一识别身份证信息是指代表每个用户身份具有唯一性的信息,能够判定两个用户医疗记录数据是否属于同一个人,同时具有肯定和否定的作用,例如身份证号、社保卡号,一旦有后不可更改,两个用户医疗记录数据对应的身份证号相同,则说明这两个用户医疗记录数据属于同一个人,反之,则说明两个用户医疗记录数据不属于同一个人;不唯一识别身份特征信息是指能够体现用户的身份特征信息,包含用户不变的信息,但具有不唯一性,只能判定两个用户医疗记录数据不是属于同一个人,即只具有否定作用,例如性别、出生日期、血型和民族,两个用户医疗记录数据对应的性别不同,则这两个用户医疗记录数据肯定不属于同一个人,反之,不能说明两个用户医疗记录数据属于同一个人。
更优选地,所述比较装置12用于:当所述样本中两个用户医疗记录数据均具有唯一识别身份特征信息时,基于所述唯一识别身份特征信息的比较信息,将所述分为相同类样本或不同类样本;当所述样本中两个用户医疗记录数据至少一个不具有唯一识别身份特征信息时,基于所述不唯一识别身份特征信息的比较信息,将所述样本分为不同类样本或待确定类样本;当所述样本中两个用户医疗记录数据至少一个不具有身份特征信息时,则将所述样本分为待确定类样本。
在一具体场景中,例如,将用户的体检数据进行选取和分级,分为等级A,等级B和等级C,其中,等级A为唯一识别身份特征信息,包括:身份证号、社保卡号;等级B为不唯一识别身份特征信息,包括:性别、出生日期、血型和民族;等级C为记录特征信息,包括:身高、体重、脉搏、血糖、收缩压、舒张压、血红蛋白、谷丙转氨酶和体检间隔天数。如图2示出具体场景的流程示意图,由于大部分用户的身份证号、社保 卡号信息缺失,因此,第一步要判断两条体检记录中的等级A信息是否存在。如果两条体检记录都包含身份证号或者社保卡号,则需要进一步判断身份证号或者社保卡号是否相同,若身份证号或者社保卡号相同,则这两条体检记录属于同一个人,若身份证或者社保卡均不同,则判定这两条体检记录属于不同的人。如果这两条体检记录中至少有一条的身份证号和社保卡号信息均缺失,则需要通过等级B继续判定。如果两条体检记录中包含等级B的信息,若性别、出生日期、血型和民族中只要有一项不相同,则能直接判定这两条记录属于不同的人。如果两条体检记录不包含等级B的信息或者等级B中的信息经过比较后均相同,则不能判定这两条体检记录是否属于同一个人,需要进一步通过等级C进行判定。通过上述等级A和等级B的判定,如果两条体检记录属于同一个人,则把这两条体检记录分为相同类样本,可记为正样本;如果两条体检记录属于不同的人,则把这两条体检记录分为不同类样本,可记为负样本;如果通过等级A和等级B比较还不能判定,则将这两条体检记录所在的样本分为待确定类样本;如果两条体检记录中至少有一条不包含等级A和等级B中的信息时,即缺失等级A和等级B中的信息,则将其所在的样本分为待确定类样本。
本领域技术人员应能理解,上述分类样本的方法仅为举例,其他现有的或今后可能出现的分类样本的方法如可适用于本发明,也应包含在本发明保护范围以内,并在此以引用方式包含于此。
优选地,所述预测分类模型为利用机器学习法,基于所述相同类样本和所述不同类样本及其中用户数据的医疗记录特征信息进行训练所获得的。
在此,预测分类即判定待确定类样本中两个用户医疗记录数据是否属于同一个人,接上述具体场景,通过利用机器学习法创建关于正负样本中两个用户医疗记录数据的若干体检记录信息的差别信息与正负样本的分类信息的逻辑回归模型,将正负样本中的两个用户医疗记录数据的若干体检记录信息的差别信息及相应样本的分类信息输入所创建的模型中;接着,基于所述差别信息和相应所述样本的分类信息对所述逻辑回归模型进行训练,即训练正负样本以获得逻辑回归模型中各体检记录信息的差别信息的权重信息。
更优选地,所述机器学习法包括逻辑回归法,获得所述预测分类模型的过程包括:创建逻辑回归模型,所述逻辑回归模型为关于所述样本中两个用户医疗记录数据的若干医疗记录特征信息的差别信息与所述样本的分类信息的关系模型;获取所述相同类样本和所述不同类样本中的两个用户医疗记录数据的若干医疗记录特征信息的差别信息及相应样本的分类信息;基于所述差别信息和相应所述样本的分类信息对所述逻辑回归模型 进行训练,以获得逻辑回归模型中各医疗记录特征信息的差别信息的权重信息。
继续接前例,相同类样本记为正样本,不同类样本记为负样本,计算正样本和负样本中两条体检记录中等级C的特征信息的相差值,得到如下形式集合:{身高差,体重差,脉搏差,血糖差,收缩压差,舒张压差,血红蛋白差,谷丙转氨酶差,体检间隔天数},接着,将得到的集合按照8:2的比例分为训练集和测试集,其中,训练集和测试集中正负样本比例均为1:1,利用逻辑回归法创建逻辑回归的模型形式如下:
Y=C0+C1*身高差+C2*体重差+C3*脉搏差+C4*血糖差+C5*收缩压差+C6*舒张压差+C7*血红蛋白差+C8*谷丙转氨酶差+C9*体检间隔天数
式中,C0、C1、C2……C9表示权重系数,Y表示分类结果,当Y大于等于0.5时,表示这两条体检记录属于同一个人,当Y小于0.5时,则表示这两条体检记录属于不同的人。
接着,用训练集中正负样本对创建的逻辑回归模型进行训练,得到权重系数C0、C1、C2……C9的值,权重系数的大小表明对应的特征信息对分类结果的影响大小。
更优选地,获得所述预测分类模型的过程还包括:利用未进行过训练的若干所述差别信息和相应所述样本的分类信息对所述逻辑回归模型进行测试。接上例,用训练集中正负样本对创建的逻辑回归模型进行训练得到权重系数后,将测试集中的正负样本对该模型进行测试,计算该模型的准确率和AUC(曲线下面积)值。
本领域技术人员应能理解,上述预测分类模型的方法仅为举例,其他现有的或今后可能出现的预测分类模型的方法如可适用于本发明,也应包含在本发明保护范围以内,并在此以引用方式包含于此。
图3示出根据本申请一个方面的一个优选实施例的训练装置13的结构示意图。该装置包括获取单元131、输入单元132和样本单元133。
其中,获取单元131,获取所述待确定类样本中两个用户医疗记录数据的若干医疗记录特征信息的差别信息;输入单元132,将所述差别信息输入所述逻辑回归模型,获得所述样本的分类信息;样本单元133,基于所述样本的分类信息将所述待确定类样本再分为相同类样本或不同类样本。
优选地,获取单元131,获取所述待确定类样本中两个用户医疗记录数据的若干医疗记录特征信息的差别信息。在此,差别信息是指待确定样本中两个用户医疗记录数据的相应的身高差、体重差、脉搏差、血糖差、收缩压差、舒张压差、血红蛋白差、谷丙转氨酶差和体检间隔天数的信息,计算所述的差别信息。
本领域技术人员应能理解,上述差别信息仅为举例,其他现有的或今后可能出现的差别信息如可适用于本发明,也应包含在本发明保护范围以内,并在此以引用方式包含于此。
接着,输入单元132,将所述差别信息输入所述逻辑回归模型,获得所述样本的分类信息。再继续接前例,基于得到逻辑回归模型的权重系数,将计算得到的差别信息输入所得逻辑回归模型,获取所述样本的分类信息即计算出分类结果值Y,得到Y的信息。
随后,样本单元133,基于所述样本的分类信息将所述待确定类样本再分为相同类样本或不同类样本。根据计算出的Y值,判定所述待确定类样本,当Y大于等于0.5时,表示待确定类样本为相同类样本,即样本中的两条体检记录属于同一个人;当Y小于0.5时,表示待确定类样本为不同类样本,即样本中的两条体检记录不属于同一个人。
优选地,所述机器学习法包括随机森林法。在此,基于相同类样本和不同类样本及其中用户医疗记录信息数据的记录特征信息进行训练获得预测分类模型,用N来表示训练的相同类样本和不同类样本的个数,M表示变量的数目。m为已知,被用来决定当在一个节点上做决定时,会使用到多少变量,其中,m小于M;从N个训练案例中以可重复取样的方式,取样N次,形成一组训练集(即bootstrap取样),并使用这棵树来对待确定类样本预测其类别,即判定待确定类样本中每个样本是否属于同一个人,并评估其误差。对于每一个节点,随机选择m个基于此点上的变量,根据m个变量,计算其最佳的分割方式;每棵树都会完整成长而不会剪枝(Pruning)。
图4示出根据本申请又一个方面的一种用户数据的分类方法流程示意图,该方法包括步骤S11、步骤S12、步骤S13和步骤S14。
其中,在步骤S11中,获取若干用户数据,并基于所述用户数据生成若干样本,每一所述样本包括具有相同标识特征信息的两个用户数据;在步骤S12中,基于所述样本中两个用户数据的区别特征信息及所述区别特征信息的比较信息,将所述样本分为相同类样本、不同类样本或待确定类样本;在步骤S13中,利用预测分类模型将所述待确定类样本再分为相同类样本或不同类样本,其中,所述预测分类模型为基于所述相同类样本和所述不同类样本及其中用户数据的综合描述特征信息进行训练所获得的;在步骤S14中,基于所述相同类样本和不同类样本,对所述用户数据进行分类。
在此,所述标识特征信息为能够将用户数据分为多个不同子集的特征信息,例如:人的姓名、商家的品牌名等;所述区别特征信息为能够用于判定是否属于相同的用户的特征信息,例如人的身份证号码、人的手机号、商家的组织机构代码等;所述综合描述 特征信息为无法直接关联识别用户数据,但通过所述综合描述特征信息的综合描述,也能够建立用户数据对应的用户的“画像”,间接地判定是否属于相同的用户的特征信息,例如人的生理信息(包括身高、体重、血压等)、商家的经营领域或宣传等。
具体地,所述区别特征信息包括唯一识别区别特征信息和不唯一识别区别特征信息。在此,所述唯一识别区别特征信息为:能够直接判定样本中两个用户数据是相同类或不同类,进而确定样本为相同类样本或不同类样本的特征信息,例如人的身份证号码、商家的组织机构代码等;所述不唯一识别区别特征信息为:当样本中两个用户数据的对应的不唯一识别区别特征信息不同时,则能够直接确定两个用户数据是不同类,进而确定样本为不同类样本,但当对应的不唯一识别区别特征信息相同时,则不能够直接确定两个用户数据是相同类或不同类,进而确定样本为相同类样本的特征信息,例如人的性别、民族、国际、商家的经营领域等。
具体地,在步骤S12中:当所述样本中两个用户数据均具有唯一识别区别特征信息时,基于所述唯一识别区别特征信息的比较信息,将所述分为相同类样本或不同类样本;当所述样本中两个用户数据至少一个不具有唯一识别区别特征信息时,基于所述不唯一识别区别特征信息的比较信息,将所述样本分为不同类样本或待确定类样本;当所述样本中两个用户数据至少一个不具有区别特征信息时,则将所述样本分为待确定类样本。
具体地,所述预测分类模型为利用机器学习法,基于所述相同类样本和所述不同类样本及其中用户数据的综合描述特征信息进行训练所获得的。
具体地,所述机器学习法包括逻辑回归法,获得所述预测分类模型的过程包括:创建逻辑回归模型,所述逻辑回归模型为关于所述样本中两个用户数据的若干综合描述特征信息的差别信息与所述样本的分类信息的关系模型;获取所述相同类样本和所述不同类样本中的两个用户数据的若干综合描述特征信息的差别信息及相应样本的分类信息;基于所述差别信息和相应所述样本的分类信息对所述逻辑回归模型进行训练,以获得逻辑回归模型中各综合描述特征信息的差别信息的权重信息。
具体地,获得所述预测分类模型的过程还包括:利用未进行过训练的若干所述差别信息和相应所述样本的分类信息对所述逻辑回归模型进行测试。
具体地,步骤S13包括:步骤S131,获取所述待确定类样本中两个用户数据的若干综合描述特征信息的差别信息;步骤S132,将所述差别信息输入所述逻辑回归模型,获得所述样本的分类信息;步骤S133,基于所述样本的分类信息将所述待确定类样本再分为相同类样本或不同类样本。
具体地,所述机器学习法包括随机森林法。
本申请所述方法用于用户数据的分类,首先通过所述标识特征信息为能够将用户数据分为多个不同子集,并在子集中两两组成样本,对每一样本进行比较,比较过程中利用所述区别特征信息对样本进行比较,获取相同类样本、不同类样本及由于所述区别特征信息缺失或由区别特征信息不能直接确定的待确定样本,利用相同类样本和不同类样本中的用户数据的综合描述特征信息训练获得的训练模型,即用户数据的“画像”,再利用训练模型对待确定样本进行进一步比较,再根据相同类样本和不同类样本对用户数据进行子集下的分类,从而能够更准确地根据用户数据的特征信息对用户数据进行关联识别和分类,进而为用户提供个性化服务打下基础。
目前,医疗机构采用姓名+身份证(或者手机号)来识别用户连续多年的体检数据,由于常见的重名情况,同一用户的姓名可能实际对应不同的用户,而用户的身份证号码、手机号等区别特征信息也常常出现缺失,医疗机构只能根据用户的当前体检结果来诊断,而且本次体检过后不会再提供后续服务,导致医疗机构未能打通每个用户多年的体检记录,不能观察到每个用户连续多年身体各项指标的变化情况,进而无法为用户提供更好地个性化服务。
因此,可利用本申请所述方法对用户医疗记录数据中可能具有的用户区别特征信息和医疗综合描述特征信息对具有相同的用户的姓名的用户数据进行分类,从而得到医疗机构中每个用户对应的一系列医疗服务,包括体检记录,打通用户连续多年的医疗记录,提高用户数据关联识别的准确率,同时,更好地开发和利用医疗机构积累的海量用户数据,对用户个人、医疗机构和社会产生巨大的价值。
我们将在实施例中以对用户的体检记录进行关联识别的分类为例,对本申请所述方法应用于医疗领域进行具体说明。本领域技术人员应能理解,除体检记录外,本申请还可以应用于医疗领域其他用户数据的分类,例如用户的医疗诊断记录、大病记录、健康随访记录等。其中,所述区别特征信息优选地为用户的身份特征信息,例如民族、性别、年龄等,所述综合描述特征信息优选地为用户的医疗记录特征信息,例如体检记录中的生理数据、过往病史记录等。
具体地,在步骤S11中,获取若干用户医疗记录数据,并基于所述用户医疗记录数据生成若干样本,每一所述样本包括具有相同用户姓名信息的两个用户医疗记录数据。
在此,在体检机构中,用户医疗记录数据包括体检记录数据,按照体检用户的姓名将所有的体检记录划分为多个子集,每个子集包含一条或者多条体检记录,每个子集中 可能是多个人的体检记录集合。如果一个子集中只有一条体检记录,即此人未与他人重名且只体检过一次,则此体检记录唯一的属于此人;如果一个子集中不少于两条体检记录时,将每个子集中的任意两条体检记录作为一个样本,即获取若干样本。
具体地,在步骤S12中,基于所述样本中两个用户医疗记录数据的区别特征信息及所述区别特征信息的比较信息,将所述样本分为相同类样本、不同类样本或待确定类样本。
在此,所述用户的身份特征信息可以是关于用户身份的信息,例如包括但不限于姓名、性别、身份证、手机号、婚姻状况、国籍、在职情况、总工龄等,从用户的身份信息中特征选取区别特征信息,该信息包括身份证号、社保卡号、性别、出生日期、血型和民族,当然,特征选取的区别特征信息不限于上述所列举的内容。
根据所在同一个样本中的两个用户医疗记录数据的区别特征信息及所述区别特征信息的比较信息,将样本按照设定的方法进行分类,例如,当所述样本中两个用户医疗记录数据均具有唯一识别区别特征信息身份证号时,比较两个身份证号,若相同,则将比较的两个用户医疗记录数据所在的样本分为相同类样本,若不同,则将比较的两个用户医疗记录数据所在的样本分为不同类样本;当所述样本中两个用户医疗记录数据至少一个缺失身份证号时,进行比较两个用户的不唯一识别区别特征信息,进行比较性别、出生日期、血型和民族,若其中有一项不同则将所述样本分为不同类样本,若比较的不唯一识别区别特征信息都相同,则将所述样本分为待确定类样本;当所述样本中两个用户医疗记录数据至少一个不具有区别特征信息时,即至少有一个用户的身份证号、出生日期、性别、血型和民族等这些代表用户身份的信息缺失时,则将所述样本分为待确定类样本。
本领域技术人员应能理解,上述样本的分类方法仅为举例,其他现有的或今后可能出现的样本的分类方法如可适用于本发明,也应包含在本发明保护范围以内,并在此以引用方式包含于此。
具体地,在步骤S13中,利用预测分类模型将所述待确定类样本再分为相同类样本或不同类样本,其中,所述预测分类模型为基于所述相同类样本和所述不同类样本及其中用户医疗记录数据的综合描述特征信息进行训练所获得的。
在一具体场景中,综合描述特征信息包括医疗综合描述特征信息,例如,用户的体检记录信息,包括身高、体重、脉搏、血糖、收缩压、舒张压、血红蛋白、谷丙转氨酶、体检间隔天数。计算每个相同类样本和不同类样本中两条体检记录中综合描述特征信息 的相差值,采用机器学习中的逻辑回归或随机森林方法建立用户关联和识别模型;对所建立的模型利用相同类样本和不同类样本进行训练,以获得模型中各综合描述特征信息的差别信息的权重信息。采用训练好的模型对待确定类样本进行预测分类,即判定待确定类样本中每个样本是否属于同一个人。
本领域技术人员应能理解,上述预测分类仅为举例,其他现有的或今后可能出现的预测分类如可适用于本发明,也应包含在本发明保护范围以内,并在此以引用方式包含于此。
具体地,在步骤S14中,基于所述相同类样本和不同类样本,对所述用户数据进行分类。在此,确定每个样本为相同类样本或不同样本后,得到一系列两条体检记录属于同一个人的关系对,可以采用ODPS或Hadoop对结果进行汇总,将每一个用户数据(即体检记录)分类到一个用户中,进而得到每个用户对应的一系列体检记录。
在此,所述ODPS(Open Data Processing Service)是指开放数据处理服务,海量数据处理和分析的服务平台,提供针对TB/PB级数据、实时性要求不高的分布式处理能力,应用于数据分析、海量数据统计、机器学习、数据挖掘等领域。所述Hadoop是一个能够对大量数据进行分布式处理的软件框架,用户可以在Hadoop上开发和运行处理海量数据的应用程序。本领域技术人员应能理解,并不限于采用ODPS或Hadoop对用户数据的分类方法,今后可能出现的能够使用与用户数据的分类方法的,也可以包括在本申请的思想范围之内。
优选地,所述区别特征信息包括唯一识别区别特征信息和不唯一识别区别特征信息。在此,唯一识别身份证信息是指代表每个用户身份具有唯一性的信息,能够判定两个用户医疗记录数据是否属于同一个人,同时具有肯定和否定的作用,例如身份证号、社保卡号,一旦有后不可更改,两个用户医疗记录数据对应的身份证号相同,则说明这两个用户医疗记录数据属于同一个人,反之,则说明两个用户医疗记录数据不属于同一个人;不唯一识别区别特征信息是指能够体现用户的区别特征信息,包含用户不变的信息,但具有不唯一性,只能判定两个用户医疗记录数据不是属于同一个人,即只具有否定作用,例如性别、出生日期、血型和民族,两个用户医疗记录数据对应的性别不同,则这两个用户医疗记录数据肯定不属于同一个人,反之,不能说明两个用户医疗记录数据属于同一个人。
更优选地,在步骤S12中,当所述样本中两个用户医疗记录数据均具有唯一识别区别特征信息时,基于所述唯一识别区别特征信息的比较信息,将所述分为相同类样本或 不同类样本;当所述样本中两个用户医疗记录数据至少一个不具有唯一识别区别特征信息时,基于所述不唯一识别区别特征信息的比较信息,将所述样本分为不同类样本或待确定类样本;当所述样本中两个用户医疗记录数据至少一个不具有区别特征信息时,则将所述样本分为待确定类样本。
在一具体场景中,例如,将用户的体检数据进行选取和分级,分为等级A,等级B和等级C,其中,等级A为唯一识别区别特征信息,包括:身份证号、社保卡号;等级B为不唯一识别区别特征信息,包括:性别、出生日期、血型和民族;等级C为综合描述特征信息,包括:身高、体重、脉搏、血糖、收缩压、舒张压、血红蛋白、谷丙转氨酶和体检间隔天数。如图2示出具体场景的流程示意图,由于大部分用户的身份证号、社保卡号信息缺失,因此,第一步要判断两条体检记录中的等级A信息是否存在。如果两条体检记录都包含身份证号或者社保卡号,则需要进一步判断身份证号或者社保卡号是否相同,若身份证号或者社保卡号相同,则这两条体检记录属于同一个人,若身份证或者社保卡均不同,则判定这两条体检记录属于不同的人。如果这两条体检记录中至少有一条的身份证号和社保卡号信息均缺失,则需要通过等级B继续判定。如果两条体检记录中包含等级B的信息,若性别、出生日期、血型和民族中只要有一项不相同,则能直接判定这两条记录属于不同的人。如果两条体检记录不包含等级B的信息或者等级B中的信息经过比较后均相同,则不能判定这两条体检记录是否属于同一个人,需要进一步通过等级C进行判定。通过上述等级A和等级B的判定,如果两条体检记录属于同一个人,则把这两条体检记录分为相同类样本,可记为正样本;如果两条体检记录属于不同的人,则把这两条体检记录分为不同类样本,可记为负样本;如果通过等级A和等级B比较还不能判定,则将这两条体检记录所在的样本分为待确定类样本;如果两条体检记录中至少有一条不包含等级A和等级B中的信息时,即缺失等级A和等级B中的信息,则将其所在的样本分为待确定类样本。
本领域技术人员应能理解,上述分类样本的方法仅为举例,其他现有的或今后可能出现的分类样本的方法如可适用于本发明,也应包含在本发明保护范围以内,并在此以引用方式包含于此。
优选地,所述预测分类模型为利用机器学习法,基于所述相同类样本和所述不同类样本及其中用户数据的医疗综合描述特征信息进行训练所获得的。
在此,预测分类即判定待确定类样本中两个用户医疗记录数据是否属于同一个人,接上述具体场景,通过利用机器学习法创建关于正负样本中两个用户医疗记录数据的若 干体检记录信息的差别信息与正负样本的分类信息的逻辑回归模型,将正负样本中的两个用户医疗记录数据的若干体检记录信息的差别信息及相应样本的分类信息输入所创建的模型中;接着,基于所述差别信息和相应所述样本的分类信息对所述逻辑回归模型进行训练,即训练正负样本以获得逻辑回归模型中各体检记录信息的差别信息的权重信息。
更优选地,所述机器学习法包括逻辑回归法,获得所述预测分类模型的过程包括:创建逻辑回归模型,所述逻辑回归模型为关于所述样本中两个用户医疗记录数据的若干医疗综合描述特征信息的差别信息与所述样本的分类信息的关系模型;获取所述相同类样本和所述不同类样本中的两个用户医疗记录数据的若干医疗综合描述特征信息的差别信息及相应样本的分类信息;基于所述差别信息和相应所述样本的分类信息对所述逻辑回归模型进行训练,以获得逻辑回归模型中各医疗综合描述特征信息的差别信息的权重信息。
继续接前例,相同类样本记为正样本,不同类样本记为负样本,计算正样本和负样本中两条体检记录中等级C的特征信息的相差值,得到如下形式集合:{身高差,体重差,脉搏差,血糖差,收缩压差,舒张压差,血红蛋白差,谷丙转氨酶差,体检间隔天数},接着,将得到的集合按照8:2的比例分为训练集和测试集,其中,训练集和测试集中正负样本比例均为1:1,利用逻辑回归法创建逻辑回归的模型形式如下:
Y=C0+C1*身高差+C2*体重差+C3*脉搏差+C4*血糖差+C5*收缩压差+C6*舒张压差+C7*血红蛋白差+C8*谷丙转氨酶差+C9*体检间隔天数
式中,C0、C1、C2……C9表示权重系数,Y表示分类结果,当Y大于等于0.5时,表示这两条体检记录属于同一个人,当Y小于0.5时,则表示这两条体检记录属于不同的人。
接着,用训练集中正负样本对创建的逻辑回归模型进行训练,得到权重系数C0、C1、C2……C9的值,权重系数的大小表明对应的特征信息对分类结果的影响大小。
更优选地,获得所述预测分类模型的过程还包括:利用未进行过训练的若干所述差别信息和相应所述样本的分类信息对所述逻辑回归模型进行测试。接上例,用训练集中正负样本对创建的逻辑回归模型进行训练得到权重系数后,将测试集中的正负样本对该模型进行测试,计算该模型的准确率和AUC(曲线下面积)值。
本领域技术人员应能理解,上述预测分类模型的方法仅为举例,其他现有的或今后可能出现的预测分类模型的方法如可适用于本发明,也应包含在本发明保护范围以内,并在此以引用方式包含于此。
图5示出根据本申请又一个方面的一个优选实施例的步骤S13的方法流程示意图。步骤S13包括步骤S131、步骤S132和步骤S133。
其中,在步骤S131中,获取所述待确定类样本中两个用户医疗记录数据的若干医疗综合描述特征信息的差别信息;在步骤S132中,将所述差别信息输入所述逻辑回归模型,获得所述样本的分类信息;在步骤S133中,基于所述样本的分类信息将所述待确定类样本再分为相同类样本或不同类样本。
优选地,在步骤S131中,获取所述待确定类样本中两个用户医疗记录数据的若干医疗综合描述特征信息的差别信息。在此,差别信息是指待确定样本中两个用户医疗记录数据的相应的身高差、体重差、脉搏差、血糖差、收缩压差、舒张压差、血红蛋白差、谷丙转氨酶差和体检间隔天数的信息,计算所述的差别信息。
本领域技术人员应能理解,上述差别信息仅为举例,其他现有的或今后可能出现的差别信息如可适用于本发明,也应包含在本发明保护范围以内,并在此以引用方式包含于此。
接着,在步骤S132中,将所述差别信息输入所述逻辑回归模型,获得所述样本的分类信息。再继续接前例,基于得到逻辑回归模型的权重系数,将计算得到的差别信息输入所得逻辑回归模型,获取所述样本的分类信息即计算出分类结果值Y,得到Y的信息。
随后,在步骤S133中,基于所述样本的分类信息将所述待确定类样本再分为相同类样本或不同类样本。根据计算出的Y值,判定所述待确定类样本,当Y大于等于0.5时,表示待确定类样本为相同类样本,即样本中的两条体检记录属于同一个人;当Y小于0.5时,表示待确定类样本为不同类样本,即样本中的两条体检记录不属于同一个人。
优选地,所述机器学习法包括随机森林法。在此,基于相同类样本和不同类样本及其中用户医疗记录信息数据的综合描述特征信息进行训练获得预测分类模型,用N来表示训练的相同类样本和不同类样本的个数,M表示变量的数目。m为已知,被用来决定当在一个节点上做决定时,会使用到多少变量,其中,m小于M;从N个训练案例中以可重复取样的方式,取样N次,形成一组训练集(即bootstrap取样),并使用这棵树来对待确定类样本预测其类别,即判定待确定类样本中每个样本是否属于同一个人,并评估其误差。对于每一个节点,随机选择m个基于此点上的变量,根据m个变量,计算其最佳的分割方式;每棵树都会完整成长而不会剪枝(Pruning)。
本申请所述用于用户数据的分类方法和设备,通过获取若干用户数据,并基于所述用户数据生成若干样本,每一所述样本包括具有相同标识特征信息的两个用户数据;基 于所述样本中两个用户数据的区别特征信息及所述区别特征信息的比较信息,将所述样本分为相同类样本、不同类样本或待确定类样本;接着,利用预测分类模型将所述待确定类样本再分为相同类样本或不同类样本,其中,所述预测分类模型为基于所述相同类样本和所述不同类样本及其中用户数据的综合描述特征信息进行训练所获得的;基于所述相同类样本和不同类样本,对所述用户数据进行分类。从而更准确地根据用户数据的特征信息对用户数据进行关联识别,更好的对用户数据分类至相应用户的记录,打通用户多次的综合描述特征信息,以供为用户服务。
进一步地,本申请可应用于医疗领域,例如应用于对用户的体检记录的关联识别等,所述用户数据为用户的医疗记录数据,例如用户的体检记录等。本申请通过获取若干用户的医疗记录数据,并基于所述用户的医疗记录数据生成若干样本,每一所述样本包括具有相同姓名的两个用户医疗记录数据;基于所述样本中两个用户医疗记录数据的区别特征信息及所述区别特征信息的比较信息,将所述样本分为相同类样本、不同类样本或待确定类样本;接着,利用预测分类模型将所述待确定类样本再分为相同类样本或不同类样本,其中,所述预测分类模型为基于所述相同类样本和所述不同类样本及其中用户数据的医疗综合描述特征信息进行训练所获得的;基于所述相同类样本和不同类样本,对所述用户医疗记录数据进行分类。从而得到医疗机构中每个用户对应的一系列体检记录,打通用户连续多年的体检记录。进一步地,所述预测分类模型为利用机器学习法,基于所述相同类样本和所述不同类样本及其中用户医疗记录数据的综合描述特征信息进行训练所获得的,能够识别到全部的体检记录,且准确率高,同时,更好地开发和利用体检机构积累的海量用户医疗记录数据,对用户个人、医疗机构和社会产生巨大的价值。
需要注意的是,本申请可在软件和/或软件与硬件的组合体中被实施,例如,可采用专用集成电路(ASIC)、通用目的计算机或任何其他类似硬件设备来实现。在一个实施例中,本申请的软件程序可以通过处理器执行以实现上文所述步骤或功能。同样地,本申请的软件程序(包括相关的数据结构)可以被存储到计算机可读记录介质中,例如,RAM存储器,磁或光驱动器或软磁盘及类似设备。另外,本申请的一些步骤或功能可采用硬件来实现,例如,作为与处理器配合从而执行各个步骤或功能的电路。
另外,本申请的一部分可被应用为计算机程序产品,例如计算机程序指令,当其被计算机执行时,通过该计算机的操作,可以调用或提供根据本申请的方法和/或技术方案。而调用本申请的方法的程序指令,可能被存储在固定的或可移动的记录介质中,和/或通过广播或其他信号承载媒体中的数据流而被传输,和/或被存储在根据所述程序指令运行 的计算机设备的工作存储器中。在此,根据本申请的一个实施例包括一个装置,该装置包括用于存储计算机程序指令的存储器和用于执行程序指令的处理器,其中,当该计算机程序指令被该处理器执行时,触发该装置运行基于前述根据本申请的多个实施例的方法和/或技术方案。
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附图标记视为限制所涉及的权利要求。此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。装置权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第一,第二等词语用来表示名称,而并不表示任何特定的顺序。

Claims (18)

  1. 一种用户数据的分类方法,其中,所述方法包括:
    获取若干用户数据,并基于所述用户数据生成若干样本,每一所述样本包括具有相同标识特征信息的两个用户数据;
    基于所述样本中两个用户数据的区别特征信息及所述区别特征信息的比较信息,将所述样本分为相同类样本、不同类样本或待确定类样本;
    利用预测分类模型将所述待确定类样本再分为相同类样本或不同类样本,其中,所述预测分类模型为基于所述相同类样本和所述不同类样本及其中用户数据的综合描述特征信息进行训练所获得的;
    基于所述相同类样本和不同类样本,对所述用户数据进行分类。
  2. 根据权利要求1所述的方法,其中,所述用户数据为用户的医疗记录数据,所述区别特征信息包括所述用户的身份特征信息,所述综合描述特征信息包括用户的医疗记录特征信息。
  3. 根据权利要求1或2所述的方法,其中,所述区别特征信息包括唯一识别区别特征信息和不唯一识别区别特征信息。
  4. 根据权利要求3所述的方法,其中,所述基于所述样本中两个用户数据的区别特征信息及所述区别特征信息的比较信息,将所述样本分为相同类样本、不同类样本或待确定类样本包括:
    当所述样本中两个用户数据均具有唯一识别区别特征信息时,基于所述唯一识别区别特征信息的比较信息,将所述分为相同类样本或不同类样本;
    当所述样本中两个用户数据至少一个不具有唯一识别区别特征信息时,基于所述不唯一识别区别特征信息的比较信息,将所述样本分为不同类样本或待确定类样本;
    当所述样本中两个用户数据至少一个不具有区别特征信息时,则将所述样本分为待确定类样本。
  5. 根据权利要求1至4中任一项所述的方法,其中,所述预测分类模型为利用机器学习法,基于所述相同类样本和所述不同类样本及其中用户数据的综合描述特征信息进行训练所获得的。
  6. 根据权利要求5所述的方法,其中,所述机器学习法包括逻辑回归法,获得所述预测分类模型的过程包括:
    创建逻辑回归模型,所述逻辑回归模型为关于所述样本中两个用户数据的若干综合 描述特征信息的差别信息与所述样本的分类信息的关系模型;
    获取所述相同类样本和所述不同类样本中的两个用户数据的若干综合描述特征信息的差别信息及相应样本的分类信息;
    基于所述差别信息和相应所述样本的分类信息对所述逻辑回归模型进行训练,以获得逻辑回归模型中各综合描述特征信息的差别信息的权重信息。
  7. 根据权利要求6所述的方法,其中,获得所述预测分类模型的过程还包括:
    利用未进行过训练的若干所述差别信息和相应所述样本的分类信息对所述逻辑回归模型进行测试。
  8. 根据权利要求5至7中任一项所述的方法,其中,所述利用预测分类模型将所述待确定类样本再分为相同类样本或不同类样本包括:
    获取所述待确定类样本中两个用户数据的若干综合描述特征信息的差别信息;
    将所述差别信息输入所述逻辑回归模型,获得所述样本的分类信息;
    基于所述样本的分类信息将所述待确定类样本再分为相同类样本或不同类样本。
  9. 根据权利要求8所述的方法,其中,所述机器学习法包括随机森林法。
  10. 一种用户数据的分类设备,其中,所述设备包括:
    获取装置,用于获取若干用户数据,并基于所述用户数据生成若干样本,每一所述样本包括具有相同标识特征信息的两个用户数据;
    比较装置,用于基于所述样本中两个用户数据的区别特征信息及所述区别特征信息的比较信息,将所述样本分为相同类样本、不同类样本或待确定类样本;
    训练装置,用于利用预测分类模型将所述待确定类样本再分为相同类样本或不同类样本,其中,所述预测分类模型为基于所述相同类样本和所述不同类样本及其中用户数据的综合描述特征信息进行训练所获得的;
    分类装置,用于基于所述相同类样本和不同类样本,对所述用户数据进行分类。
  11. 根据权利要求10所述的设备,其中,所述用户数据为用户的医疗记录数据,所述区别特征信息包括所述用户的身份特征信息,所述综合描述特征信息包括用户的医疗记录特征信息。
  12. 根据权利要求10或11所述的设备,其中,所述区别特征信息包括唯一识别区别特征信息和不唯一识别区别特征信息。
  13. 根据权利要求12所述的设备,其中,所述比较装置用于:
    当所述样本中两个用户数据均具有唯一识别区别特征信息时,基于所述唯一识别区 别特征信息的比较信息,将所述分为相同类样本或不同类样本;
    当所述样本中两个用户数据至少一个不具有唯一识别区别特征信息时,基于所述不唯一识别区别特征信息的比较信息,将所述样本分为不同类样本或待确定类样本;
    当所述样本中两个用户数据至少一个不具有区别特征信息时,则将所述样本分为待确定类样本。
  14. 根据权利要求10至13中任一项所述的设备,其中,所述预测分类模型为利用机器学习法,基于所述相同类样本和所述不同类样本及其中用户数据的综合描述特征信息进行训练所获得的。
  15. 根据权利要求14所述的设备,其中,所述机器学习法包括逻辑回归法,获得所述预测分类模型的过程包括:
    创建逻辑回归模型,所述逻辑回归模型为关于所述样本中两个用户数据的若干综合描述特征信息的差别信息与所述样本的分类信息的关系模型;
    获取所述相同类样本和所述不同类样本中的两个用户数据的若干综合描述特征信息的差别信息及相应样本的分类信息;
    基于所述差别信息和相应所述样本的分类信息对所述逻辑回归模型进行训练,以获得逻辑回归模型中各综合描述特征信息的差别信息的权重信息。
  16. 根据权利要求15所述的设备,其中,获得所述预测分类模型的过程还包括:
    利用未进行过训练的若干所述差别信息和相应所述样本的分类信息对所述逻辑回归模型进行测试。
  17. 根据权利要求14至16中任一项所述的设备,其中,所述训练装置包括:
    获取单元,用于获取所述待确定类样本中两个用户数据的若干综合描述特征信息的差别信息;
    输入单元,用于将所述差别信息输入所述逻辑回归模型,获得所述样本的分类信息;
    样本单元,用于基于所述样本的分类信息将所述待确定类样本再分为相同类样本或不同类样本。
  18. 根据权利要求17所述的设备,其中,所述机器学习法包括随机森林法。
PCT/CN2016/097495 2015-09-09 2016-08-31 一种用户数据分类的方法和设备 WO2017041651A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510571182.2 2015-09-09
CN201510571182.2A CN106529110A (zh) 2015-09-09 2015-09-09 一种用户数据分类的方法和设备

Publications (1)

Publication Number Publication Date
WO2017041651A1 true WO2017041651A1 (zh) 2017-03-16

Family

ID=58240613

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/097495 WO2017041651A1 (zh) 2015-09-09 2016-08-31 一种用户数据分类的方法和设备

Country Status (2)

Country Link
CN (1) CN106529110A (zh)
WO (1) WO2017041651A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107256231A (zh) * 2017-05-04 2017-10-17 腾讯科技(深圳)有限公司 一种团队成员识别设备、方法及系统
CN109961296A (zh) * 2017-12-25 2019-07-02 腾讯科技(深圳)有限公司 商户类型识别方法及装置
CN112233740A (zh) * 2020-09-28 2021-01-15 广州金域医学检验中心有限公司 患者身份识别方法、装置、设备和介质
CN112417308A (zh) * 2020-12-17 2021-02-26 国网河北省电力有限公司营销服务中心 一种基于电力大数据的用户画像标签生成方法

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019790B (zh) * 2017-10-09 2023-08-22 阿里巴巴集团控股有限公司 文本识别、文本监控、数据对象识别、数据处理方法
CN108831552A (zh) * 2018-04-09 2018-11-16 平安科技(深圳)有限公司 电子装置、鼻咽癌筛查分析方法和计算机可读存储介质
CN110210884B (zh) * 2018-05-29 2023-05-05 腾讯科技(深圳)有限公司 确定用户特征数据的方法、装置、计算机设备及存储介质
CN109460440B (zh) * 2018-09-18 2023-10-27 平安科技(深圳)有限公司 一种基于权重值的画像处理方法、装置及设备
CN112925911B (zh) * 2021-02-25 2022-08-12 平安普惠企业管理有限公司 基于多模态数据的投诉分类方法及其相关设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739501A (zh) * 2008-11-04 2010-06-16 财团法人资讯工业策进会 病患风险程度辨识系统与方法
CN102033965A (zh) * 2011-01-17 2011-04-27 安徽海汇金融投资集团有限公司 一种基于分类模型的数据分类方法及系统
CN103200861A (zh) * 2011-11-04 2013-07-10 松下电器产业株式会社 类似病例检索装置以及类似病例检索方法
CN104778388A (zh) * 2015-05-04 2015-07-15 苏州大学 一种两个不同平台下同一用户识别方法及系统

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2583207B1 (en) * 2010-06-17 2018-12-19 Koninklijke Philips N.V. Identity matching of patient records
US20140317756A1 (en) * 2011-12-15 2014-10-23 Nec Corporation Anonymization apparatus, anonymization method, and computer program
CN103678659A (zh) * 2013-12-24 2014-03-26 焦点科技股份有限公司 一种基于随机森林算法的电子商务网站欺诈用户识别方法及系统
CN104239490B (zh) * 2014-09-05 2017-05-10 电子科技大学 一种用于ugc网站平台的多账户检测方法及装置
CN104462318A (zh) * 2014-12-01 2015-03-25 国家电网公司 一种多网络中相同人名的身份识别方法及装置
CN104537252B (zh) * 2015-01-05 2019-09-17 深圳市腾讯计算机系统有限公司 用户状态单分类模型训练方法和装置
CN104537118B (zh) * 2015-01-26 2017-12-26 苏州大学 一种微博数据处理方法、装置及系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739501A (zh) * 2008-11-04 2010-06-16 财团法人资讯工业策进会 病患风险程度辨识系统与方法
CN102033965A (zh) * 2011-01-17 2011-04-27 安徽海汇金融投资集团有限公司 一种基于分类模型的数据分类方法及系统
CN103200861A (zh) * 2011-11-04 2013-07-10 松下电器产业株式会社 类似病例检索装置以及类似病例检索方法
CN104778388A (zh) * 2015-05-04 2015-07-15 苏州大学 一种两个不同平台下同一用户识别方法及系统

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107256231A (zh) * 2017-05-04 2017-10-17 腾讯科技(深圳)有限公司 一种团队成员识别设备、方法及系统
CN109961296A (zh) * 2017-12-25 2019-07-02 腾讯科技(深圳)有限公司 商户类型识别方法及装置
CN112233740A (zh) * 2020-09-28 2021-01-15 广州金域医学检验中心有限公司 患者身份识别方法、装置、设备和介质
CN112233740B (zh) * 2020-09-28 2024-03-29 广州金域医学检验中心有限公司 患者身份识别方法、装置、设备和介质
CN112417308A (zh) * 2020-12-17 2021-02-26 国网河北省电力有限公司营销服务中心 一种基于电力大数据的用户画像标签生成方法

Also Published As

Publication number Publication date
CN106529110A (zh) 2017-03-22

Similar Documents

Publication Publication Date Title
WO2017041651A1 (zh) 一种用户数据分类的方法和设备
Ambekar et al. Disease risk prediction by using convolutional neural network
US20200303075A1 (en) System and a method to predict occurrence of a chronic diseases
Dogrucu et al. Moodable: On feasibility of instantaneous depression assessment using machine learning on voice samples with retrospectively harvested smartphone and social media data
Islam et al. Machine learning approaches for predicting hypertension and its associated factors using population-level data from three South Asian countries
Sunitha et al. A comparative analysis of deep neural network architectures for the dynamic diagnosis of COVID‐19 based on acoustic cough features
CN110570941B (zh) 一种基于文本语义向量模型评估心理状态的系统和装置
CN113855038B (zh) 基于多模型集成的心电信号危急值的预测方法及装置
Rajliwall et al. Cardiovascular risk prediction based on XGBoost
CN114783580B (zh) 一种医疗数据质量评估方法及系统
Alhassan et al. Stacked denoising autoencoders for mortality risk prediction using imbalanced clinical data
CN110391013A (zh) 一种基于语义向量构建神经网络预测心理健康的系统和装置
Amin et al. Personalized health monitoring using predictive analytics
CN112926332A (zh) 一种实体关系联合抽取方法及装置
Sumangali et al. A classifier based approach for early detection of diabetes mellitus
Hasan et al. Improving Medical Image Decision‐Making by Leveraging Metacognitive Processes and Representational Similarity
Flores et al. Depression screening using deep learning on follow-up questions in clinical interviews
US8473314B2 (en) Method and system for determining precursors of health abnormalities from processing medical records
Tekieh et al. Analysing healthcare coverage with data mining techniques
Rammal et al. Heart failure prediction models using big data techniques
Swathi et al. Prediction of Chronic Kidney Disease with Various Machine Learning Techniques: A Comparative Study
Benmalek et al. A cough-based covid-19 detection with Gammatone and Mel-frequency cepstral coefficients
Parimala et al. Diabetes Prediction using Machine Learning
CN113688854A (zh) 数据处理方法、装置及计算设备
Adgaonkar et al. K-Means and Decision Tree Approach to Predict the Severity of Diabetes Cases

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16843589

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16843589

Country of ref document: EP

Kind code of ref document: A1