WO2020098308A1 - Method, device and equipment for establishing crowd portrait classification medel and storage medium - Google Patents

Method, device and equipment for establishing crowd portrait classification medel and storage medium Download PDF

Info

Publication number
WO2020098308A1
WO2020098308A1 PCT/CN2019/097892 CN2019097892W WO2020098308A1 WO 2020098308 A1 WO2020098308 A1 WO 2020098308A1 CN 2019097892 W CN2019097892 W CN 2019097892W WO 2020098308 A1 WO2020098308 A1 WO 2020098308A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
user
factors
factor
user data
Prior art date
Application number
PCT/CN2019/097892
Other languages
French (fr)
Chinese (zh)
Inventor
金戈
徐亮
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020098308A1 publication Critical patent/WO2020098308A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition

Definitions

  • the present application relates to the field of data processing technology, and in particular to a method, device, computer device, and storage medium for establishing a group portrait classification model.
  • Crowd portrait classification refers to the process of classifying crowd portraits on newly input user data through the crowd portrait classification model.
  • the crowd portrait classification model is constructed by using a preset model to train massive user data.
  • employee data includes: employee's position, length of service, education, gender, department and other employee attributes.
  • Use the preset model to train a large number of employee data construct an employee portrait classification model, and then obtain a number of employee portraits through the employee portrait classification model to complete the classification of each employee.
  • employee's turnover situation can be used to build an employee turnover prediction model, and then the employee turnover prediction model can be used to predict the probability of an employee's turnover.
  • the preset models on which the crowd portrait classification model is built are mainly classification models and clustering models, such as SVM, neural network, k-means, etc.
  • classification models and clustering models such as SVM, neural network, k-means, etc.
  • the inventor of the present application found that the prior art has the following problems: whether it is to build a crowd portrait classification model based on a classification model or a clustering model, the resulting crowd portrait classification model is only It can be used for classification, its interpretation is poor, and it does not well reflect the correlation between the user attributes of the user data and the association between the user attributes of the user data and the category attribution.
  • a method for establishing a crowd portrait classification model includes: acquiring user data to be classified for a crowd portrait, wherein each piece of user data includes multiple user attributes corresponding to the user; each user The attribute is used as a factor of the Chow-Liu algorithm, and the Chow-Liu algorithm is used to select a factor among all the factors for correlation until all factors are correlated to obtain a Bayesian network model; Training in the Sri Lankan network model to obtain the crowd portrait classification model.
  • the method further includes: performing data preprocessing on the user data.
  • the data preprocessing includes: data cleaning and standardization processing; the data cleaning includes: deleting vacant data, noise data, duplicate data, and error data in user data; the standardization processing includes: Integrate multiple data corresponding to the same user.
  • the Chow-Liu algorithm is used to select a factor among all factors for correlation until all factors are associated, including: for each factor, according to formula one, select among all unselected factors. The factor with the smallest KL distance is used as the correlation factor of this factor until all factors are selected;
  • the first formula is: KL (P (X)
  • T (X)) - ⁇ I (X i , Pa (X i )) + ⁇ H (X i ) -H (X 1 , X 2 .. ., X n )
  • T (X)) represents the KL distance between this factor and any one of all unselected factors
  • P (X) represents the distribution of all factors before correlation
  • T (X) Represents the distribution of all factors after correlation
  • X i represents the i-th factor
  • H represents entropy
  • Pa (X i ) represents the parent node of X i
  • I represents mutual information, which is calculated by formula two, the formula The second is:
  • p (a) represents the probability of the occurrence of the value a
  • p (b) represents the probability of the occurrence of the value b
  • p (a, b) represents the probability of the occurrence of the value b under the premise of the occurrence of the value b
  • X 1 and X 2 represent Any two user attributes among the plurality of user attributes
  • the value a is any value belonging to the user attribute X 1
  • the value b is any value belonging to the user attribute X 2 .
  • the user data includes labeled data and unlabeled data
  • the step of inputting the user data into the Bayesian network model for training includes: adopting a semi-supervised learning method to input Go to the user data in the Bayesian network model for training.
  • the use of a semi-supervised learning method to train user data input to the Bayesian network model includes: using the Bayesian network model to perform label prediction on unlabeled data; using the The Bayesian network model trains the label data; the above two steps are repeated alternately until the training process converges.
  • the method further includes: when receiving newly input user data, using the crowd portrait classification model to classify the user data to obtain a corresponding classification result.
  • An apparatus for establishing a crowd portrait classification model includes: a data acquisition unit for acquiring user data to be classified for a crowd portrait, wherein each piece of user data includes multiple users corresponding to the user Attribute; factor correlation unit, used to take each user attribute as a factor of the Chow-Liu algorithm, using the Chow-Liu algorithm to select factors among all factors for correlation until all factors are correlated, and a Bayesian network model is obtained; A data training unit for inputting the user data into the Bayesian network model for training to obtain the crowd portrait classification model.
  • the device further includes: a pre-processing unit 802, configured to pre-process the user data.
  • the preprocessing unit when the data preprocessing includes: data cleaning and standardization processing, includes: a data cleaning module and a standardized processing module.
  • the data cleaning module is used to delete vacant data, noise data, duplicate data and error data in user data;
  • the standardized processing module is used to integrate multiple data corresponding to the same user.
  • the factor association unit 704 is specifically configured to perform the following steps:
  • the first formula is: KL (P (X)
  • T (X)) - ⁇ I (X i , Pa (X i )) + ⁇ H (X i ) -H (X 1 , X 2 .. ., X n )
  • T (X)) represents the KL distance between this factor and any one of all unselected factors
  • P (X) represents the distribution of all factors before correlation
  • T (X) Represents the distribution of all factors after correlation
  • X i represents the i-th factor
  • H represents entropy
  • Pa (X i ) represents the parent node of X i
  • I represents mutual information, which is calculated by formula two, the formula The second is:
  • p (a) represents the probability of the occurrence of the value a
  • p (b) represents the probability of the occurrence of the value b
  • p (a, b) represents the probability of the occurrence of the value b under the premise of the occurrence of the value b
  • X 1 and X 2 represent Any two user attributes among the plurality of user attributes
  • the value a is any value belonging to the user attribute X 1
  • the value b is any value belonging to the user attribute X 2 .
  • the data training unit is specifically used to train the user data input into the Bayesian network model using a semi-supervised learning method.
  • the data training unit when the user data includes label data and unlabeled data, is specifically configured to perform the following steps: use the Bayesian network model to perform label prediction on the unlabeled data; use the Bayesian The yes network model trains the label data; the above two steps are repeated alternately until the training process converges.
  • the apparatus for establishing a crowd portrait classification model may further include a classification unit for classifying crowd portraits using the crowd portrait classification model when receiving newly input user data, to obtain The corresponding classification result.
  • a computer device includes a memory and a processor.
  • the memory stores computer-readable instructions.
  • the processor causes the processor to perform the method for establishing a crowd portrait classification model A step of.
  • a non-volatile readable storage medium storing computer-readable instructions, when the computer-readable instructions are executed by one or more processors, causes the one or more processors to perform the method for establishing the above-mentioned crowd portrait classification model A step of.
  • Method, device, computer equipment and storage medium for establishing the above-mentioned crowd portrait classification model to obtain user data to be subjected to crowd portrait classification, where each piece of user data includes multiple user attributes corresponding to the user; each user attribute is regarded as Chow- A factor of Liu algorithm, use Chow-Liu algorithm to select factors among all factors for correlation, until all factors are correlated, get Bayesian network model; input user data into Bayesian network model for training, get crowd portrait Classification model.
  • the method of establishing the above-mentioned crowd portrait classification model uses each user attribute of multiple user attributes included in the user data as a factor of the Chow-Liu algorithm.
  • the Chow-Liu algorithm is used for factor selection and correlation.
  • the Chow-Liu algorithm can compare The relationship between various factors can be well reflected, and the relationship between factors and category attribution can be reflected at the same time, so the crowd portrait classification model based on the Chow-Liu algorithm can well reflect the correlation between various user attributes of user data. At the same time, it can reflect the association of each user attribute and category attribution of user data.
  • FIG. 1 is an implementation environment diagram of a method for establishing a group portrait classification model provided in an embodiment
  • FIG. 2 is a block diagram of the internal structure of a computer device in an embodiment
  • FIG. 3 is a flowchart of a method for establishing a group portrait classification model in an embodiment
  • FIG. 4 is a flowchart of a method for establishing a crowd portrait classification model in an embodiment
  • FIG. 5 is a flowchart of a method for classifying a group portrait in an embodiment
  • FIG. 6 is a flowchart of a method for establishing a group portrait classification model in an embodiment
  • FIG. 7 is a structural block diagram of an apparatus for establishing a group portrait classification model in an embodiment
  • FIG. 8 is a structural block diagram of an apparatus for establishing a group portrait classification model in an embodiment
  • FIG. 9 is a structural block diagram of a preprocessing unit in an embodiment
  • FIG. 10 is a structural block diagram of an apparatus for establishing a crowd portrait classification model in an embodiment.
  • FIG. 1 is an implementation environment diagram of a method for establishing a crowd portrait classification model provided in an embodiment.
  • the implementation environment includes a computer device 110 and a database 120.
  • the database 120 stores user data to be classified for crowd portraits and newly input user data.
  • the user data to be classified for the group portrait stored in the database 120 includes tag data and unlabeled data.
  • the computer device 110 is a device for processing user data to establish a crowd portrait classification model.
  • the computer device 110 acquires user data to be subjected to crowd portrait classification and newly input user data from the database 120.
  • the computer device 110 acquires the tag data and unlabeled data from the database 120.
  • the model builder can use the computer device 110 to obtain the user data to be classified for the crowd portrait, and then obtain the Bayesian network model according to the multiple user attributes included in the user data, and then proceed User data for crowd portrait classification is input into the Bayesian network model for training, and a crowd portrait classification model is obtained.
  • both the computer device 110 and the database 120 may be smart phones, tablet computers, notebook computers, desktop computers, servers, etc., but they are not limited thereto.
  • the computer device 110 and the database 120 may be connected via Bluetooth, USB (Universal Serial Bus), or other communication connection methods, and this application is not limited herein.
  • the database 120 may be independent of the computer device 110 (shown in FIG. 1), or the database 120 may be integrated with the inside of the computer device 110 (not shown in FIG. 1).
  • the computer device includes a processor, a non-volatile readable storage medium, a memory, and a network interface connected through a system bus.
  • the non-volatile readable storage medium of the computer device stores an operating system, a database, and computer-readable instructions.
  • the database may store user data to be classified for crowd portraits and newly input user data.
  • the user data to be classified for the group portrait stored in the database includes tag data and unlabeled data.
  • the processor may enable the processor to implement a method for establishing a group portrait classification model.
  • the processor of the computer device is used to provide calculation and control capabilities, and support the operation of the entire computer device.
  • the memory of the computer device may store computer readable instructions.
  • the processor may cause the processor to execute a method for establishing a group portrait classification model.
  • the network interface of the computer device is used to communicate with the outside.
  • a method for establishing a crowd portrait classification model is proposed.
  • the method for establishing a crowd portrait classification model can be applied to the computer device 110 described above, and includes the following steps:
  • Step S302 Obtain user data to be classified for crowd portraits, where each piece of user data includes multiple user attributes corresponding to the user;
  • the user data includes multiple user attributes corresponding to the user.
  • employee data includes multiple employee attributes corresponding to the employee: position, length of service, education, gender, department, etc.
  • For each user attribute among the multiple user attributes there are several values belonging to the user attribute, and several numeric values correspond to several pieces of user data, that is, one-to-one correspondence with several users.
  • the employee data includes multiple employee attributes corresponding to the employee: position, seniority, education, gender, department, etc., assuming that there are employee A and employee B, employee A and employee B employee data As shown in Table 1.
  • the user data to be classified for crowd portraits is sample data to be input into a preset model, and the preset model is a Bayesian network model obtained according to multiple user attributes.
  • Step S304 taking each user attribute as a factor of the Chow-Liu algorithm, and using the Chow-Liu algorithm to select a factor among all factors for correlation until all factors are correlated to obtain a Bayesian network model;
  • the Chow-Liu algorithm can better reflect the association between each factor, and at the same time, it can reflect the association between the factor and the category attribution. Furthermore, while making predictions, the influence of various factors on the prediction results can be summarized from the model. For example, taking the prediction of the probability of employee turnover as an example, by using the Chow-Liu algorithm, you can predict the probability of a certain employee leaving, and at the same time discover which employee attribute is the influencing factor that leads to a high turnover rate, so as to recruit subsequent employees Work provides a reference. In the specific implementation process, multiple associations are performed.
  • step 304 includes: for each factor, select the factor with the smallest KL distance from all unselected factors according to formula one as the correlation factor of the factor until all factors are selected; the formula One is:
  • T (X)) represents the KL distance between this factor and any one of all unselected factors
  • P (X) represents the distribution of all factors before correlation
  • T (X) Represents the distribution of all factors after correlation
  • X i represents the i-th factor
  • H represents entropy
  • Pa (X i ) represents the parent node of X i
  • I represents mutual information, which is calculated by formula two, the formula The second is:
  • selecting one factor from multiple factors to be associated with another factor includes the following steps: Step 1): Calculate the KL of each factor in the multiple factors and the other factor according to Formula 1 and Formula 2 Distance, formula one and formula two are:
  • Step 2) Determine the factor with the smallest KL distance from the other factor from multiple factors, the criterion for the first determination process is the smallest KL distance, the criterion for the second determination process is the second smallest KL distance, the third The criterion for the second determination process is that the KL distance is the third smallest, and so on; step 3): determine whether a loop will be generated if the determined factor is associated with the other factor, and if the determination result is no, go to step 4); If the judgment result is yes, return to step 2); step 4): associate the determined factor with the other factor.
  • Step S306 Input the user data into the Bayesian network model for training to obtain the crowd portrait classification model.
  • the user data to be classified for the crowd portrait is input into the Bayesian network model for training to obtain the crowd portrait classification model.
  • the Bayesian network model for training to obtain the crowd portrait classification model.
  • taking user data as employee data and constructing an employee turnover prediction model as an example, for employee data to be predicted for turnover probability first, according to the employee data, including multiple employee attributes corresponding to the employee: position, seniority, education, gender, Departments, etc., get the Bayesian network model; then, input employee data into the obtained Bayesian network model for training to obtain the employee turnover prediction model.
  • the method of establishing the above-mentioned crowd portrait classification model uses each user attribute of multiple user attributes included in the user data as a factor of the Chow-Liu algorithm.
  • the Chow-Liu algorithm is used for factor selection and correlation. Because the Chow-Liu algorithm can compare The relationship between various factors can be well reflected, and the relationship between factors and category attribution can be reflected at the same time, so the crowd portrait classification model based on the Chow-Liu algorithm can well reflect the correlation between various user attributes of user data. At the same time, it can reflect the association of each user attribute and category attribution of user data.
  • FIG. 4 shows an implementation flowchart of a method for establishing a crowd portrait classification model when user data includes label data and unlabeled data in one embodiment, including the following steps: Step S402: Obtaining the group portrait classification to be performed User data, where each piece of user data includes multiple user attributes corresponding to the user; step S404, each user attribute is used as a factor of the Chow-Liu algorithm, and the Chow-Liu algorithm is used to select a factor among all factors for correlation Until all the factors are associated to obtain the Bayesian network model; step S406, the user data input into the Bayesian network model is trained using a semi-supervised learning method to obtain the crowd portrait classification model.
  • step S402 and step S404 is similar to the implementation process of step S302 and step S304 respectively, which will not be repeated here.
  • some user data to be classified for crowd portraits can be pre-labeled. Therefore, the user data to be classified for crowd portraits includes label data and unlabeled data, and then label data and unlabeled The data is input to the Bayesian network model for training, and a more accurate crowd portrait classification model is obtained.
  • employee data to be predicted for turnover probability
  • the employee data including multiple employee attributes corresponding to the employee: position, seniority, education, gender , Departments, etc.
  • tag some employee data and tag this part of employee data based on the actual turnover of this part of employees: employees who have left and are employees who have left.
  • step S406 includes the following steps: using the Bayesian network model to perform label prediction on unlabeled data; using the Bayesian network model to train label data; and repeatedly performing the above two steps alternately Until the training process converges.
  • the semi-supervised learning methods include E-step and M-step.
  • the user data includes label data and unlabeled data.
  • the process of inputting user data into the Bayesian network model obtained after performing step S404 is as follows: First, perform E-step, that is, use the Bayesian network model obtained after performing step S404 to perform label prediction on unlabeled data . Subsequently, M-step is performed, that is, using label data to retrain the Bayesian network model, and repeating E-step and M-step alternately until the training process converges, and finally a crowd portrait classification model is obtained.
  • the Bayesian network model obtained from multiple user attributes can also use such unlabeled data for training, that is to say In addition to using labeled data as training data, you can add unlabeled data as training data to avoid the problem of too low training data, thereby improving the accuracy of the final crowd portrait classification model.
  • Step S502 Obtain user data to be classified for crowd portraits, where each piece of user data includes multiple user attributes corresponding to the user Step S504, taking each user attribute as a factor of the Chow-Liu algorithm, using the Chow-Liu algorithm to select factors among all factors for correlation, until all factors are correlated, and a Bayesian network model is obtained; step S506, Input the user data into the Bayesian network model for training to obtain the crowd portrait classification model; step S508, when receiving newly input user data, use the crowd portrait classification model to the user The data is used to classify crowd portraits, and the corresponding classification results are obtained.
  • step S502-step S506 is similar to the implementation process of step S302-step S306, and will not be repeated here.
  • the crowd portrait classification model can be used to realize the crowd portrait classification.
  • the newly input user data is received, and then the user data is input to the crowd portrait classification model obtained after performing steps S502-S506, and the output of the crowd portrait classification model is the classification result.
  • employee turnover prediction model for example, input employee data of an employee into the employee turnover prediction model, and the employee turnover prediction model can be used to predict the employee's turnover.
  • Probability that is, the output of the employee turnover prediction model is the probability of the employee leaving.
  • FIG. 6 shows an implementation flowchart of a method for establishing a crowd portrait classification model in an embodiment, including the following steps: Step S602: Obtain user data to be subjected to crowd portrait classification, where each piece of user data includes multiple corresponding to the user User attributes; step S604, data preprocessing is performed on the user data; step S606, each user attribute is used as a factor of the Chow-Liu algorithm, and the Chow-Liu algorithm is used to select a factor among all factors for correlation until Associate all the factors to obtain a Bayesian network model; Step S608, input the user data into the Bayesian network model for training, and obtain the crowd portrait classification model.
  • the implementation process of step S602 is similar to the implementation process of step S302, and will not be repeated here.
  • the data preprocessing in step S604 includes: data cleaning and standardization processing; the data cleaning includes: deleting vacant data, noise data, duplicate data, and error data in user data; and the standardization processing includes : Integrate multiple data corresponding to the same user.
  • the user data originally acquired after step S602 has "dirty data", including data vacancies and noise, inconsistencies, duplication, errors, etc., in order to ensure the accuracy of later data processing, and After the crowd portrait classification model obtains the classification results, in order to reduce the impact of the classification results on the final decision, it is necessary to preprocess the originally acquired user data. That is, the vacant data, noise data, duplicate data, and error data in the originally acquired user data are deleted.
  • step S604 each user attribute included in the preprocessed user data in step S604 is used as a factor of the Chow-Liu algorithm, and the rest is similar to step S304.
  • step S606 the pre-processed user data in step S604 is input into the Bayesian network model obtained after step S606 for training, and the rest is similar to step S306.
  • an apparatus for establishing a crowd portrait classification model may be integrated into the computer device 110 described above, and may include a data acquisition unit 702 and factors Association unit 704, and data training unit 706.
  • the data obtaining unit 702 is used to obtain user data for group portrait classification, where each piece of user data includes multiple user attributes corresponding to the user; the factor association unit 704 is used to take each user attribute as the Chow-Liu algorithm A factor, using the Chow-Liu algorithm to select a factor among all factors for correlation until all factors are correlated to obtain a Bayesian network model; a data training unit 706 is used to input the user data to the Bayesian Training in the Sri Lankan network model to obtain the crowd portrait classification model.
  • the apparatus for establishing a crowd portrait classification model may further include a preprocessing unit 802.
  • the preprocessing unit 802 is used to perform data preprocessing on the user data. As shown in FIG.
  • the preprocessing unit 802 when data preprocessing includes: data cleaning and standardization processing, includes: a data cleaning module 802A and a standardized processing module 802B.
  • the data cleaning module 802A is used to delete vacant data, noise data, duplicate data and erroneous data in user data;
  • the standardized processing module 802B is used to integrate multiple data corresponding to the same user.
  • the factor correlation unit 704 is specifically configured to perform the following steps: For each factor, select the factor with the smallest KL distance among all unselected factors according to formula 1 as the correlation factor of the factor until all factors are equal Is selected; the first formula is: KL (P (X)
  • T (X)) - ⁇ I (X i , Pa (X i )) + ⁇ H (X i ) -H (X 1 , X 2 ..., X n )
  • T (X)) represents the KL distance between this factor and any one of all unselected factors
  • P (X) represents the distribution of all factors before correlation
  • T (X) Represents the distribution of all factors after correlation
  • X i represents the i-th factor
  • H represents entropy
  • Pa (X i ) represents the parent node of X i
  • I represents mutual information, which is calculated by formula two, the formula The second is:
  • p (a) represents the probability of the occurrence of the value a
  • p (b) represents the probability of the occurrence of the value b
  • p (a, b) represents the probability of the occurrence of the value b under the premise of the occurrence of the value b
  • X 1 and X 2 represent Any two user attributes among the plurality of user attributes
  • the value a is any value belonging to the user attribute X 1
  • the value b is any value belonging to the user attribute X 2 .
  • the data training unit 706 is specifically configured to use a semi-supervised learning method to train the user data input into the Bayesian network model.
  • the data training unit 706 is specifically configured to perform the following steps: use the Bayesian network model to perform label prediction on the unlabeled data; use the Bayesian The yes network model trains the label data; the above two steps are repeated alternately until the training process converges.
  • the apparatus for establishing a crowd portrait classification model may further include a classification unit 1002.
  • the classification unit 1002 is configured to use the crowd portrait classification model to perform crowd portrait classification on the user data when receiving newly input user data, and obtain a corresponding classification result.
  • a computer device is proposed.
  • the computer device includes a non-volatile readable storage medium, a processor, and is stored on the non-volatile readable storage medium and is available on the processor.
  • Computer-readable instructions running on the computer when the processor executes the computer-readable instructions, the following steps are implemented: acquiring user data to be classified for crowd portraits, where each piece of user data includes multiple user attributes corresponding to the user; Each user attribute is used as a factor of the Chow-Liu algorithm, and the Chow-Liu algorithm is used to select factors among all factors for correlation until all factors are correlated to obtain a Bayesian network model; the user data is input to all Training in the Bayesian network model to obtain the crowd portrait classification model.
  • the processor when the processor executes the computer-readable instructions, it also performs the following step: performing data preprocessing on the user data.
  • the data preprocessing includes: data cleaning and standardization processing; the step of data preprocessing performed by the processor on the user data includes: deleting vacant data and noise data in the user data, Duplicate data and wrong data; integrate multiple data corresponding to the same user.
  • the process performed by the processor using the Chow-Liu algorithm to select factors among all factors for correlation until the step of correlating all factors includes: for each factor, according to formula one, all unselected The factor with the smallest distance from its KL is selected as the correlation factor of the factor until all factors are selected; the formula one is:
  • T (X)) represents the KL distance between this factor and any one of all unselected factors
  • P (X) represents the distribution of all factors before correlation
  • T (X) Represents the distribution of all factors after correlation
  • X i represents the i-th factor
  • H represents entropy
  • Pa (X i ) represents the parent node of X i
  • I represents mutual information, which is calculated by formula two, the formula The second is:
  • p (a) represents the probability of the occurrence of the value a
  • p (b) represents the probability of the occurrence of the value b
  • p (a, b) represents the probability of the occurrence of the value b under the premise of the occurrence of the value b
  • X 1 and X 2 represent Any two user attributes among the plurality of user attributes
  • the value a is any value belonging to the user attribute X 1
  • the value b is any value belonging to the user attribute X 2
  • the user data includes labeled data and unlabeled data
  • the step of inputting the user data into the Bayesian network model and performed by the processor for training includes using semi-supervised The learning method trains the user data input into the Bayesian network model.
  • the step of training the user data input into the Bayesian network model by the processor using a semi-supervised learning method includes: using the Bayesian network model to label unlabeled data Prediction; using the Bayesian network model to train the label data; repeatedly performing the above two steps alternately until the training process converges.
  • the processor executes the computer-readable instructions, it also performs the following steps: when receiving newly input user data, the crowd portrait classification model is used to classify the user data to obtain the corresponding classification result.
  • a non-volatile readable storage medium storing computer-readable instructions.
  • the one or more processors execute the following Step: Obtain user data to be classified for crowd portraits, where each piece of user data includes multiple user attributes corresponding to the user; use each user attribute as a factor of the Chow-Liu algorithm, and use the Chow-Liu algorithm in all The selection factors among the factors are correlated until all the factors are correlated to obtain a Bayesian network model; the user data is input into the Bayesian network model for training to obtain the crowd portrait classification model.
  • the processor executes the computer-readable instructions, it also performs the following step: performing data preprocessing on the user data.
  • the data preprocessing includes: data cleaning and standardization processing; the step of data preprocessing performed by the processor on the user data includes: deleting vacant data and noise data in the user data, Duplicate data and wrong data; integrate multiple data corresponding to the same user.
  • the process performed by the processor using the Chow-Liu algorithm to select factors among all factors for correlation until the step of correlating all factors includes: for each factor, according to formula one, all unselected The factor with the smallest distance from its KL is selected as the correlation factor of the factor until all factors are selected; the formula one is:
  • T (X)) represents the KL distance between this factor and any one of all unselected factors
  • P (X) represents the distribution of all factors before correlation
  • T (X) Represents the distribution of all factors after correlation
  • X i represents the i-th factor
  • H represents entropy
  • Pa (X i ) represents the parent node of X i
  • I represents mutual information, which is calculated by formula two, the formula The second is:
  • p (a) represents the probability of the occurrence of the value a
  • p (b) represents the probability of the occurrence of the value b
  • p (a, b) represents the probability of the occurrence of the value b under the premise of the occurrence of the value b
  • X 1 and X 2 represent Any two user attributes among the plurality of user attributes
  • the value a is any value belonging to the user attribute X 1
  • the value b is any value belonging to the user attribute X 2
  • the user data includes labeled data and unlabeled data
  • the step of inputting the user data into the Bayesian network model and performed by the processor for training includes using semi-supervised The learning method trains the user data input into the Bayesian network model.
  • the step of training the user data input into the Bayesian network model by the processor using a semi-supervised learning method includes: using the Bayesian network model to label unlabeled data Prediction; using the Bayesian network model to train the label data; repeatedly performing the above two steps alternately until the training process converges.
  • the processor executes the computer-readable instructions, it also performs the following steps: when receiving newly input user data, the crowd portrait classification model is used to classify the user data to obtain the corresponding classification result.
  • the computer program may be stored in a computer-readable storage medium.
  • the foregoing storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM).
  • ROM Read-Only Memory
  • RAM Random Access Memory

Abstract

The present invention relates to a method, device and computer equipment for establishing a crowd portrait classification model and a storage medium. The method comprises: acquiring user data to be performed crowd portrait classification thereon, each piece of user data comprising a plurality of user attributes corresponding to the user; taking each user attribute as a factor of Chow-liu algorithm, and using the Chow-liu algorithm to select a factor among all factors to establish association until all factors are associated to obtain a Bayesian network model; and inputting all user data into the Bayesian network model for training to obtain the crowd portrait classification model. The crowd portrait classification model according to the described method has a better explanatory feature, and can well reflect the correlation between respective user attributes of the user data.

Description

人群画像分类模型的建立方法、装置、设备和存储介质Method, device, equipment and storage medium for establishing crowd portrait classification model
本申请要求与2018年11月12日提交中国专利局、申请号为201811340717.5、申请名称为“人群画像分类模型的建立方法、装置、设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application requires the priority of the Chinese patent application submitted to the Chinese Patent Office on November 12, 2018 with the application number 201811340717.5 and the application name "Crowd portrait classification model establishment method, device, equipment and storage medium". Incorporated by reference in the application.
技术领域Technical field
本申请涉及数据处理技术领域,特别是涉及的人群画像分类模型的建立方法、装置、计算机设备和存储介质。The present application relates to the field of data processing technology, and in particular to a method, device, computer device, and storage medium for establishing a group portrait classification model.
背景技术Background technique
人群画像分类是指通过人群画像分类模型,对新输入的用户数据进行人群画像分类的过程。其中,人群画像分类模型是利用预设模型对海量的用户数据进行训练而构建的。Crowd portrait classification refers to the process of classifying crowd portraits on newly input user data through the crowd portrait classification model. Among them, the crowd portrait classification model is constructed by using a preset model to train massive user data.
以员工画像分类为例,员工数据包括:员工的职位、工龄、教育、性别、部门等多个员工属性。利用预设模型对海量的员工数据进行训练,构建出员工画像分类模型,再通过员工画像分类模型得到若干的员工画像,从而完成对各个员工的分类。在员工画像分类的过程中,可以利用员工的离职情况构建员工离职预测模型,进而通过员工离职预测模型,预测某个员工离职的概率。Taking employee portrait classification as an example, employee data includes: employee's position, length of service, education, gender, department and other employee attributes. Use the preset model to train a large number of employee data, construct an employee portrait classification model, and then obtain a number of employee portraits through the employee portrait classification model to complete the classification of each employee. In the process of employee portrait classification, the employee's turnover situation can be used to build an employee turnover prediction model, and then the employee turnover prediction model can be used to predict the probability of an employee's turnover.
目前,构建人群画像分类模型所基于的预设模型主要是分类模型和聚类模型,例如SVM,神经网络,k-means等。然而,在对现有技术的研究与实践中,本申请的发明人发现,现有技术存在以下问题:无论是基于分类模型还是聚类模型构建人群画像分类模型,构建得到的人群画像分类模型仅能够用于分类,其解释性较差,不能很好地反映出用户数据的各个用户属性之间的相关性以及用户数据的各个用户属性与类别归属的关联。At present, the preset models on which the crowd portrait classification model is built are mainly classification models and clustering models, such as SVM, neural network, k-means, etc. However, in the research and practice of the prior art, the inventor of the present application found that the prior art has the following problems: whether it is to build a crowd portrait classification model based on a classification model or a clustering model, the resulting crowd portrait classification model is only It can be used for classification, its interpretation is poor, and it does not well reflect the correlation between the user attributes of the user data and the association between the user attributes of the user data and the category attribution.
发明内容Summary of the invention
基于此,有必要针对目前构建出的人群画像分类模型的解释性较差,不能很好地反映出用户数据的各个用户属性之间的相关性的问题,提供一种人群画像分类模型的建立方法、装置、计算机设备和存储介质。Based on this, it is necessary to provide a method for establishing a crowd portrait classification model for the problem that the currently constructed crowd portrait classification model has poor interpretability and does not well reflect the correlation between various user attributes of user data , Devices, computer equipment and storage media.
一种人群画像分类模型的建立方法,所述人群画像分类模型的建立方法包括:获取待进行人群画像分类的用户数据,其中每一条用户数据包括该用户对应的多个用户属性;将每个用户属性作为Chow-Liu算法的一个因子,利用所述Chow-Liu算法在所有因子中选择因子进行关联,直至关联所有的因子,得到贝叶斯网络模型;将所述用户数据输入到所述贝叶斯网络模型中进行训练,得到所述人群画像分类模型。A method for establishing a crowd portrait classification model, the method for establishing a crowd portrait classification model includes: acquiring user data to be classified for a crowd portrait, wherein each piece of user data includes multiple user attributes corresponding to the user; each user The attribute is used as a factor of the Chow-Liu algorithm, and the Chow-Liu algorithm is used to select a factor among all the factors for correlation until all factors are correlated to obtain a Bayesian network model; Training in the Sri Lankan network model to obtain the crowd portrait classification model.
在其中一个实施例中,所述方法还包括:对所述用户数据进行数据预处理。In one of the embodiments, the method further includes: performing data preprocessing on the user data.
在其中一个实施例中,所述数据预处理包括:数据清洗以及标准化处理;所述数据清洗包括:删除用户数据中的空缺数据、噪声数据、重复数据以及错误数据;所述标准化处理包括:将同一个用户对应的多个数据进行整合。In one of the embodiments, the data preprocessing includes: data cleaning and standardization processing; the data cleaning includes: deleting vacant data, noise data, duplicate data, and error data in user data; the standardization processing includes: Integrate multiple data corresponding to the same user.
在其中一个实施例中,所述利用所述Chow-Liu算法在所有因子中选择因子进行关联,直至关联所有的因子,包括:对于每一个因子,根据公式一在所有未选取的因子中选择与其KL距离最小的因子作为该因子的关联因子,直至所有因子均被选取;In one of the embodiments, the Chow-Liu algorithm is used to select a factor among all factors for correlation until all factors are associated, including: for each factor, according to formula one, select among all unselected factors. The factor with the smallest KL distance is used as the correlation factor of this factor until all factors are selected;
所述公式一为:KL(P(X)||T(X))=-∑I(X i,Pa(X i))+∑H(X i)-H(X 1,X 2...,X n) The first formula is: KL (P (X) || T (X)) =-∑I (X i , Pa (X i )) + ∑H (X i ) -H (X 1 , X 2 .. ., X n )
其中,KL(P(X)||T(X))表示该因子与所有未选择的因子中任一因子的KL距离,P(X)表示进行关联之前所有因子的分布情况,T(X)表示进行关联之后所有因子的分布情况;X i表示第i个因子,H表示熵,Pa(X i)表示X i的父节点;I表示互信息,是通过公式二计算得到的,所述公式二为: Among them, KL (P (X) || T (X)) represents the KL distance between this factor and any one of all unselected factors, P (X) represents the distribution of all factors before correlation, T (X) Represents the distribution of all factors after correlation; X i represents the i-th factor, H represents entropy, Pa (X i ) represents the parent node of X i ; I represents mutual information, which is calculated by formula two, the formula The second is:
Figure PCTCN2019097892-appb-000001
Figure PCTCN2019097892-appb-000001
其中,p(a)表示数值a出现的概率,p(b)表示数值b出现的概率,p(a,b)表示数值b出现的前提下数值b出现的概率,X 1和X 2代表所述多个用户属性中任两个用户属性,数值a为属于用户属性X 1的任一数值,数值b为属于用户属性X 2的任一数值。 Among them, p (a) represents the probability of the occurrence of the value a, p (b) represents the probability of the occurrence of the value b, p (a, b) represents the probability of the occurrence of the value b under the premise of the occurrence of the value b, X 1 and X 2 represent Any two user attributes among the plurality of user attributes, the value a is any value belonging to the user attribute X 1 , and the value b is any value belonging to the user attribute X 2 .
在其中一个实施例中,所述用户数据中包括标签数据以及未标签数据,所述将所述用户数据输入到所述贝叶斯网络模型中进行训练的步骤包括:采用半监督学习方法对输入到贝叶斯网络模型中的用户数据进行训练。In one of the embodiments, the user data includes labeled data and unlabeled data, and the step of inputting the user data into the Bayesian network model for training includes: adopting a semi-supervised learning method to input Go to the user data in the Bayesian network model for training.
在其中一个实施例中,所述采用半监督学习方法对输入到贝叶斯网络模型中的用户数据进行训练,包括:利用所述贝叶斯网络模型对未标签数据进行标签预测;利用所述贝叶斯网络模型对标签数据进行训练;重复交替执行上述两个步骤,直至训练过程收敛。In one embodiment, the use of a semi-supervised learning method to train user data input to the Bayesian network model includes: using the Bayesian network model to perform label prediction on unlabeled data; using the The Bayesian network model trains the label data; the above two steps are repeated alternately until the training process converges.
在其中一个实施例中,所述方法还包括:在接收到新输入的用户数据时,利用所述人群画像分类模型对所述用户数据进行人群画像分类,得到对应的分类结果。In one of the embodiments, the method further includes: when receiving newly input user data, using the crowd portrait classification model to classify the user data to obtain a corresponding classification result.
一种人群画像分类模型的建立装置,所述人群画像分类模型的建立装置包括:数据获取单元,用于获取待进行人群画像分类的用户数据,其中每一条用户数据包括该用户对应的多个用户属性;因子关联单元,用于将每个用户属性作为Chow-Liu算法的一个因子,利用所述Chow-Liu算法在所有因子中选择因子进行关联,直至关联所有的因子,得到贝叶斯网络模型;数据训练单元,用于将所述用户数据输入到所述贝叶斯网络模型中进行训练,得到所述人群画像分类模型。An apparatus for establishing a crowd portrait classification model, the apparatus for establishing a crowd portrait classification model includes: a data acquisition unit for acquiring user data to be classified for a crowd portrait, wherein each piece of user data includes multiple users corresponding to the user Attribute; factor correlation unit, used to take each user attribute as a factor of the Chow-Liu algorithm, using the Chow-Liu algorithm to select factors among all factors for correlation until all factors are correlated, and a Bayesian network model is obtained A data training unit for inputting the user data into the Bayesian network model for training to obtain the crowd portrait classification model.
在其中一个实施例中,所述装置还包括:包括预处理单元802,用于对所述用户数据进行数据预处理。In one of the embodiments, the device further includes: a pre-processing unit 802, configured to pre-process the user data.
在其中一个实施例中,当数据预处理包括:数据清洗以及标准化处理时,所述预处理单元包括:数据清洗模块和标准化处理模块。所述数据清洗模块,用于删除用户数据中的空缺数据、噪声数据、重复数据以及错误数据;所述标准化处理模块,用于将同一个用户对应的多个数据进行整合。In one of the embodiments, when the data preprocessing includes: data cleaning and standardization processing, the preprocessing unit includes: a data cleaning module and a standardized processing module. The data cleaning module is used to delete vacant data, noise data, duplicate data and error data in user data; the standardized processing module is used to integrate multiple data corresponding to the same user.
在其中一个实施例中,因子关联单元704具体用于执行以下步骤:In one of the embodiments, the factor association unit 704 is specifically configured to perform the following steps:
对于每一个因子,根据公式一在所有未选取的因子中选择与其KL距离最小的因子作为该因子的关联因子,直至所有因子均被选取;For each factor, select the factor with the smallest KL distance from all unselected factors according to formula 1 as the correlation factor of the factor until all factors are selected;
所述公式一为:KL(P(X)||T(X))=-∑I(X i,Pa(X i))+∑H(X i)-H(X 1,X 2...,X n) The first formula is: KL (P (X) || T (X)) =-∑I (X i , Pa (X i )) + ∑H (X i ) -H (X 1 , X 2 .. ., X n )
其中,KL(P(X)||T(X))表示该因子与所有未选择的因子中任一因子的KL距离,P(X)表示进行关联之前所有因子的分布情况,T(X)表示进行关联之后所有因子的分布情况;X i表示第i个因子,H表示熵,Pa(X i)表示X i的父节点;I表示互信息,是通过公式二计算得到的,所述公式二为: Among them, KL (P (X) || T (X)) represents the KL distance between this factor and any one of all unselected factors, P (X) represents the distribution of all factors before correlation, T (X) Represents the distribution of all factors after correlation; X i represents the i-th factor, H represents entropy, Pa (X i ) represents the parent node of X i ; I represents mutual information, which is calculated by formula two, the formula The second is:
Figure PCTCN2019097892-appb-000002
Figure PCTCN2019097892-appb-000002
其中,p(a)表示数值a出现的概率,p(b)表示数值b出现的概率,p(a,b)表示数值b出现的前提下数值b出现的概率,X 1和X 2代表所述多个用户属性中任两个用户属性,数值a为属于用户属性X 1的任一数值,数值b为属于用户属性X 2的任一数值。 Among them, p (a) represents the probability of the occurrence of the value a, p (b) represents the probability of the occurrence of the value b, p (a, b) represents the probability of the occurrence of the value b under the premise of the occurrence of the value b, X 1 and X 2 represent Any two user attributes among the plurality of user attributes, the value a is any value belonging to the user attribute X 1 , and the value b is any value belonging to the user attribute X 2 .
在其中一个实施例中,当用户数据中包括标签数据以及未标签数据时,所述数据训练单元具体用于采用半监督学习方法对输入到贝叶斯网络模型中的用户数据进行训练。In one of the embodiments, when the user data includes labeled data and unlabeled data, the data training unit is specifically used to train the user data input into the Bayesian network model using a semi-supervised learning method.
在其中一个实施例中,当用户数据中包括标签数据以及未标签数据时,数据训练单元具体用于执行以下步骤:利用所述贝叶斯网络模型对未标签数据进行标签预测;利用所述贝叶斯网络模型对标签数据进行训练;重复交替执行上述两个步骤,直至训练过程收敛。In one of the embodiments, when the user data includes label data and unlabeled data, the data training unit is specifically configured to perform the following steps: use the Bayesian network model to perform label prediction on the unlabeled data; use the Bayesian The yes network model trains the label data; the above two steps are repeated alternately until the training process converges.
在其中一个实施例中,人群画像分类模型的建立装置还可以包括分类单元,用于在接收到新输入的用户数据时,利用所述人群画像分类模型对所述用户数据进行人群画像分类,得到对应的分类结果。In one of the embodiments, the apparatus for establishing a crowd portrait classification model may further include a classification unit for classifying crowd portraits using the crowd portrait classification model when receiving newly input user data, to obtain The corresponding classification result.
一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行上述人群画像分类模型的建立方法的步骤。A computer device includes a memory and a processor. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the processor causes the processor to perform the method for establishing a crowd portrait classification model A step of.
一种存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述人群画像分类模型的建立方法的步骤。A non-volatile readable storage medium storing computer-readable instructions, when the computer-readable instructions are executed by one or more processors, causes the one or more processors to perform the method for establishing the above-mentioned crowd portrait classification model A step of.
上述人群画像分类模型的建立方法、装置、计算机设备和存储介质,获取待进行人群画像分类的用户数据,其中每一条用户数据包括该用户对应的多个用户属性;将每个用户属性作为Chow-Liu算法的一个因子,利用Chow-Liu算法在所有因子中选择因子进行关联,直至关联所有的因子,得到贝叶斯网络模型;将用户数据输入到贝叶斯网络模型中进行训练,得到人群画像分类模型。上述人群画像分类模型的建立方法,将用户数据包括的多个用户属性中每个用户属性作为Chow-Liu算法的一个因子,利用Chow-Liu算法进行因子选择以及关联,由于Chow-Liu算法能够较好反映地各个因子之间的关联,同时能够反映因子与类别归属的关联,所以基于Chow-Liu算法构建的人群画像分类模型能够很好地反映出用户数据的各个用户属性之间的相关性,同时能够反映用户数据的各个用户属性与类别归属的关联。Method, device, computer equipment and storage medium for establishing the above-mentioned crowd portrait classification model to obtain user data to be subjected to crowd portrait classification, where each piece of user data includes multiple user attributes corresponding to the user; each user attribute is regarded as Chow- A factor of Liu algorithm, use Chow-Liu algorithm to select factors among all factors for correlation, until all factors are correlated, get Bayesian network model; input user data into Bayesian network model for training, get crowd portrait Classification model. The method of establishing the above-mentioned crowd portrait classification model uses each user attribute of multiple user attributes included in the user data as a factor of the Chow-Liu algorithm. The Chow-Liu algorithm is used for factor selection and correlation. Because the Chow-Liu algorithm can compare The relationship between various factors can be well reflected, and the relationship between factors and category attribution can be reflected at the same time, so the crowd portrait classification model based on the Chow-Liu algorithm can well reflect the correlation between various user attributes of user data. At the same time, it can reflect the association of each user attribute and category attribution of user data.
附图说明BRIEF DESCRIPTION
图1为一个实施例中提供的人群画像分类模型的建立方法的实施环境图;FIG. 1 is an implementation environment diagram of a method for establishing a group portrait classification model provided in an embodiment;
图2为一个实施例中计算机设备的内部结构框图;2 is a block diagram of the internal structure of a computer device in an embodiment;
图3为一个实施例中人群画像分类模型的建立方法的流程图;3 is a flowchart of a method for establishing a group portrait classification model in an embodiment;
图4为一个实施例中人群画像分类模型的建立方法的流程图;4 is a flowchart of a method for establishing a crowd portrait classification model in an embodiment;
图5为一个实施例中人群画像分类的方法的流程图;5 is a flowchart of a method for classifying a group portrait in an embodiment;
图6为一个实施例中人群画像分类模型的建立方法的流程图;6 is a flowchart of a method for establishing a group portrait classification model in an embodiment;
图7为一个实施例中人群画像分类模型的建立装置的结构框图;7 is a structural block diagram of an apparatus for establishing a group portrait classification model in an embodiment;
图8为一个实施例中人群画像分类模型的建立装置的结构框图;8 is a structural block diagram of an apparatus for establishing a group portrait classification model in an embodiment;
图9为一个实施例中预处理单元的结构框图;9 is a structural block diagram of a preprocessing unit in an embodiment;
图10为一个实施例中人群画像分类模型的建立装置的结构框图。FIG. 10 is a structural block diagram of an apparatus for establishing a crowd portrait classification model in an embodiment.
具体实施方式detailed description
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clear, the present application will be described in further detail in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application.
图1为一个实施例中提供的人群画像分类模型的建立方法的实施环境图,如图1所示,在该实施环境中,包括计算机设备110以及数据库120。数据库120中存储有待进行人群画像分类的用户数据以及新输入的用户数据。在预先对待进行人群画像分类的用户数据中的部分用户数据打标签的情况下,数据库120中存储的待进行人群画像分类的用户数据包括标签数据和未标签数据。计算机设备110为用于对用户数据进行处理以建立人群画像分类模型的设备,计算机设备110从数据库120中获取待进行人群画像分类的用户数据以及新输入的用户数据。在数据库120中存储的待进行人群画像分类的用户数据包括标签数据和未 标签数据的情况下,计算机设备110从数据库120中获取标签数据和未标签数据。当需要建立人群画像分类模型时,模型建立人员可以利用计算机设备110获取待进行人群画像分类的用户数据,然后根据该用户数据包括的多个用户属性,得到贝叶斯网络模型,然后将待进行人群画像分类的用户数据输入该贝叶斯网络模型进行训练,得到人群画像分类模型。FIG. 1 is an implementation environment diagram of a method for establishing a crowd portrait classification model provided in an embodiment. As shown in FIG. 1, the implementation environment includes a computer device 110 and a database 120. The database 120 stores user data to be classified for crowd portraits and newly input user data. In the case where some user data in the user data to be classified for the group portrait is to be tagged in advance, the user data to be classified for the group portrait stored in the database 120 includes tag data and unlabeled data. The computer device 110 is a device for processing user data to establish a crowd portrait classification model. The computer device 110 acquires user data to be subjected to crowd portrait classification and newly input user data from the database 120. In the case where the user data to be classified for the group portrait stored in the database 120 includes tag data and unlabeled data, the computer device 110 acquires the tag data and unlabeled data from the database 120. When a crowd portrait classification model needs to be established, the model builder can use the computer device 110 to obtain the user data to be classified for the crowd portrait, and then obtain the Bayesian network model according to the multiple user attributes included in the user data, and then proceed User data for crowd portrait classification is input into the Bayesian network model for training, and a crowd portrait classification model is obtained.
需要说明的是,计算机设备110和数据库120均分别可为智能手机、平板电脑、笔记本电脑、台式计算机、服务器等,但并不局限于此。计算机设备110与数据库120可以通过蓝牙、USB(Universal Serial Bus,通用串行总线)或者其他通讯连接方式进行连接,本申请在此不做限制。数据库120可以独立于计算机设备110(如图1所示),或者,数据库120可以集成与计算机设备110内部(图1未示出)。It should be noted that both the computer device 110 and the database 120 may be smart phones, tablet computers, notebook computers, desktop computers, servers, etc., but they are not limited thereto. The computer device 110 and the database 120 may be connected via Bluetooth, USB (Universal Serial Bus), or other communication connection methods, and this application is not limited herein. The database 120 may be independent of the computer device 110 (shown in FIG. 1), or the database 120 may be integrated with the inside of the computer device 110 (not shown in FIG. 1).
图2为一个实施例中计算机设备的内部结构示意图。如图2所示,该计算机设备包括通过系统总线连接的处理器、非易失性可读存储介质、存储器和网络接口。其中,该计算机设备的非易失性可读存储介质存储有操作系统、数据库和计算机可读指令,数据库中可存储有待进行人群画像分类的用户数据以及新输入的用户数据。在预先对待进行人群画像分类的用户数据中的部分用户数据打标签的情况下,数据库中存储的待进行人群画像分类的用户数据包括标签数据和未标签数据。该计算机可读指令被处理器执行时,可使得处理器实现一种人群画像分类模型的建立方法。该计算机设备的处理器用于提供计算和控制能力,支撑整个计算机设备的运行。该计算机设备的存储器中可存储有计算机可读指令,该计算机可读指令被处理器执行时,可使得处理器执行一种人群画像分类模型的建立方法。该计算机设备的网络接口用于与外部通信连接。本领域技术人员可以理解,图2中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。2 is a schematic diagram of the internal structure of a computer device in an embodiment. As shown in FIG. 2, the computer device includes a processor, a non-volatile readable storage medium, a memory, and a network interface connected through a system bus. Among them, the non-volatile readable storage medium of the computer device stores an operating system, a database, and computer-readable instructions. The database may store user data to be classified for crowd portraits and newly input user data. In the case where some user data in the user data to be classified for the group portrait is to be tagged in advance, the user data to be classified for the group portrait stored in the database includes tag data and unlabeled data. When the computer-readable instructions are executed by the processor, the processor may enable the processor to implement a method for establishing a group portrait classification model. The processor of the computer device is used to provide calculation and control capabilities, and support the operation of the entire computer device. The memory of the computer device may store computer readable instructions. When the computer readable instructions are executed by the processor, the processor may cause the processor to execute a method for establishing a group portrait classification model. The network interface of the computer device is used to communicate with the outside. Those skilled in the art may understand that the structure shown in FIG. 2 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. The specific computer equipment may It includes more or fewer components than shown in the figure, or some components are combined, or have a different component arrangement.
如图3所示,在一个实施例中,提出了一种人群画像分类模型的建立方法,该人群画像分类模型的建立方法可以应用于上述的计算机设备110中,包括以下步骤:As shown in FIG. 3, in one embodiment, a method for establishing a crowd portrait classification model is proposed. The method for establishing a crowd portrait classification model can be applied to the computer device 110 described above, and includes the following steps:
步骤S302,获取待进行人群画像分类的用户数据,其中每一条用户数据包括该用户对应的多个用户属性;Step S302: Obtain user data to be classified for crowd portraits, where each piece of user data includes multiple user attributes corresponding to the user;
在本实施例中,用户数据包括用户对应的多个用户属性。以用户数据是员工数据为例,员工数据包括员工对应的多个员工属性:职位、工龄、教育、性别、部门等。针对多个用户属性中的每个用户属性,属于该用户属性的数值有若干个,若干个数值与若干条用户数据一一对应,也即与若干个用户一一对应。示例地,以用户数据是员工数据为例,员工数据包括员工对应的多个员工属性:职位、工龄、教育、性别、部门等,假设有员工A和员工B,员工A和员工B的员工数据分别如表1所示。In this embodiment, the user data includes multiple user attributes corresponding to the user. Taking user data as employee data for example, employee data includes multiple employee attributes corresponding to the employee: position, length of service, education, gender, department, etc. For each user attribute among the multiple user attributes, there are several values belonging to the user attribute, and several numeric values correspond to several pieces of user data, that is, one-to-one correspondence with several users. Exemplarily, taking the user data as employee data for example, the employee data includes multiple employee attributes corresponding to the employee: position, seniority, education, gender, department, etc., assuming that there are employee A and employee B, employee A and employee B employee data As shown in Table 1.
表1员工数据示意图Table 1 Schematic diagram of employee data
Figure PCTCN2019097892-appb-000003
Figure PCTCN2019097892-appb-000003
从表1可以看出,属于职位这一用户属性的数值有2个:0011和0012,分别与员工A的职位以及员工B的职位一一对应,也即分别与员工A和员工B一一对应。同理,属于工龄这一用户属性的数值有2个:5和2,分别与员工A的工龄以及员工B的工龄一一对应,也即分别与员工A和员工B一一对应。待进行人群画像分类的用户数据是将要输入预设模型的样本数据,该预设模型为根据多个用户属性得到的贝叶斯网络模型。As can be seen from Table 1, there are two values of the user attribute belonging to positions: 0011 and 0012, which correspond to the positions of employee A and employee B respectively, that is, one-to-one correspondence with employee A and employee B, respectively . Similarly, there are two values for the user attribute of seniority: 5 and 2, which correspond to the seniority of employee A and the seniority of employee B respectively, that is, one-to-one correspondence with employee A and employee B, respectively. The user data to be classified for crowd portraits is sample data to be input into a preset model, and the preset model is a Bayesian network model obtained according to multiple user attributes.
步骤S304,将每个用户属性作为Chow-Liu算法的一个因子,利用所述Chow-Liu算法在所有因子中选择因子进行关联,直至关联所有的因子,得到贝叶斯网络模型;Step S304, taking each user attribute as a factor of the Chow-Liu algorithm, and using the Chow-Liu algorithm to select a factor among all factors for correlation until all factors are correlated to obtain a Bayesian network model;
在本实施例中,考虑到Chow-Liu算法能够较好地反映各个因子之间的关联,同时能够反映因子与类别归属的关联。进而在进行预测的同时,能够从模型中归纳各个因子对于预测结果的影响。示例地,以对员工离职的概率进行预测为例,通过利用Chow-Liu算法,可以预测某个员工离职的概率,同时发现哪个员工属性是导致高离职率的影响因素,从而为后续的员工招聘工作提供参考。在具体实施过程中,执行多次关联。第一次关联的过程为:首先,将多个用户属性中的任何一个用户属性作为Chow-Liu算法的第一个因子,然后,将多个用户属性中剩余的各个用户属性作为Chow-Liu算法的其他各个因子,从这些其他各个因子中选择一个因子与第一个因子关联。第二次关联的过程为:将上一次关联过程中剩余的各个用户属性中的任何一个用户属性作为Chow-Liu算法的第二个因子,将本次关联过程中剩余的各个用户属性作为Chow-Liu算法的其他各个因子,从这些其他各个因子中选择一个因子与第二个因子关联。第三次关联过程与第二次关联过程类似。如此进行多次关联,直至关联所有的因子。需要注意的是,每次关联过程中需要避免产生回路。在一种实施方式中,步骤304包括:对于每一个因子,根据公式一在所有未选取的因子中选择与其KL距离最小的因子作为该因子的关联因子,直至所有因子均被选取;所述公式一为:In this embodiment, it is considered that the Chow-Liu algorithm can better reflect the association between each factor, and at the same time, it can reflect the association between the factor and the category attribution. Furthermore, while making predictions, the influence of various factors on the prediction results can be summarized from the model. For example, taking the prediction of the probability of employee turnover as an example, by using the Chow-Liu algorithm, you can predict the probability of a certain employee leaving, and at the same time discover which employee attribute is the influencing factor that leads to a high turnover rate, so as to recruit subsequent employees Work provides a reference. In the specific implementation process, multiple associations are performed. The process of the first association is: first, any one of the multiple user attributes is used as the first factor of the Chow-Liu algorithm, and then, the remaining user attributes of the multiple user attributes are used as the Chow-Liu algorithm For each of the other factors, select a factor from these other factors to associate with the first factor. The second association process is as follows: any one of the remaining user attributes in the previous association process is used as the second factor of the Chow-Liu algorithm, and each remaining user attribute in this association process is used as Chow- The other factors of Liu algorithm, select one factor from these other factors and associate with the second factor. The third association process is similar to the second association process. Do this many times until all factors are related. It should be noted that loops need to be avoided during each association process. In one embodiment, step 304 includes: for each factor, select the factor with the smallest KL distance from all unselected factors according to formula one as the correlation factor of the factor until all factors are selected; the formula One is:
KL(P(X)||T(X))=-∑I(X i,Pa(X i))+∑H(X i)-H(X 1,X 2...,X n) KL (P (X) || T (X)) =-∑I (X i , Pa (X i )) + ∑H (X i ) -H (X 1 , X 2 ..., X n )
其中,KL(P(X)||T(X))表示该因子与所有未选择的因子中任一因子的KL距离,P(X)表示进行关联之前所有因子的分布情况,T(X)表示进行关联之后所有因子的分布情况;X i表示第i个因子,H表示熵,Pa(X i)表示X i的父节点;I表示互信息,是通过公式二计算得到的,所述公式二为: Among them, KL (P (X) || T (X)) represents the KL distance between this factor and any one of all unselected factors, P (X) represents the distribution of all factors before correlation, T (X) Represents the distribution of all factors after correlation; X i represents the i-th factor, H represents entropy, Pa (X i ) represents the parent node of X i ; I represents mutual information, which is calculated by formula two, the formula The second is:
Figure PCTCN2019097892-appb-000004
Figure PCTCN2019097892-appb-000004
其中,p(a)表示数值a出现的概率,p(b)表示数值b出现的概率,p(a,b)表示数值b出现的前提下数值b出现的概率,X 1和X 2代表所述多个用户属性中任两个用户属性,数值a为属于用户属性X 1的任一数值,数值b为属于用户属性X 2的任一数值。在具体实施过程中,从多个因子中选择一个因子与另一个因子关联,包括以下步骤:步骤1):根据公式一和公式二,计算多个因子中每个因子与该另一个因子的KL距离,公式一和公式二分别为: Among them, p (a) represents the probability of the occurrence of the value a, p (b) represents the probability of the occurrence of the value b, p (a, b) represents the probability of the occurrence of the value b under the premise of the occurrence of the value b, X 1 and X 2 represent Any two user attributes among the plurality of user attributes, the value a is any value belonging to the user attribute X 1 , and the value b is any value belonging to the user attribute X 2 . In the specific implementation process, selecting one factor from multiple factors to be associated with another factor includes the following steps: Step 1): Calculate the KL of each factor in the multiple factors and the other factor according to Formula 1 and Formula 2 Distance, formula one and formula two are:
KL(P(X)||T(X))=-∑I(X i,Pa(X i))+∑H(X i)-H(X 1,X 2...,X n)公式一 KL (P (X) || T (X)) =-∑I (X i , Pa (X i )) + ∑H (X i ) -H (X 1 , X 2 ..., X n ) formula One
Figure PCTCN2019097892-appb-000005
Figure PCTCN2019097892-appb-000005
步骤2):从多个因子中确定出与该另一个因子的KL距离最小的因子,第一次确定过程的标准为KL距离最小,第二次确定过程的标准为KL距离次小,第三次确定过程的标准为KL距离第三小,依次类推;步骤3):判断如果确定出的因子与该另一个因子关联,是否会产生回路,若判断结果为否,则转入步骤4);若判断结果为是,则返回步骤2);步骤4):将确定出的因子与该另一个因子关联。Step 2): Determine the factor with the smallest KL distance from the other factor from multiple factors, the criterion for the first determination process is the smallest KL distance, the criterion for the second determination process is the second smallest KL distance, the third The criterion for the second determination process is that the KL distance is the third smallest, and so on; step 3): determine whether a loop will be generated if the determined factor is associated with the other factor, and if the determination result is no, go to step 4); If the judgment result is yes, return to step 2); step 4): associate the determined factor with the other factor.
步骤S306,将所述用户数据输入到所述贝叶斯网络模型中进行训练,得到所述人群画像分类模型。Step S306: Input the user data into the Bayesian network model for training to obtain the crowd portrait classification model.
在根据多个用户属性得到的贝叶斯网络模型,再将待进行人群画像分类的用户数据输入贝叶斯网络模型进行训练,得到人群画像分类模型。示例地,以用户数据是员工数据且构建员工离职预测模型为例,针对待进行离职概率预测的员工数据,首先,根据员工数据包括员工对应的多个员工属性:职位、工龄、教育、性别、部门等,得到贝叶斯网络模型;然后,将员工数据输入到得到的贝叶斯网络模型进行训练,得到员工离职预测模型。上述人群画像分类模型的建立方法,将用户数据包括的多个用户属性中每个用户属性作为Chow-Liu算法的一个因子,利用Chow-Liu算法进行因子选择以及关联,由于Chow-Liu算法能够较好反映地各个因子之间的关联,同时能够反映因子与类别归属的关联,所以基于Chow-Liu算法构建的人群画像分类模型能够很好地反映出用户数据的各个用户属性之间的相关性,同时能够反映用户数据的各个用户属性与类别归属的关联。In the Bayesian network model obtained according to the attributes of multiple users, the user data to be classified for the crowd portrait is input into the Bayesian network model for training to obtain the crowd portrait classification model. Exemplarily, taking user data as employee data and constructing an employee turnover prediction model as an example, for employee data to be predicted for turnover probability, first, according to the employee data, including multiple employee attributes corresponding to the employee: position, seniority, education, gender, Departments, etc., get the Bayesian network model; then, input employee data into the obtained Bayesian network model for training to obtain the employee turnover prediction model. The method of establishing the above-mentioned crowd portrait classification model uses each user attribute of multiple user attributes included in the user data as a factor of the Chow-Liu algorithm. The Chow-Liu algorithm is used for factor selection and correlation. Because the Chow-Liu algorithm can compare The relationship between various factors can be well reflected, and the relationship between factors and category attribution can be reflected at the same time, so the crowd portrait classification model based on the Chow-Liu algorithm can well reflect the correlation between various user attributes of user data. At the same time, it can reflect the association of each user attribute and category attribution of user data.
图4示出了在一个实施例中,当用户数据中包括标签数据以及未标签数据时,人群画像分类模型的 建立方法的实现流程图,包括以下步骤:步骤S402,获取待进行人群画像分类的用户数据,其中每一条用户数据包括该用户对应的多个用户属性;步骤S404,将每个用户属性作为Chow-Liu算法的一个因子,利用所述Chow-Liu算法在所有因子中选择因子进行关联,直至关联所有的因子,得到贝叶斯网络模型;步骤S406,采用半监督学习方法对输入到贝叶斯网络模型中的用户数据进行训练,得到所述人群画像分类模型。其中,步骤S402以及步骤S404的实现过程分别与步骤S302以及步骤S304的实现过程类似,在此就不再赘述。为了提升得到的人群画像分类模型的精度,可以预先对部分待进行人群画像分类的用户数据打标签,因而待进行人群画像分类的用户数据包括标签数据以及未标签数据,然后将标签数据和未标签数据输入贝叶斯网络模型进行训练,得到精度更高的人群画像分类模型。示例地,以用户数据是员工数据且构建员工离职预测模型为例,针对待进行离职概率预测的员工数据,一方面,根据员工数据包括员工对应的多个员工属性:职位、工龄、教育、性别、部门等,得到贝叶斯网络模型;另一方面,对部分员工数据打标签,根据这部分员工的实际离职情况,对这部分员工数据打标签:已离职员工和为离职员工。然后,将打标签的员工数据(已知离职情况)和未打标签的员工数据(未知离职情况)输入得到的贝叶斯网络模型进行训练,得到员工离职预测模型。FIG. 4 shows an implementation flowchart of a method for establishing a crowd portrait classification model when user data includes label data and unlabeled data in one embodiment, including the following steps: Step S402: Obtaining the group portrait classification to be performed User data, where each piece of user data includes multiple user attributes corresponding to the user; step S404, each user attribute is used as a factor of the Chow-Liu algorithm, and the Chow-Liu algorithm is used to select a factor among all factors for correlation Until all the factors are associated to obtain the Bayesian network model; step S406, the user data input into the Bayesian network model is trained using a semi-supervised learning method to obtain the crowd portrait classification model. Wherein, the implementation process of step S402 and step S404 is similar to the implementation process of step S302 and step S304 respectively, which will not be repeated here. In order to improve the accuracy of the obtained crowd portrait classification model, some user data to be classified for crowd portraits can be pre-labeled. Therefore, the user data to be classified for crowd portraits includes label data and unlabeled data, and then label data and unlabeled The data is input to the Bayesian network model for training, and a more accurate crowd portrait classification model is obtained. Exemplarily, taking user data as employee data and constructing an employee turnover prediction model as an example, for employee data to be predicted for turnover probability, on the one hand, according to the employee data, including multiple employee attributes corresponding to the employee: position, seniority, education, gender , Departments, etc., get the Bayesian network model; on the other hand, tag some employee data, and tag this part of employee data based on the actual turnover of this part of employees: employees who have left and are employees who have left. Then, input the labeled employee data (known departure situation) and unlabeled employee data (unknown departure situation) into the obtained Bayesian network model for training to obtain the employee turnover prediction model.
在一种实施方式中,步骤S406包括以下步骤:利用所述贝叶斯网络模型对未标签数据进行标签预测;利用所述贝叶斯网络模型对标签数据进行训练;重复交替执行上述两个步骤,直至训练过程收敛。在具体实施过中,半监督学习方法包括E-step和M-step。用户数据包括标签数据以及未标签数据。将用户数据输入到执行步骤S404后得到的贝叶斯网络模型中进行训练的过程如下:首先,进行E-step,即利用执行步骤S404后得到的贝叶斯网络模型对未标签数据进行标签预测。随后,进行M-step,即利用标签数据,重新训练贝叶斯网络模型,并交替重复E-step与M-step,直至训练过程收敛,最终得到人群画像分类模型。In one embodiment, step S406 includes the following steps: using the Bayesian network model to perform label prediction on unlabeled data; using the Bayesian network model to train label data; and repeatedly performing the above two steps alternately Until the training process converges. In the specific implementation, the semi-supervised learning methods include E-step and M-step. The user data includes label data and unlabeled data. The process of inputting user data into the Bayesian network model obtained after performing step S404 is as follows: First, perform E-step, that is, use the Bayesian network model obtained after performing step S404 to perform label prediction on unlabeled data . Subsequently, M-step is performed, that is, using label data to retrain the Bayesian network model, and repeating E-step and M-step alternately until the training process converges, and finally a crowd portrait classification model is obtained.
在本实施例中,当获取的待进行人群画像分类的用户数据包括未标签数据时,根据多个用户属性得到的贝叶斯网络模型同样能够将此类未标签数据用于训练,也就是说,除将标签数据作为训练数据外,可以加入未标签数据作为训练数据,以避免训练数据量过低的问题,从而提升最终得到的人群画像分类模型的精度。图5示出了一个实施例中人群画像分类的方法的实现流程图,包括以下步骤:步骤S502,获取待进行人群画像分类的用户数据,其中每一条用户数据包括该用户对应的多个用户属性;步骤S504,将每个用户属性作为Chow-Liu算法的一个因子,利用所述Chow-Liu算法在所有因子中选择因子进行关联,直至关联所有的因子,得到贝叶斯网络模型;步骤S506,将所述用户数据输入到所述贝叶斯网络模型中进行训练,得到所述人群画像分类模型;步骤S508,在接收到新输入的用户数据时,利用所述人群画像分类模型对所述用户数据进行人群画像分类,得到对应的分类结果。其中,步骤S502-步骤S506的实现过程分别与 步骤S302-步骤S306的实现过程类似,在此就不再赘述。在得到人群画像分类模型之后,即可利用该人群画像分类模型实现人群画像分类。具体地,接收新输入的用户数据,然后将该用户数据输入执行步骤S502-步骤S506后所得到的人群画像分类模型,该人群画像分类模型的输出即为分类结果。示例地,以用户数据是员工数据且人群画像分类模型是员工离职预测模型为例,将一个某个员工的员工数据输入该员工离职预测模型,通过该员工离职预测模型即可预测该员工离职的概率,也即是说,该员工离职预测模型的输出即为该员工离职的概率。In this embodiment, when the acquired user data to be classified for crowd portraits includes unlabeled data, the Bayesian network model obtained from multiple user attributes can also use such unlabeled data for training, that is to say In addition to using labeled data as training data, you can add unlabeled data as training data to avoid the problem of too low training data, thereby improving the accuracy of the final crowd portrait classification model. FIG. 5 shows an implementation flowchart of a method for classifying crowd portraits in an embodiment, including the following steps: Step S502: Obtain user data to be classified for crowd portraits, where each piece of user data includes multiple user attributes corresponding to the user Step S504, taking each user attribute as a factor of the Chow-Liu algorithm, using the Chow-Liu algorithm to select factors among all factors for correlation, until all factors are correlated, and a Bayesian network model is obtained; step S506, Input the user data into the Bayesian network model for training to obtain the crowd portrait classification model; step S508, when receiving newly input user data, use the crowd portrait classification model to the user The data is used to classify crowd portraits, and the corresponding classification results are obtained. Among them, the implementation process of step S502-step S506 is similar to the implementation process of step S302-step S306, and will not be repeated here. After the crowd portrait classification model is obtained, the crowd portrait classification model can be used to realize the crowd portrait classification. Specifically, the newly input user data is received, and then the user data is input to the crowd portrait classification model obtained after performing steps S502-S506, and the output of the crowd portrait classification model is the classification result. Exemplarily, taking user data as employee data and crowd portrait classification model as employee turnover prediction model, for example, input employee data of an employee into the employee turnover prediction model, and the employee turnover prediction model can be used to predict the employee's turnover. Probability, that is, the output of the employee turnover prediction model is the probability of the employee leaving.
图6示出了一个实施例中人群画像分类模型的建立方法的实现流程图,包括以下步骤:步骤S602,获取待进行人群画像分类的用户数据,其中每一条用户数据包括该用户对应的多个用户属性;步骤S604,对所述用户数据进行数据预处理;步骤S606,将每个用户属性作为Chow-Liu算法的一个因子,利用所述Chow-Liu算法在所有因子中选择因子进行关联,直至关联所有的因子,得到贝叶斯网络模型;步骤S608,将所述用户数据输入到所述贝叶斯网络模型中进行训练,得到所述人群画像分类模型。其中,步骤S602的实现过程与步骤S302的实现过程类似,在此就不再赘述。在一种实施方式中,步骤S604中的数据预处理包括:数据清洗以及标准化处理;所述数据清洗包括:删除用户数据中的空缺数据、噪声数据、重复数据以及错误数据;所述标准化处理包括:将同一个用户对应的多个数据进行整合。在本实施例中,考虑到执行步骤S602后原始获取的用户数据存在“脏数据”,包括数据空缺和噪声、不一致、重复、错误等问题,为了保证后期数据处理的准确性,以及,在利用人群画像分类模型得到分类结果后,为了减少该分类结果对最终决策造成的影响,有必要对原始获取的用户数据进行预处理。即删除原始获取的用户数据中的空缺数据、噪声数据、重复数据以及错误数据。此外,人群画像的建立需要有整合多源数据的能力,例如一个用户可能使用多个设备,在网络上拥有多个账号。因此需要把同一用户的多个账号组合,即将同一个用户对应的多个数据进行整合,进而建立统一的标准,以完整标识用户的人群画像。在执行完步骤S604之后,步骤S606中将经过步骤S604中的预处理后的用户数据包括的每个用户属性作为Chow-Liu算法的一个因子,其余与步骤S304类似。同理,步骤S606中将经过步骤S604中的预处理后的用户数据,输入到执行步骤S606后所得到的贝叶斯网络模型中进行训练,其余与步骤S306类似。FIG. 6 shows an implementation flowchart of a method for establishing a crowd portrait classification model in an embodiment, including the following steps: Step S602: Obtain user data to be subjected to crowd portrait classification, where each piece of user data includes multiple corresponding to the user User attributes; step S604, data preprocessing is performed on the user data; step S606, each user attribute is used as a factor of the Chow-Liu algorithm, and the Chow-Liu algorithm is used to select a factor among all factors for correlation until Associate all the factors to obtain a Bayesian network model; Step S608, input the user data into the Bayesian network model for training, and obtain the crowd portrait classification model. The implementation process of step S602 is similar to the implementation process of step S302, and will not be repeated here. In one embodiment, the data preprocessing in step S604 includes: data cleaning and standardization processing; the data cleaning includes: deleting vacant data, noise data, duplicate data, and error data in user data; and the standardization processing includes : Integrate multiple data corresponding to the same user. In this embodiment, considering that the user data originally acquired after step S602 has "dirty data", including data vacancies and noise, inconsistencies, duplication, errors, etc., in order to ensure the accuracy of later data processing, and After the crowd portrait classification model obtains the classification results, in order to reduce the impact of the classification results on the final decision, it is necessary to preprocess the originally acquired user data. That is, the vacant data, noise data, duplicate data, and error data in the originally acquired user data are deleted. In addition, the establishment of crowd portraits requires the ability to integrate multi-source data. For example, a user may use multiple devices and have multiple accounts on the network. Therefore, it is necessary to combine multiple accounts of the same user, that is, to integrate multiple data corresponding to the same user, and then establish a unified standard to completely identify the user's crowd portrait. After step S604 is executed, in step S606, each user attribute included in the preprocessed user data in step S604 is used as a factor of the Chow-Liu algorithm, and the rest is similar to step S304. Similarly, in step S606, the pre-processed user data in step S604 is input into the Bayesian network model obtained after step S606 for training, and the rest is similar to step S306.
如图7所示,在一个实施例中,提供了一种人群画像分类模型的建立装置,该人群画像分类模型的建立装置可以集成于上述的计算机设备110中,可以包括数据获取单元702、因子关联单元704、以及数据训练单元706。数据获取单元702,用于获取待进行人群画像分类的用户数据,其中每一条用户数据包括该用户对应的多个用户属性;因子关联单元704,用于将每个用户属性作为Chow-Liu算法的一个因子,利用所述Chow-Liu算法在所有因子中选择因子进行关联,直至关联所有的因子,得到贝叶斯网络模型;数 据训练单元706,用于将所述用户数据输入到所述贝叶斯网络模型中进行训练,得到所述人群画像分类模型。如图8所示,人群画像分类模型的建立装置还可以包括预处理单元802。预处理单元802,用于对所述用户数据进行数据预处理。如图9所示,在一个实施例中,当数据预处理包括:数据清洗以及标准化处理时,预处理单元802包括:数据清洗模块802A和标准化处理模块802B。数据清洗模块802A,用于删除用户数据中的空缺数据、噪声数据、重复数据以及错误数据;标准化处理模块802B,用于将同一个用户对应的多个数据进行整合。在一个实施例中,因子关联单元704具体用于执行以下步骤:对于每一个因子,根据公式一在所有未选取的因子中选择与其KL距离最小的因子作为该因子的关联因子,直至所有因子均被选取;所述公式一为:KL(P(X)||T(X))=-∑I(X i,Pa(X i))+∑H(X i)-H(X 1,X 2...,X n) As shown in FIG. 7, in one embodiment, an apparatus for establishing a crowd portrait classification model is provided. The apparatus for establishing a crowd portrait classification model may be integrated into the computer device 110 described above, and may include a data acquisition unit 702 and factors Association unit 704, and data training unit 706. The data obtaining unit 702 is used to obtain user data for group portrait classification, where each piece of user data includes multiple user attributes corresponding to the user; the factor association unit 704 is used to take each user attribute as the Chow-Liu algorithm A factor, using the Chow-Liu algorithm to select a factor among all factors for correlation until all factors are correlated to obtain a Bayesian network model; a data training unit 706 is used to input the user data to the Bayesian Training in the Sri Lankan network model to obtain the crowd portrait classification model. As shown in FIG. 8, the apparatus for establishing a crowd portrait classification model may further include a preprocessing unit 802. The preprocessing unit 802 is used to perform data preprocessing on the user data. As shown in FIG. 9, in one embodiment, when data preprocessing includes: data cleaning and standardization processing, the preprocessing unit 802 includes: a data cleaning module 802A and a standardized processing module 802B. The data cleaning module 802A is used to delete vacant data, noise data, duplicate data and erroneous data in user data; the standardized processing module 802B is used to integrate multiple data corresponding to the same user. In one embodiment, the factor correlation unit 704 is specifically configured to perform the following steps: For each factor, select the factor with the smallest KL distance among all unselected factors according to formula 1 as the correlation factor of the factor until all factors are equal Is selected; the first formula is: KL (P (X) || T (X)) =-∑I (X i , Pa (X i )) + ∑H (X i ) -H (X 1 , X 2 ..., X n )
其中,KL(P(X)||T(X))表示该因子与所有未选择的因子中任一因子的KL距离,P(X)表示进行关联之前所有因子的分布情况,T(X)表示进行关联之后所有因子的分布情况;X i表示第i个因子,H表示熵,Pa(X i)表示X i的父节点;I表示互信息,是通过公式二计算得到的,所述公式二为: Among them, KL (P (X) || T (X)) represents the KL distance between this factor and any one of all unselected factors, P (X) represents the distribution of all factors before correlation, T (X) Represents the distribution of all factors after correlation; X i represents the i-th factor, H represents entropy, Pa (X i ) represents the parent node of X i ; I represents mutual information, which is calculated by formula two, the formula The second is:
Figure PCTCN2019097892-appb-000006
Figure PCTCN2019097892-appb-000006
其中,p(a)表示数值a出现的概率,p(b)表示数值b出现的概率,p(a,b)表示数值b出现的前提下数值b出现的概率,X 1和X 2代表所述多个用户属性中任两个用户属性,数值a为属于用户属性X 1的任一数值,数值b为属于用户属性X 2的任一数值。在一个实施例中,当用户数据中包括标签数据以及未标签数据时,数据训练单元706具体用于采用半监督学习方法对输入到贝叶斯网络模型中的用户数据进行训练。在一个实施例中,当用户数据中包括标签数据以及未标签数据时,数据训练单元706具体用于执行以下步骤:利用所述贝叶斯网络模型对未标签数据进行标签预测;利用所述贝叶斯网络模型对标签数据进行训练;重复交替执行上述两个步骤,直至训练过程收敛。如图10所示,人群画像分类模型的建立装置还可以包括分类单元1002。分类单元1002,用于在接收到新输入的用户数据时,利用所述人群画像分类模型对所述用户数据进行人群画像分类,得到对应的分类结果。在一个实施例中,提出了一种计算机设备,所述计算机设备包括非易失性可读存储介质、处理器及存储在所述非易失性可读存储介质上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现以下步骤:获取待进行人群画像分类的用户数据,其中每一条用户数据包括该用户对应的多个用户属性;将每个用户属性作为Chow-Liu算法的一个因子,利用所述Chow-Liu算法在所有因子中选择因子进行关联,直至关联所有的因子,得到贝叶斯网络模型;将所述用户数据输入到所述贝叶斯网络模型中进行训练,得到所述人群画像分类模型。在一个实施例中,处理器执行计算机可读指令时还执行以下步骤:对所述用户数据进行数据预处理。在一个实 施例中,所述数据预处理包括:数据清洗以及标准化处理;所述处理器所执行的对所述用户数据进行数据预处理的步骤包括:删除用户数据中的空缺数据、噪声数据、重复数据以及错误数据;将同一个用户对应的多个数据进行整合。在一个实施例中,所述处理器所执行的利用所述Chow-Liu算法在所有因子中选择因子进行关联,直至关联所有的因子的步骤包括:对于每一个因子,根据公式一在所有未选取的因子中选择与其KL距离最小的因子作为该因子的关联因子,直至所有因子均被选取;所述公式一为: Among them, p (a) represents the probability of the occurrence of the value a, p (b) represents the probability of the occurrence of the value b, p (a, b) represents the probability of the occurrence of the value b under the premise of the occurrence of the value b, X 1 and X 2 represent Any two user attributes among the plurality of user attributes, the value a is any value belonging to the user attribute X 1 , and the value b is any value belonging to the user attribute X 2 . In one embodiment, when the user data includes label data and unlabeled data, the data training unit 706 is specifically configured to use a semi-supervised learning method to train the user data input into the Bayesian network model. In one embodiment, when the user data includes label data and unlabeled data, the data training unit 706 is specifically configured to perform the following steps: use the Bayesian network model to perform label prediction on the unlabeled data; use the Bayesian The yes network model trains the label data; the above two steps are repeated alternately until the training process converges. As shown in FIG. 10, the apparatus for establishing a crowd portrait classification model may further include a classification unit 1002. The classification unit 1002 is configured to use the crowd portrait classification model to perform crowd portrait classification on the user data when receiving newly input user data, and obtain a corresponding classification result. In one embodiment, a computer device is proposed. The computer device includes a non-volatile readable storage medium, a processor, and is stored on the non-volatile readable storage medium and is available on the processor. Computer-readable instructions running on the computer, when the processor executes the computer-readable instructions, the following steps are implemented: acquiring user data to be classified for crowd portraits, where each piece of user data includes multiple user attributes corresponding to the user; Each user attribute is used as a factor of the Chow-Liu algorithm, and the Chow-Liu algorithm is used to select factors among all factors for correlation until all factors are correlated to obtain a Bayesian network model; the user data is input to all Training in the Bayesian network model to obtain the crowd portrait classification model. In one embodiment, when the processor executes the computer-readable instructions, it also performs the following step: performing data preprocessing on the user data. In one embodiment, the data preprocessing includes: data cleaning and standardization processing; the step of data preprocessing performed by the processor on the user data includes: deleting vacant data and noise data in the user data, Duplicate data and wrong data; integrate multiple data corresponding to the same user. In one embodiment, the process performed by the processor using the Chow-Liu algorithm to select factors among all factors for correlation until the step of correlating all factors includes: for each factor, according to formula one, all unselected The factor with the smallest distance from its KL is selected as the correlation factor of the factor until all factors are selected; the formula one is:
KL(P(X)||T(X))=-∑I(X i,Pa(X i))+∑H(X i)-H(X 1,X 2...,X n) KL (P (X) || T (X)) =-∑I (X i , Pa (X i )) + ∑H (X i ) -H (X 1 , X 2 ..., X n )
其中,KL(P(X)||T(X))表示该因子与所有未选择的因子中任一因子的KL距离,P(X)表示进行关联之前所有因子的分布情况,T(X)表示进行关联之后所有因子的分布情况;X i表示第i个因子,H表示熵,Pa(X i)表示X i的父节点;I表示互信息,是通过公式二计算得到的,所述公式二为: Among them, KL (P (X) || T (X)) represents the KL distance between this factor and any one of all unselected factors, P (X) represents the distribution of all factors before correlation, T (X) Represents the distribution of all factors after correlation; X i represents the i-th factor, H represents entropy, Pa (X i ) represents the parent node of X i ; I represents mutual information, which is calculated by formula two, the formula The second is:
Figure PCTCN2019097892-appb-000007
Figure PCTCN2019097892-appb-000007
其中,p(a)表示数值a出现的概率,p(b)表示数值b出现的概率,p(a,b)表示数值b出现的前提下数值b出现的概率,X 1和X 2代表所述多个用户属性中任两个用户属性,数值a为属于用户属性X 1的任一数值,数值b为属于用户属性X 2的任一数值。在一个实施例中,所述用户数据中包括标签数据以及未标签数据,所述处理器所执行的将所述用户数据输入到所述贝叶斯网络模型中进行训练的步骤包括:采用半监督学习方法对输入到贝叶斯网络模型中的用户数据进行训练。在一个实施例中,所述处理器所执行的采用半监督学习方法对输入到贝叶斯网络模型中的用户数据进行训练的步骤包括:利用所述贝叶斯网络模型对未标签数据进行标签预测;利用所述贝叶斯网络模型对标签数据进行训练;重复交替执行上述两个步骤,直至训练过程收敛。在一个实施例中,处理器执行计算机可读指令时还执行以下步骤:在接收到新输入的用户数据时,利用所述人群画像分类模型对所述用户数据进行人群画像分类,得到对应的分类结果。 Among them, p (a) represents the probability of the occurrence of the value a, p (b) represents the probability of the occurrence of the value b, p (a, b) represents the probability of the occurrence of the value b under the premise of the occurrence of the value b, X 1 and X 2 represent Any two user attributes among the plurality of user attributes, the value a is any value belonging to the user attribute X 1 , and the value b is any value belonging to the user attribute X 2 . In one embodiment, the user data includes labeled data and unlabeled data, and the step of inputting the user data into the Bayesian network model and performed by the processor for training includes using semi-supervised The learning method trains the user data input into the Bayesian network model. In one embodiment, the step of training the user data input into the Bayesian network model by the processor using a semi-supervised learning method includes: using the Bayesian network model to label unlabeled data Prediction; using the Bayesian network model to train the label data; repeatedly performing the above two steps alternately until the training process converges. In one embodiment, when the processor executes the computer-readable instructions, it also performs the following steps: when receiving newly input user data, the crowd portrait classification model is used to classify the user data to obtain the corresponding classification result.
在一个实施例中,提出了一种存储有计算机可读指令的非易失性可读存储介质,该计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:获取待进行人群画像分类的用户数据,其中每一条用户数据包括该用户对应的多个用户属性;将每个用户属性作为Chow-Liu算法的一个因子,利用所述Chow-Liu算法在所有因子中选择因子进行关联,直至关联所有的因子,得到贝叶斯网络模型;将所述用户数据输入到所述贝叶斯网络模型中进行训练,得到所述人群画像分类模型。在一个实施例中,处理器执行计算机可读指令时还执行以下步骤:对所述用户数据进行数据预处理。在一个实施例中,所述数据预处理包括:数据清洗以及标准化处理;所述处理器所执行的对所述用户数据进行数据预处理的步骤包括:删除用户数据中的空缺数据、噪声数据、重复数据以及错误数据;将同一个用户对应的多个数据进 行整合。在一个实施例中,所述处理器所执行的利用所述Chow-Liu算法在所有因子中选择因子进行关联,直至关联所有的因子的步骤包括:对于每一个因子,根据公式一在所有未选取的因子中选择与其KL距离最小的因子作为该因子的关联因子,直至所有因子均被选取;所述公式一为:In one embodiment, a non-volatile readable storage medium storing computer-readable instructions is provided. When the computer-readable instructions are executed by one or more processors, the one or more processors execute the following Step: Obtain user data to be classified for crowd portraits, where each piece of user data includes multiple user attributes corresponding to the user; use each user attribute as a factor of the Chow-Liu algorithm, and use the Chow-Liu algorithm in all The selection factors among the factors are correlated until all the factors are correlated to obtain a Bayesian network model; the user data is input into the Bayesian network model for training to obtain the crowd portrait classification model. In one embodiment, when the processor executes the computer-readable instructions, it also performs the following step: performing data preprocessing on the user data. In one embodiment, the data preprocessing includes: data cleaning and standardization processing; the step of data preprocessing performed by the processor on the user data includes: deleting vacant data and noise data in the user data, Duplicate data and wrong data; integrate multiple data corresponding to the same user. In one embodiment, the process performed by the processor using the Chow-Liu algorithm to select factors among all factors for correlation until the step of correlating all factors includes: for each factor, according to formula one, all unselected The factor with the smallest distance from its KL is selected as the correlation factor of the factor until all factors are selected; the formula one is:
KL(P(X)||T(X))=-∑I(X i,Pa(X i))+∑H(X i)-H(X 1,X 2...,X n) KL (P (X) || T (X)) =-∑I (X i , Pa (X i )) + ∑H (X i ) -H (X 1 , X 2 ..., X n )
其中,KL(P(X)||T(X))表示该因子与所有未选择的因子中任一因子的KL距离,P(X)表示进行关联之前所有因子的分布情况,T(X)表示进行关联之后所有因子的分布情况;X i表示第i个因子,H表示熵,Pa(X i)表示X i的父节点;I表示互信息,是通过公式二计算得到的,所述公式二为: Among them, KL (P (X) || T (X)) represents the KL distance between this factor and any one of all unselected factors, P (X) represents the distribution of all factors before correlation, T (X) Represents the distribution of all factors after correlation; X i represents the i-th factor, H represents entropy, Pa (X i ) represents the parent node of X i ; I represents mutual information, which is calculated by formula two, the formula The second is:
Figure PCTCN2019097892-appb-000008
Figure PCTCN2019097892-appb-000008
其中,p(a)表示数值a出现的概率,p(b)表示数值b出现的概率,p(a,b)表示数值b出现的前提下数值b出现的概率,X 1和X 2代表所述多个用户属性中任两个用户属性,数值a为属于用户属性X 1的任一数值,数值b为属于用户属性X 2的任一数值。在一个实施例中,所述用户数据中包括标签数据以及未标签数据,所述处理器所执行的将所述用户数据输入到所述贝叶斯网络模型中进行训练的步骤包括:采用半监督学习方法对输入到贝叶斯网络模型中的用户数据进行训练。在一个实施例中,所述处理器所执行的采用半监督学习方法对输入到贝叶斯网络模型中的用户数据进行训练的步骤包括:利用所述贝叶斯网络模型对未标签数据进行标签预测;利用所述贝叶斯网络模型对标签数据进行训练;重复交替执行上述两个步骤,直至训练过程收敛。在一个实施例中,处理器执行计算机可读指令时还执行以下步骤:在接收到新输入的用户数据时,利用所述人群画像分类模型对所述用户数据进行人群画像分类,得到对应的分类结果。本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,该计算机程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。 Among them, p (a) represents the probability of the occurrence of the value a, p (b) represents the probability of the occurrence of the value b, p (a, b) represents the probability of the occurrence of the value b under the premise of the occurrence of the value b, X 1 and X 2 represent Any two user attributes among the plurality of user attributes, the value a is any value belonging to the user attribute X 1 , and the value b is any value belonging to the user attribute X 2 . In one embodiment, the user data includes labeled data and unlabeled data, and the step of inputting the user data into the Bayesian network model and performed by the processor for training includes using semi-supervised The learning method trains the user data input into the Bayesian network model. In one embodiment, the step of training the user data input into the Bayesian network model by the processor using a semi-supervised learning method includes: using the Bayesian network model to label unlabeled data Prediction; using the Bayesian network model to train the label data; repeatedly performing the above two steps alternately until the training process converges. In one embodiment, when the processor executes the computer-readable instructions, it also performs the following steps: when receiving newly input user data, the crowd portrait classification model is used to classify the user data to obtain the corresponding classification result. A person of ordinary skill in the art may understand that all or part of the processes in the method of the above embodiments may be completed by instructing relevant hardware through a computer program. The computer program may be stored in a computer-readable storage medium. When executed, it may include the processes of the foregoing method embodiments. The foregoing storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM). The technical features of the above-mentioned embodiments can be combined arbitrarily. To simplify the description, all possible combinations of the technical features in the above-mentioned embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, All should be considered within the scope of this description. The above-mentioned embodiments only express several implementation manners of the present application, and their descriptions are more specific and detailed, but they should not be construed as limiting the patent scope of the present application. It should be pointed out that, for a person of ordinary skill in the art, without departing from the concept of the present application, a number of modifications and improvements can be made, which all fall within the protection scope of the present application. Therefore, the protection scope of the patent of this application shall be subject to the appended claims.

Claims (20)

  1. 一种人群画像分类模型的建立方法,其特征在于,包括:A method for establishing a crowd portrait classification model, which is characterized by including:
    获取待进行人群画像分类的用户数据,其中每一条用户数据包括该用户对应的多个用户属性;Obtain user data to be classified for crowd portraits, where each piece of user data includes multiple user attributes corresponding to the user;
    将每个用户属性作为Chow-Liu算法的一个因子,利用所述Chow-Liu算法在所有因子中选择因子进行关联,直至关联所有的因子,得到贝叶斯网络模型;Take each user attribute as a factor of the Chow-Liu algorithm, and use the Chow-Liu algorithm to select a factor among all factors for correlation until all factors are correlated to obtain a Bayesian network model;
    将所述用户数据输入到所述贝叶斯网络模型中进行训练,得到所述人群画像分类模型。Input the user data into the Bayesian network model for training to obtain the crowd portrait classification model.
  2. 根据权利要求1所述的方法,其特征在于,在获取待进行人群画像分类的用户数据的步骤之后,所述方法还包括:对所述用户数据进行数据预处理。The method according to claim 1, characterized in that after the step of acquiring user data to be classified for crowd portraits, the method further comprises: performing data preprocessing on the user data.
  3. 根据权利要求2所述的方法,其特征在于,所述数据预处理包括:数据清洗以及标准化处理;The method according to claim 2, wherein the data preprocessing includes: data cleaning and standardization processing;
    所述数据清洗包括:删除用户数据中的空缺数据、噪声数据、重复数据以及错误数据;The data cleaning includes: deleting vacant data, noise data, duplicate data, and erroneous data in user data;
    所述标准化处理包括:将同一个用户对应的多个数据进行整合。The standardization process includes: integrating multiple data corresponding to the same user.
  4. 根据权利要求1所述的方法,其特征在于,所述利用所述Chow-Liu算法在所有因子中选择因子进行关联,直至关联所有的因子,包括:The method according to claim 1, wherein the using the Chow-Liu algorithm selects a factor among all factors for correlation until all factors are correlated, including:
    对于每一个因子,根据公式一在所有未选取的因子中选择与其KL距离最小的因子作为该因子的关联因子,直至所有因子均被选取;For each factor, select the factor with the smallest KL distance from all unselected factors according to formula 1 as the correlation factor of the factor until all factors are selected;
    所述公式一为:KL(P(X)||T(X))=-∑I(X i,Pa(X i))+∑H(X i)-H(X 1,X 2...,X n) The first formula is: KL (P (X) || T (X)) =-∑I (X i , Pa (X i )) + ∑H (X i ) -H (X 1 , X 2 .. ., X n )
    其中,KL(P(X)||T(X))表示该因子与所有未选择的因子中任一因子的KL距离,P(X)表示进行关联之前所有因子的分布情况,T(X)表示进行关联之后所有因子的分布情况;X i表示第i个因子,H表示熵,Pa(X i)表示X i的父节点;I表示互信息,是通过公式二计算得到的,所述公式二为: Among them, KL (P (X) || T (X)) represents the KL distance between this factor and any one of all unselected factors, P (X) represents the distribution of all factors before correlation, T (X) Represents the distribution of all factors after correlation; X i represents the i-th factor, H represents entropy, Pa (X i ) represents the parent node of X i ; I represents mutual information, which is calculated by formula two, the formula The second is:
    Figure PCTCN2019097892-appb-100001
    Figure PCTCN2019097892-appb-100001
    其中,p(a)表示数值a出现的概率,p(b)表示数值b出现的概率,p(a,b)表示数值b出现的前提下数值b出现的概率,X 1和X 2代表所述多个用户属性中任两个用户属性,数值a为属于用户属性X 1的任一数值,数值b为属于用户属性X 2的任一数值。 Among them, p (a) represents the probability of the occurrence of the value a, p (b) represents the probability of the occurrence of the value b, p (a, b) represents the probability of the occurrence of the value b under the premise of the occurrence of the value b, X 1 and X 2 represent Any two user attributes among the plurality of user attributes, the value a is any value belonging to the user attribute X 1 , and the value b is any value belonging to the user attribute X 2 .
  5. 根据权利要求1所述的方法,其特征在于,所述用户数据中包括标签数据以及未标签数据,所述将所述用户数据输入到所述贝叶斯网络模型中进行训练的步骤包括:采用半监督学习方法对输入到贝叶斯网络模型中的用户数据进行训练。The method according to claim 1, wherein the user data includes label data and unlabeled data, and the step of inputting the user data into the Bayesian network model for training includes: using The semi-supervised learning method trains user data input into the Bayesian network model.
  6. 根据权利要求5所述的方法,其特征在于,所述采用半监督学习方法对输入到贝叶斯网络模型中的用户数据进行训练,包括:The method according to claim 5, wherein the training of user data input to the Bayesian network model using a semi-supervised learning method includes:
    利用所述贝叶斯网络模型对未标签数据进行标签预测;Using the Bayesian network model to perform label prediction on unlabeled data;
    利用所述贝叶斯网络模型对标签数据进行训练;Use the Bayesian network model to train the label data;
    重复交替执行上述两个步骤,直至训练过程收敛。Repeat the above two steps alternately until the training process converges.
  7. 根据权利要求1所述的方法,其特征在于,在得到所述人群画像分类模型之后,所述方法还包括:The method according to claim 1, wherein after the crowd portrait classification model is obtained, the method further comprises:
    在接收到新输入的用户数据时,利用所述人群画像分类模型对所述用户数据进行人群画像分类,得到对应的分类结果。When the newly input user data is received, the user portrait classification model is used to classify the user portrait to obtain the corresponding classification result.
  8. 一种人群画像分类模型的建立装置,所述人群画像分类模型的建立装置包括:An apparatus for establishing a crowd portrait classification model, the apparatus for establishing a crowd portrait classification model includes:
    数据获取单元,用于获取待进行人群画像分类的用户数据,其中每一条用户数据包括该用户对应的多个用户属性;A data obtaining unit, configured to obtain user data to be classified for crowd portraits, wherein each piece of user data includes multiple user attributes corresponding to the user;
    因子关联单元,用于将每个用户属性作为Chow-Liu算法的一个因子,利用所述Chow-Liu算法在所有因子中选择因子进行关联,直至关联所有的因子,得到贝叶斯网络模型;Factor correlation unit, used to take each user attribute as a factor of the Chow-Liu algorithm, and use the Chow-Liu algorithm to select factors among all factors for correlation until all factors are correlated to obtain a Bayesian network model;
    数据训练单元,用于将所述用户数据输入到所述贝叶斯网络模型中进行训练,得到所述人群画像分类模型。The data training unit is used for inputting the user data into the Bayesian network model for training to obtain the crowd portrait classification model.
  9. 根据权利要求8所述的装置,其特征在于,所述装置还包括:预处理单元,用于对所述用户数据进行数据预处理。The apparatus according to claim 8, characterized in that the apparatus further comprises: a preprocessing unit, configured to perform data preprocessing on the user data.
  10. 根据权利要求9所述的装置,其特征在于,当数据预处理包括:数据清洗以及标准化处理时,所述预处理单元包括:数据清洗模块和标准化处理模块;所述数据清洗模块,用于删除用户数据中的空缺数据、噪声数据、重复数据以及错误数据;标准化处理模块,用于将同一个用户对应的多个数据进行整合。The apparatus according to claim 9, wherein when the data preprocessing includes: data cleaning and standardization processing, the preprocessing unit includes: a data cleaning module and a standardized processing module; and the data cleaning module is used to delete Vacancy data, noise data, duplicate data, and error data in user data; standardized processing module, used to integrate multiple data corresponding to the same user.
  11. 根据权利要求8所述的装置,其特征在于,所述因子关联单元具体用于执行以下步骤:The apparatus according to claim 8, wherein the factor correlation unit is specifically configured to perform the following steps:
    对于每一个因子,根据公式一在所有未选取的因子中选择与其KL距离最小的因子作为该因子的关联因子,直至所有因子均被选取;For each factor, select the factor with the smallest KL distance from all unselected factors according to formula 1 as the correlation factor of the factor until all factors are selected;
    所述公式一为:KL(P(X)||T(X))=-∑I(X i,Pa(X i))+∑H(X i)-H(X 1,X 2...,X n) The first formula is: KL (P (X) || T (X)) =-∑I (X i , Pa (X i )) + ∑H (X i ) -H (X 1 , X 2 .. ., X n )
    其中,KL(P(X)||T(X))表示该因子与所有未选择的因子中任一因子的KL距离,P(X)表示进行关联之前所有因子的分布情况,T(X)表示进行关联之后所有因子的分布情况;X i表示第i个因子,H表示熵,Pa(X i)表示X i的父节点;I表示互信息,是通过公式二计算得到的,所述公式二为: Among them, KL (P (X) || T (X)) represents the KL distance between this factor and any one of all unselected factors, P (X) represents the distribution of all factors before correlation, T (X) Represents the distribution of all factors after correlation; X i represents the i-th factor, H represents entropy, Pa (X i ) represents the parent node of X i ; I represents mutual information, which is calculated by formula two, the formula The second is:
    Figure PCTCN2019097892-appb-100002
    Figure PCTCN2019097892-appb-100002
    其中,p(a)表示数值a出现的概率,p(b)表示数值b出现的概率,p(a,b)表示数值b出现的前提下数值b出现的概率,X 1和X 2代表所述多个用户属性中任两个用户属性,数值a为属于用户属性X 1的任一数值,数值b为属于用户属性X 2的任一数值。 Among them, p (a) represents the probability of the occurrence of the value a, p (b) represents the probability of the occurrence of the value b, p (a, b) represents the probability of the occurrence of the value b under the premise of the occurrence of the value b, X 1 and X 2 represent Any two user attributes among the plurality of user attributes, the value a is any value belonging to the user attribute X 1 , and the value b is any value belonging to the user attribute X 2 .
  12. 根据权利要求8所述的装置,其特征在于,当所述用户数据中包括标签数据以及未标签数据时,数据训练单元具体用于采用半监督学习方法对输入到贝叶斯网络模型中的用户数据进行训练。The apparatus according to claim 8, characterized in that, when the user data includes label data and unlabeled data, the data training unit is specifically used to adopt a semi-supervised learning method to the user input to the Bayesian network model Data for training.
  13. 根据权利要求12所述的装置,其特征在于,当所述用户数据中包括标签数据以及未标签数据时,所述数据训练单元具体用于执行以下步骤:The apparatus according to claim 12, wherein when the user data includes label data and unlabeled data, the data training unit is specifically configured to perform the following steps:
    利用所述贝叶斯网络模型对未标签数据进行标签预测;Using the Bayesian network model to perform label prediction on unlabeled data;
    利用所述贝叶斯网络模型对标签数据进行训练;Use the Bayesian network model to train the label data;
    重复交替执行上述两个步骤,直至训练过程收敛。Repeat the above two steps alternately until the training process converges.
  14. 根据权利要求8所述的装置,其特征在于,所述装置还包括:分类单元,用于在接收到新输入的用户数据时,利用所述人群画像分类模型对所述用户数据进行人群画像分类,得到对应的分类结果。The apparatus according to claim 8, wherein the apparatus further comprises: a classification unit for classifying the user data using the crowd portrait classification model when receiving newly input user data To get the corresponding classification result.
  15. 一种计算机设备,包括非易失性可读存储介质和处理器,所述非易失性可读存储介质中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时实现人群画像分类模型的建立方法,包括:获取待进行人群画像分类的用户数据,其中每一条用户数据包括该用户对应的多个用户属性;将每个用户属性作为Chow-Liu算法的一个因子,利用所述Chow-Liu算法在所有因子中选择因子进行关联,直至关联所有的因子,得到贝叶斯网络模型;将所述用户数据输入到所述贝叶斯网络模型中进行训练,得到所述人群画像分类模型。A computer device includes a non-volatile readable storage medium and a processor, and the non-volatile readable storage medium stores computer-readable instructions, which are implemented when the processor executes the instructions A method for establishing a crowd portrait classification model includes: obtaining user data to be classified for crowd portraits, where each piece of user data includes multiple user attributes corresponding to the user; using each user attribute as a factor of the Chow-Liu algorithm, using The Chow-Liu algorithm selects factors among all factors for correlation until all factors are correlated to obtain a Bayesian network model; the user data is input into the Bayesian network model for training to obtain the crowd Portrait classification model.
  16. 根据权利要求15所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时实现在获取待进行人群画像分类的用户数据的步骤之后,所述方法还包括:对所述用户数据进行数据预处理。The computer device according to claim 15, characterized in that after the processor executes the computer-readable instructions, after the step of acquiring user data for classifying a group portrait is implemented, the method further includes: User data is preprocessed.
  17. 根据权利要求16所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时实现所述数据预处理包括:数据清洗以及标准化处理;所述数据清洗包括:删除用户数据中的空缺数据、噪声数据、重复数据以及错误数据;所述标准化处理包括:将同一个用户对应的多个数据进行整合。The computer device according to claim 16, characterized in that, when the processor executes the computer-readable instructions, implementing the data preprocessing includes: data cleaning and standardization processing; the data cleaning includes: deleting user data Vacancy data, noise data, duplicate data, and error data; the standardization process includes: integrating multiple data corresponding to the same user.
  18. 一种存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时实现人群画像分类模型的建立方法,包括:获取待进行人群画像分类的用户数据,其中每一条用户数据包括该用户对应的多个用户属性;将每个用户属性作为Chow-Liu算法的一个因子,利用所述 Chow-Liu算法在所有因子中选择因子进行关联,直至关联所有的因子,得到贝叶斯网络模型;将所述用户数据输入到所述贝叶斯网络模型中进行训练,得到所述人群画像分类模型。A non-volatile readable storage medium that stores computer-readable instructions, and when the computer-readable instructions are executed by one or more processors, a method for establishing a group portrait classification model includes: acquiring a group portrait classification to be performed User data, where each piece of user data includes multiple user attributes corresponding to the user; each user attribute is used as a factor of the Chow-Liu algorithm, and the Chow-Liu algorithm is used to select factors among all factors for correlation until Correlate all factors to obtain a Bayesian network model; input the user data into the Bayesian network model for training, and obtain the crowd portrait classification model.
  19. 根据权利要求18所述的非易失性可读存储介质,其特征在于,所述处理器执行所述计算机可读指令时实现在获取待进行人群画像分类的用户数据的步骤之后,所述方法还包括:对所述用户数据进行数据预处理。The non-volatile readable storage medium according to claim 18, characterized in that, when the processor executes the computer-readable instructions, after the step of acquiring user data to be classified for a group portrait, the method It also includes: performing data preprocessing on the user data.
  20. 根据权利要求19所述的非易失性可读存储介质,其特征在于,所述处理器执行所述计算机可读指令时实现所述数据预处理包括:数据清洗以及标准化处理;所述数据清洗包括:删除用户数据中的空缺数据、噪声数据、重复数据以及错误数据;所述标准化处理包括:将同一个用户对应的多个数据进行整合。The non-volatile readable storage medium according to claim 19, wherein the data pre-processing when the processor executes the computer-readable instructions includes: data cleaning and standardization processing; the data cleaning It includes: deleting vacant data, noise data, duplicate data, and erroneous data in user data; the standardization process includes: integrating multiple data corresponding to the same user.
PCT/CN2019/097892 2018-11-12 2019-07-26 Method, device and equipment for establishing crowd portrait classification medel and storage medium WO2020098308A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811340717.5A CN109740620B (en) 2018-11-12 2018-11-12 Method, device, equipment and storage medium for establishing crowd figure classification model
CN201811340717.5 2018-11-12

Publications (1)

Publication Number Publication Date
WO2020098308A1 true WO2020098308A1 (en) 2020-05-22

Family

ID=66355640

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/097892 WO2020098308A1 (en) 2018-11-12 2019-07-26 Method, device and equipment for establishing crowd portrait classification medel and storage medium

Country Status (2)

Country Link
CN (1) CN109740620B (en)
WO (1) WO2020098308A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112288444A (en) * 2020-10-23 2021-01-29 翼果(深圳)科技有限公司 Cross-border SAAS client analysis method and system based on big data
CN112416488A (en) * 2020-11-03 2021-02-26 深圳依时货拉拉科技有限公司 User portrait implementation method and device, computer equipment and computer readable storage medium
CN112862582A (en) * 2021-02-18 2021-05-28 深圳无域科技技术有限公司 User portrait generation system and method based on financial wind control
CN113377760A (en) * 2021-07-06 2021-09-10 国网江苏省电力有限公司营销服务中心 Method and system for establishing low-voltage resident feature portrait based on electric power data and multivariate data
CN114119058A (en) * 2021-08-10 2022-03-01 国家电网有限公司 User portrait model construction method and device and storage medium
CN114722252A (en) * 2022-03-18 2022-07-08 深圳市小满科技有限公司 Foreign trade user classification method based on user portrait and related equipment

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109740620B (en) * 2018-11-12 2023-09-26 平安科技(深圳)有限公司 Method, device, equipment and storage medium for establishing crowd figure classification model
CN110399404A (en) * 2019-07-25 2019-11-01 北京明略软件系统有限公司 A kind of the user's expression generation method and device of computer
CN110472680B (en) * 2019-08-08 2021-05-25 京东城市(北京)数字科技有限公司 Object classification method, device and computer-readable storage medium
CN111339402A (en) * 2020-02-10 2020-06-26 口碑(上海)信息技术有限公司 Service processing method and device
CN111783873B (en) * 2020-06-30 2023-08-25 中国工商银行股份有限公司 User portrait method and device based on increment naive Bayes model
CN112380104A (en) * 2020-11-19 2021-02-19 北京百度网讯科技有限公司 User attribute identification method and device, electronic equipment and storage medium
CN112434884A (en) * 2020-12-12 2021-03-02 广东电力信息科技有限公司 Method and device for establishing supplier classified portrait

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080195431A1 (en) * 2007-02-12 2008-08-14 International Business Machines Corporation System and method for correlating business transformation metrics with sustained business performance
CN105893406A (en) * 2015-11-12 2016-08-24 乐视云计算有限公司 Group user profiling method and system
CN106651424A (en) * 2016-09-28 2017-05-10 国网山东省电力公司电力科学研究院 Electric power user figure establishment and analysis method based on big data technology
CN107391603A (en) * 2017-06-30 2017-11-24 北京奇虎科技有限公司 User's portrait method for building up and device for mobile terminal
CN108021700A (en) * 2017-12-25 2018-05-11 暴风集团股份有限公司 A kind of user tag generation method, device and server
CN109740620A (en) * 2018-11-12 2019-05-10 平安科技(深圳)有限公司 Method for building up, device, equipment and the storage medium of crowd portrayal disaggregated model

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120078678A1 (en) * 2010-09-23 2012-03-29 Infosys Technologies Limited Method and system for estimation and analysis of operational parameters in workflow processes
WO2013093173A1 (en) * 2011-12-21 2013-06-27 Nokia Corporation A method, an apparatus and a computer software for context recognition
US20170154314A1 (en) * 2015-11-30 2017-06-01 FAMA Technologies, Inc. System for searching and correlating online activity with individual classification factors
CN107895277A (en) * 2017-09-30 2018-04-10 平安科技(深圳)有限公司 Method, electronic installation and the medium of push loan advertisement in the application
CN107895245A (en) * 2017-12-26 2018-04-10 国网宁夏电力有限公司银川供电公司 A kind of tariff recovery methods of risk assessment based on user's portrait

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080195431A1 (en) * 2007-02-12 2008-08-14 International Business Machines Corporation System and method for correlating business transformation metrics with sustained business performance
CN105893406A (en) * 2015-11-12 2016-08-24 乐视云计算有限公司 Group user profiling method and system
CN106651424A (en) * 2016-09-28 2017-05-10 国网山东省电力公司电力科学研究院 Electric power user figure establishment and analysis method based on big data technology
CN107391603A (en) * 2017-06-30 2017-11-24 北京奇虎科技有限公司 User's portrait method for building up and device for mobile terminal
CN108021700A (en) * 2017-12-25 2018-05-11 暴风集团股份有限公司 A kind of user tag generation method, device and server
CN109740620A (en) * 2018-11-12 2019-05-10 平安科技(深圳)有限公司 Method for building up, device, equipment and the storage medium of crowd portrayal disaggregated model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SUN, WENJING: "Bayesian Classifier Based on Dependency Analysis and Hypothesis Testing", ELECTRONIC TECHNOLOGY & INFORMATION SCIENCE , CHINA MASTER S THESES FULL-TEXT DATABASE, 15 November 2014 (2014-11-15) *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112288444A (en) * 2020-10-23 2021-01-29 翼果(深圳)科技有限公司 Cross-border SAAS client analysis method and system based on big data
CN112416488A (en) * 2020-11-03 2021-02-26 深圳依时货拉拉科技有限公司 User portrait implementation method and device, computer equipment and computer readable storage medium
CN112862582A (en) * 2021-02-18 2021-05-28 深圳无域科技技术有限公司 User portrait generation system and method based on financial wind control
CN112862582B (en) * 2021-02-18 2024-03-22 深圳无域科技技术有限公司 User portrait generation system and method based on financial wind control
CN113377760A (en) * 2021-07-06 2021-09-10 国网江苏省电力有限公司营销服务中心 Method and system for establishing low-voltage resident feature portrait based on electric power data and multivariate data
CN114119058A (en) * 2021-08-10 2022-03-01 国家电网有限公司 User portrait model construction method and device and storage medium
CN114119058B (en) * 2021-08-10 2023-09-26 国家电网有限公司 User portrait model construction method, device and storage medium
CN114722252A (en) * 2022-03-18 2022-07-08 深圳市小满科技有限公司 Foreign trade user classification method based on user portrait and related equipment

Also Published As

Publication number Publication date
CN109740620A (en) 2019-05-10
CN109740620B (en) 2023-09-26

Similar Documents

Publication Publication Date Title
WO2020098308A1 (en) Method, device and equipment for establishing crowd portrait classification medel and storage medium
US20210224142A1 (en) Systems and methods for removing identifiable information
US11816078B2 (en) Automatic entity resolution with rules detection and generation system
CN107967575B (en) Artificial intelligence platform system for artificial intelligence insurance consultation service
WO2021164382A1 (en) Method and apparatus for performing feature processing for user classification model
WO2018196760A1 (en) Ensemble transfer learning
US10720150B2 (en) Augmented intent and entity extraction using pattern recognition interstitial regular expressions
JP7337949B2 (en) Handling Categorical Field Values in Machine Learning Applications
Nguyen et al. Practical and theoretical aspects of mixture‐of‐experts modeling: An overview
JP7119865B2 (en) Information processing method and device, and information detection method and device
US11729286B2 (en) Feature-based network embedding
US11586838B2 (en) End-to-end fuzzy entity matching
CN112509690B (en) Method, apparatus, device and storage medium for controlling quality
WO2023115884A1 (en) Ordered classification tag determining method and apparatus, electronic device, and storage medium
CN114428860A (en) Pre-hospital emergency case text recognition method and device, terminal and storage medium
CN114492601A (en) Resource classification model training method and device, electronic equipment and storage medium
US11847599B1 (en) Computing system for automated evaluation of process workflows
Moutafis et al. Rank-based score normalization for multi-biometric score fusion
US11967314B2 (en) Automatic generation of a contextual meeting summary
CN114898184A (en) Model training method, data processing method and device and electronic equipment
CN113806541A (en) Emotion classification method and emotion classification model training method and device
Matsuo et al. Self-augmented multi-modal feature embedding
US20240070466A1 (en) Unsupervised Labeling for Enhancing Neural Network Operations
CN112669003B (en) Business coaching method and device based on artificial intelligence and computer equipment
US20230419102A1 (en) Token synthesis for machine learning models

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19883661

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS (EPO FORM 1205A DATED 20.08.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19883661

Country of ref document: EP

Kind code of ref document: A1