WO2020098308A1

WO2020098308A1 - Method, device and equipment for establishing crowd portrait classification medel and storage medium

Info

Publication number: WO2020098308A1
Application number: PCT/CN2019/097892
Authority: WO
Inventors: 金戈; 徐亮; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-11-12
Filing date: 2019-07-26
Publication date: 2020-05-22
Also published as: CN109740620A; CN109740620B

Abstract

The present invention relates to a method, device and computer equipment for establishing a crowd portrait classification model and a storage medium. The method comprises: acquiring user data to be performed crowd portrait classification thereon, each piece of user data comprising a plurality of user attributes corresponding to the user; taking each user attribute as a factor of Chow-liu algorithm, and using the Chow-liu algorithm to select a factor among all factors to establish association until all factors are associated to obtain a Bayesian network model; and inputting all user data into the Bayesian network model for training to obtain the crowd portrait classification model. The crowd portrait classification model according to the described method has a better explanatory feature, and can well reflect the correlation between respective user attributes of the user data.

Description

Method, device, equipment and storage medium for establishing crowd portrait classification model

This application requires the priority of the Chinese patent application submitted to the Chinese Patent Office on November 12, 2018 with the application number 201811340717.5 and the application name "Crowd portrait classification model establishment method, device, equipment and storage medium". Incorporated by reference in the application.

Technical field

The present application relates to the field of data processing technology, and in particular to a method, device, computer device, and storage medium for establishing a group portrait classification model.

Background technique

Crowd portrait classification refers to the process of classifying crowd portraits on newly input user data through the crowd portrait classification model. Among them, the crowd portrait classification model is constructed by using a preset model to train massive user data.

Taking employee portrait classification as an example, employee data includes: employee's position, length of service, education, gender, department and other employee attributes. Use the preset model to train a large number of employee data, construct an employee portrait classification model, and then obtain a number of employee portraits through the employee portrait classification model to complete the classification of each employee. In the process of employee portrait classification, the employee's turnover situation can be used to build an employee turnover prediction model, and then the employee turnover prediction model can be used to predict the probability of an employee's turnover.

At present, the preset models on which the crowd portrait classification model is built are mainly classification models and clustering models, such as SVM, neural network, k-means, etc. However, in the research and practice of the prior art, the inventor of the present application found that the prior art has the following problems: whether it is to build a crowd portrait classification model based on a classification model or a clustering model, the resulting crowd portrait classification model is only It can be used for classification, its interpretation is poor, and it does not well reflect the correlation between the user attributes of the user data and the association between the user attributes of the user data and the category attribution.

Summary of the invention

Based on this, it is necessary to provide a method for establishing a crowd portrait classification model for the problem that the currently constructed crowd portrait classification model has poor interpretability and does not well reflect the correlation between various user attributes of user data , Devices, computer equipment and storage media.

A method for establishing a crowd portrait classification model, the method for establishing a crowd portrait classification model includes: acquiring user data to be classified for a crowd portrait, wherein each piece of user data includes multiple user attributes corresponding to the user; each user The attribute is used as a factor of the Chow-Liu algorithm, and the Chow-Liu algorithm is used to select a factor among all the factors for correlation until all factors are correlated to obtain a Bayesian network model; Training in the Sri Lankan network model to obtain the crowd portrait classification model.

In one of the embodiments, the method further includes: performing data preprocessing on the user data.

In one of the embodiments, the data preprocessing includes: data cleaning and standardization processing; the data cleaning includes: deleting vacant data, noise data, duplicate data, and error data in user data; the standardization processing includes: Integrate multiple data corresponding to the same user.

In one of the embodiments, the Chow-Liu algorithm is used to select a factor among all factors for correlation until all factors are associated, including: for each factor, according to formula one, select among all unselected factors. The factor with the smallest KL distance is used as the correlation factor of this factor until all factors are selected;

The first formula is: KL (P (X) || T (X)) =-∑I (X _i , Pa (X _i )) + ∑H (X _i ) -H (X ₁ , X ₂ .. ., X _n )

Among them, KL (P (X) || T (X)) represents the KL distance between this factor and any one of all unselected factors, P (X) represents the distribution of all factors before correlation, T (X) Represents the distribution of all factors after correlation; X _i represents the i-th factor, H represents entropy, Pa (X _i ) represents the parent node of X _i ; I represents mutual information, which is calculated by formula two, the formula The second is:

Among them, p (a) represents the probability of the occurrence of the value a, p (b) represents the probability of the occurrence of the value b, p (a, b) represents the probability of the occurrence of the value b under the premise of the occurrence of the value b, X ₁ and X ₂ represent Any two user attributes among the plurality of user attributes, the value a is any value belonging to the user attribute X ₁ , and the value b is any value belonging to the user attribute X ₂ .

In one of the embodiments, the user data includes labeled data and unlabeled data, and the step of inputting the user data into the Bayesian network model for training includes: adopting a semi-supervised learning method to input Go to the user data in the Bayesian network model for training.

In one embodiment, the use of a semi-supervised learning method to train user data input to the Bayesian network model includes: using the Bayesian network model to perform label prediction on unlabeled data; using the The Bayesian network model trains the label data; the above two steps are repeated alternately until the training process converges.

In one of the embodiments, the method further includes: when receiving newly input user data, using the crowd portrait classification model to classify the user data to obtain a corresponding classification result.

An apparatus for establishing a crowd portrait classification model, the apparatus for establishing a crowd portrait classification model includes: a data acquisition unit for acquiring user data to be classified for a crowd portrait, wherein each piece of user data includes multiple users corresponding to the user Attribute; factor correlation unit, used to take each user attribute as a factor of the Chow-Liu algorithm, using the Chow-Liu algorithm to select factors among all factors for correlation until all factors are correlated, and a Bayesian network model is obtained A data training unit for inputting the user data into the Bayesian network model for training to obtain the crowd portrait classification model.

In one of the embodiments, the device further includes: a pre-processing unit 802, configured to pre-process the user data.

In one of the embodiments, when the data preprocessing includes: data cleaning and standardization processing, the preprocessing unit includes: a data cleaning module and a standardized processing module. The data cleaning module is used to delete vacant data, noise data, duplicate data and error data in user data; the standardized processing module is used to integrate multiple data corresponding to the same user.

In one of the embodiments, the factor association unit 704 is specifically configured to perform the following steps:

For each factor, select the factor with the smallest KL distance from all unselected factors according to formula 1 as the correlation factor of the factor until all factors are selected;

In one of the embodiments, when the user data includes labeled data and unlabeled data, the data training unit is specifically used to train the user data input into the Bayesian network model using a semi-supervised learning method.

In one of the embodiments, when the user data includes label data and unlabeled data, the data training unit is specifically configured to perform the following steps: use the Bayesian network model to perform label prediction on the unlabeled data; use the Bayesian The yes network model trains the label data; the above two steps are repeated alternately until the training process converges.

In one of the embodiments, the apparatus for establishing a crowd portrait classification model may further include a classification unit for classifying crowd portraits using the crowd portrait classification model when receiving newly input user data, to obtain The corresponding classification result.

A computer device includes a memory and a processor. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the processor causes the processor to perform the method for establishing a crowd portrait classification model A step of.

A non-volatile readable storage medium storing computer-readable instructions, when the computer-readable instructions are executed by one or more processors, causes the one or more processors to perform the method for establishing the above-mentioned crowd portrait classification model A step of.

Method, device, computer equipment and storage medium for establishing the above-mentioned crowd portrait classification model to obtain user data to be subjected to crowd portrait classification, where each piece of user data includes multiple user attributes corresponding to the user; each user attribute is regarded as Chow- A factor of Liu algorithm, use Chow-Liu algorithm to select factors among all factors for correlation, until all factors are correlated, get Bayesian network model; input user data into Bayesian network model for training, get crowd portrait Classification model. The method of establishing the above-mentioned crowd portrait classification model uses each user attribute of multiple user attributes included in the user data as a factor of the Chow-Liu algorithm. The Chow-Liu algorithm is used for factor selection and correlation. Because the Chow-Liu algorithm can compare The relationship between various factors can be well reflected, and the relationship between factors and category attribution can be reflected at the same time, so the crowd portrait classification model based on the Chow-Liu algorithm can well reflect the correlation between various user attributes of user data. At the same time, it can reflect the association of each user attribute and category attribution of user data.

BRIEF DESCRIPTION

FIG. 1 is an implementation environment diagram of a method for establishing a group portrait classification model provided in an embodiment;

2 is a block diagram of the internal structure of a computer device in an embodiment;

3 is a flowchart of a method for establishing a group portrait classification model in an embodiment;

4 is a flowchart of a method for establishing a crowd portrait classification model in an embodiment;

5 is a flowchart of a method for classifying a group portrait in an embodiment;

6 is a flowchart of a method for establishing a group portrait classification model in an embodiment;

7 is a structural block diagram of an apparatus for establishing a group portrait classification model in an embodiment;

8 is a structural block diagram of an apparatus for establishing a group portrait classification model in an embodiment;

9 is a structural block diagram of a preprocessing unit in an embodiment;

FIG. 10 is a structural block diagram of an apparatus for establishing a crowd portrait classification model in an embodiment.

detailed description

In order to make the purpose, technical solutions and advantages of the present application more clear, the present application will be described in further detail in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application.

FIG. 1 is an implementation environment diagram of a method for establishing a crowd portrait classification model provided in an embodiment. As shown in FIG. 1, the implementation environment includes a computer device 110 and a database 120. The database 120 stores user data to be classified for crowd portraits and newly input user data. In the case where some user data in the user data to be classified for the group portrait is to be tagged in advance, the user data to be classified for the group portrait stored in the database 120 includes tag data and unlabeled data. The computer device 110 is a device for processing user data to establish a crowd portrait classification model. The computer device 110 acquires user data to be subjected to crowd portrait classification and newly input user data from the database 120. In the case where the user data to be classified for the group portrait stored in the database 120 includes tag data and unlabeled data, the computer device 110 acquires the tag data and unlabeled data from the database 120. When a crowd portrait classification model needs to be established, the model builder can use the computer device 110 to obtain the user data to be classified for the crowd portrait, and then obtain the Bayesian network model according to the multiple user attributes included in the user data, and then proceed User data for crowd portrait classification is input into the Bayesian network model for training, and a crowd portrait classification model is obtained.

It should be noted that both the computer device 110 and the database 120 may be smart phones, tablet computers, notebook computers, desktop computers, servers, etc., but they are not limited thereto. The computer device 110 and the database 120 may be connected via Bluetooth, USB (Universal Serial Bus), or other communication connection methods, and this application is not limited herein. The database 120 may be independent of the computer device 110 (shown in FIG. 1), or the database 120 may be integrated with the inside of the computer device 110 (not shown in FIG. 1).

2 is a schematic diagram of the internal structure of a computer device in an embodiment. As shown in FIG. 2, the computer device includes a processor, a non-volatile readable storage medium, a memory, and a network interface connected through a system bus. Among them, the non-volatile readable storage medium of the computer device stores an operating system, a database, and computer-readable instructions. The database may store user data to be classified for crowd portraits and newly input user data. In the case where some user data in the user data to be classified for the group portrait is to be tagged in advance, the user data to be classified for the group portrait stored in the database includes tag data and unlabeled data. When the computer-readable instructions are executed by the processor, the processor may enable the processor to implement a method for establishing a group portrait classification model. The processor of the computer device is used to provide calculation and control capabilities, and support the operation of the entire computer device. The memory of the computer device may store computer readable instructions. When the computer readable instructions are executed by the processor, the processor may cause the processor to execute a method for establishing a group portrait classification model. The network interface of the computer device is used to communicate with the outside. Those skilled in the art may understand that the structure shown in FIG. 2 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. The specific computer equipment may It includes more or fewer components than shown in the figure, or some components are combined, or have a different component arrangement.

As shown in FIG. 3, in one embodiment, a method for establishing a crowd portrait classification model is proposed. The method for establishing a crowd portrait classification model can be applied to the computer device 110 described above, and includes the following steps:

Step S302: Obtain user data to be classified for crowd portraits, where each piece of user data includes multiple user attributes corresponding to the user;

In this embodiment, the user data includes multiple user attributes corresponding to the user. Taking user data as employee data for example, employee data includes multiple employee attributes corresponding to the employee: position, length of service, education, gender, department, etc. For each user attribute among the multiple user attributes, there are several values belonging to the user attribute, and several numeric values correspond to several pieces of user data, that is, one-to-one correspondence with several users. Exemplarily, taking the user data as employee data for example, the employee data includes multiple employee attributes corresponding to the employee: position, seniority, education, gender, department, etc., assuming that there are employee A and employee B, employee A and employee B employee data As shown in Table 1.

Table 1 Schematic diagram of employee data

As can be seen from Table 1, there are two values of the user attribute belonging to positions: 0011 and 0012, which correspond to the positions of employee A and employee B respectively, that is, one-to-one correspondence with employee A and employee B, respectively . Similarly, there are two values for the user attribute of seniority: 5 and 2, which correspond to the seniority of employee A and the seniority of employee B respectively, that is, one-to-one correspondence with employee A and employee B, respectively. The user data to be classified for crowd portraits is sample data to be input into a preset model, and the preset model is a Bayesian network model obtained according to multiple user attributes.

Step S304, taking each user attribute as a factor of the Chow-Liu algorithm, and using the Chow-Liu algorithm to select a factor among all factors for correlation until all factors are correlated to obtain a Bayesian network model;

In this embodiment, it is considered that the Chow-Liu algorithm can better reflect the association between each factor, and at the same time, it can reflect the association between the factor and the category attribution. Furthermore, while making predictions, the influence of various factors on the prediction results can be summarized from the model. For example, taking the prediction of the probability of employee turnover as an example, by using the Chow-Liu algorithm, you can predict the probability of a certain employee leaving, and at the same time discover which employee attribute is the influencing factor that leads to a high turnover rate, so as to recruit subsequent employees Work provides a reference. In the specific implementation process, multiple associations are performed. The process of the first association is: first, any one of the multiple user attributes is used as the first factor of the Chow-Liu algorithm, and then, the remaining user attributes of the multiple user attributes are used as the Chow-Liu algorithm For each of the other factors, select a factor from these other factors to associate with the first factor. The second association process is as follows: any one of the remaining user attributes in the previous association process is used as the second factor of the Chow-Liu algorithm, and each remaining user attribute in this association process is used as Chow- The other factors of Liu algorithm, select one factor from these other factors and associate with the second factor. The third association process is similar to the second association process. Do this many times until all factors are related. It should be noted that loops need to be avoided during each association process. In one embodiment, step 304 includes: for each factor, select the factor with the smallest KL distance from all unselected factors according to formula one as the correlation factor of the factor until all factors are selected; the formula One is:

KL (P (X) || T (X)) =-∑I (X _i , Pa (X _i )) + ∑H (X _i ) -H (X ₁ , X ₂ ..., X _n )

Among them, p (a) represents the probability of the occurrence of the value a, p (b) represents the probability of the occurrence of the value b, p (a, b) represents the probability of the occurrence of the value b under the premise of the occurrence of the value b, X ₁ and X ₂ represent Any two user attributes among the plurality of user attributes, the value a is any value belonging to the user attribute X ₁ , and the value b is any value belonging to the user attribute X ₂ . In the specific implementation process, selecting one factor from multiple factors to be associated with another factor includes the following steps: Step 1): Calculate the KL of each factor in the multiple factors and the other factor according to Formula 1 and Formula 2 Distance, formula one and formula two are:

KL (P (X) || T (X)) =-∑I (X _i , Pa (X _i )) + ∑H (X _i ) -H (X ₁ , X ₂ ..., X _n ) formula One

Step 2): Determine the factor with the smallest KL distance from the other factor from multiple factors, the criterion for the first determination process is the smallest KL distance, the criterion for the second determination process is the second smallest KL distance, the third The criterion for the second determination process is that the KL distance is the third smallest, and so on; step 3): determine whether a loop will be generated if the determined factor is associated with the other factor, and if the determination result is no, go to step 4); If the judgment result is yes, return to step 2); step 4): associate the determined factor with the other factor.

Step S306: Input the user data into the Bayesian network model for training to obtain the crowd portrait classification model.

In the Bayesian network model obtained according to the attributes of multiple users, the user data to be classified for the crowd portrait is input into the Bayesian network model for training to obtain the crowd portrait classification model. Exemplarily, taking user data as employee data and constructing an employee turnover prediction model as an example, for employee data to be predicted for turnover probability, first, according to the employee data, including multiple employee attributes corresponding to the employee: position, seniority, education, gender, Departments, etc., get the Bayesian network model; then, input employee data into the obtained Bayesian network model for training to obtain the employee turnover prediction model. The method of establishing the above-mentioned crowd portrait classification model uses each user attribute of multiple user attributes included in the user data as a factor of the Chow-Liu algorithm. The Chow-Liu algorithm is used for factor selection and correlation. Because the Chow-Liu algorithm can compare The relationship between various factors can be well reflected, and the relationship between factors and category attribution can be reflected at the same time, so the crowd portrait classification model based on the Chow-Liu algorithm can well reflect the correlation between various user attributes of user data. At the same time, it can reflect the association of each user attribute and category attribution of user data.

FIG. 4 shows an implementation flowchart of a method for establishing a crowd portrait classification model when user data includes label data and unlabeled data in one embodiment, including the following steps: Step S402: Obtaining the group portrait classification to be performed User data, where each piece of user data includes multiple user attributes corresponding to the user; step S404, each user attribute is used as a factor of the Chow-Liu algorithm, and the Chow-Liu algorithm is used to select a factor among all factors for correlation Until all the factors are associated to obtain the Bayesian network model; step S406, the user data input into the Bayesian network model is trained using a semi-supervised learning method to obtain the crowd portrait classification model. Wherein, the implementation process of step S402 and step S404 is similar to the implementation process of step S302 and step S304 respectively, which will not be repeated here. In order to improve the accuracy of the obtained crowd portrait classification model, some user data to be classified for crowd portraits can be pre-labeled. Therefore, the user data to be classified for crowd portraits includes label data and unlabeled data, and then label data and unlabeled The data is input to the Bayesian network model for training, and a more accurate crowd portrait classification model is obtained. Exemplarily, taking user data as employee data and constructing an employee turnover prediction model as an example, for employee data to be predicted for turnover probability, on the one hand, according to the employee data, including multiple employee attributes corresponding to the employee: position, seniority, education, gender , Departments, etc., get the Bayesian network model; on the other hand, tag some employee data, and tag this part of employee data based on the actual turnover of this part of employees: employees who have left and are employees who have left. Then, input the labeled employee data (known departure situation) and unlabeled employee data (unknown departure situation) into the obtained Bayesian network model for training to obtain the employee turnover prediction model.

In one embodiment, step S406 includes the following steps: using the Bayesian network model to perform label prediction on unlabeled data; using the Bayesian network model to train label data; and repeatedly performing the above two steps alternately Until the training process converges. In the specific implementation, the semi-supervised learning methods include E-step and M-step. The user data includes label data and unlabeled data. The process of inputting user data into the Bayesian network model obtained after performing step S404 is as follows: First, perform E-step, that is, use the Bayesian network model obtained after performing step S404 to perform label prediction on unlabeled data . Subsequently, M-step is performed, that is, using label data to retrain the Bayesian network model, and repeating E-step and M-step alternately until the training process converges, and finally a crowd portrait classification model is obtained.

In this embodiment, when the acquired user data to be classified for crowd portraits includes unlabeled data, the Bayesian network model obtained from multiple user attributes can also use such unlabeled data for training, that is to say In addition to using labeled data as training data, you can add unlabeled data as training data to avoid the problem of too low training data, thereby improving the accuracy of the final crowd portrait classification model. FIG. 5 shows an implementation flowchart of a method for classifying crowd portraits in an embodiment, including the following steps: Step S502: Obtain user data to be classified for crowd portraits, where each piece of user data includes multiple user attributes corresponding to the user Step S504, taking each user attribute as a factor of the Chow-Liu algorithm, using the Chow-Liu algorithm to select factors among all factors for correlation, until all factors are correlated, and a Bayesian network model is obtained; step S506, Input the user data into the Bayesian network model for training to obtain the crowd portrait classification model; step S508, when receiving newly input user data, use the crowd portrait classification model to the user The data is used to classify crowd portraits, and the corresponding classification results are obtained. Among them, the implementation process of step S502-step S506 is similar to the implementation process of step S302-step S306, and will not be repeated here. After the crowd portrait classification model is obtained, the crowd portrait classification model can be used to realize the crowd portrait classification. Specifically, the newly input user data is received, and then the user data is input to the crowd portrait classification model obtained after performing steps S502-S506, and the output of the crowd portrait classification model is the classification result. Exemplarily, taking user data as employee data and crowd portrait classification model as employee turnover prediction model, for example, input employee data of an employee into the employee turnover prediction model, and the employee turnover prediction model can be used to predict the employee's turnover. Probability, that is, the output of the employee turnover prediction model is the probability of the employee leaving.

FIG. 6 shows an implementation flowchart of a method for establishing a crowd portrait classification model in an embodiment, including the following steps: Step S602: Obtain user data to be subjected to crowd portrait classification, where each piece of user data includes multiple corresponding to the user User attributes; step S604, data preprocessing is performed on the user data; step S606, each user attribute is used as a factor of the Chow-Liu algorithm, and the Chow-Liu algorithm is used to select a factor among all factors for correlation until Associate all the factors to obtain a Bayesian network model; Step S608, input the user data into the Bayesian network model for training, and obtain the crowd portrait classification model. The implementation process of step S602 is similar to the implementation process of step S302, and will not be repeated here. In one embodiment, the data preprocessing in step S604 includes: data cleaning and standardization processing; the data cleaning includes: deleting vacant data, noise data, duplicate data, and error data in user data; and the standardization processing includes : Integrate multiple data corresponding to the same user. In this embodiment, considering that the user data originally acquired after step S602 has "dirty data", including data vacancies and noise, inconsistencies, duplication, errors, etc., in order to ensure the accuracy of later data processing, and After the crowd portrait classification model obtains the classification results, in order to reduce the impact of the classification results on the final decision, it is necessary to preprocess the originally acquired user data. That is, the vacant data, noise data, duplicate data, and error data in the originally acquired user data are deleted. In addition, the establishment of crowd portraits requires the ability to integrate multi-source data. For example, a user may use multiple devices and have multiple accounts on the network. Therefore, it is necessary to combine multiple accounts of the same user, that is, to integrate multiple data corresponding to the same user, and then establish a unified standard to completely identify the user's crowd portrait. After step S604 is executed, in step S606, each user attribute included in the preprocessed user data in step S604 is used as a factor of the Chow-Liu algorithm, and the rest is similar to step S304. Similarly, in step S606, the pre-processed user data in step S604 is input into the Bayesian network model obtained after step S606 for training, and the rest is similar to step S306.

As shown in FIG. 7, in one embodiment, an apparatus for establishing a crowd portrait classification model is provided. The apparatus for establishing a crowd portrait classification model may be integrated into the computer device 110 described above, and may include a data acquisition unit 702 and factors Association unit 704, and data training unit 706. The data obtaining unit 702 is used to obtain user data for group portrait classification, where each piece of user data includes multiple user attributes corresponding to the user; the factor association unit 704 is used to take each user attribute as the Chow-Liu algorithm A factor, using the Chow-Liu algorithm to select a factor among all factors for correlation until all factors are correlated to obtain a Bayesian network model; a data training unit 706 is used to input the user data to the Bayesian Training in the Sri Lankan network model to obtain the crowd portrait classification model. As shown in FIG. 8, the apparatus for establishing a crowd portrait classification model may further include a preprocessing unit 802. The preprocessing unit 802 is used to perform data preprocessing on the user data. As shown in FIG. 9, in one embodiment, when data preprocessing includes: data cleaning and standardization processing, the preprocessing unit 802 includes: a data cleaning module 802A and a standardized processing module 802B. The data cleaning module 802A is used to delete vacant data, noise data, duplicate data and erroneous data in user data; the standardized processing module 802B is used to integrate multiple data corresponding to the same user. In one embodiment, the factor correlation unit 704 is specifically configured to perform the following steps: For each factor, select the factor with the smallest KL distance among all unselected factors according to formula 1 as the correlation factor of the factor until all factors are equal Is selected; the first formula is: KL (P (X) || T (X)) =-∑I (X _i , Pa (X _i )) + ∑H (X _i ) -H (X ₁ , X ₂ ..., X _n )

Among them, p (a) represents the probability of the occurrence of the value a, p (b) represents the probability of the occurrence of the value b, p (a, b) represents the probability of the occurrence of the value b under the premise of the occurrence of the value b, X ₁ and X ₂ represent Any two user attributes among the plurality of user attributes, the value a is any value belonging to the user attribute X ₁ , and the value b is any value belonging to the user attribute X ₂ . In one embodiment, when the user data includes label data and unlabeled data, the data training unit 706 is specifically configured to use a semi-supervised learning method to train the user data input into the Bayesian network model. In one embodiment, when the user data includes label data and unlabeled data, the data training unit 706 is specifically configured to perform the following steps: use the Bayesian network model to perform label prediction on the unlabeled data; use the Bayesian The yes network model trains the label data; the above two steps are repeated alternately until the training process converges. As shown in FIG. 10, the apparatus for establishing a crowd portrait classification model may further include a classification unit 1002. The classification unit 1002 is configured to use the crowd portrait classification model to perform crowd portrait classification on the user data when receiving newly input user data, and obtain a corresponding classification result. In one embodiment, a computer device is proposed. The computer device includes a non-volatile readable storage medium, a processor, and is stored on the non-volatile readable storage medium and is available on the processor. Computer-readable instructions running on the computer, when the processor executes the computer-readable instructions, the following steps are implemented: acquiring user data to be classified for crowd portraits, where each piece of user data includes multiple user attributes corresponding to the user; Each user attribute is used as a factor of the Chow-Liu algorithm, and the Chow-Liu algorithm is used to select factors among all factors for correlation until all factors are correlated to obtain a Bayesian network model; the user data is input to all Training in the Bayesian network model to obtain the crowd portrait classification model. In one embodiment, when the processor executes the computer-readable instructions, it also performs the following step: performing data preprocessing on the user data. In one embodiment, the data preprocessing includes: data cleaning and standardization processing; the step of data preprocessing performed by the processor on the user data includes: deleting vacant data and noise data in the user data, Duplicate data and wrong data; integrate multiple data corresponding to the same user. In one embodiment, the process performed by the processor using the Chow-Liu algorithm to select factors among all factors for correlation until the step of correlating all factors includes: for each factor, according to formula one, all unselected The factor with the smallest distance from its KL is selected as the correlation factor of the factor until all factors are selected; the formula one is:

Among them, p (a) represents the probability of the occurrence of the value a, p (b) represents the probability of the occurrence of the value b, p (a, b) represents the probability of the occurrence of the value b under the premise of the occurrence of the value b, X ₁ and X ₂ represent Any two user attributes among the plurality of user attributes, the value a is any value belonging to the user attribute X ₁ , and the value b is any value belonging to the user attribute X ₂ . In one embodiment, the user data includes labeled data and unlabeled data, and the step of inputting the user data into the Bayesian network model and performed by the processor for training includes using semi-supervised The learning method trains the user data input into the Bayesian network model. In one embodiment, the step of training the user data input into the Bayesian network model by the processor using a semi-supervised learning method includes: using the Bayesian network model to label unlabeled data Prediction; using the Bayesian network model to train the label data; repeatedly performing the above two steps alternately until the training process converges. In one embodiment, when the processor executes the computer-readable instructions, it also performs the following steps: when receiving newly input user data, the crowd portrait classification model is used to classify the user data to obtain the corresponding classification result.

In one embodiment, a non-volatile readable storage medium storing computer-readable instructions is provided. When the computer-readable instructions are executed by one or more processors, the one or more processors execute the following Step: Obtain user data to be classified for crowd portraits, where each piece of user data includes multiple user attributes corresponding to the user; use each user attribute as a factor of the Chow-Liu algorithm, and use the Chow-Liu algorithm in all The selection factors among the factors are correlated until all the factors are correlated to obtain a Bayesian network model; the user data is input into the Bayesian network model for training to obtain the crowd portrait classification model. In one embodiment, when the processor executes the computer-readable instructions, it also performs the following step: performing data preprocessing on the user data. In one embodiment, the data preprocessing includes: data cleaning and standardization processing; the step of data preprocessing performed by the processor on the user data includes: deleting vacant data and noise data in the user data, Duplicate data and wrong data; integrate multiple data corresponding to the same user. In one embodiment, the process performed by the processor using the Chow-Liu algorithm to select factors among all factors for correlation until the step of correlating all factors includes: for each factor, according to formula one, all unselected The factor with the smallest distance from its KL is selected as the correlation factor of the factor until all factors are selected; the formula one is:

Among them, p (a) represents the probability of the occurrence of the value a, p (b) represents the probability of the occurrence of the value b, p (a, b) represents the probability of the occurrence of the value b under the premise of the occurrence of the value b, X ₁ and X ₂ represent Any two user attributes among the plurality of user attributes, the value a is any value belonging to the user attribute X ₁ , and the value b is any value belonging to the user attribute X ₂ . In one embodiment, the user data includes labeled data and unlabeled data, and the step of inputting the user data into the Bayesian network model and performed by the processor for training includes using semi-supervised The learning method trains the user data input into the Bayesian network model. In one embodiment, the step of training the user data input into the Bayesian network model by the processor using a semi-supervised learning method includes: using the Bayesian network model to label unlabeled data Prediction; using the Bayesian network model to train the label data; repeatedly performing the above two steps alternately until the training process converges. In one embodiment, when the processor executes the computer-readable instructions, it also performs the following steps: when receiving newly input user data, the crowd portrait classification model is used to classify the user data to obtain the corresponding classification result. A person of ordinary skill in the art may understand that all or part of the processes in the method of the above embodiments may be completed by instructing relevant hardware through a computer program. The computer program may be stored in a computer-readable storage medium. When executed, it may include the processes of the foregoing method embodiments. The foregoing storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM). The technical features of the above-mentioned embodiments can be combined arbitrarily. To simplify the description, all possible combinations of the technical features in the above-mentioned embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, All should be considered within the scope of this description. The above-mentioned embodiments only express several implementation manners of the present application, and their descriptions are more specific and detailed, but they should not be construed as limiting the patent scope of the present application. It should be pointed out that, for a person of ordinary skill in the art, without departing from the concept of the present application, a number of modifications and improvements can be made, which all fall within the protection scope of the present application. Therefore, the protection scope of the patent of this application shall be subject to the appended claims.

Claims

A method for establishing a crowd portrait classification model, which is characterized by including:

Obtain user data to be classified for crowd portraits, where each piece of user data includes multiple user attributes corresponding to the user;

Take each user attribute as a factor of the Chow-Liu algorithm, and use the Chow-Liu algorithm to select a factor among all factors for correlation until all factors are correlated to obtain a Bayesian network model;

Input the user data into the Bayesian network model for training to obtain the crowd portrait classification model.
The method according to claim 1, characterized in that after the step of acquiring user data to be classified for crowd portraits, the method further comprises: performing data preprocessing on the user data.
The method according to claim 2, wherein the data preprocessing includes: data cleaning and standardization processing;

The data cleaning includes: deleting vacant data, noise data, duplicate data, and erroneous data in user data;

The standardization process includes: integrating multiple data corresponding to the same user.
The method according to claim 1, wherein the using the Chow-Liu algorithm selects a factor among all factors for correlation until all factors are correlated, including:

For each factor, select the factor with the smallest KL distance from all unselected factors according to formula 1 as the correlation factor of the factor until all factors are selected;

The first formula is: KL (P (X) || T (X)) =-∑I (X i , Pa (X i )) + ∑H (X i ) -H (X 1 , X 2 .. ., X n )

Among them, KL (P (X) || T (X)) represents the KL distance between this factor and any one of all unselected factors, P (X) represents the distribution of all factors before correlation, T (X) Represents the distribution of all factors after correlation; X i represents the i-th factor, H represents entropy, Pa (X i ) represents the parent node of X i ; I represents mutual information, which is calculated by formula two, the formula The second is:

Among them, p (a) represents the probability of the occurrence of the value a, p (b) represents the probability of the occurrence of the value b, p (a, b) represents the probability of the occurrence of the value b under the premise of the occurrence of the value b, X 1 and X 2 represent Any two user attributes among the plurality of user attributes, the value a is any value belonging to the user attribute X 1 , and the value b is any value belonging to the user attribute X 2 .
The method according to claim 1, wherein the user data includes label data and unlabeled data, and the step of inputting the user data into the Bayesian network model for training includes: using The semi-supervised learning method trains user data input into the Bayesian network model.
The method according to claim 5, wherein the training of user data input to the Bayesian network model using a semi-supervised learning method includes:

Using the Bayesian network model to perform label prediction on unlabeled data;

Use the Bayesian network model to train the label data;

Repeat the above two steps alternately until the training process converges.
The method according to claim 1, wherein after the crowd portrait classification model is obtained, the method further comprises:

When the newly input user data is received, the user portrait classification model is used to classify the user portrait to obtain the corresponding classification result.
An apparatus for establishing a crowd portrait classification model, the apparatus for establishing a crowd portrait classification model includes:

A data obtaining unit, configured to obtain user data to be classified for crowd portraits, wherein each piece of user data includes multiple user attributes corresponding to the user;

Factor correlation unit, used to take each user attribute as a factor of the Chow-Liu algorithm, and use the Chow-Liu algorithm to select factors among all factors for correlation until all factors are correlated to obtain a Bayesian network model;

The data training unit is used for inputting the user data into the Bayesian network model for training to obtain the crowd portrait classification model.
The apparatus according to claim 8, characterized in that the apparatus further comprises: a preprocessing unit, configured to perform data preprocessing on the user data.
The apparatus according to claim 9, wherein when the data preprocessing includes: data cleaning and standardization processing, the preprocessing unit includes: a data cleaning module and a standardized processing module; and the data cleaning module is used to delete Vacancy data, noise data, duplicate data, and error data in user data; standardized processing module, used to integrate multiple data corresponding to the same user.
The apparatus according to claim 8, wherein the factor correlation unit is specifically configured to perform the following steps:

For each factor, select the factor with the smallest KL distance from all unselected factors according to formula 1 as the correlation factor of the factor until all factors are selected;

The first formula is: KL (P (X) || T (X)) =-∑I (X i , Pa (X i )) + ∑H (X i ) -H (X 1 , X 2 .. ., X n )

Among them, KL (P (X) || T (X)) represents the KL distance between this factor and any one of all unselected factors, P (X) represents the distribution of all factors before correlation, T (X) Represents the distribution of all factors after correlation; X i represents the i-th factor, H represents entropy, Pa (X i ) represents the parent node of X i ; I represents mutual information, which is calculated by formula two, the formula The second is:

Among them, p (a) represents the probability of the occurrence of the value a, p (b) represents the probability of the occurrence of the value b, p (a, b) represents the probability of the occurrence of the value b under the premise of the occurrence of the value b, X 1 and X 2 represent Any two user attributes among the plurality of user attributes, the value a is any value belonging to the user attribute X 1 , and the value b is any value belonging to the user attribute X 2 .
The apparatus according to claim 8, characterized in that, when the user data includes label data and unlabeled data, the data training unit is specifically used to adopt a semi-supervised learning method to the user input to the Bayesian network model Data for training.
The apparatus according to claim 12, wherein when the user data includes label data and unlabeled data, the data training unit is specifically configured to perform the following steps:

Using the Bayesian network model to perform label prediction on unlabeled data;

Use the Bayesian network model to train the label data;

Repeat the above two steps alternately until the training process converges.
The apparatus according to claim 8, wherein the apparatus further comprises: a classification unit for classifying the user data using the crowd portrait classification model when receiving newly input user data To get the corresponding classification result.
A computer device includes a non-volatile readable storage medium and a processor, and the non-volatile readable storage medium stores computer-readable instructions, which are implemented when the processor executes the instructions A method for establishing a crowd portrait classification model includes: obtaining user data to be classified for crowd portraits, where each piece of user data includes multiple user attributes corresponding to the user; using each user attribute as a factor of the Chow-Liu algorithm, using The Chow-Liu algorithm selects factors among all factors for correlation until all factors are correlated to obtain a Bayesian network model; the user data is input into the Bayesian network model for training to obtain the crowd Portrait classification model.
The computer device according to claim 15, characterized in that after the processor executes the computer-readable instructions, after the step of acquiring user data for classifying a group portrait is implemented, the method further includes: User data is preprocessed.
The computer device according to claim 16, characterized in that, when the processor executes the computer-readable instructions, implementing the data preprocessing includes: data cleaning and standardization processing; the data cleaning includes: deleting user data Vacancy data, noise data, duplicate data, and error data; the standardization process includes: integrating multiple data corresponding to the same user.
A non-volatile readable storage medium that stores computer-readable instructions, and when the computer-readable instructions are executed by one or more processors, a method for establishing a group portrait classification model includes: acquiring a group portrait classification to be performed User data, where each piece of user data includes multiple user attributes corresponding to the user; each user attribute is used as a factor of the Chow-Liu algorithm, and the Chow-Liu algorithm is used to select factors among all factors for correlation until Correlate all factors to obtain a Bayesian network model; input the user data into the Bayesian network model for training, and obtain the crowd portrait classification model.
The non-volatile readable storage medium according to claim 18, characterized in that, when the processor executes the computer-readable instructions, after the step of acquiring user data to be classified for a group portrait, the method It also includes: performing data preprocessing on the user data.
The non-volatile readable storage medium according to claim 19, wherein the data pre-processing when the processor executes the computer-readable instructions includes: data cleaning and standardization processing; the data cleaning It includes: deleting vacant data, noise data, duplicate data, and erroneous data in user data; the standardization process includes: integrating multiple data corresponding to the same user.