CN111325280A

CN111325280A - Label generation method and system

Info

Publication number: CN111325280A
Application number: CN202010125081.3A
Authority: CN
Inventors: 吴雨
Original assignee: Suning Cloud Computing Co Ltd
Current assignee: Suning Cloud Computing Co Ltd
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2020-06-23

Abstract

The embodiment of the invention discloses a method and a system for generating a label, wherein the method comprises the steps of acquiring basic information and behavior characteristic data of a user, and cleaning the basic information and the behavior characteristic data of the user; selecting the characteristics of a logistic regression classification model according to the cleaned basic information and behavior characteristic data of the user; training a logistic regression classification model according to the characteristics of the logistic regression classification model; and predicting by using the trained logistic regression classification model to generate a label. The problem of being convenient for predict and optimize the basic information label of the E-commerce user is solved, and the accuracy and the integrity of the label are greatly improved.

Description

Label generation method and system

Technical Field

The invention relates to the field of computers, in particular to a label generation method and a label generation system.

Background

In a member system of an e-commerce, basic information of members can be frequently used, specifically, a tag system can be used, information of users is decomposed and stored in a database through various tags, for example, the user gender is taken as an example, the traditional e-commerce is realized by manually recording the user gender or identifying registered information, and then the gender tag is maintained as an initial result and falls into an off-line table, and the change is hardly caused.

Therefore, various inconveniences are brought, for example, due to the fact that data are manually input, the possibility and risk of errors are greatly improved; if the member's registration information is used, there are privacy-related problems, resulting in many missing values; meanwhile, the shopping behavior of the e-commerce is not necessarily consistent with the actual gender, and errors can also be caused. Therefore, a solution is urgently needed to predict and optimize the basic information label of the e-commerce user, and the integrity and the accuracy of the label are improved.

Disclosure of Invention

The embodiment of the invention provides a label generation method and a label generation system, which solve the problem of convenience in predicting and optimizing basic information labels of e-commerce users.

In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:

in a first aspect, an embodiment of the present invention provides a method for generating a tag, where the method includes:

acquiring basic information and behavior characteristic data of a user, and cleaning the basic information and behavior characteristic data of the user;

selecting the characteristics of a logistic regression classification model according to the cleaned basic information and behavior characteristic data of the user;

training a logistic regression classification model according to the characteristics of the logistic regression classification model;

and predicting by using the trained logistic regression classification model to generate a label.

With reference to the first aspect, as a first implementation scheme of the embodiment of the present invention, the acquiring basic information and behavior feature data of a user, and cleaning the basic information and behavior feature data of the user specifically includes:

acquiring user behavior and log information, and filtering and processing the user behavior and log information into basic information and behavior characteristic data of the user;

cleaning the basic information and the behavior characteristic data of the user, and removing null values, repeated values and abnormal values;

and preprocessing the basic information and the behavior characteristic data of the user.

With reference to the first implementable scenario of the first aspect, as a second implementable scenario of the embodiment of the present invention, the selecting, according to the cleaned basic information and behavior feature data of the user, a feature of a logistic regression classification model specifically includes:

and selecting the characteristics used by the logistic regression classification model in a mode of presetting the characteristics.

With reference to the first implementable aspect of the first aspect, as a third implementable aspect of the embodiment of the present invention, the selecting, according to the cleaned basic information and behavior feature data of the user, a feature of a logistic regression classification model specifically includes:

calculating the feature importance of each feature by using a GBDT algorithm according to the basic information and the behavior feature data of the user;

and selecting the features with high feature importance as the features used by the logistic regression classification model according to the feature importance.

With reference to the first implementable scenario of the first aspect, as a fourth implementable scenario of the embodiment of the present invention, the training of the logistic regression classification model according to the features of the logistic regression classification model specifically includes:

presetting at least one hyper-parameter of the logistic regression classification model;

according to the selected features and the logistic regression classification model, taking the features as parameters of the logistic regression classification model, combining the preprocessed user basic information, and substituting the preset hyper-parameters into training one by one to obtain the trained logistic regression classification model;

and comparing the logistic regression classification models after different hyper-parameter training, and selecting the optimal model and the hyper-parameter to obtain the optimal logistic regression classification model.

In a second aspect, an embodiment of the present invention further provides a system for generating a tag, where the system includes:

the cleaning module is used for acquiring the basic information and the behavior characteristic data of the user and cleaning the basic information and the behavior characteristic data of the user;

the selection module is used for selecting the characteristics of the logistic regression classification model according to the cleaned basic information and behavior characteristic data of the user;

the training module is used for training the logistic regression classification model according to the characteristics of the logistic regression classification model;

and the generating module is used for predicting by using the trained logistic regression classification model to generate a label.

With reference to the second aspect, as a first implementation scheme of the embodiment of the present invention, the cleaning module specifically includes:

the filtering unit is used for acquiring user behavior and log information and filtering the user behavior and log information into basic information and behavior characteristic data of the user;

the cleaning unit is used for cleaning the basic information and the behavior characteristic data of the user and removing null values, repeated values and abnormal values;

and the preprocessing unit is used for preprocessing the basic information and the behavior characteristic data of the user.

With reference to the first implementable aspect of the second aspect, as a second implementable aspect of the embodiment of the present invention, the selecting module further includes:

and the presetting unit is used for selecting the characteristics used by the logistic regression classification model in a characteristic presetting mode.

With reference to the first implementable aspect of the second aspect, as a third implementable aspect of the embodiment of the present invention, the selecting module further includes:

the computing unit is used for computing the feature importance of each feature by using a GBDT algorithm according to the basic information and the behavior feature data of the user;

and the selecting unit is used for selecting the features with high feature importance as the features used by the logistic regression classification model according to the feature importance.

With reference to the first implementable scenario of the second aspect, as a fourth implementable scenario of the embodiment of the present invention, the training module specifically includes:

the setting unit is used for presetting at least one hyper-parameter of the logistic regression classification model;

the training unit is used for taking the selected features as parameters of the logistic regression classification model according to the selected features and the logistic regression classification model, combining the preprocessed user basic information, and substituting the preset hyper-parameters into training one by one to obtain the trained logistic regression classification model;

and the tuning unit is used for comparing the logistic regression classification models after different hyper-parameter training, and selecting the optimal model and the hyper-parameter to obtain the optimal logistic regression classification model.

The label generation method and system provided by the embodiment of the invention solve the problem of convenience in predicting and optimizing the basic information label of the E-commerce user. Compared with the prior art, in the implementation of the invention, the data is cleaned and preprocessed, then the data is put into the logistic regression classification model for training, and finally the trained logistic regression classification model is used for prediction, so that the newly registered user can be calculated and predicted every day, the error information of the old user can be optimized and updated, and meanwhile, when the shopping gender of the user is changed and the equipment account number is not changed, the gender label of the user can be updated in time, so that the accuracy and the integrity of the label are greatly improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flow chart of a tag generation method according to an embodiment of the present invention;

FIG. 2 is a block diagram of the flow of step S130 in FIG. 1;

FIG. 3 is a block diagram of a tag generation system according to an embodiment of the present invention;

fig. 4 is a block diagram of a tag generation system according to another embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, those skilled in the art can obtain the embodiments without any inventive step in advance, and the embodiments are within the protection scope of the present invention.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

In the existing e-commerce member system, the existing member basic information specifically relates to a tag system, the information of a user is decomposed and stored in a database through various tags, for example, the user gender is taken as an example, the traditional e-commerce is to manually record the user gender or identify the registered information, then the gender tag is maintained as an initial result and falls into an off-line table, and the change is hardly caused.

Such prior art has a great disadvantage, for example, because the data is often manually entered, there is a possibility of manual entry errors; if the registration information of the member is called, privacy-related problems exist, and values of a plurality of places are lost; meanwhile, the gender of the shopping behavior representative of the user does not necessarily match the registered gender, which may cause errors in marketing such as recommendation. Therefore, a solution is urgently needed to predict and optimize the basic information label of the e-commerce user, and the integrity and the accuracy of the label are improved.

The embodiment of the invention provides a label generation method and a label generation system, which solve the problem of convenience in predicting and optimizing basic information labels of e-commerce users. The method and the device realize calculation and prediction of newly registered users every day, optimize and update error information of old users, and simultaneously can update the gender labels of the users in time under the condition that the shopping gender of the users is unchanged due to the change of people, so that the accuracy and the integrity of the labels are greatly improved.

Fig. 1 shows a flow chart of a method of generating a tag according to an embodiment of the invention. Referring to fig. 1, the method for generating a tag of the present embodiment includes steps S110 to S140.

Step S110, acquiring basic information and behavior characteristic data of a user, and cleaning the basic information and behavior characteristic data of the user;

step S120, selecting the characteristics of a logistic regression classification model according to the cleaned basic information and behavior characteristic data of the user;

s130, training a logistic regression classification model according to the characteristics of the logistic regression classification model;

and step S140, predicting by using the trained logistic regression classification model to generate a label.

It should be noted that the logistic regression classification model obtained by the embodiment of the present invention may be retrained according to different user information and requirements provided daily in an actual working environment, so as to achieve an optimal prediction efficiency and obtain a prediction result.

In the embodiment, the problem of convenience in predicting and optimizing the basic information label of the e-commerce user is solved by providing the label generation method. The method and the device realize calculation and prediction of newly registered users every day, optimize and update error information of old users, and simultaneously can update the gender labels of the users in time under the condition that the shopping gender of the users is unchanged due to the change of people, thereby greatly improving the accuracy and the integrity of the labels.

Wherein, step S110 specifically includes:

s1101, acquiring user behavior and log information, and filtering to obtain basic information and behavior characteristic data of the user;

s1102, cleaning the basic information and the behavior characteristic data of the user, and removing null values, repeated values and abnormal values;

s1103 preprocesses the basic information and behavior feature data of the user.

In the actual operation process, data acquisition mainly uses hive of a Hadoop component to process offline data, wherein the offline data comprises basic information and shopping behavior data of a user, for example, shopping interest preference data of the user is needed when the gender of the user is predicted; behavior or log data are extracted from an upstream flow table, a search table, an addition purchase and order table and daily by the hive, common feature data are stored in the hive table, a part of real-time feature data are processed by spark + kafka, the obtained real-time feature data are obtained and combined with offline data, and the obtained real-time feature data are used for cleaning downstream data. The data cleaning is to perform cleaning processing on the acquired offline data and real-time data, for example, a spark writing data cleaning method including missing value filling, abnormal value deleting and the like is used to change the data into a dataframe format which can be identified by the model, and the user ID is used as a main key to remove repeated items. And then preprocessing work, namely data conversion work, including feature normalization, feature selection and feature combination, one-hot coding and the like is carried out, wherein in addition to a common feature conversion method, a GBDT algorithm is also utilized to process continuous features to convert the continuous features into discrete features, and meanwhile, all feature weights after conversion can be calculated and provided for feature selection as reference. Therefore, the data is sorted, and the model is fully prepared for the subsequent feature selection and establishment.

Wherein, the step S120 specifically includes:

In this embodiment, features required for building a model may be selected and trained according to features selected in the past or focused on by a manual setting method without referring to the feature weights provided in the final preprocessing in step S110.

Preferably, the step S120 further includes:

s1201, according to the basic information and the behavior characteristic data of the user, calculating to obtain the characteristic importance of each characteristic by using a GBDT algorithm;

s1202, according to the feature importance, selecting the features with high feature importance as the features used by the logistic regression classification model.

In this embodiment, the feature weight provided in the last preprocessing in the reference step S110 is selected, the features are screened according to the feature weight calculated by the GBDT model, then the features with too small correlation coefficients are deleted, and finally the features are selected as the features required for training the logistic regression classification model in the subsequent step; or calculating the correlation of the features by using the Pearson correlation coefficient, and then removing similar features; or using PCA to reduce the dimension of the feature to achieve the goal of feature selection. The purpose of feature selection is to make model training faster and more accurate.

Wherein, the step S130 specifically includes:

s1301, presetting at least one hyper-parameter of the logistic regression classification model;

s1302, according to the selected features and the logistic regression classification model, taking the features as parameters of the logistic regression classification model, combining the preprocessed user basic information, and substituting the preset hyper-parameters into training one by one to obtain the trained logistic regression classification model;

s1303 compares the logistic regression classification models after different hyper-parameter training, and selects the optimal model and the hyper-parameter to obtain the optimal logistic regression classification model.

In the actual operation process, the logistic regression classification model is trained, processed data and characteristics are required to be used as training data and parameters, the LR logistic regression classification model is used for local training for multiple times, and then different models and parameters are compared to obtain a model with the highest efficiency and accuracy. Meanwhile, in the process of multiple times of training, at least one value can be preset for the hyper-parameters needing to be adjusted, then a pipeline is used for packaging and calling the model, then a cv cross validation method is used for comparing the hyper-parameters related in the model, training results are compared, and finally the model with the best effect and parameter values of all the parameters are output for the prediction of each label in the actual production environment.

Taking the example of predicting the gender label of the user in specific implementation, firstly, behavior or log data are extracted from an upstream flow table, a search table, an order adding and ordering table at daily timing by using the hive of a Hadoop component, common feature data are stored in the hive table, part of real-time feature data are processed by using spark + kafka, the obtained real-time feature data are obtained and combined with offline data to be used for cleaning downstream data. And then, carrying out simple cleaning processing on the acquired offline data and the acquired real-time data, writing a data cleaning method by utilizing spark, mainly adopting methods such as missing value filling, abnormal value deleting and the like, changing the data into a dataframe format which can be identified by a model, and simultaneously taking the user ID as a main key to remove repeated items. And then carrying out feature conversion processing methods such as normalization, feature combination, one-hot coding and the like on the features, simultaneously processing continuous features by using a GBDT algorithm to convert the continuous features into discrete features, and calculating all feature weights after conversion. And then selecting the features, deleting the features with too small correlation coefficient according to the weight of each feature calculated by the GBDT algorithm, and then taking the selected features as parameters of the logistic regression classification model. Then, the processed data and the processed characteristics are used as training data, an LR logistic regression classification model is trained, at least one value is preset for the hyper-parameters needing to be adjusted, then a pipeline is used for packaging and calling the model, then a cv cross validation method is used for comparing the hyper-parameters related to the model, the training results are compared, and finally the model with the best effect and the parameter values of all parameters are output to form a usable prediction model. And finally, the method and the model in the whole process are made into jar packets and are issued to a generation environment, the training is carried out at regular time every day, the trained model is stored in an HDFS address of the production environment, the data of the newly acquired user and the user needing to be updated, which are related to the shopping gender, are processed and predicted every day, a new shopping gender prediction result is obtained, and the result is updated to a shopping gender label.

Based on the same inventive concept, an embodiment of the present invention further provides a tag generation system, and fig. 3 shows a system framework diagram of the tag generation system according to an embodiment of the present invention. As shown in fig. 3, includes:

the cleaning module 100 is configured to acquire basic information and behavior feature data of a user, and clean the basic information and behavior feature data of the user;

a selecting module 200, configured to select a feature of a logistic regression classification model according to the cleaned basic information and behavior feature data of the user;

a training module 300, configured to train a logistic regression classification model according to the features of the logistic regression classification model;

and a generating module 400, configured to perform prediction by using the trained logistic regression classification model to generate a label.

In the embodiment, the problem of convenience in predicting and optimizing the basic information label of the e-commerce user is solved by providing the label generation method and the label generation system. The method and the device realize calculation and prediction of newly registered users every day, optimize and update error information of old users, and simultaneously can update the gender labels of the users in time under the condition that the shopping gender of the users is unchanged due to the change of people, thereby greatly improving the accuracy and the integrity of the labels.

As shown in fig. 4, the cleaning module 100 specifically includes:

the filtering unit 101 is configured to obtain user behavior and log information, and filter the user behavior and log information into basic information and behavior feature data of the user;

a cleaning unit 102, configured to clean the basic information and the behavior feature data of the user, and remove null values, repeated values, and abnormal values;

and the preprocessing unit 103 is used for preprocessing the basic information and the behavior characteristic data of the user.

The selecting module 200 specifically includes:

the preset unit 201 is configured to select a feature used by the logistic regression classification model in a manner of presetting the feature.

Preferably, the selecting module 200 further includes:

a calculating unit 202, configured to calculate, according to the basic information and the behavior feature data of the user, a feature importance of each feature by using a GBDT algorithm;

and a selecting unit 203, configured to select, according to the feature importance, a feature with a high feature importance as a feature used by the logistic regression classification model.

Wherein, the training module 300 specifically includes:

a setting unit 301, configured to preset at least one hyper-parameter of the logistic regression classification model;

a training unit 302, configured to use the selected features as parameters of a logistic regression classification model according to the selected features and the logistic regression classification model, and substitute the preset hyper-parameters into training one by one in combination with the preprocessed user basic information to obtain a trained logistic regression classification model;

and the tuning unit 303 is configured to compare the logistic regression classification models after different hyper-parameter training, and select an optimal model and a hyper-parameter to obtain an optimal logistic regression classification model.

The label generation system provided by the embodiment of the invention solves the problem of convenience in predicting and optimizing the basic information labels of e-commerce users. The method and the device realize calculation and prediction of newly registered users every day, optimize and update error information of old users, and simultaneously can update the gender labels of the users in time under the condition that the shopping gender of the users is unchanged due to the change of people, thereby greatly improving the accuracy and the integrity of the labels.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. Those skilled in the art will appreciate that the modules in the devices in the embodiments may be adaptively changed and arranged in one or more devices different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for generating a label, comprising:

2. The method according to claim 1, wherein the acquiring the basic information and the behavior feature data of the user and cleaning the basic information and the behavior feature data of the user specifically comprises:

3. The method according to claim 2, wherein selecting features of a logistic regression classification model according to the cleaned basic information and behavior feature data of the user specifically comprises:

4. The method according to claim 2, wherein selecting features of a logistic regression classification model according to the cleaned basic information and behavior feature data of the user specifically comprises:

5. The method of claim 2, wherein training the logistic regression classification model based on the features of the logistic regression classification model comprises:

6. A label generation system, comprising:

7. The system of claim 6, wherein the cleaning module specifically comprises:

8. The system of claim 7, wherein the selection module further comprises:

9. The system of claim 7, wherein the selection module further comprises:

10. The system according to claim 7, wherein the training module specifically comprises: