CN111325280A - Label generation method and system - Google Patents

Label generation method and system Download PDF

Info

Publication number
CN111325280A
CN111325280A CN202010125081.3A CN202010125081A CN111325280A CN 111325280 A CN111325280 A CN 111325280A CN 202010125081 A CN202010125081 A CN 202010125081A CN 111325280 A CN111325280 A CN 111325280A
Authority
CN
China
Prior art keywords
logistic regression
user
classification model
regression classification
basic information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010125081.3A
Other languages
Chinese (zh)
Inventor
吴雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suning Cloud Computing Co Ltd
Original Assignee
Suning Cloud Computing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suning Cloud Computing Co Ltd filed Critical Suning Cloud Computing Co Ltd
Priority to CN202010125081.3A priority Critical patent/CN111325280A/en
Publication of CN111325280A publication Critical patent/CN111325280A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2365Ensuring data consistency and integrity

Abstract

The embodiment of the invention discloses a method and a system for generating a label, wherein the method comprises the steps of acquiring basic information and behavior characteristic data of a user, and cleaning the basic information and the behavior characteristic data of the user; selecting the characteristics of a logistic regression classification model according to the cleaned basic information and behavior characteristic data of the user; training a logistic regression classification model according to the characteristics of the logistic regression classification model; and predicting by using the trained logistic regression classification model to generate a label. The problem of being convenient for predict and optimize the basic information label of the E-commerce user is solved, and the accuracy and the integrity of the label are greatly improved.

Description

Label generation method and system
Technical Field
The invention relates to the field of computers, in particular to a label generation method and a label generation system.
Background
In a member system of an e-commerce, basic information of members can be frequently used, specifically, a tag system can be used, information of users is decomposed and stored in a database through various tags, for example, the user gender is taken as an example, the traditional e-commerce is realized by manually recording the user gender or identifying registered information, and then the gender tag is maintained as an initial result and falls into an off-line table, and the change is hardly caused.
Therefore, various inconveniences are brought, for example, due to the fact that data are manually input, the possibility and risk of errors are greatly improved; if the member's registration information is used, there are privacy-related problems, resulting in many missing values; meanwhile, the shopping behavior of the e-commerce is not necessarily consistent with the actual gender, and errors can also be caused. Therefore, a solution is urgently needed to predict and optimize the basic information label of the e-commerce user, and the integrity and the accuracy of the label are improved.
Disclosure of Invention
The embodiment of the invention provides a label generation method and a label generation system, which solve the problem of convenience in predicting and optimizing basic information labels of e-commerce users.
In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:
in a first aspect, an embodiment of the present invention provides a method for generating a tag, where the method includes:
acquiring basic information and behavior characteristic data of a user, and cleaning the basic information and behavior characteristic data of the user;
selecting the characteristics of a logistic regression classification model according to the cleaned basic information and behavior characteristic data of the user;
training a logistic regression classification model according to the characteristics of the logistic regression classification model;
and predicting by using the trained logistic regression classification model to generate a label.
With reference to the first aspect, as a first implementation scheme of the embodiment of the present invention, the acquiring basic information and behavior feature data of a user, and cleaning the basic information and behavior feature data of the user specifically includes:
acquiring user behavior and log information, and filtering and processing the user behavior and log information into basic information and behavior characteristic data of the user;
cleaning the basic information and the behavior characteristic data of the user, and removing null values, repeated values and abnormal values;
and preprocessing the basic information and the behavior characteristic data of the user.
With reference to the first implementable scenario of the first aspect, as a second implementable scenario of the embodiment of the present invention, the selecting, according to the cleaned basic information and behavior feature data of the user, a feature of a logistic regression classification model specifically includes:
and selecting the characteristics used by the logistic regression classification model in a mode of presetting the characteristics.
With reference to the first implementable aspect of the first aspect, as a third implementable aspect of the embodiment of the present invention, the selecting, according to the cleaned basic information and behavior feature data of the user, a feature of a logistic regression classification model specifically includes:
calculating the feature importance of each feature by using a GBDT algorithm according to the basic information and the behavior feature data of the user;
and selecting the features with high feature importance as the features used by the logistic regression classification model according to the feature importance.
With reference to the first implementable scenario of the first aspect, as a fourth implementable scenario of the embodiment of the present invention, the training of the logistic regression classification model according to the features of the logistic regression classification model specifically includes:
presetting at least one hyper-parameter of the logistic regression classification model;
according to the selected features and the logistic regression classification model, taking the features as parameters of the logistic regression classification model, combining the preprocessed user basic information, and substituting the preset hyper-parameters into training one by one to obtain the trained logistic regression classification model;
and comparing the logistic regression classification models after different hyper-parameter training, and selecting the optimal model and the hyper-parameter to obtain the optimal logistic regression classification model.
In a second aspect, an embodiment of the present invention further provides a system for generating a tag, where the system includes:
the cleaning module is used for acquiring the basic information and the behavior characteristic data of the user and cleaning the basic information and the behavior characteristic data of the user;
the selection module is used for selecting the characteristics of the logistic regression classification model according to the cleaned basic information and behavior characteristic data of the user;
the training module is used for training the logistic regression classification model according to the characteristics of the logistic regression classification model;
and the generating module is used for predicting by using the trained logistic regression classification model to generate a label.
With reference to the second aspect, as a first implementation scheme of the embodiment of the present invention, the cleaning module specifically includes:
the filtering unit is used for acquiring user behavior and log information and filtering the user behavior and log information into basic information and behavior characteristic data of the user;
the cleaning unit is used for cleaning the basic information and the behavior characteristic data of the user and removing null values, repeated values and abnormal values;
and the preprocessing unit is used for preprocessing the basic information and the behavior characteristic data of the user.
With reference to the first implementable aspect of the second aspect, as a second implementable aspect of the embodiment of the present invention, the selecting module further includes:
and the presetting unit is used for selecting the characteristics used by the logistic regression classification model in a characteristic presetting mode.
With reference to the first implementable aspect of the second aspect, as a third implementable aspect of the embodiment of the present invention, the selecting module further includes:
the computing unit is used for computing the feature importance of each feature by using a GBDT algorithm according to the basic information and the behavior feature data of the user;
and the selecting unit is used for selecting the features with high feature importance as the features used by the logistic regression classification model according to the feature importance.
With reference to the first implementable scenario of the second aspect, as a fourth implementable scenario of the embodiment of the present invention, the training module specifically includes:
the setting unit is used for presetting at least one hyper-parameter of the logistic regression classification model;
the training unit is used for taking the selected features as parameters of the logistic regression classification model according to the selected features and the logistic regression classification model, combining the preprocessed user basic information, and substituting the preset hyper-parameters into training one by one to obtain the trained logistic regression classification model;
and the tuning unit is used for comparing the logistic regression classification models after different hyper-parameter training, and selecting the optimal model and the hyper-parameter to obtain the optimal logistic regression classification model.
The label generation method and system provided by the embodiment of the invention solve the problem of convenience in predicting and optimizing the basic information label of the E-commerce user. Compared with the prior art, in the implementation of the invention, the data is cleaned and preprocessed, then the data is put into the logistic regression classification model for training, and finally the trained logistic regression classification model is used for prediction, so that the newly registered user can be calculated and predicted every day, the error information of the old user can be optimized and updated, and meanwhile, when the shopping gender of the user is changed and the equipment account number is not changed, the gender label of the user can be updated in time, so that the accuracy and the integrity of the label are greatly improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flow chart of a tag generation method according to an embodiment of the present invention;
FIG. 2 is a block diagram of the flow of step S130 in FIG. 1;
FIG. 3 is a block diagram of a tag generation system according to an embodiment of the present invention;
fig. 4 is a block diagram of a tag generation system according to another embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, those skilled in the art can obtain the embodiments without any inventive step in advance, and the embodiments are within the protection scope of the present invention.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
In the existing e-commerce member system, the existing member basic information specifically relates to a tag system, the information of a user is decomposed and stored in a database through various tags, for example, the user gender is taken as an example, the traditional e-commerce is to manually record the user gender or identify the registered information, then the gender tag is maintained as an initial result and falls into an off-line table, and the change is hardly caused.
Such prior art has a great disadvantage, for example, because the data is often manually entered, there is a possibility of manual entry errors; if the registration information of the member is called, privacy-related problems exist, and values of a plurality of places are lost; meanwhile, the gender of the shopping behavior representative of the user does not necessarily match the registered gender, which may cause errors in marketing such as recommendation. Therefore, a solution is urgently needed to predict and optimize the basic information label of the e-commerce user, and the integrity and the accuracy of the label are improved.
The embodiment of the invention provides a label generation method and a label generation system, which solve the problem of convenience in predicting and optimizing basic information labels of e-commerce users. The method and the device realize calculation and prediction of newly registered users every day, optimize and update error information of old users, and simultaneously can update the gender labels of the users in time under the condition that the shopping gender of the users is unchanged due to the change of people, so that the accuracy and the integrity of the labels are greatly improved.
Fig. 1 shows a flow chart of a method of generating a tag according to an embodiment of the invention. Referring to fig. 1, the method for generating a tag of the present embodiment includes steps S110 to S140.
Step S110, acquiring basic information and behavior characteristic data of a user, and cleaning the basic information and behavior characteristic data of the user;
step S120, selecting the characteristics of a logistic regression classification model according to the cleaned basic information and behavior characteristic data of the user;
s130, training a logistic regression classification model according to the characteristics of the logistic regression classification model;
and step S140, predicting by using the trained logistic regression classification model to generate a label.
It should be noted that the logistic regression classification model obtained by the embodiment of the present invention may be retrained according to different user information and requirements provided daily in an actual working environment, so as to achieve an optimal prediction efficiency and obtain a prediction result.
In the embodiment, the problem of convenience in predicting and optimizing the basic information label of the e-commerce user is solved by providing the label generation method. The method and the device realize calculation and prediction of newly registered users every day, optimize and update error information of old users, and simultaneously can update the gender labels of the users in time under the condition that the shopping gender of the users is unchanged due to the change of people, thereby greatly improving the accuracy and the integrity of the labels.
Wherein, step S110 specifically includes:
s1101, acquiring user behavior and log information, and filtering to obtain basic information and behavior characteristic data of the user;
s1102, cleaning the basic information and the behavior characteristic data of the user, and removing null values, repeated values and abnormal values;
s1103 preprocesses the basic information and behavior feature data of the user.
In the actual operation process, data acquisition mainly uses hive of a Hadoop component to process offline data, wherein the offline data comprises basic information and shopping behavior data of a user, for example, shopping interest preference data of the user is needed when the gender of the user is predicted; behavior or log data are extracted from an upstream flow table, a search table, an addition purchase and order table and daily by the hive, common feature data are stored in the hive table, a part of real-time feature data are processed by spark + kafka, the obtained real-time feature data are obtained and combined with offline data, and the obtained real-time feature data are used for cleaning downstream data. The data cleaning is to perform cleaning processing on the acquired offline data and real-time data, for example, a spark writing data cleaning method including missing value filling, abnormal value deleting and the like is used to change the data into a dataframe format which can be identified by the model, and the user ID is used as a main key to remove repeated items. And then preprocessing work, namely data conversion work, including feature normalization, feature selection and feature combination, one-hot coding and the like is carried out, wherein in addition to a common feature conversion method, a GBDT algorithm is also utilized to process continuous features to convert the continuous features into discrete features, and meanwhile, all feature weights after conversion can be calculated and provided for feature selection as reference. Therefore, the data is sorted, and the model is fully prepared for the subsequent feature selection and establishment.
Wherein, the step S120 specifically includes:
and selecting the characteristics used by the logistic regression classification model in a mode of presetting the characteristics.
In this embodiment, features required for building a model may be selected and trained according to features selected in the past or focused on by a manual setting method without referring to the feature weights provided in the final preprocessing in step S110.
Preferably, the step S120 further includes:
s1201, according to the basic information and the behavior characteristic data of the user, calculating to obtain the characteristic importance of each characteristic by using a GBDT algorithm;
s1202, according to the feature importance, selecting the features with high feature importance as the features used by the logistic regression classification model.
In this embodiment, the feature weight provided in the last preprocessing in the reference step S110 is selected, the features are screened according to the feature weight calculated by the GBDT model, then the features with too small correlation coefficients are deleted, and finally the features are selected as the features required for training the logistic regression classification model in the subsequent step; or calculating the correlation of the features by using the Pearson correlation coefficient, and then removing similar features; or using PCA to reduce the dimension of the feature to achieve the goal of feature selection. The purpose of feature selection is to make model training faster and more accurate.
Wherein, the step S130 specifically includes:
s1301, presetting at least one hyper-parameter of the logistic regression classification model;
s1302, according to the selected features and the logistic regression classification model, taking the features as parameters of the logistic regression classification model, combining the preprocessed user basic information, and substituting the preset hyper-parameters into training one by one to obtain the trained logistic regression classification model;
s1303 compares the logistic regression classification models after different hyper-parameter training, and selects the optimal model and the hyper-parameter to obtain the optimal logistic regression classification model.
In the actual operation process, the logistic regression classification model is trained, processed data and characteristics are required to be used as training data and parameters, the LR logistic regression classification model is used for local training for multiple times, and then different models and parameters are compared to obtain a model with the highest efficiency and accuracy. Meanwhile, in the process of multiple times of training, at least one value can be preset for the hyper-parameters needing to be adjusted, then a pipeline is used for packaging and calling the model, then a cv cross validation method is used for comparing the hyper-parameters related in the model, training results are compared, and finally the model with the best effect and parameter values of all the parameters are output for the prediction of each label in the actual production environment.
Taking the example of predicting the gender label of the user in specific implementation, firstly, behavior or log data are extracted from an upstream flow table, a search table, an order adding and ordering table at daily timing by using the hive of a Hadoop component, common feature data are stored in the hive table, part of real-time feature data are processed by using spark + kafka, the obtained real-time feature data are obtained and combined with offline data to be used for cleaning downstream data. And then, carrying out simple cleaning processing on the acquired offline data and the acquired real-time data, writing a data cleaning method by utilizing spark, mainly adopting methods such as missing value filling, abnormal value deleting and the like, changing the data into a dataframe format which can be identified by a model, and simultaneously taking the user ID as a main key to remove repeated items. And then carrying out feature conversion processing methods such as normalization, feature combination, one-hot coding and the like on the features, simultaneously processing continuous features by using a GBDT algorithm to convert the continuous features into discrete features, and calculating all feature weights after conversion. And then selecting the features, deleting the features with too small correlation coefficient according to the weight of each feature calculated by the GBDT algorithm, and then taking the selected features as parameters of the logistic regression classification model. Then, the processed data and the processed characteristics are used as training data, an LR logistic regression classification model is trained, at least one value is preset for the hyper-parameters needing to be adjusted, then a pipeline is used for packaging and calling the model, then a cv cross validation method is used for comparing the hyper-parameters related to the model, the training results are compared, and finally the model with the best effect and the parameter values of all parameters are output to form a usable prediction model. And finally, the method and the model in the whole process are made into jar packets and are issued to a generation environment, the training is carried out at regular time every day, the trained model is stored in an HDFS address of the production environment, the data of the newly acquired user and the user needing to be updated, which are related to the shopping gender, are processed and predicted every day, a new shopping gender prediction result is obtained, and the result is updated to a shopping gender label.
The label generation method and system provided by the embodiment of the invention solve the problem of convenience in predicting and optimizing the basic information label of the E-commerce user. Compared with the prior art, in the implementation of the invention, the data is cleaned and preprocessed, then the data is put into the logistic regression classification model for training, and finally the trained logistic regression classification model is used for prediction, so that the newly registered user can be calculated and predicted every day, the error information of the old user can be optimized and updated, and meanwhile, when the shopping gender of the user is changed and the equipment account number is not changed, the gender label of the user can be updated in time, so that the accuracy and the integrity of the label are greatly improved.
Based on the same inventive concept, an embodiment of the present invention further provides a tag generation system, and fig. 3 shows a system framework diagram of the tag generation system according to an embodiment of the present invention. As shown in fig. 3, includes:
the cleaning module 100 is configured to acquire basic information and behavior feature data of a user, and clean the basic information and behavior feature data of the user;
a selecting module 200, configured to select a feature of a logistic regression classification model according to the cleaned basic information and behavior feature data of the user;
a training module 300, configured to train a logistic regression classification model according to the features of the logistic regression classification model;
and a generating module 400, configured to perform prediction by using the trained logistic regression classification model to generate a label.
In the embodiment, the problem of convenience in predicting and optimizing the basic information label of the e-commerce user is solved by providing the label generation method and the label generation system. The method and the device realize calculation and prediction of newly registered users every day, optimize and update error information of old users, and simultaneously can update the gender labels of the users in time under the condition that the shopping gender of the users is unchanged due to the change of people, thereby greatly improving the accuracy and the integrity of the labels.
It should be noted that the logistic regression classification model obtained by the embodiment of the present invention may be retrained according to different user information and requirements provided daily in an actual working environment, so as to achieve an optimal prediction efficiency and obtain a prediction result.
As shown in fig. 4, the cleaning module 100 specifically includes:
the filtering unit 101 is configured to obtain user behavior and log information, and filter the user behavior and log information into basic information and behavior feature data of the user;
a cleaning unit 102, configured to clean the basic information and the behavior feature data of the user, and remove null values, repeated values, and abnormal values;
and the preprocessing unit 103 is used for preprocessing the basic information and the behavior characteristic data of the user.
In the actual operation process, data acquisition mainly uses hive of a Hadoop component to process offline data, wherein the offline data comprises basic information and shopping behavior data of a user, for example, shopping interest preference data of the user is needed when the gender of the user is predicted; behavior or log data are extracted from an upstream flow table, a search table, an addition purchase and order table and daily by the hive, common feature data are stored in the hive table, a part of real-time feature data are processed by spark + kafka, the obtained real-time feature data are obtained and combined with offline data, and the obtained real-time feature data are used for cleaning downstream data. The data cleaning is to perform cleaning processing on the acquired offline data and real-time data, for example, a spark writing data cleaning method including missing value filling, abnormal value deleting and the like is used to change the data into a dataframe format which can be identified by the model, and the user ID is used as a main key to remove repeated items. And then preprocessing work, namely data conversion work, including feature normalization, feature selection and feature combination, one-hot coding and the like is carried out, wherein in addition to a common feature conversion method, a GBDT algorithm is also utilized to process continuous features to convert the continuous features into discrete features, and meanwhile, all feature weights after conversion can be calculated and provided for feature selection as reference. Therefore, the data is sorted, and the model is fully prepared for the subsequent feature selection and establishment.
The selecting module 200 specifically includes:
the preset unit 201 is configured to select a feature used by the logistic regression classification model in a manner of presetting the feature.
In this embodiment, features required for building a model may be selected and trained according to features selected in the past or focused on by a manual setting method without referring to the feature weights provided in the final preprocessing in step S110.
Preferably, the selecting module 200 further includes:
a calculating unit 202, configured to calculate, according to the basic information and the behavior feature data of the user, a feature importance of each feature by using a GBDT algorithm;
and a selecting unit 203, configured to select, according to the feature importance, a feature with a high feature importance as a feature used by the logistic regression classification model.
In this embodiment, the feature weight provided in the last preprocessing in the reference step S110 is selected, the features are screened according to the feature weight calculated by the GBDT model, then the features with too small correlation coefficients are deleted, and finally the features are selected as the features required for training the logistic regression classification model in the subsequent step; or calculating the correlation of the features by using the Pearson correlation coefficient, and then removing similar features; or using PCA to reduce the dimension of the feature to achieve the goal of feature selection. The purpose of feature selection is to make model training faster and more accurate.
Wherein, the training module 300 specifically includes:
a setting unit 301, configured to preset at least one hyper-parameter of the logistic regression classification model;
a training unit 302, configured to use the selected features as parameters of a logistic regression classification model according to the selected features and the logistic regression classification model, and substitute the preset hyper-parameters into training one by one in combination with the preprocessed user basic information to obtain a trained logistic regression classification model;
and the tuning unit 303 is configured to compare the logistic regression classification models after different hyper-parameter training, and select an optimal model and a hyper-parameter to obtain an optimal logistic regression classification model.
In the actual operation process, the logistic regression classification model is trained, processed data and characteristics are required to be used as training data and parameters, the LR logistic regression classification model is used for local training for multiple times, and then different models and parameters are compared to obtain a model with the highest efficiency and accuracy. Meanwhile, in the process of multiple times of training, at least one value can be preset for the hyper-parameters needing to be adjusted, then a pipeline is used for packaging and calling the model, then a cv cross validation method is used for comparing the hyper-parameters related in the model, training results are compared, and finally the model with the best effect and parameter values of all the parameters are output for the prediction of each label in the actual production environment.
The label generation system provided by the embodiment of the invention solves the problem of convenience in predicting and optimizing the basic information labels of e-commerce users. The method and the device realize calculation and prediction of newly registered users every day, optimize and update error information of old users, and simultaneously can update the gender labels of the users in time under the condition that the shopping gender of the users is unchanged due to the change of people, thereby greatly improving the accuracy and the integrity of the labels.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. Those skilled in the art will appreciate that the modules in the devices in the embodiments may be adaptively changed and arranged in one or more devices different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for generating a label, comprising:
acquiring basic information and behavior characteristic data of a user, and cleaning the basic information and behavior characteristic data of the user;
selecting the characteristics of a logistic regression classification model according to the cleaned basic information and behavior characteristic data of the user;
training a logistic regression classification model according to the characteristics of the logistic regression classification model;
and predicting by using the trained logistic regression classification model to generate a label.
2. The method according to claim 1, wherein the acquiring the basic information and the behavior feature data of the user and cleaning the basic information and the behavior feature data of the user specifically comprises:
acquiring user behavior and log information, and filtering and processing the user behavior and log information into basic information and behavior characteristic data of the user;
cleaning the basic information and the behavior characteristic data of the user, and removing null values, repeated values and abnormal values;
and preprocessing the basic information and the behavior characteristic data of the user.
3. The method according to claim 2, wherein selecting features of a logistic regression classification model according to the cleaned basic information and behavior feature data of the user specifically comprises:
and selecting the characteristics used by the logistic regression classification model in a mode of presetting the characteristics.
4. The method according to claim 2, wherein selecting features of a logistic regression classification model according to the cleaned basic information and behavior feature data of the user specifically comprises:
calculating the feature importance of each feature by using a GBDT algorithm according to the basic information and the behavior feature data of the user;
and selecting the features with high feature importance as the features used by the logistic regression classification model according to the feature importance.
5. The method of claim 2, wherein training the logistic regression classification model based on the features of the logistic regression classification model comprises:
presetting at least one hyper-parameter of the logistic regression classification model;
according to the selected features and the logistic regression classification model, taking the features as parameters of the logistic regression classification model, combining the preprocessed user basic information, and substituting the preset hyper-parameters into training one by one to obtain the trained logistic regression classification model;
and comparing the logistic regression classification models after different hyper-parameter training, and selecting the optimal model and the hyper-parameter to obtain the optimal logistic regression classification model.
6. A label generation system, comprising:
the cleaning module is used for acquiring the basic information and the behavior characteristic data of the user and cleaning the basic information and the behavior characteristic data of the user;
the selection module is used for selecting the characteristics of the logistic regression classification model according to the cleaned basic information and behavior characteristic data of the user;
the training module is used for training the logistic regression classification model according to the characteristics of the logistic regression classification model;
and the generating module is used for predicting by using the trained logistic regression classification model to generate a label.
7. The system of claim 6, wherein the cleaning module specifically comprises:
the filtering unit is used for acquiring user behavior and log information and filtering the user behavior and log information into basic information and behavior characteristic data of the user;
the cleaning unit is used for cleaning the basic information and the behavior characteristic data of the user and removing null values, repeated values and abnormal values;
and the preprocessing unit is used for preprocessing the basic information and the behavior characteristic data of the user.
8. The system of claim 7, wherein the selection module further comprises:
and the presetting unit is used for selecting the characteristics used by the logistic regression classification model in a characteristic presetting mode.
9. The system of claim 7, wherein the selection module further comprises:
the computing unit is used for computing the feature importance of each feature by using a GBDT algorithm according to the basic information and the behavior feature data of the user;
and the selecting unit is used for selecting the features with high feature importance as the features used by the logistic regression classification model according to the feature importance.
10. The system according to claim 7, wherein the training module specifically comprises:
the setting unit is used for presetting at least one hyper-parameter of the logistic regression classification model;
the training unit is used for taking the selected features as parameters of the logistic regression classification model according to the selected features and the logistic regression classification model, combining the preprocessed user basic information, and substituting the preset hyper-parameters into training one by one to obtain the trained logistic regression classification model;
and the tuning unit is used for comparing the logistic regression classification models after different hyper-parameter training, and selecting the optimal model and the hyper-parameter to obtain the optimal logistic regression classification model.
CN202010125081.3A 2020-02-27 2020-02-27 Label generation method and system Pending CN111325280A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010125081.3A CN111325280A (en) 2020-02-27 2020-02-27 Label generation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010125081.3A CN111325280A (en) 2020-02-27 2020-02-27 Label generation method and system

Publications (1)

Publication Number Publication Date
CN111325280A true CN111325280A (en) 2020-06-23

Family

ID=71167379

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010125081.3A Pending CN111325280A (en) 2020-02-27 2020-02-27 Label generation method and system

Country Status (1)

Country Link
CN (1) CN111325280A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113469645A (en) * 2021-06-21 2021-10-01 广州政企互联科技有限公司 Intelligent storage method for policy data
CN116383029A (en) * 2023-06-06 2023-07-04 和元达信息科技有限公司 User behavior label generation method and device based on small program
CN116383029B (en) * 2023-06-06 2024-04-26 和元达信息科技有限公司 User behavior label generation method and device based on small program

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7461063B1 (en) * 2004-05-26 2008-12-02 Proofpoint, Inc. Updating logistic regression models using coherent gradient
CN106127525A (en) * 2016-06-27 2016-11-16 浙江大学 A kind of TV shopping Method of Commodity Recommendation based on sorting algorithm
CN107220217A (en) * 2017-05-31 2017-09-29 北京京东尚科信息技术有限公司 Characteristic coefficient training method and device that logic-based is returned
CN107330445A (en) * 2017-05-31 2017-11-07 北京京东尚科信息技术有限公司 The Forecasting Methodology and device of user property
CN109800884A (en) * 2017-11-14 2019-05-24 阿里巴巴集团控股有限公司 Processing method, device, equipment and the computer storage medium of model parameter

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7461063B1 (en) * 2004-05-26 2008-12-02 Proofpoint, Inc. Updating logistic regression models using coherent gradient
CN106127525A (en) * 2016-06-27 2016-11-16 浙江大学 A kind of TV shopping Method of Commodity Recommendation based on sorting algorithm
CN107220217A (en) * 2017-05-31 2017-09-29 北京京东尚科信息技术有限公司 Characteristic coefficient training method and device that logic-based is returned
CN107330445A (en) * 2017-05-31 2017-11-07 北京京东尚科信息技术有限公司 The Forecasting Methodology and device of user property
CN109800884A (en) * 2017-11-14 2019-05-24 阿里巴巴集团控股有限公司 Processing method, device, equipment and the computer storage medium of model parameter

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陶皖 等: "《云计算与大数据》", pages: 128 - 133 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113469645A (en) * 2021-06-21 2021-10-01 广州政企互联科技有限公司 Intelligent storage method for policy data
CN116383029A (en) * 2023-06-06 2023-07-04 和元达信息科技有限公司 User behavior label generation method and device based on small program
CN116383029B (en) * 2023-06-06 2024-04-26 和元达信息科技有限公司 User behavior label generation method and device based on small program

Similar Documents

Publication Publication Date Title
CN110956224B (en) Evaluation model generation and evaluation data processing method, device, equipment and medium
CN103744928B (en) A kind of network video classification method based on history access record
JP2020500420A (en) Yield prediction system and method for machine learning based semiconductor manufacturing
US10671926B2 (en) Method and system for generating predictive models for scoring and prioritizing opportunities
US10706359B2 (en) Method and system for generating predictive models for scoring and prioritizing leads
KR102065780B1 (en) Electronic apparatus performing prediction of time series data using big data and method for predicting thereof
CN110880124A (en) Conversion rate evaluation method and device
KR101435096B1 (en) Apparatus and method for prediction of merchandise demand using social network service data
CN110956278A (en) Method and system for retraining machine learning models
CN111325280A (en) Label generation method and system
CN116452212B (en) Intelligent customer service commodity knowledge base information management method and system
CN116934380A (en) E-commerce material supply and demand combined prediction method under abnormal event
CN116501979A (en) Information recommendation method, information recommendation device, computer equipment and computer readable storage medium
CN116823496A (en) Intelligent insurance risk assessment and pricing system based on artificial intelligence
CN116579640A (en) Power marketing service channel user experience assessment method and system
EP3493082A1 (en) A method of exploring databases of time-stamped data in order to discover dependencies between the data and predict future trends
CN114240318A (en) Target object oriented information processing method and device and computer equipment
US20210073840A1 (en) Multi-layered system for heterogeneous pricing decisions by continuously learning market and hotel dynamics
CN111091410B (en) Node embedding and user behavior characteristic combined net point sales prediction method
CN113313615A (en) Method and device for quantitatively grading and grading enterprise judicial risks
Khan et al. Privacy Preserved and Decentralized Smartphone Recommendation System
CN112765451A (en) Client intelligent screening method and system based on ensemble learning algorithm
CN112686448A (en) Loss early warning method and system based on attribute data
Ortega-Bastida et al. Regional gross domestic product prediction using twitter deep learning representations
CN117668205B (en) Smart logistics customer service processing method, system, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination