CN115186759A - Model training method and user classification method - Google Patents

Model training method and user classification method Download PDF

Info

Publication number
CN115186759A
CN115186759A CN202210858459.XA CN202210858459A CN115186759A CN 115186759 A CN115186759 A CN 115186759A CN 202210858459 A CN202210858459 A CN 202210858459A CN 115186759 A CN115186759 A CN 115186759A
Authority
CN
China
Prior art keywords
user
data
data set
classification model
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210858459.XA
Other languages
Chinese (zh)
Inventor
周鹏程
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Avatr Technology Chongqing Co Ltd
Original Assignee
Avatr Technology Chongqing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Avatr Technology Chongqing Co Ltd filed Critical Avatr Technology Chongqing Co Ltd
Priority to CN202210858459.XA priority Critical patent/CN115186759A/en
Publication of CN115186759A publication Critical patent/CN115186759A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses a model training method and a user classification method. The model training method comprises the following steps: obtaining a first data set of each of at least one first user; screening all second data corresponding to each first data set according to a set screening rule to obtain a second data set corresponding to each first user; training at least one model according to a set number of the second data sets; wherein the user classification model is used to determine a user type of the user.

Description

Model training method and user classification method
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a model training method and a user classification method.
Background
In different links of an ecological industrial chain in the intelligent automobile industry, characteristic analysis needs to be carried out on user behavior data, and further subsequent services can be provided for users in a targeted mode. In the intelligent automobile industry, the characteristics of user behavior data have complexity and diversity, and user classification cannot be accurately carried out on the basis of the user behavior data.
Disclosure of Invention
In view of this, embodiments of the present application provide a model training method and a user classification method, so as to at least solve the problem that the user classification cannot be accurately performed based on user behavior data in the related art.
The embodiment of the application provides a model training method, which comprises the following steps:
obtaining a first data set of each of at least one first user; wherein the first data set contains at least one first data of the first user; said first data characterizing a user characteristic of said first user;
converting the first data into second data; the second data represents a characteristic value associated with the first label; the first label is used for marking the user type of the first user;
screening all second data corresponding to each first data set according to a set screening rule to obtain a second data set corresponding to each first user; the second data set comprises second data for user classification model training; the screening rules include at least one of the following rules: performing feature screening according to the correlation coefficient between the second data, performing feature screening according to the influence of the second data on the detection capability of a screening model, and performing feature screening according to prior information;
training at least one user classification model according to a set number of the second data sets; wherein the user classification model is used to determine a user type of the user.
The embodiment of the application further provides a user classification method, which comprises the following steps:
acquiring a fourth data set; the fourth data set characterizes at least one user data of a second user;
inputting the fourth data set to a user classification model, and determining a user type of a second user based on the user classification model; the user classification model is trained based on any model training method.
In the embodiment of the application, a first data set of each first user in at least one first user is obtained, the first data are converted into second data, a second data set corresponding to each first user is obtained by screening all second data corresponding to each first data set according to a set screening rule, at least one user classification model is trained according to a set number of second data sets, the user classification model which can be applied to the intelligent automobile industry can be trained, and automobile users can be accurately classified according to different characteristics of different automobile users.
Drawings
Fig. 1 is a schematic flow chart illustrating an implementation of a model training method according to an embodiment of the present application;
FIG. 2 is a schematic diagram illustrating an implementation flow of a model training method according to another embodiment of the present application;
FIG. 3 is a schematic diagram illustrating an implementation flow of a model training method according to another embodiment of the present application;
FIG. 4 is a schematic diagram illustrating an implementation flow of a model training method according to another embodiment of the present application;
FIG. 5 is a schematic flow chart illustrating an implementation of a model training method according to another embodiment of the present application;
fig. 6 is a schematic flow chart illustrating an implementation of a user classification method according to an embodiment of the present application;
fig. 7 is a schematic flow chart illustrating an implementation of a user classification method according to another embodiment of the present application;
FIG. 8 is a schematic diagram of a user data collection prompt provided herein;
FIG. 9 is a schematic structural diagram of a model training apparatus according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a user classifying device according to an embodiment of the present application;
fig. 11 is a schematic diagram of a hardware component structure of an electronic device according to an embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the following drawings and specific embodiments.
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
The technical means described in the embodiments of the present application may be arbitrarily combined without conflict.
In addition, in the embodiments of the present application, "first", "second", and the like are used for distinguishing similar objects, and are not necessarily used for describing a particular order or sequence.
The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.
The embodiment of the application provides a model training method, and fig. 1 is a schematic flow chart of the model training method in the embodiment of the application. As shown in fig. 1, the method includes:
s101: a first data set is obtained for each of at least one first user.
In a smart car scenario, the first data set may be obtained from a different data source.
The first data source is: the business data of the first user includes sex, age, province, city of the first user, a source of a business data channel of the first user (for example, whether the business data of the first user is obtained from a small program or an application, a public network and the like), points, growth value, fan number, lottery number, winning number and the like.
The second data source is: and the advertisement media data about the first user comprises advertisement click events, advertisement exposure events and the like.
The third data source is: software Development Kit (SDK) embedded data, including applet embedded data, official website embedded data, and application embedded data. In practical applications, the first data set may be used to record 300-500 user characteristics of the first user.
In practical application, three data sources corresponding to the mobile phone number of the first user can be collected from a network through the mobile phone number of the first user, and then the first data set can be obtained.
In practical applications, the first data set may further include user features from offline collection, wherein the user features about the first user from offline collection may be imported, so that data analysis may be performed on the first data set simultaneously including the user features collected online and the user features collected offline.
In practical application, in order to distinguish first data sets corresponding to different first users, a user identifier of the first user may be carried in the first data set, where the user identifier of the first user may be a mobile phone number of the first user, and the first users correspond to the mobile phone numbers of the first users one to one, so that the first users corresponding to the first data set may be determined by the mobile phone numbers carried in the first data set.
In an embodiment, after the at least one first data set is obtained, data cleaning and other processing may be performed on the at least one first data in the first data set, so that the processing efficiency on the first data set can be improved, and redundant data can be removed.
In this embodiment, the first data set may be processed according to a set data format, a first list corresponding to the first data set is generated, and the first list is stored, where the first data (that is, the user characteristics of the first user), the characteristic group where the first data is located, and the first tag of the first user are recorded in the first list.
In this embodiment, the feature groups may be distinguished according to a channel for obtaining the features of the user, and specifically, the feature groups may be classified into advertisements, applets, official networks, application programs, and communities, where the advertisements are from a public domain, that is, a mobile phone number of the user does not need to be registered, and the applets, official networks, application programs, and communities are from a private domain, that is, a mobile phone number of the user needs to be registered.
In this embodiment, the first tag represents a user type of the first user, and the first tag may be marked manually or automatically by formulating a relevant business rule. In different application scenarios, the user classification of the first user is different, for example, the first user may be distinguished by the first tag as a user visiting a store and a user not visiting the store, and the first user is classified by whether the vehicle user visits the vehicle store.
In another application scenario, the first user may be divided into an attrition user and an non-attrition user by the first tag, where the attrition user refers to a user having a login behavior of an application program in a first time period, and the user having no login behavior of the application program in a second time period.
In yet another application scenario, the first user may be distinguished as an anomalous user or a non-anomalous user by the first label. In the application scenario, a service person can manually check whether the reported automobile user is an abnormal user according to the report record, or screen out suspected abnormal automobile users by formulating related service rules, and then manually check the suspected abnormal automobile users by the service person, wherein the service rules can be used for judging whether the refreshing browsing times per minute of the automobile users are more than 100 times and/or whether the automobile users only participate in lottery activities but do not have browsing behaviors, and if one is satisfied, the corresponding automobile user is judged as the suspected abnormal automobile user.
In practical applications, the first data is substantially a feature value of the first user, for example, a brand a of the mobile terminal of the first user can be obtained from the advertisement, and the corresponding first data is the brand a, so that, when there are a plurality of pieces of first data, a feature meaning represented by each piece of first data cannot be accurately distinguished in the first list, in which case, a feature name corresponding to each piece of first data may be further recorded in the first list.
Referring to table 1, table 1 shows a first list, where table 1 contains a feature group, a feature name, a feature (i.e., first data) and a first tag, where a user type of a first user is determined by a value carried by the first tag in table 1, and in practical applications, the user type corresponding to the value carried by the first tag may be set according to requirements, for example, in a feasible manner, a user type corresponding to a value carried by the first tag being 0 is a non-abnormal user, a user type corresponding to a value carried by the first tag being 1 is an abnormal user, and referring to this setting, the first user is marked as a non-abnormal user in table 1.
In practical applications, the first list also carries the user identifier of the first user.
TABLE 1
Figure BDA0003755084240000041
S102: and converting the first data into second data.
Here, all the first data in the first data sets of all the first users are converted into second data, the first data describe the user characteristics through different types of numerical values, for example, the first data can describe the gender characteristics of the user through "men" (chinese characters), and can also describe different user characteristics through english characters and the like, and the model can only recognize numeric characters, so that the first data need to be converted into the second data, the second data can be recognized by the model, and the converted second data can be more related to the user types, which is beneficial to the effect of training the model.
In an embodiment, as shown in fig. 2, the converting the first data into the second data includes:
s201: and determining a first conversion mode according to the data type of the first data.
S202: and converting the first data into the second data according to the determined first conversion mode.
In this embodiment, different data types are set with different conversion manners, where the data types include discrete data and continuous data, and the discrete data refers to that characteristic values of features can be listed one by one, or that the characteristic values are set values, for example, for the gender features of a user, corresponding characteristic values can be listed as male and female, that is, the characteristic value is either male or female, and there is no other value that does not belong to the two set values. The continuous data refers to that the characteristic value is any value, for example, for the consumption amount characteristic of the user, the characteristic value can be any sum value.
The data type of the first data can be determined by the characteristics of the user features described by the first data, or by the value of the first data, and the first data is converted into the second data according to a first conversion mode corresponding to the data type of the first data, wherein the common conversion mode includes: mapping coding, heat coding, discretization conversion and logarithm conversion.
In practical applications, the common transformation modes for discrete data are mapping coding and one-hot coding, and the common transformation modes for continuous data are discretization transformation, logarithmic transformation and mapping coding. Various conversion modes are described below:
one-hot encoding: n is used as a state register to encode N states, each state has a corresponding independent register bit, and only one bit is valid. For example, after the operating system of the mobile terminal of the user is subjected to unique hot coding, the operating system a may be coded as 0, the operating system B may be coded as 1, and in the case that the first data is the operating system a, the converted second data is 0.
Discretizing and converting: in order to mine deep level correlation among data, reduce interference of abnormal deviation data on the whole model and improve prediction accuracy, discretization conversion needs to be carried out on continuous data, and nonlinear features are introduced into the model. For example, the first data characterizes the number of times the first user accesses the application, and the number of times the application is accessed is divided into different ranges, which in one aspect may be divided into four ranges of less than 10 accesses, 10-50 accesses, 50-100 accesses, and more than 100 accesses, with the first data ultimately being converted into one of the four ranges.
Logarithmic transformation: and log operation is carried out on the first data, and the distribution of the converted second data is smoother and smoother.
Mapping and coding: the mapping coding method comprises the steps of establishing a nonlinear relation between different first data and different first labels during mapping coding, wherein when the data format of the first data is discrete data, the second data is a first label average value corresponding to the first data, when the data format of the first data is continuous data, the first data needs to be subjected to discretization conversion, and the second data is a first label average value corresponding to the first data after the discretization conversion.
S103: and screening all the second data corresponding to each first data set according to a set screening rule to obtain a second data set corresponding to each first user.
In practical application, the first data set contains a large amount of second data, and if all the second data in the first data set are applied to model training, the prediction capability of the model is poor, so that all the second data in the first data set need to be screened more effectively, and then the screened second data set is applied to model training.
In this embodiment, the feature screening may be performed according to at least one of the following three screening rules, including performing feature screening according to a correlation coefficient between the second data, performing feature screening according to an influence of the second data on a detection capability of the screening model, and performing feature screening according to the prior information. The characteristic screening is carried out according to the correlation coefficient between the second data, and the screening rule is formulated according to the correlation between the two second data, so that the situation that the prediction capability and accuracy of the user classification model are reduced due to the fact that the correlation between the second data is too strong is avoided. And performing feature screening according to the influence of the second data on the detection capability of the screening model, namely screening the second data which is poor in correlation with the first label or has small influence on the prediction capability in an iterative or one-time mode, so that the second data with certain significance and predictability can be selected. The screening model can be an XGboost model, the XGboost model operates on the basis of an extensible machine learning system of a decision tree, the importance degrees of all the features are automatically sorted after the XGboost model operates, and then a plurality of successively ranked features are removed. And performing feature screening according to the priori information, namely judging according to actual service data by combining related technical knowledge and technical experience of a specific service scene, so as to increase or remove a part of second data based on the priori information.
In practical application, the second data may also be screened according to actual requirements, for example, in a scenario where the user classification model is applied to classification of abnormal users and non-abnormal users, since the abnormal users usually have malicious comments, report violations, and other behaviors, the second data obtained with respect to the buried point data may be retained, for example, the types of browsing contents (including lottery, invitation, check-in, and the like), or the number of times, the number of days, the duration, and the frequency of specific behaviors (including refreshing pages, clicking buttons, inputting sensitive words, publishing sensitive contents).
In a scenario where the user classification model is applied to attrition users and non-attrition users, the second data about the application program may be selected to be retained, and in addition, the second data generated from a questionnaire filled by the user about the application program usage may be retained.
In the scenario that the user classification model is applied to the visiting store user and the non-visiting store user, the second data generated from the questionnaire filled by the user about the vehicle purchasing intention and the visiting store intention can be retained.
In an embodiment, as shown in fig. 3, when the second data set corresponding to each first user is filtered from all the second data corresponding to each first data set according to the set filtering rule, the method includes:
s301: target second data in the first data set is determined according to the first parameter.
Here, in the process of performing feature filtering according to the correlation coefficient between the second data, performing feature filtering according to the correlation coefficient between the second data in the same feature group and performing feature filtering according to the correlation coefficient between the second data in different feature groups may be subdivided, in this embodiment, the first parameter is used to determine the correlation between the second data in the same feature group, and all the second data corresponding to the first data set are filtered according to the first parameter, where the target second data is the second data retained in the first data set, that is, the correlation between every two target second data in the same feature group is not high.
One method of determining target second data in a first data set is detailed:
a first parameter corresponding to the first data set is determined. In this embodiment, the first parameter is a pearson correlation coefficient, a value of the first parameter is between-1 and 1, and a correlation between any two data in the same feature group can be determined by the first parameter, where the first parameter can pass through the correlation
Figure BDA0003755084240000071
Figure BDA0003755084240000072
Calculated such that X corresponds to one of the second data and Y corresponds to the other second data, p X,Y Corresponding to the first parameter.
And according to the first parameters, removing the n second data with the highest first parameters in the first data set. In practical application, it is assumed that N second data are stored in the first data set, and after the N second data are removed, the first parameter corresponding to the first data set is determined again, wherein N-N second data are stored in the first data set.
After the iterative processing is performed on the first data set, the first parameter of each two second data in the first data set can be smaller than the first set threshold, and in this case, each second data stored in the first data set can be considered as the target second data.
In the screening method, the relevance of each second data can be more accurately analyzed, so that the data with high relevance can be accurately removed, but the processing time is too long due to the fact that the first data set needs to be subjected to multiple iterative processing.
Another method of determining target second data in a first data set is detailed:
in this way, the processing speed of the first data set can be increased, but because the correlation of each second data is not accurately analyzed, some characteristics of strong predictability may exist in the target second data.
S302: and determining a second data set in all target second data corresponding to the first data set according to the second parameter.
After analyzing the correlation of the second data in the same feature group in the first data set, analyzing the correlation of every two target second data between different feature groups on the basis of the target second data in the first data set, wherein the second parameter is the pearson correlation coefficient of every two second data between different feature groups, and the correlation between every two second data in different feature groups can be determined through the second parameter.
And according to the second parameters, target second data with second parameters larger than or equal to a second set threshold value in different feature groups in the first data set are removed, and the target second data with second parameters smaller than the second set threshold value in the first data set are determined as the second data set, so that the screening of the second data from the same feature group and different feature groups is completed.
In an embodiment, as shown in fig. 4, when the second data set corresponding to each first user is obtained by filtering all the second data corresponding to each first data set according to the set filtering rule, the method includes:
s401: and freely combining all the second data corresponding to the first data set to generate at least one second data combination.
Then the first kind of second data combination may be composed of the second data a, the second kind of second data combination is the second data a + the second data B, and the third kind of second data combination is the second data a + the second data B + the second data C.
S402: and determining a third parameter corresponding to each second data combination according to the screening model.
The third parameter is an index value of a setting index of the screening model when a second data combination is used as input data of the screening model, where the screening model has a different use from that of the user classification model in step S104, the screening model is used for the second data screening, and may be a logistic regression model and/or an XGBoost model, and the user classification model in step S104 is used for user classification.
And taking each second data combination as input data of the screening model, so as to obtain a third parameter corresponding to each second data combination. In this embodiment, the third parameter may be an amount of information of the erythroid pool (AIC, akaike information criterion), where AIC =2k-2ln (L), where k is the number of parameters and L is a likelihood function.
In one case, the third parameter may further include a Bayesian Information Criterion (BIC), where BIC = ln (n) k-2ln (L), where k is the number of parameters, n is the number of samples, and L is the likelihood function.
In yet another case, the third parameter may further include a p-value, wherein the p-value is obtained by inputting the second data into the screening model, and automatically calculating the p-value of each second data after the screening model runs.
S403: and determining a second data set according to the third parameter and a third set threshold.
Rule one is as follows: the second data to be screened is determined by comparing the change of the third parameter of each second data combination, for example, comparing the third parameter a (index value of the set performance index of the screening model when the second data a is used as input data) with the third parameter B (index value of the set performance index of the screening model when the second data a and the second data B are used as input data), if the third parameter a is better than the third parameter B, the second data B can be rejected, and the second data a is retained. Wherein, in rule one, the third parameter is AIC and/or BIC.
Rule two: and eliminating a second characteristic that the third parameter is smaller than a third set threshold, wherein in the rule II, the third parameter is a p value.
Rule three: the XGboost model is a scalable machine learning system based on a decision tree, the importance degree of each second data can be automatically sequenced after the model is operated, a plurality of second data after ranking can be removed, and the rest second data are screened to form a second data set so as to generate input data of the screening model.
The first rule and the second rule may be applied to feature screening using a logistic regression model, and the first rule, the second rule and the third rule may be applied to feature screening using an XGBoost model.
S104: training at least one user classification model according to a set number of the second data sets.
The second data set contains positive sample data and negative sample data, and when the users visiting the store are classified, the second data of the second user visiting the store in the second data set is used as the positive sample data, and the second data of the second user not visiting the store in the second data set is used as the negative sample data, wherein whether the user not visiting the store or the user not visiting the store is determined by the first tag, in practical application, the set number of the second data sets may be 80% of the total number of the second data sets, for example, assuming that 10 second data sets are generated, 8 second data sets may be randomly selected from the second data sets as input data for model training.
A set number of second data sets are input into at least one user classification model to train the user classification model. In practical application, the XGboost model can be trained by using the second data set, and the XGboost model is an optimized distributed gradient enhancement library. In another mode, the logistic regression model may be trained using the second data set, where the logistic regression model is a binary statistical method, and the XGBoost model and the logistic regression model may also be trained simultaneously.
The XGboost model and the logistic regression model output a probability value between 0 and 1, specifically an S-shaped function, i.e. a function
Figure BDA0003755084240000091
Where y represents the probability value of the output, ω is the weight of the model, b represents the bias value, and x is the eigenvalue of the second data set. In practical application, the threshold of the general probability value two classification is 0.5, when the calculated probability value is greater than 0.5, the corresponding output probability value is 1, and when the calculated probability value is less than 0.5, the corresponding output probability value is 0.
In one embodiment, as shown in fig. 5, the method further comprises:
s501: and constructing a third data set according to the second data set corresponding to each first user.
Under the condition of simultaneously training the first user classification model and the second user classification model, one of the first user classification model and the second user classification model is selected for use by respectively verifying the detection capability of the first user classification model and the detection capability of the second model.
The third data set is used for verifying the detection capability of the model, the third data set is derived from 20% of all the second data sets, the third data set also comprises positive sample data and negative sample data, for example, assuming that 10 second data sets are generated, 8 second data sets in the 10 second data sets are extracted as input data of model training when the input data of the model training is constructed, and 2 second data sets which are not used as the input data of the model training can be determined as the third data set.
S502: and acquiring a first output result and a second output result.
And inputting the third data set into the first user classification model, and obtaining a first output result, wherein the first output result represents the user type of the first user in the third data set determined by the first user classification model.
And inputting the third data set into a second user classification model, and obtaining a second output result, wherein the second output result represents the user type of the first user in the third data set determined by the second user classification model.
S503: determining a first performance indicator value based on the first output result, and determining a second performance indicator value based on the second output result.
Here, the first performance index value and the second performance index value respectively include an accuracy rate, a recall rate, an F1 score and an AUC, where the accuracy rate reflects a number of correct predictions in first positive sample data, where the first positive sample data is data predicted as a positive sample in an output result, and the recall rate reflects a number of correct predictions in second positive sample data, where the second positive sample data is positive sample data determined according to the first tag, and meanings of the first positive sample data and the second positive sample data are explained by taking a user accessing a store as an example, and when a set number of second data sets are input into the user classification model, an output result of the user classification model predicts that the first user is a user accessing the store, and then the first user belongs to the first positive sample data, and when the first tag of the first user marks that the first user has a behavior of accessing the store, the first user belongs to the second positive sample data. In practical applications, the recall rate decreases as the precision rate increases. The F1 score is the harmonic mean of the precision and recall, and a higher F1 score can only be obtained if both the precision and recall are very high. The AUC is defined as the area under a receiver operating characteristic curve (ROC) curve, the area of the AUC is not greater than 1, and the higher the value of the AUC, the better the prediction capability of the user classification model is represented.
The degree of the detection capability of the first user classification model can be determined through the first performance index value, and the degree of the detection capability of the second user classification model can be determined through the second performance index value. The first performance index value is obtained from analyzing the first output result, and the second performance index value is obtained from analyzing the second output result.
S504: and selecting a model meeting the set performance index value from the first user classification model and the second user classification model for use according to the first performance index value and the second performance index value.
And comparing the first performance index value with the second performance index value, and selecting a model with better prediction capability for use, so that the capability of identifying different automobile users can be improved.
In the embodiment of the application, three application scenarios of the model training method in the intelligent automobile industry are provided.
Scene one
In the first scenario, a target automobile user is identified through the training model, and the target automobile user refers to an automobile user who has a high probability of visiting a store or purchasing an automobile, so that subsequent services can be provided for different automobile users in a targeted manner.
Step 1: a first data set of at least one vehicle user is obtained from three data sources, in which at least one user characteristic of the vehicle user is recorded. In addition, the user data collected offline can be imported, and further, the user data collected offline and the user data collected online can be subjected to data analysis at the same time, for example, related questions such as purchasing intention or visiting store intention are set in a questionnaire survey, the questionnaire survey is issued offline, and the data about purchasing intention or visiting store intention filled by the automobile user is imported from the collected questionnaire feedback as the user characteristics of the automobile user.
And 2, step: and storing the first data set in a first list according to a set data format. The first data, the feature groups where the first data are located, and the first tags of the automobile users are recorded in the first list respectively, where the first tags are used for distinguishing whether the automobile users belong to a store visiting user or a store not visiting user, and in practical application, the first tags may be manually marked or automatically marked according to a certain rule.
And step 3: and converting the first data into second data. Here, the first data may be processed by selecting an appropriate conversion method from mapping coding, one-hot coding, discretization conversion, and logarithmic conversion according to whether the first data is discrete data or continuous data, so as to obtain corresponding second data.
And 4, step 4: and performing feature screening on all second data corresponding to the first data set to obtain a second data set corresponding to each first user, wherein the feature screening is performed according to correlation coefficients among the second data, the feature screening is performed according to the influence of the second data on the detection capability of the screening model, and the feature screening is performed according to the priori information.
The characteristic screening according to the correlation coefficient between the second data comprises the step of screening the second data in the first data set by utilizing the Pearson correlation coefficient of every two second data in the same characteristic grouping and the Pearson correlation coefficient of every two second data in different characteristic groupings to obtain a second data set.
The characteristic screening according to the influence of the second data on the detection capability of the screening model is to screen the second data in the first data set to obtain a second data set by using index values of set indexes of the screening model under different second data combinations.
In the target automobile user identification scene, characteristics related to the automobile purchasing intention and the shop visiting intention are mainly concerned, the related user characteristics can be reserved, the characteristic of the browsing times of the application program of the automobile user is relatively unimportant, and the characteristic can be eliminated.
And 5: and training the first user classification model and the second user classification model according to the set number of the second data sets.
And 6: the method includes generating a third data set based on a second data set corresponding to each first user, obtaining a first output result generated by a first user classification model and related to the third data set, and obtaining a second output result generated by a second user classification model and related to the third data set, wherein the third data set is used for verifying the detection capability of the model.
And 7: a first performance metric value is determined with respect to the first user classification model based on the first output result, and a second performance metric value is determined with respect to the second user classification model based on the second output result.
And 8: and selecting a model with better performance from the first user classification model and the second user classification model for use according to the first performance index value and the second performance index value.
In practical application, a normalization process is performed according to the probability output by the user classification model, the probability value is converted into a score of the automobile user, the higher the score is, the automobile user is taken as a target user, and the higher the willingness to visit a store or the willingness to buy the automobile is, specifically, the value range of the probability value is 0-1, so that the probability value can be divided into 10 levels, and the 10 levels are respectively corresponding to 10 integer values between 1 and 10, so that the score of the automobile user can be determined according to the probability value, for example, assuming that the probability value is 0.5, the score of the automobile user can be correspondingly determined as 1.. In practical application, business personnel can provide subsequent services for different automobile users according to scores of different automobile users, for example, various incentive tasks can be issued for automobile users with high scores, relevant advertisements or information cannot be released for automobile users with low scores, and popularization efficiency can be improved.
Scene two
In a second scenario, whether the automobile user has an abnormal user or not is identified through the training model, and whether the automobile user belongs to the abnormal user or a non-abnormal user can be accurately identified.
Step 1: a first data set of at least one vehicle user is obtained from three data sources, in which at least one user characteristic of the vehicle user is recorded.
Step 2: and storing the first data set in a first list according to a set data format. In practical application, the first label may be manually marked or automatically marked according to a certain rule, for example, a reported and recorded automobile user is marked as an abnormal user, an automobile user whose browsing frequency per minute is greater than a set threshold value may be marked as an abnormal user, and in addition, an automobile user who only participates in a lottery activity but does not have a browsing behavior may be marked as an abnormal user.
And step 3: and converting the first data into second data. Here, the first data may be processed by selecting an appropriate conversion method from mapping coding, unique heat coding, discretization conversion, and logarithm conversion according to whether the first data is discrete data or continuous data, so as to obtain corresponding second data.
And 4, step 4: and performing feature screening on all second data corresponding to each first data set to obtain a second data set corresponding to each first user, wherein the feature screening is performed according to correlation coefficients among the second data, the feature screening is performed according to the influence of the second data on the detection capability of the screening model, and the feature screening is performed according to the priori information.
The characteristic screening according to the correlation coefficient between the second data comprises the step of screening the second data in the first data set by utilizing the Pearson correlation coefficient of every two second data in the same characteristic grouping and the Pearson correlation coefficient of every two second data in different characteristic groupings to obtain a second data set.
The characteristic screening according to the influence of the second data on the detection capability of the screening model is to screen the second data in the first data set to obtain a second data set by using index values of set indexes of the screening model under different second data combinations.
In the abnormal user identification scenario, whether the automobile user has abnormal behavior is mainly analyzed through online buried point data, so that second data (including lottery drawing, invitation, check-in and the like) about browsing content types and second data about the number of times, days, duration, frequency and the like of specific behaviors (including refreshing pages, clicking buttons, inputting sensitive words, publishing sensitive content) need to be retained.
And 5: and training the first user classification model and the second user classification according to a set number of second data sets.
Step 6: the method includes generating a third data set based on a second data set corresponding to each first user, obtaining a first output result generated by a first user classification model and related to the third data set, and obtaining a second output result generated by a second user classification model and related to the third data set, wherein the third data set is used for verifying the detection capability of the model.
And 7: a first performance metric value is determined with respect to the first user classification model based on the first output result, and a second performance metric value is determined with respect to the second user classification model based on the second output result.
And 8: and selecting a model with better performance from the first user classification model and the second user classification model for use according to the first performance index value and the second performance index value.
In practical application, a normalization process is performed according to the probability output by the user classification model, the probability value is converted into a score of the automobile user, the automobile user is biased to the abnormal user when the score is higher, specifically, the value range of the probability value is 0-1, so that the probability value can be divided into 10 levels, and the 10 levels are respectively corresponding to 10 integer values between 1 and 10, so that the score of the automobile user can be determined according to the probability value, for example, assuming that the probability value is 0.5, the score of the automobile user can be correspondingly determined to be 1. In practical application, service personnel can carry out right reduction or prohibition on automobile users with high grades according to the grades of different automobile users.
Scene three
In scenario three, the model is trained to identify whether the automobile user has a lost user.
Step 1: a first data set of at least one vehicle user is obtained from three data sources, in which at least one user characteristic of the vehicle user is recorded. In addition, since the attrition users are identified by the application login behavior, and the application login behavior belongs to the internet behavior, the user characteristics can be obtained online, for example, the options voted/filled by the automobile users can be captured as one of the user characteristics from a questionnaire filled by the automobile users about the use feedback of the application.
Step 2: and storing the first data set in a first list according to a set data format. The first data, the feature group where the first data is located, and the first label of the automobile user are recorded in the first list respectively, where the first label is used to distinguish whether the automobile user belongs to an attrition user or a non-attrition user, where the definition of the attrition user means that a login behavior of the application program exists in the first time period, and the login behavior of the application program does not exist in the second time period. In practical applications, the first label may be marked manually or automatically according to a certain rule.
And step 3: and converting the first data into second data. Here, the first data may be processed by selecting an appropriate conversion method from mapping coding, one-hot coding, discretization conversion, and logarithmic conversion according to whether the first data is discrete data or continuous data, so as to obtain corresponding second data.
And 4, step 4: and performing feature screening on the second data corresponding to each first data set to obtain a second data set corresponding to each first user, wherein the feature screening is performed according to the correlation coefficient among the second data, the feature screening is performed according to the influence of the second data on the detection capability of the screening model, and the feature screening is performed according to the priori information.
The characteristic screening according to the correlation coefficient between the second data comprises the step of screening the second data in the first data set by utilizing the Pearson correlation coefficient of every two second data in the same characteristic grouping and the Pearson correlation coefficient of every two second data in different characteristic groupings to obtain a second data set.
The characteristic screening according to the influence of the second data on the detection capability of the screening model is to screen the second data in the first data set to obtain a second data set by using index values of set indexes of the screening model under different second data combinations.
In the lost user identification scenario, the determination is mainly made by the application login behavior of the automobile user, and therefore, it is necessary to retain second data about the usage habit of the application of the automobile user.
And 5: and training the first user classification model and the second user classification model according to a set number of second data sets.
Step 6: the method comprises the steps of generating a third data set based on a second data set corresponding to each first user, obtaining a first output result generated by a first user classification model and related to the third data set, and obtaining a second output result generated by a second user classification and related to the third data set, wherein the third data set is used for verifying the detection capability of the model.
And 7: a first performance metric value is determined with respect to the first user classification model based on the first output result, and a second performance metric value is determined with respect to the second user classification model based on the second output result.
And step 8: and selecting a model with better performance from the first user classification model and the second user classification model for use according to the first performance index value and the second performance index value.
In practical application, a normalization process is performed according to the probability output by the user classification model, the probability value is converted into a score of the automobile user, the higher the score is, the automobile user is biased to be a lost user, specifically, the value range of the probability value is 0-1, so that the probability value can be divided into 10 levels, and the 10 levels are respectively corresponding to 10 integer values between 1 and 10, so that the score of the automobile user can be determined according to the probability value, for example, assuming that the probability value is 0.5, the score of the automobile user can be correspondingly determined to be 1.. In practical application, service personnel can mine user preferences of automobile users with high scores according to scores of different automobile users, and make strategies for preventing automobile user loss and recall loss aiming at the automobile users with high scores.
In the embodiment, a first data set of each first user in at least one first user is obtained, the first data is converted into second data, a second data set corresponding to each first user is obtained by screening all second data corresponding to each first data set according to a set screening rule, at least one user classification model is trained based on a set number of second data sets, and the trained user classification model can process complex and diverse feature data of automobile users, so that batch automobile users can be accurately and automatically identified, and corresponding strategies can be formulated for different automobile users in a subsequent targeted manner.
An embodiment of the present application further provides a user classification method, as shown in fig. 6, including:
s601: a fourth data set is acquired.
The fourth data set refers to at least one user data of a second user needing user classification, wherein the fourth data set can be collected from user data in different channels, and the first data source is as follows: the business data of the first user includes sex, age, province, city of the first user, a source of a business data channel of the first user (for example, whether the business data of the first user is obtained from a small program or an application, a public network and the like), points, growth value, fan number, lottery number, winning number and the like. The second data source is: and the advertisement media data about the first user comprises advertisement click events, advertisement exposure events and the like. The third data source is: the SDK data burying comprises applet data burying, official website data burying and application data burying, and can also lead in the data of the second user collected offline and process and analyze the data of the second user collected offline and the data of the second user collected online.
S602: the fourth data set is input to a user classification model, and a user type of the second user is determined based on the user classification model.
The user classification model is obtained by training based on the model training method, and the user classification model can output the user type of the second user by analyzing and processing the fourth data set.
In an embodiment, as shown in fig. 7, the inputting the fourth data set to a user classification model, and the determining the user type of the second user based on the user classification model includes:
s701: and inputting the fourth data set into the user classification model to obtain the first probability.
The fourth data is input into the user classification model, the user classification model can output a first probability by processing the fourth data, and the probability value that the second user becomes one of the users in the user classification can be determined through the first probability, for example, whether the second user has a user who wants to buy a car or not can be determined through the first probability.
S702: and determining the user score corresponding to the first probability according to the set mapping relation.
In practical application, a mapping relationship between the probability value and the user score may be preset, for example, the value range of the probability value is 0 to 1, and the corresponding user score is 1 to 10, so that the probability value may be divided into 10 layers, where each probability layer corresponds to one user score, and for example, the corresponding user score is 1 when the probability value falls within the range of 0 to 0.1.
And determining the user score of the second user according to the set mapping relation. In practical application, business personnel can make relevant strategies and services meeting second users with different grades according to the grades of the second users.
In an embodiment, the user classification model is to determine a target user existing in the second user, where the target user represents a user having a desire to set an object, for example, in an automobile scenario, the target user refers to a user having a desire to set an object, that is, an automobile user having a desire to buy an automobile or a desire to visit a store. In practical application, whether the second user is the target user or not can be determined according to a third output result generated by the user classification model, in a feasible manner, the third output result is a user score generated by the user classification model, and when the score of the second user is higher, the second user can be considered as the target user.
And when the second user is the target user, releasing the set resource task to the second user.
In the case where the second user is the target user, the service person may issue the set resource task to the second user, for example, issue motivational information to the second user so that the second user may visit a store or purchase a set object. In practical application, when the second user is not the target user, the second user can be stopped from being put with relevant advertisements or information, so that the popularization efficiency can be improved.
In one embodiment, the user classification model is to determine attrition users present in the second user, where attrition users refer to users having access behavior during a first time period and no access behavior during a second time period, where the first time period is earlier than the second time period. In practical applications, the access behavior may be a behavior of a user accessing a store, an access behavior of a set application program for the user to access, or an access behavior of the user to information of a set object, for example, a second behavior of the user accessing the store exists in a first week, and no access behavior exists in a second week, including an access behavior of the store, an access behavior of the set application program, and an access behavior of the information of the set object do not exist, and the user may be determined to be an attrition user. In a feasible manner, the third output result is a user score generated by the user classification model, and when the score of the second user is higher, the second user can be considered as a churning user.
When the second user is a lost user, the interest or preference of the second user can be mined, the resource task is configured for the second user according to the interest or preference of the second user, namely the resource task contains the content interested by the second user, after the resource task related to the second user is configured, the content interested by the second user is mainly delivered to the second user through an online channel, and the second user can generate an access behavior to a set application and/or an access behavior to information of a set object according to the interested content, so that the lost user can be saved.
In one embodiment, the user classification model is used for determining an abnormal user existing in the second user, where the abnormal user refers to a user having an abnormal access behavior, for example, the frequency of the access behavior exceeds a set threshold value within a set time period, where the access behavior includes a browsing behavior occurring after the user logs in the setting application program, or a situation where the user only initiates a participation in a lottery behavior and does not initiate the browsing behavior.
In a feasible manner, the third output result is a user score generated by the user classification model, and when the score of the second user is higher, the second user can be considered as an abnormal user.
In the case that the second user is determined to be an abnormal user, the service personnel may restrict the access behavior of the second user, for example, to reduce the right or block the second user. On the other hand, it can be determined that the abnormal user behavior is less likely to become the target user, and therefore, the resource placement related to the second user can be further stopped.
In the above embodiment, the fourth data set is obtained, the fourth data set is input to the user classification model, the user type of the second user is determined based on the user classification model, the user type of the second user can be accurately identified, and a targeted subsequent service is provided for the second user.
It should be noted that the user-related data (e.g., the service data of the user such as sex, age, province, city, point, number of fans, etc.) related to the present application is obtained after obtaining the permission or approval of the user; that is, when the present application is applied to a specific product or technology, user permission needs to be obtained to achieve the acquisition and processing of the relevant data, and the processing of the relevant data needs to comply with relevant laws and regulations and regulatory standards of relevant countries and regions.
For example, when a city where the user is located needs to be acquired, a location acquisition prompt may be displayed in the terminal of the user, and after receiving a confirmation operation of the user for the location acquisition prompt, the terminal may determine the city where the user is located according to the acquired current location of the user. As shown in fig. 8.
In order to implement the model training method according to the embodiment of the present application, an embodiment of the present application further provides a model training apparatus, as shown in fig. 9, the apparatus includes:
a first obtaining unit 901, configured to obtain a first data set of each of at least one first user; wherein the first data set contains at least one first data of the first user; said first data characterizing a user characteristic of said first user;
a conversion unit 902, configured to convert the first data into second data; the second data represents a characteristic value associated with the first label; the first label is used for marking the user type of the first user;
a screening unit 903, configured to screen all second data corresponding to each first data set according to a set screening rule to obtain a second data set corresponding to each first user; the screening rules at least comprise the following rules: performing feature screening according to the correlation coefficient between the second data, performing feature screening according to the influence of the second data on the detection capability of the screening model, and performing feature screening according to the prior information;
a training unit 904, configured to train at least one user classification model according to a set number of the second data sets; wherein the user classification model is used to determine a user type of the user.
In an embodiment, after the first acquiring unit 901 acquires at least one first data set, the apparatus further includes:
the storage unit is used for storing a first list corresponding to each first data set in the at least one first data set; wherein, the first and the second end of the pipe are connected with each other,
the first list records first data in the first data set, characteristic groups where the first data are located and a first label of a first user according to a set data format.
In an embodiment, when converting the first data into the second data, the converting unit 902 is further configured to:
determining a first conversion mode according to the data type of the first data; the data type comprises a discrete data type or a continuous data type;
converting the first data into the second data according to the determined first conversion mode; wherein, the first and the second end of the pipe are connected with each other,
the first conversion mode comprises at least one of one-hot encoding, mapping encoding, discretization conversion and logarithmic conversion.
In an embodiment, when the screening unit 903 screens all the second data corresponding to each first data set according to a set screening rule to obtain a second data set corresponding to each first user, the screening unit is further configured to:
determining target second data in the first data set according to the first parameter; the first parameter characterizes a correlation between any two second data in the first data set in the same feature group;
determining a second data set in all target second data corresponding to the first data set according to the second parameter; the second parameter characterizes a correlation between any two target second data of different feature groupings.
In an embodiment, the screening unit 903, when determining the target second data in the first data set according to the first parameter, is further configured to:
iterating the first data set to perform the following: determining a first parameter corresponding to the first data set; n second data with the highest first parameters in the first data set are removed; wherein n is an integer > 0;
after the iteration is completed, the first parameter of every two second data in the first data set is smaller than a first set threshold value; every two second data in the first data set are the target second data.
In an embodiment, the screening unit 903, when determining the target second data in the first data set according to the first parameter, is further configured to:
and screening all the second data corresponding to each first data set according to a set screening rule to obtain a second data set corresponding to each first user.
In an embodiment, when the screening unit 903 screens all the second data corresponding to each first data set according to a set screening rule to obtain a second data set corresponding to each first user, the screening unit is further configured to:
all second data corresponding to the first data set are freely combined to generate at least one second data combination;
determining a third parameter corresponding to each second data combination according to the screening model; the third parameter represents an index value of a set index of the screening model under the condition that the second data combination is the input data of the screening model;
and determining the second data set according to the third parameter and a third set threshold.
In an embodiment, the at least one user classification model comprises a first user classification model and a second user classification model, the apparatus further comprising:
the building unit is used for building a third data set for the second data set corresponding to each first user; the third data set is used for verifying the detection capability of the model;
a third obtaining unit, configured to obtain the first output result and the second output result; the first output result represents the user type of the first user in the third data set determined based on the first user classification model; the second output result represents the user type of the first user in the third data set determined based on the second user classification model;
a determining unit, configured to determine a first performance index value according to the first output result, and determine a second performance index value according to the second output result; the first performance index value represents an index value of a performance index for measuring the detection capability of the first user classification model; the second performance index value represents an index value of a performance index for measuring the detection capability of the second user classification model;
and the processing unit is used for selecting a model meeting the set performance index value from the first user classification model and the second user classification model according to the first performance index value and the second performance index value.
In practical applications, the first obtaining unit 901, the transforming unit 902, the screening unit 903, and the training unit 904 may be implemented by a processor in the model training apparatus. Of course, the processor needs to run the program stored in the memory to realize the functions of the above-described program modules.
It should be noted that, when performing model training, the model training apparatus provided in the embodiment of fig. 9 is only illustrated by dividing the program modules, and in practical applications, the above processing may be distributed and completed by different program modules as needed, that is, the internal structure of the apparatus may be divided into different program modules to complete all or part of the above-described processing. In addition, the model training device and the model training method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.
In order to implement the user classification method provided in the embodiment of the present application, an embodiment of the present application further provides a user classification apparatus, as shown in fig. 10, the apparatus includes:
a second acquisition unit 1001 configured to acquire a fourth data set; the fourth data set characterizes at least one user data of a second user;
a determining unit 1002, configured to input the fourth data set to a user classification model, and determine a user type of a second user based on the user classification model; the user classification model is trained based on the model training method.
In an embodiment, the determining unit 1002, when inputting the fourth data set to a user classification model, determines the user type of the second user based on the user classification model, is configured to:
inputting the fourth data set to the user classification model, resulting in a first probability; said first probability characterizing a probability value of said second user being one of a class of users;
determining a user score corresponding to the first probability according to a set mapping relation; and the set mapping relation represents user scores corresponding to different probability intervals.
In an embodiment, the determining unit 1002, when determining the user type of the second user based on the user classification model, is further configured to:
determining whether the second user is a target user based on a third output result generated by the user classification model; the target user represents a user having a desire to set an object.
In an embodiment, the determining unit 1002, when determining the user type of the second user based on the user classification model, is further configured to:
determining whether the second user is a attrition user based on a third output result generated by the user classification model; the attrition users characterize users who have access behavior during a first time period and who do not have access behavior during a second time period; the first time period is earlier than the second time period.
In an embodiment, the determining unit 1002, when determining the user type of the second user based on the user classification model, is further configured to:
determining whether the second user is an abnormal user based on a third output result generated by the user classification model; the abnormal user characterizes a user having abnormal access behavior.
In practice, the second obtaining unit 1001, determining to 1002 may be implemented by a processor in the user classification apparatus. Of course, the processor needs to run the program stored in the memory to realize the functions of the above-described program modules.
It should be noted that, when the user classifying device provided in the embodiment of fig. 10 performs user classification, the division of each program module is merely exemplified, and in practical applications, the above processing may be distributed to different program modules according to needs, that is, the internal structure of the device may be divided into different program modules to complete all or part of the above-described processing. In addition, the user classification provided by the above embodiment and the user classification method embodiment belong to the same concept, and the specific implementation process thereof is described in the method embodiment, which is not described herein again.
Based on the hardware implementation of the program module, and in order to implement the method according to the embodiment of the present application, an embodiment of the present application further provides an electronic device, and fig. 11 is a schematic diagram of a hardware composition structure of the electronic device according to the embodiment of the present application, and as shown in fig. 11, the electronic device includes:
a communication interface 1 capable of information interaction with other devices such as network devices and the like;
and the processor 2 is connected with the communication interface 1 to realize information interaction with other equipment, and is used for executing a model training method or a user classification method provided by one or more technical schemes when running a computer program. And the computer program is stored on the memory 3.
In practice, of course, the various components in the electronic device are coupled together by the bus system 4. It will be appreciated that the bus system 4 is used to enable connection communication between these components. The bus system 4 comprises, in addition to a data bus, a power bus, a control bus and a status signal bus. For clarity of illustration, however, the various buses are labeled as bus system 4 in fig. 11.
The memory 3 in the embodiment of the present application is used to store various types of data to support the operation of the electronic device. Examples of such data include: any computer program for operating on an electronic device.
It will be appreciated that the memory 3 may be either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a magnetic random access Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), synchronous Static Random Access Memory (SSRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), synchronous Dynamic Random Access Memory (SLDRAM), direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 3 described in the embodiments of the present application is intended to comprise, without being limited to, these and any other suitable types of memory.
The method disclosed in the embodiment of the present application may be applied to the processor 2, or may be implemented by the processor 2. The processor 2 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by instructions in the form of hardware integrated logic circuits or software in the processor 2. The processor 2 described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 2 may implement or perform the methods, steps and logic blocks disclosed in the embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in the memory 3, and the processor 2 reads the program in the memory 3 and in combination with its hardware performs the steps of the aforementioned method.
When the processor 2 executes the program, the corresponding processes in the methods according to the embodiments of the present application are realized, and for brevity, are not described herein again.
In an exemplary embodiment, the present application further provides a computer-readable storage medium, for example, including a memory 3 storing a computer program, which can be executed by a processor 2 to implement the steps of the foregoing method. The computer readable storage medium may be Memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash Memory, magnetic surface Memory, optical disk, or CD-ROM.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, terminal and method may be implemented in other manners. The above-described device embodiments are only illustrative, for example, the division of the unit is only one logical function division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned computer-readable storage media comprise: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.
Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a computer-readable storage medium, which includes several instructions for causing an electronic device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned computer-readable storage media comprise: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (13)

1. A method of model training, comprising:
obtaining a first data set of each of at least one first user; wherein the first data set contains at least one first data of the first user; said first data characterizing a user characteristic of said first user;
converting the first data into second data; the second data represents a characteristic value associated with the first label; the first label is used for marking the user type of the first user;
screening all second data corresponding to each first data set according to a set screening rule to obtain a second data set corresponding to each first user; the second data set comprises second data for user classification model training; the screening rules include at least one of the following rules: performing feature screening according to the correlation coefficient between the second data, performing feature screening according to the influence of the second data on the detection capability of a screening model, and performing feature screening according to prior information;
training at least one user classification model according to a set number of the second data sets; wherein the user classification model is used to determine a user type of the user.
2. The model training method of claim 1, wherein after acquiring at least one first data set, the method further comprises;
storing a first list corresponding to each first data set in the at least one first data set; wherein the content of the first and second substances,
the first list records first data in the first data set, characteristic groups where the first data are located and first labels of the first users according to a set data format.
3. The model training method of claim 1 or 2, wherein said converting the first data into second data comprises:
determining a first conversion mode according to the data type of the first data; the data type comprises a discrete data type or a continuous data type;
converting the first data into the second data according to the determined first conversion mode; wherein the content of the first and second substances,
the first conversion manner includes at least one of one-hot encoding, mapping encoding, discretization conversion, and logarithmic conversion.
4. The model training method according to claim 1, wherein the obtaining, according to the set filtering rule, the second data set corresponding to each first user from all the second data corresponding to each first data set by filtering comprises:
determining target second data in the first data set according to the first parameter; the first parameter characterizes a correlation between any two second data in the first data set in the same feature group;
determining a second data set in all target second data corresponding to the first data set according to the second parameter; the second parameter characterizes a correlation between any two target second data of different feature groupings.
5. The model training method of claim 4, wherein the determining the target second data in the first data set according to the first parameter comprises:
iterating the first data set to perform the following: determining a first parameter corresponding to the first data set; n second data with the highest first parameters in the first data set are removed; wherein n is an integer greater than 0;
after the iteration is completed, the first parameter of every two second data in the first data set is smaller than a first set threshold value; every two second data in the first data set are the target second data.
6. The model training method of claim 4, wherein the determining the target second data in the first data set according to the first parameter comprises:
determining a first parameter corresponding to the first data set;
and according to the first parameter, determining second data of the first data set, of which the first parameter is smaller than a first set threshold value, as the target second data.
7. The model training method according to claim 1, wherein the obtaining, by filtering according to the set filtering rule, the second data set corresponding to each first user from all the second data corresponding to each first data set includes:
all second data corresponding to the first data set are freely combined to generate at least one second data combination;
determining a third parameter corresponding to each second data combination according to the screening model; the third parameter represents an index value of a set index of the screening model under the condition that the second data combination is the input data of the screening model;
and determining the second data set according to the third parameter and a third set threshold.
8. The model training method of claim 1, wherein the at least one user classification model comprises a first user classification model and a second user classification model, the method further comprising:
constructing a third data set according to the second data set corresponding to each first user; the third data set is used for verifying the detection capability of the user classification model;
acquiring a first output result and a second output result; the first output result represents the user type of the first user in the third data set determined based on the first user classification model; the second output result represents the user type of the first user in the third data set determined based on the second user classification model;
determining a first performance index value according to the first output result, and determining a second performance index value according to the second output result; the first performance index value represents an index value of a performance index for measuring the detection capability of the first user classification model; the second performance index value represents an index value of a performance index for measuring the detection capability of the second user classification model;
and selecting a user classification model meeting the set performance index value from the first user classification model and the second user classification model according to the first performance index value and the second performance index value.
9. A method for classifying a user, comprising:
acquiring a fourth data set; the fourth data set characterizes at least one user data of a second user;
inputting the fourth data set to a user classification model, and determining a user type of a second user based on the user classification model; the user classification model is trained on the method of any one of claims 1 to 7.
10. The user classification method of claim 9, the inputting the fourth data set to a user classification model, determining a user type of a second user based on the user classification model, comprising:
inputting the fourth data set to the user classification model, resulting in a first probability; said first probability characterizing a probability value of said second user being one of a class of users;
determining a user score corresponding to the first probability according to a set mapping relation; and the set mapping relation represents user scores corresponding to different probability intervals.
11. The method of claim 9, wherein the determining the user type of the second user based on the user classification model comprises:
determining whether the second user is a target user based on a third output result generated by the user classification model; the target user represents a user having a desire to set an object.
12. The method of claim 9, wherein determining the user type of the second user based on the user classification model comprises:
determining whether the second user is a attrition user based on a third output result generated by the user classification model; the attrition users characterize users who have access behavior during a first time period and who do not have access behavior during a second time period; the first time period is earlier than the second time period.
13. The method of claim 9, wherein the determining the user type of the second user based on the user classification model comprises:
determining whether the second user is an abnormal user based on a third output result generated by the user classification model; the abnormal user characterizes a user having abnormal access behavior.
CN202210858459.XA 2022-07-20 2022-07-20 Model training method and user classification method Pending CN115186759A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210858459.XA CN115186759A (en) 2022-07-20 2022-07-20 Model training method and user classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210858459.XA CN115186759A (en) 2022-07-20 2022-07-20 Model training method and user classification method

Publications (1)

Publication Number Publication Date
CN115186759A true CN115186759A (en) 2022-10-14

Family

ID=83518398

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210858459.XA Pending CN115186759A (en) 2022-07-20 2022-07-20 Model training method and user classification method

Country Status (1)

Country Link
CN (1) CN115186759A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116703651A (en) * 2023-08-08 2023-09-05 成都秦川物联网科技股份有限公司 Intelligent gas data center operation management method, internet of things system and medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116703651A (en) * 2023-08-08 2023-09-05 成都秦川物联网科技股份有限公司 Intelligent gas data center operation management method, internet of things system and medium
CN116703651B (en) * 2023-08-08 2023-11-14 成都秦川物联网科技股份有限公司 Intelligent gas data center operation management method, internet of things system and medium

Similar Documents

Publication Publication Date Title
Ruchansky et al. Csi: A hybrid deep model for fake news detection
US20220005125A1 (en) Systems and methods for collecting and processing alternative data sources for risk analysis and insurance
Thorleuchter et al. Analyzing existing customers’ websites to improve the customer acquisition process as well as the profitability prediction in B-to-B marketing
CN107040397B (en) Service parameter acquisition method and device
CN107862022B (en) Culture resource recommendation system
CN109165975B (en) Label recommending method, device, computer equipment and storage medium
Bhadani et al. Political audience diversity and news reliability in algorithmic ranking
CN111783016B (en) Website classification method, device and equipment
Rafei et al. Big data for finite population inference: Applying quasi-random approaches to naturalistic driving data using Bayesian additive regression trees
CN112667825B (en) Intelligent recommendation method, device, equipment and storage medium based on knowledge graph
Ortega et al. Artificial intelligence scientific documentation dataset for recommender systems
Rajabi et al. User behavior modelling for fake information mitigation on social web
Coletto et al. Electoral predictions with twitter: a machine-learning approach
CN113869931A (en) Advertisement putting strategy determining method and device, computer equipment and storage medium
CN115186759A (en) Model training method and user classification method
US10346856B1 (en) Personality aggregation and web browsing
Papadimitriou et al. Needs and priorities of road safety stakeholders for evidence-based policy making
Du et al. ExpSeeker: Extract public exploit code information from social media
CN112685618A (en) User feature identification method and device, computing equipment and computer storage medium
CN108021713B (en) Document clustering method and device
CN113935788B (en) Model evaluation method, device, equipment and computer readable storage medium
CN115952468A (en) Feature processing method, device, equipment and computer storage medium
CN112084408B (en) List data screening method, device, computer equipment and storage medium
CN113688206A (en) Text recognition-based trend analysis method, device, equipment and medium
Mouronte-López et al. Patterns of human and bots behaviour on Twitter conversations about sustainability

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination