CN111291798A

CN111291798A - User basic attribute prediction method based on ensemble learning

Info

Publication number: CN111291798A
Application number: CN202010070270.5A
Authority: CN
Inventors: 曹倩; 王曼; 刘立红; 左敏; 李海生
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2020-06-16
Anticipated expiration: 2040-01-21
Also published as: CN111291798B

Abstract

The invention relates to a user basic attribute prediction method based on ensemble learning. Firstly, converting a multi-classification problem into a plurality of two-classification problems, and performing two-classification prediction by using a LightGBM and FM fusion model as a two-classifier; and then combining the prediction results of the two classifications with the original characteristics to construct a multi-classification model. Experimental results show that the fusion method provided by the invention can improve the effect of user attribute prediction.

Description

User basic attribute prediction method based on ensemble learning

Technical Field

The invention relates to the technical field of integrated learning, in particular to a user basic attribute prediction method based on installation and use data of an App (application) of a smart phone.

Background

With the development of mobile internet, smart phones have become the most mobile devices for people. The number of apps currently available in application stores has exceeded four million, and apps installed and used by people may be closely related to their basic attributes of gender, age, and the like. The information can reflect personal information such as basic attributes, interest preferences, living habits and the like of the user. The deep mining of the user attributes can help an application store to know the behavior characteristics of the user and has targeted recommended products; and the system can also help enterprises to accurately put internet advertisements, thereby saving the advertisement cost.

The existing research mainly comprises the steps of carrying out basic attribute prediction on an App installed by a user, mining the characteristics of data used by the App of the user less and more coarsely, and analyzing the frequency and the time length of the App used by the user and the sequence of the App used in depth; on the other hand, the existing research mainly adopts traditional machine learning methods such as SVM and bayes, and the ensemble learning as an important part of the machine learning is also gradually applied to the field of user attribute prediction. However, the existing algorithm based on ensemble learning also has some disadvantages, such as a certain amount of information is lost in the problem division process, and the final model fusion is a time-consuming and complex parameter adjustment process.

Disclosure of Invention

In order to solve the problems that the existing method is less in mining of the data used by the user App and low in accuracy of basic attribute prediction, the method is used for mining and predicting the basic attributes of the user based on the App installation and the data used. The technical scheme of the invention is as follows: a user basic attribute prediction method based on ensemble learning comprises the following steps: based on an App installation list and App use data of a mobile phone user, the gender and age of the user are predicted, and the method comprises the following steps:

step 1, collecting data recorded by installation and use behaviors of user apps, wherein the data comprises user IDs (identities), an installed App list and the use time of each App in the list; preprocessing collected data recorded by the installation and use behaviors of the user App, and filtering abnormal and missing data; obtaining preprocessed original data;

step 2, dividing the preprocessed original data into 12 classified data sets, wherein the classified data sets comprise 1 individual second classification and 11 age second classifications, and only data labels are different among different classified data sets;

and 3, extracting features of the binary data set, wherein the features comprise basic statistical features: the number of apps installed by the user; the number of App categories installed by the user; the number of apps of each category is installed by a user; counting the use duration of each period of 24 hours by a user; the maximum, minimum and average App use time of a user every day; the user averagely opens the App times every day; the user uses App earliest and latest time; the user App uses the preference characteristic and the Applist2vec characteristic;

step 4, fusing the lightGBM and the FM model of the factorization machine to construct two classifiers, extracting high-dimensional combination characteristics by using the lightGBM, inputting the high-dimensional combination characteristics into the FM classifier, and training to obtain the prediction probability of each two classifiers;

step 5, splicing the prediction probability and the characteristics in the step 3 to obtain new training characteristics which are { basic attribute characteristics, Applist2vec characteristics, and the combined user App using preference characteristics, probability 1, probability 2, … … and probability 12 }; and combining the gender and the age, converting the problem into a multi-classification problem, inputting the new training characteristics into a multi-classifier for training, and outputting a prediction result.

Further, the data of the user App installation and use behavior record collected in step 1 specifically includes:

app use behavior records of a plurality of users are collected, wherein the App use behavior records comprise user IDs, user sexes and ages, App lists installed by the users and opening and closing time of each App in the lists.

Further, the step 1 is to preprocess the collected data of the user App installation and use behavior records, and the filtering of abnormal and missing data specifically includes:

(1) user App use time: from the analyzed time stamp of the App used by the user, the data of the abnormal year is removed when the App is turned on and off by the user, including 1970, 1975 and 2025 years;

(2) the daily use time of the user: removing user behavior data records with the use time less than 0.5 hour;

(3) app installed times: and eliminating apps with fewer than 5 installed users.

Further, the 1 individual classification and 11 age classifications in the step 2 are specifically:

the sex of the user is male and female, and is respectively marked as 1 and 2; the age of the user is divided into 11 intervals; the combined gender and age was classified into 22 categories for multi-category prediction, and the combined gender-age group was designated as sex _ age, (sex-1) · 11+ age.

Where sex represents the user's gender and age represents the user's age group.

Further, the step 2 of dividing the preprocessed raw data into 12 binary data sets includes:

converting the basic attribute label into a two-classification label corresponding to the two classifiers, namely converting the data set label of the two classifier 1 into an age interval 1 and a non-age interval 1; converting the data set labels of the second classifier 2 into an age interval 2 and a non-age interval 2; … …, respectively; converting the data set labels of the second classifier 11 into an age interval 11 and a non-age interval 11; the data set labels of the second classifier 12 are converted into gender male and gender female, only the data labels are different between different classified data sets, and other data are not changed.

Further, the step 3: extracting features from the binary data set includes:

extracting features from each classified data set according to the App installation and use data and the App category information of the user, wherein the features comprise basic attribute features, user App use preference features and Applist2vec features; the difference of the characteristics of each binary data set is that a user App uses a preference characteristic;

(1) the basic attribute features include: the number of apps installed by a user, the number of App types installed by the user, the number of apps installed by the user in each type, statistics of the use duration of each period of 24 hours by the user, the maximum, minimum and average use duration of the apps by the user each day, the average number of times of opening the apps by the user each day, and the earliest and latest times of using the apps by the user;

(2) the user App usage preference features include: firstly, extracting an important App under each attribute based on information gain, and then extracting features of the important App by using TF-IDF (Trans-inverse discrete frequency) on a use set;

(3) the App 2vec characteristic is that each App is regarded as a word, a sequence of each user using the App within a period of time is regarded as a document set, and the Embedding layer in the word2vec is used for extracting characteristics to obtain an App vector matrix after dimensionality reduction; and (3) constructing a word vector model by adopting a CBOW network structure, and extracting 20-dimensional continuous features from the behavior data used by the user App.

Further, the step 4: the second classifier structure includes:

inputting the feature set into a LightGBM model for training, performing five-fold cross prediction on training samples in the LightGBM model, calculating which leaf node of each decision tree each sample belongs to, recording the leaf node to which the sample belongs as 1, recording other leaf nodes as 0, and extracting a high-dimensional combination 0-1 feature vector:

x'_i＝g(x_i,θ)_{num_tree×num_leaves}

wherein x is_iRepresenting the i-th training sample feature vector, x_i' represents a high-dimensional combination 0-1 feature vector of the ith training sample, g (-) represents a leaf node of the LightGBM classifier, 1 is taken when the ith sample belongs to the leaf node, otherwise 0 is taken, num _ tree represents the number of decision trees in the LightGBM model, and num _ leaves represents the number of leaf nodes on each tree.

The FM model is expressed as:

wherein n represents the number of features of the sample,

<·,·>representing the dot product of two vectors of size k,

and training the FM model by a random gradient descent (SGD) method to obtain weight parameters in the model, and further performing binary prediction according to the input feature set.

Further, the step 5 comprises:

the method comprises the steps of combining preference features used by a user App in 12 classifiers, splicing prediction results of the 12 secondary classifiers into a feature set, combining to obtain a new feature set which is { basic attribute feature, Applist2vec feature, inputting the new feature set into a multi-classifier for training, outputting gender-age multi-classification prediction results, and outputting results output by the multi-classifier, wherein the original feature information is reserved, and the prediction results of the 12 sub-classifiers are synthesized.

Has the advantages that:

compared with the prior art, the embodiment of the invention has the beneficial effects that: the basic attributes of the gender and the age of the user can be predicted by analyzing the installation and use behaviors of the mobile user App, and the attribute prediction algorithm based on the ensemble learning effectively improves the prediction accuracy.

Drawings

FIG. 1: the invention provides a flow diagram of a user basic attribute prediction method based on ensemble learning;

FIG. 2: the invention provides a classifier based on fusion of a LightGBM and an FM.

Detailed Description

The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings.

In this embodiment, as shown in fig. 1, an algorithm flow of the method proposed by the present invention is provided:

step 1: data collection and preprocessing

App usage behavior records of a large number of users are collected, wherein the App usage behavior records comprise user IDs, user sexes and ages, App lists installed by the users and opening and closing time of each App in the lists. The sex of the user is male and female, and is respectively marked as 1 and 2; the age of the user is divided into 11 sections, and the sections are marked as 0(<15), 1 (15-20), 2 (20-25) … … and 10(>60) according to the age. Since gender and age are cross attributes which affect and couple with each other, the invention combines gender and age into 22 classes for multi-classification prediction, and the combined gender-age group is marked as sex _ age, and the sex _ age is (sex-1) · 11+ age. Where sex represents the user's gender and age represents the user's age group.

Firstly, abnormal data in a data set is filtered, and the filtering scheme comprises the following steps:

(1) user App use time: according to the analyzed time stamp, the time when the user turns on and off the App comprises a small amount of data of abnormal years such as 1970, 1975 and 2025, most of the data come from 2-3 months in 2017, so that the data of only 2-3 months are reserved, and the data of other months are regarded as noise;

(2) the daily use time of the user: the fact that the number of hours is too small can cause problems in the user data sampling process, and user behavior data records with the use duration less than 0.5 hour are deleted;

(3) app installed times: for some Apps with extremely small population, only few users install the Apps, the significance of user classification is not large, and therefore the Apps with less than 5 users are eliminated.

Step 2: partitioning into multiple binary datasets

By using the idea of stacking in ensemble learning, an original multi-classification problem is divided into a plurality of two-classification problems, a plurality of two classifiers are constructed for prediction, and each two classifier has different training emphasis points.

This process requires converting the dataset labels to convert the multiple category data labels to the binary data labels corresponding to the problem. The method comprises the steps that the original data is subjected to classification prediction, wherein the classification prediction is a two-classification problem, the age prediction is a multi-classification problem, and basic attribute labels are converted into two-classification labels corresponding to two classifiers, namely, a data set label of the two classifier 1 is converted into an age interval 1 and a non-age interval 1; converting the data set labels of the second classifier 2 into an age interval 2 and a non-age interval 2; … …, respectively; converting the data set labels of the second classifier 11 into an age interval 11 and a non-age interval 11; the data set labels of the second classifier 12 are converted into gender male and gender female, only the data labels are different between different classified data sets, and other data are not changed.

And step 3: feature set extraction

And extracting features including basic attribute features, user App use preference features and Applist2vec features from each classified data set according to the App installation and use data and the App category information of the user. Each of the binary data set features differ in that the user App uses a preference feature.

(1) Basic attribute features

The basic attribute features include: the number of the apps installed by the user, the number of the App types installed by the user, the number of the apps installed by the user in each type, statistics of the use duration of the user in each period of 24 hours, the maximum use duration, the minimum use duration, the average number of times of opening the apps by the user, and the earliest and latest times of using the apps by the user.

(2) User App usage preference feature

The user App uses the preference characteristics to firstly extract important App under each attribute based on information gain, and then extracts characteristics by using TF-IDF for the use set of the important App.

According to the user App installation data, for each user attribute, calculating and sequencing the information gain value of each App, wherein the information gain of a mobile phone App A for a specific user attribute phi can be expressed as:

IG(Φ,A)＝H(Φ)-H(Φ|A)

where H (Φ) represents the information entropy of this particular user attribute, and H (Φ | a) refers to the information entropy under App a fixed condition. Based on the App list and attribute information installed by the user, we can calculate the App with the attribute of Φ, which is 100 before the information gain ranking, and the corresponding information gain set IG (Φ), that is, IG (Φ) ═ IG (IG)₁,…IG₁₀₀)。

According to the App use data of the user, an App set used by the user within a period of time is regarded as a document, each App is regarded as a character in the document, a TF-IDF value of an important App is calculated by using TF-IDF, and the formula is as follows:

wherein n is_i,jIs App_iApp usage set d at user_jOf (1), sigma_kn_k,jIs user App usage set d_jThe total number of App in the system, | D | is the total number of App use sets of users, i.e. the total number of users, | { j: w |, and_i∈d_jis a device containing App_iApp of (a) uses the aggregation number.

Therefore, the TF-IDF of 100 important apps is TFIDF ═ TFIDF (TFIDF)₁,…TFIDF₁₀₀) Multiplying the information gain value by the obtained 100-dimensional TF-IDF weighted information gain, namely the user App use preference characteristic, which is recorded as TFIDF _ IG, namely:

TFIDF_IG＝TFIDF_i·IG_i(i＝1,2,…,100)

(3) applist2vec feature

By analogy with word and text modeling, each App is regarded as a word, a sequence of each user using the App within a period of time is regarded as a document set, and features are extracted by using an Embedding layer in word2vec, so that a dimensionality-reduced App vector matrix can be obtained. According to the method, a word vector model is constructed by adopting a CBOW network structure, and 20-dimensional continuous features are extracted from the behavior data used by the App of the user, so that on one hand, the sequential relation of the App used by the user can be considered, on the other hand, the sparsity of a matrix is reduced, and the calculation efficiency is improved.

And 4, step 4: structure of two-classification device

Compared with the GBDT + LR model, the invention realizes the learning of two classes by adopting the LightGBM + FM (lightweight gradient hoist + factorization machine) model, and automatically realizes the selection and combination of the features. As shown in fig. 2, inputting the feature set into a LightGBM model for training, performing five-fold cross prediction on training samples in the LightGBM model, calculating which leaf node of each decision tree each sample belongs to, and recording the leaf node to which each sample belongs as 1 and recording other leaf nodes as 0, and extracting a high-dimensional combination 0-1 feature vector:

x'_i＝g(x_i,θ)_{num_tree×num_leaves}

wherein x is_iRepresenting the i-th training sample feature vector, x_i' high-dimensional combination 0-1 feature vector representing ith training sample, and g (-) represents leaf node of LightGBM classifier, taking 1 when the ith sample belongs to the leaf node, otherwise taking 1And 0, num _ tree represents the number of decision trees in the LightGBM model, and num _ leaves represents the number of leaf nodes on each tree.

FM aims at solving the problem of insufficient parameter learning of feature combinations under the condition of high-dimensional sparse data, and the model is expressed as follows:

wherein n represents the number of features of the sample,

<·,·>representing the dot product of two vectors of size k,

the FM model is trained through a Stochastic Gradient Descent (SGD) method, so that weight parameters in the model can be obtained, and then binary prediction can be performed according to an input feature set.

And 5: multi-class prediction

In the previous step 4, each two-class classifier adopts different two-class labels to construct a model, and different two-class models only have higher accuracy on a specific attribute. And when the feature sets of the two-classification models are constructed, the relevance of different App use data and attributes is different, so that the App use data are screened in different two-classification model training sets, each two-classification model selects the first 100 apps with the largest information entropy according to the labels of the two-classification models, and the use preference features of the Apps of the users are constructed according to the use duration of the 100 apps of the users. The difference of the labels and the characteristics of the two classified models brings great difference among the models, and is more beneficial to the fusion of subsequent models.

The method comprises the steps of combining user App in 12 classifiers by using preference features, splicing prediction results of 12 secondary classifiers into a feature set, combining to obtain a new feature set which is a { basic attribute feature, an Applist2vec feature, combining the preference features, a probability 1, a probability 2, an … … and a probability 12} of the combined user App, inputting the new feature set into a multi-classifier for training, and outputting gender-age multi-classification prediction results, wherein the output results of the multi-classifier not only keep original feature information, but also integrate the prediction results of 12 sub-classifiers, so that the over-fitting risk can be reduced while the algorithm precision is improved.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but various changes may be apparent to those skilled in the art, and it is intended that all inventive concepts utilizing the inventive concepts set forth herein be protected without departing from the spirit and scope of the present invention as defined and limited by the appended claims.

Claims

1. A user basic attribute prediction method based on ensemble learning is characterized in that: based on an App installation list and App use data of a mobile phone user, the gender and age of the user are predicted, and the method comprises the following steps:

2. The ensemble learning-based user basic attribute prediction method according to claim 1, wherein: the data of the user App installation and use behavior record collected in the step 1 specifically include:

3. The ensemble learning-based user basic attribute prediction method according to claim 1, wherein: preprocessing the collected data of the user App installation and use behavior records in the step 1, wherein the abnormal and missing data filtering specifically comprises the following steps:

4. The method for predicting the user basic attribute based on ensemble learning according to claim 1, wherein the 1-person two-category and the 11-age two-category in the step 2 are specifically:

the sex of the user is male and female, and is respectively marked as 1 and 2; the age of the user is divided into 11 intervals; and (3) carrying out multi-classification prediction by combining the gender and the age into 22 classes, and recording the combined gender-age group as sex _ age, wherein sex _ age is (sex-1) · 11+ age, wherein sex represents the gender of the user, and age represents the age of the user.

5. The ensemble learning-based user basic attribute prediction method according to claim 1, wherein the step 2 of dividing the preprocessed raw data into 12 binary data sets comprises:

6. The ensemble learning-based user basic attribute prediction method according to claim 1, wherein: the step 3: extracting features from the binary data set includes:

(3) the method comprises the following steps of (1) obtaining App 2vec characteristics, wherein each App is regarded as a word, a sequence of each user using the App within a period of time is regarded as a document set, and the characteristics are extracted by using an Embedding layer in the word2vec to obtain a dimensionality-reduced App vector matrix; and (3) constructing a word vector model by adopting a CBOW network structure, and extracting 20-dimensional continuous features from the behavior data used by the user App.

7. The ensemble learning-based user basic attribute prediction method according to claim 1, wherein: the step 4: the second classifier structure includes:

x'_i＝g(x_i,θ)_{num_tree×num_leaves}

wherein x is_iRepresenting the feature vector of the ith training sample, x'_iRepresenting a high-dimensional combination 0-1 feature vector of an ith training sample, g (-) represents a leaf node of the LightGBM classifier, taking 1 when the ith sample belongs to the leaf node, otherwise taking 0, num _ tree represents the number of decision trees in the LightGBM model, and num _ leaves represents the number of leaf nodes on each tree;

the FM model is expressed as:

wherein n represents the number of features of the sample,

<represents the dot product of two vectors of size k,

8. The ensemble learning-based user basic attribute prediction method according to claim 1, wherein: the step 5 comprises the following steps: