CN111291798A - User basic attribute prediction method based on ensemble learning - Google Patents

User basic attribute prediction method based on ensemble learning Download PDF

Info

Publication number
CN111291798A
CN111291798A CN202010070270.5A CN202010070270A CN111291798A CN 111291798 A CN111291798 A CN 111291798A CN 202010070270 A CN202010070270 A CN 202010070270A CN 111291798 A CN111291798 A CN 111291798A
Authority
CN
China
Prior art keywords
user
app
data
age
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010070270.5A
Other languages
Chinese (zh)
Other versions
CN111291798B (en
Inventor
曹倩
王曼
刘立红
左敏
李海生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Technology and Business University
Original Assignee
Beijing Technology and Business University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Technology and Business University filed Critical Beijing Technology and Business University
Priority to CN202010070270.5A priority Critical patent/CN111291798B/en
Publication of CN111291798A publication Critical patent/CN111291798A/en
Application granted granted Critical
Publication of CN111291798B publication Critical patent/CN111291798B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3438Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment monitoring of user actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a user basic attribute prediction method based on ensemble learning. Firstly, converting a multi-classification problem into a plurality of two-classification problems, and performing two-classification prediction by using a LightGBM and FM fusion model as a two-classifier; and then combining the prediction results of the two classifications with the original characteristics to construct a multi-classification model. Experimental results show that the fusion method provided by the invention can improve the effect of user attribute prediction.

Description

User basic attribute prediction method based on ensemble learning
Technical Field
The invention relates to the technical field of integrated learning, in particular to a user basic attribute prediction method based on installation and use data of an App (application) of a smart phone.
Background
With the development of mobile internet, smart phones have become the most mobile devices for people. The number of apps currently available in application stores has exceeded four million, and apps installed and used by people may be closely related to their basic attributes of gender, age, and the like. The information can reflect personal information such as basic attributes, interest preferences, living habits and the like of the user. The deep mining of the user attributes can help an application store to know the behavior characteristics of the user and has targeted recommended products; and the system can also help enterprises to accurately put internet advertisements, thereby saving the advertisement cost.
The existing research mainly comprises the steps of carrying out basic attribute prediction on an App installed by a user, mining the characteristics of data used by the App of the user less and more coarsely, and analyzing the frequency and the time length of the App used by the user and the sequence of the App used in depth; on the other hand, the existing research mainly adopts traditional machine learning methods such as SVM and bayes, and the ensemble learning as an important part of the machine learning is also gradually applied to the field of user attribute prediction. However, the existing algorithm based on ensemble learning also has some disadvantages, such as a certain amount of information is lost in the problem division process, and the final model fusion is a time-consuming and complex parameter adjustment process.
Disclosure of Invention
In order to solve the problems that the existing method is less in mining of the data used by the user App and low in accuracy of basic attribute prediction, the method is used for mining and predicting the basic attributes of the user based on the App installation and the data used. The technical scheme of the invention is as follows: a user basic attribute prediction method based on ensemble learning comprises the following steps: based on an App installation list and App use data of a mobile phone user, the gender and age of the user are predicted, and the method comprises the following steps:
step 1, collecting data recorded by installation and use behaviors of user apps, wherein the data comprises user IDs (identities), an installed App list and the use time of each App in the list; preprocessing collected data recorded by the installation and use behaviors of the user App, and filtering abnormal and missing data; obtaining preprocessed original data;
step 2, dividing the preprocessed original data into 12 classified data sets, wherein the classified data sets comprise 1 individual second classification and 11 age second classifications, and only data labels are different among different classified data sets;
and 3, extracting features of the binary data set, wherein the features comprise basic statistical features: the number of apps installed by the user; the number of App categories installed by the user; the number of apps of each category is installed by a user; counting the use duration of each period of 24 hours by a user; the maximum, minimum and average App use time of a user every day; the user averagely opens the App times every day; the user uses App earliest and latest time; the user App uses the preference characteristic and the Applist2vec characteristic;
step 4, fusing the lightGBM and the FM model of the factorization machine to construct two classifiers, extracting high-dimensional combination characteristics by using the lightGBM, inputting the high-dimensional combination characteristics into the FM classifier, and training to obtain the prediction probability of each two classifiers;
step 5, splicing the prediction probability and the characteristics in the step 3 to obtain new training characteristics which are { basic attribute characteristics, Applist2vec characteristics, and the combined user App using preference characteristics, probability 1, probability 2, … … and probability 12 }; and combining the gender and the age, converting the problem into a multi-classification problem, inputting the new training characteristics into a multi-classifier for training, and outputting a prediction result.
Further, the data of the user App installation and use behavior record collected in step 1 specifically includes:
app use behavior records of a plurality of users are collected, wherein the App use behavior records comprise user IDs, user sexes and ages, App lists installed by the users and opening and closing time of each App in the lists.
Further, the step 1 is to preprocess the collected data of the user App installation and use behavior records, and the filtering of abnormal and missing data specifically includes:
(1) user App use time: from the analyzed time stamp of the App used by the user, the data of the abnormal year is removed when the App is turned on and off by the user, including 1970, 1975 and 2025 years;
(2) the daily use time of the user: removing user behavior data records with the use time less than 0.5 hour;
(3) app installed times: and eliminating apps with fewer than 5 installed users.
Further, the 1 individual classification and 11 age classifications in the step 2 are specifically:
the sex of the user is male and female, and is respectively marked as 1 and 2; the age of the user is divided into 11 intervals; the combined gender and age was classified into 22 categories for multi-category prediction, and the combined gender-age group was designated as sex _ age, (sex-1) · 11+ age.
Where sex represents the user's gender and age represents the user's age group.
Further, the step 2 of dividing the preprocessed raw data into 12 binary data sets includes:
converting the basic attribute label into a two-classification label corresponding to the two classifiers, namely converting the data set label of the two classifier 1 into an age interval 1 and a non-age interval 1; converting the data set labels of the second classifier 2 into an age interval 2 and a non-age interval 2; … …, respectively; converting the data set labels of the second classifier 11 into an age interval 11 and a non-age interval 11; the data set labels of the second classifier 12 are converted into gender male and gender female, only the data labels are different between different classified data sets, and other data are not changed.
Further, the step 3: extracting features from the binary data set includes:
extracting features from each classified data set according to the App installation and use data and the App category information of the user, wherein the features comprise basic attribute features, user App use preference features and Applist2vec features; the difference of the characteristics of each binary data set is that a user App uses a preference characteristic;
(1) the basic attribute features include: the number of apps installed by a user, the number of App types installed by the user, the number of apps installed by the user in each type, statistics of the use duration of each period of 24 hours by the user, the maximum, minimum and average use duration of the apps by the user each day, the average number of times of opening the apps by the user each day, and the earliest and latest times of using the apps by the user;
(2) the user App usage preference features include: firstly, extracting an important App under each attribute based on information gain, and then extracting features of the important App by using TF-IDF (Trans-inverse discrete frequency) on a use set;
(3) the App 2vec characteristic is that each App is regarded as a word, a sequence of each user using the App within a period of time is regarded as a document set, and the Embedding layer in the word2vec is used for extracting characteristics to obtain an App vector matrix after dimensionality reduction; and (3) constructing a word vector model by adopting a CBOW network structure, and extracting 20-dimensional continuous features from the behavior data used by the user App.
Further, the step 4: the second classifier structure includes:
inputting the feature set into a LightGBM model for training, performing five-fold cross prediction on training samples in the LightGBM model, calculating which leaf node of each decision tree each sample belongs to, recording the leaf node to which the sample belongs as 1, recording other leaf nodes as 0, and extracting a high-dimensional combination 0-1 feature vector:
x'i=g(xi,θ)num_tree×num_leaves
wherein x isiRepresenting the i-th training sample feature vector, xi' represents a high-dimensional combination 0-1 feature vector of the ith training sample, g (-) represents a leaf node of the LightGBM classifier, 1 is taken when the ith sample belongs to the leaf node, otherwise 0 is taken, num _ tree represents the number of decision trees in the LightGBM model, and num _ leaves represents the number of leaf nodes on each tree.
The FM model is expressed as:
Figure BDA0002377121500000031
wherein n represents the number of features of the sample,
Figure BDA0002377121500000032
<·,·>representing the dot product of two vectors of size k,
Figure BDA0002377121500000033
and training the FM model by a random gradient descent (SGD) method to obtain weight parameters in the model, and further performing binary prediction according to the input feature set.
Further, the step 5 comprises:
the method comprises the steps of combining preference features used by a user App in 12 classifiers, splicing prediction results of the 12 secondary classifiers into a feature set, combining to obtain a new feature set which is { basic attribute feature, Applist2vec feature, inputting the new feature set into a multi-classifier for training, outputting gender-age multi-classification prediction results, and outputting results output by the multi-classifier, wherein the original feature information is reserved, and the prediction results of the 12 sub-classifiers are synthesized.
Has the advantages that:
compared with the prior art, the embodiment of the invention has the beneficial effects that: the basic attributes of the gender and the age of the user can be predicted by analyzing the installation and use behaviors of the mobile user App, and the attribute prediction algorithm based on the ensemble learning effectively improves the prediction accuracy.
Drawings
FIG. 1: the invention provides a flow diagram of a user basic attribute prediction method based on ensemble learning;
FIG. 2: the invention provides a classifier based on fusion of a LightGBM and an FM.
Detailed Description
The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings.
In this embodiment, as shown in fig. 1, an algorithm flow of the method proposed by the present invention is provided:
step 1: data collection and preprocessing
App usage behavior records of a large number of users are collected, wherein the App usage behavior records comprise user IDs, user sexes and ages, App lists installed by the users and opening and closing time of each App in the lists. The sex of the user is male and female, and is respectively marked as 1 and 2; the age of the user is divided into 11 sections, and the sections are marked as 0(<15), 1 (15-20), 2 (20-25) … … and 10(>60) according to the age. Since gender and age are cross attributes which affect and couple with each other, the invention combines gender and age into 22 classes for multi-classification prediction, and the combined gender-age group is marked as sex _ age, and the sex _ age is (sex-1) · 11+ age. Where sex represents the user's gender and age represents the user's age group.
Firstly, abnormal data in a data set is filtered, and the filtering scheme comprises the following steps:
(1) user App use time: according to the analyzed time stamp, the time when the user turns on and off the App comprises a small amount of data of abnormal years such as 1970, 1975 and 2025, most of the data come from 2-3 months in 2017, so that the data of only 2-3 months are reserved, and the data of other months are regarded as noise;
(2) the daily use time of the user: the fact that the number of hours is too small can cause problems in the user data sampling process, and user behavior data records with the use duration less than 0.5 hour are deleted;
(3) app installed times: for some Apps with extremely small population, only few users install the Apps, the significance of user classification is not large, and therefore the Apps with less than 5 users are eliminated.
Step 2: partitioning into multiple binary datasets
By using the idea of stacking in ensemble learning, an original multi-classification problem is divided into a plurality of two-classification problems, a plurality of two classifiers are constructed for prediction, and each two classifier has different training emphasis points.
This process requires converting the dataset labels to convert the multiple category data labels to the binary data labels corresponding to the problem. The method comprises the steps that the original data is subjected to classification prediction, wherein the classification prediction is a two-classification problem, the age prediction is a multi-classification problem, and basic attribute labels are converted into two-classification labels corresponding to two classifiers, namely, a data set label of the two classifier 1 is converted into an age interval 1 and a non-age interval 1; converting the data set labels of the second classifier 2 into an age interval 2 and a non-age interval 2; … …, respectively; converting the data set labels of the second classifier 11 into an age interval 11 and a non-age interval 11; the data set labels of the second classifier 12 are converted into gender male and gender female, only the data labels are different between different classified data sets, and other data are not changed.
And step 3: feature set extraction
And extracting features including basic attribute features, user App use preference features and Applist2vec features from each classified data set according to the App installation and use data and the App category information of the user. Each of the binary data set features differ in that the user App uses a preference feature.
(1) Basic attribute features
The basic attribute features include: the number of the apps installed by the user, the number of the App types installed by the user, the number of the apps installed by the user in each type, statistics of the use duration of the user in each period of 24 hours, the maximum use duration, the minimum use duration, the average number of times of opening the apps by the user, and the earliest and latest times of using the apps by the user.
(2) User App usage preference feature
The user App uses the preference characteristics to firstly extract important App under each attribute based on information gain, and then extracts characteristics by using TF-IDF for the use set of the important App.
According to the user App installation data, for each user attribute, calculating and sequencing the information gain value of each App, wherein the information gain of a mobile phone App A for a specific user attribute phi can be expressed as:
IG(Φ,A)=H(Φ)-H(Φ|A)
where H (Φ) represents the information entropy of this particular user attribute, and H (Φ | a) refers to the information entropy under App a fixed condition. Based on the App list and attribute information installed by the user, we can calculate the App with the attribute of Φ, which is 100 before the information gain ranking, and the corresponding information gain set IG (Φ), that is, IG (Φ) ═ IG (IG)1,…IG100)。
According to the App use data of the user, an App set used by the user within a period of time is regarded as a document, each App is regarded as a character in the document, a TF-IDF value of an important App is calculated by using TF-IDF, and the formula is as follows:
Figure BDA0002377121500000061
wherein n isi,jIs AppiApp usage set d at userjOf (1), sigmaknk,jIs user App usage set djThe total number of App in the system, | D | is the total number of App use sets of users, i.e. the total number of users, | { j: w |, andi∈djis a device containing AppiApp of (a) uses the aggregation number.
Therefore, the TF-IDF of 100 important apps is TFIDF ═ TFIDF (TFIDF)1,…TFIDF100) Multiplying the information gain value by the obtained 100-dimensional TF-IDF weighted information gain, namely the user App use preference characteristic, which is recorded as TFIDF _ IG, namely:
TFIDF_IG=TFIDFi·IGi(i=1,2,…,100)
(3) applist2vec feature
By analogy with word and text modeling, each App is regarded as a word, a sequence of each user using the App within a period of time is regarded as a document set, and features are extracted by using an Embedding layer in word2vec, so that a dimensionality-reduced App vector matrix can be obtained. According to the method, a word vector model is constructed by adopting a CBOW network structure, and 20-dimensional continuous features are extracted from the behavior data used by the App of the user, so that on one hand, the sequential relation of the App used by the user can be considered, on the other hand, the sparsity of a matrix is reduced, and the calculation efficiency is improved.
And 4, step 4: structure of two-classification device
Compared with the GBDT + LR model, the invention realizes the learning of two classes by adopting the LightGBM + FM (lightweight gradient hoist + factorization machine) model, and automatically realizes the selection and combination of the features. As shown in fig. 2, inputting the feature set into a LightGBM model for training, performing five-fold cross prediction on training samples in the LightGBM model, calculating which leaf node of each decision tree each sample belongs to, and recording the leaf node to which each sample belongs as 1 and recording other leaf nodes as 0, and extracting a high-dimensional combination 0-1 feature vector:
x'i=g(xi,θ)num_tree×num_leaves
wherein x isiRepresenting the i-th training sample feature vector, xi' high-dimensional combination 0-1 feature vector representing ith training sample, and g (-) represents leaf node of LightGBM classifier, taking 1 when the ith sample belongs to the leaf node, otherwise taking 1And 0, num _ tree represents the number of decision trees in the LightGBM model, and num _ leaves represents the number of leaf nodes on each tree.
FM aims at solving the problem of insufficient parameter learning of feature combinations under the condition of high-dimensional sparse data, and the model is expressed as follows:
Figure BDA0002377121500000071
wherein n represents the number of features of the sample,
Figure BDA0002377121500000072
<·,·>representing the dot product of two vectors of size k,
Figure BDA0002377121500000073
the FM model is trained through a Stochastic Gradient Descent (SGD) method, so that weight parameters in the model can be obtained, and then binary prediction can be performed according to an input feature set.
And 5: multi-class prediction
In the previous step 4, each two-class classifier adopts different two-class labels to construct a model, and different two-class models only have higher accuracy on a specific attribute. And when the feature sets of the two-classification models are constructed, the relevance of different App use data and attributes is different, so that the App use data are screened in different two-classification model training sets, each two-classification model selects the first 100 apps with the largest information entropy according to the labels of the two-classification models, and the use preference features of the Apps of the users are constructed according to the use duration of the 100 apps of the users. The difference of the labels and the characteristics of the two classified models brings great difference among the models, and is more beneficial to the fusion of subsequent models.
The method comprises the steps of combining user App in 12 classifiers by using preference features, splicing prediction results of 12 secondary classifiers into a feature set, combining to obtain a new feature set which is a { basic attribute feature, an Applist2vec feature, combining the preference features, a probability 1, a probability 2, an … … and a probability 12} of the combined user App, inputting the new feature set into a multi-classifier for training, and outputting gender-age multi-classification prediction results, wherein the output results of the multi-classifier not only keep original feature information, but also integrate the prediction results of 12 sub-classifiers, so that the over-fitting risk can be reduced while the algorithm precision is improved.
Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but various changes may be apparent to those skilled in the art, and it is intended that all inventive concepts utilizing the inventive concepts set forth herein be protected without departing from the spirit and scope of the present invention as defined and limited by the appended claims.

Claims (8)

1. A user basic attribute prediction method based on ensemble learning is characterized in that: based on an App installation list and App use data of a mobile phone user, the gender and age of the user are predicted, and the method comprises the following steps:
step 1, collecting data recorded by installation and use behaviors of user apps, wherein the data comprises user IDs (identities), an installed App list and the use time of each App in the list; preprocessing collected data recorded by the installation and use behaviors of the user App, and filtering abnormal and missing data; obtaining preprocessed original data;
step 2, dividing the preprocessed original data into 12 classified data sets, wherein the classified data sets comprise 1 individual second classification and 11 age second classifications, and only data labels are different among different classified data sets;
and 3, extracting features of the binary data set, wherein the features comprise basic statistical features: the number of apps installed by the user; the number of App categories installed by the user; the number of apps of each category is installed by a user; counting the use duration of each period of 24 hours by a user; the maximum, minimum and average App use time of a user every day; the user averagely opens the App times every day; the user uses App earliest and latest time; the user App uses the preference characteristic and the Applist2vec characteristic;
step 4, fusing the lightGBM and the FM model of the factorization machine to construct two classifiers, extracting high-dimensional combination characteristics by using the lightGBM, inputting the high-dimensional combination characteristics into the FM classifier, and training to obtain the prediction probability of each two classifiers;
step 5, splicing the prediction probability and the characteristics in the step 3 to obtain new training characteristics which are { basic attribute characteristics, Applist2vec characteristics, and the combined user App using preference characteristics, probability 1, probability 2, … … and probability 12 }; and combining the gender and the age, converting the problem into a multi-classification problem, inputting the new training characteristics into a multi-classifier for training, and outputting a prediction result.
2. The ensemble learning-based user basic attribute prediction method according to claim 1, wherein: the data of the user App installation and use behavior record collected in the step 1 specifically include:
app use behavior records of a plurality of users are collected, wherein the App use behavior records comprise user IDs, user sexes and ages, App lists installed by the users and opening and closing time of each App in the lists.
3. The ensemble learning-based user basic attribute prediction method according to claim 1, wherein: preprocessing the collected data of the user App installation and use behavior records in the step 1, wherein the abnormal and missing data filtering specifically comprises the following steps:
(1) user App use time: from the analyzed time stamp of the App used by the user, the data of the abnormal year is removed when the App is turned on and off by the user, including 1970, 1975 and 2025 years;
(2) the daily use time of the user: removing user behavior data records with the use time less than 0.5 hour;
(3) app installed times: and eliminating apps with fewer than 5 installed users.
4. The method for predicting the user basic attribute based on ensemble learning according to claim 1, wherein the 1-person two-category and the 11-age two-category in the step 2 are specifically:
the sex of the user is male and female, and is respectively marked as 1 and 2; the age of the user is divided into 11 intervals; and (3) carrying out multi-classification prediction by combining the gender and the age into 22 classes, and recording the combined gender-age group as sex _ age, wherein sex _ age is (sex-1) · 11+ age, wherein sex represents the gender of the user, and age represents the age of the user.
5. The ensemble learning-based user basic attribute prediction method according to claim 1, wherein the step 2 of dividing the preprocessed raw data into 12 binary data sets comprises:
converting the basic attribute label into a two-classification label corresponding to the two classifiers, namely converting the data set label of the two classifier 1 into an age interval 1 and a non-age interval 1; converting the data set labels of the second classifier 2 into an age interval 2 and a non-age interval 2; … …, respectively; converting the data set labels of the second classifier 11 into an age interval 11 and a non-age interval 11; the data set labels of the second classifier 12 are converted into gender male and gender female, only the data labels are different between different classified data sets, and other data are not changed.
6. The ensemble learning-based user basic attribute prediction method according to claim 1, wherein: the step 3: extracting features from the binary data set includes:
extracting features from each classified data set according to the App installation and use data and the App category information of the user, wherein the features comprise basic attribute features, user App use preference features and Applist2vec features; the difference of the characteristics of each binary data set is that a user App uses a preference characteristic;
(1) the basic attribute features include: the number of apps installed by a user, the number of App types installed by the user, the number of apps installed by the user in each type, statistics of the use duration of each period of 24 hours by the user, the maximum, minimum and average use duration of the apps by the user each day, the average number of times of opening the apps by the user each day, and the earliest and latest times of using the apps by the user;
(2) the user App usage preference features include: firstly, extracting an important App under each attribute based on information gain, and then extracting features of the important App by using TF-IDF (Trans-inverse discrete frequency) on a use set;
(3) the method comprises the following steps of (1) obtaining App 2vec characteristics, wherein each App is regarded as a word, a sequence of each user using the App within a period of time is regarded as a document set, and the characteristics are extracted by using an Embedding layer in the word2vec to obtain a dimensionality-reduced App vector matrix; and (3) constructing a word vector model by adopting a CBOW network structure, and extracting 20-dimensional continuous features from the behavior data used by the user App.
7. The ensemble learning-based user basic attribute prediction method according to claim 1, wherein: the step 4: the second classifier structure includes:
inputting the feature set into a LightGBM model for training, performing five-fold cross prediction on training samples in the LightGBM model, calculating which leaf node of each decision tree each sample belongs to, recording the leaf node to which the sample belongs as 1, recording other leaf nodes as 0, and extracting a high-dimensional combination 0-1 feature vector:
x'i=g(xi,θ)num_tree×num_leaves
wherein x isiRepresenting the feature vector of the ith training sample, x'iRepresenting a high-dimensional combination 0-1 feature vector of an ith training sample, g (-) represents a leaf node of the LightGBM classifier, taking 1 when the ith sample belongs to the leaf node, otherwise taking 0, num _ tree represents the number of decision trees in the LightGBM model, and num _ leaves represents the number of leaf nodes on each tree;
the FM model is expressed as:
Figure FDA0002377121490000031
wherein n represents the number of features of the sample,
Figure FDA0002377121490000032
<represents the dot product of two vectors of size k,
Figure FDA0002377121490000033
and training the FM model by a random gradient descent (SGD) method to obtain weight parameters in the model, and further performing binary prediction according to the input feature set.
8. The ensemble learning-based user basic attribute prediction method according to claim 1, wherein: the step 5 comprises the following steps:
the method comprises the steps of combining preference features used by a user App in 12 classifiers, splicing prediction results of the 12 secondary classifiers into a feature set, combining to obtain a new feature set which is { basic attribute feature, Applist2vec feature, inputting the new feature set into a multi-classifier for training, outputting gender-age multi-classification prediction results, and outputting results output by the multi-classifier, wherein the original feature information is reserved, and the prediction results of the 12 sub-classifiers are synthesized.
CN202010070270.5A 2020-01-21 2020-01-21 User basic attribute prediction method based on ensemble learning Active CN111291798B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010070270.5A CN111291798B (en) 2020-01-21 2020-01-21 User basic attribute prediction method based on ensemble learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010070270.5A CN111291798B (en) 2020-01-21 2020-01-21 User basic attribute prediction method based on ensemble learning

Publications (2)

Publication Number Publication Date
CN111291798A true CN111291798A (en) 2020-06-16
CN111291798B CN111291798B (en) 2021-04-20

Family

ID=71028443

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010070270.5A Active CN111291798B (en) 2020-01-21 2020-01-21 User basic attribute prediction method based on ensemble learning

Country Status (1)

Country Link
CN (1) CN111291798B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111988327A (en) * 2020-08-25 2020-11-24 北京天融信网络安全技术有限公司 Threat behavior detection and model establishment method and device, electronic equipment and storage medium
CN112036572A (en) * 2020-08-28 2020-12-04 上海冰鉴信息科技有限公司 Text list-based user feature extraction method and device
CN112084402A (en) * 2020-08-24 2020-12-15 浙江云合数据科技有限责任公司 Method for predicting user attribute by analyzing application program use data
CN112561500A (en) * 2021-02-25 2021-03-26 深圳平安智汇企业信息管理有限公司 Salary data generation method, device, equipment and medium based on user data
CN113706040A (en) * 2021-09-01 2021-11-26 深圳前海微众银行股份有限公司 Risk identification method, device, equipment and storage medium
CN114372698A (en) * 2022-01-07 2022-04-19 武大吉奥信息技术有限公司 Social risk index classification model construction method, system, device and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104573048A (en) * 2015-01-20 2015-04-29 电子科技大学 User basic attribute predicting method based on flow data of smart phone
CN106651057A (en) * 2017-01-03 2017-05-10 有米科技股份有限公司 Mobile terminal user age prediction method based on installation package sequence table
CN108256052A (en) * 2018-01-15 2018-07-06 成都初联创智软件有限公司 Automobile industry potential customers' recognition methods based on tri-training
CN108256537A (en) * 2016-12-28 2018-07-06 北京酷我科技有限公司 A kind of user gender prediction method and system
CN109255506A (en) * 2018-11-22 2019-01-22 重庆邮电大学 A kind of internet finance user's overdue loan prediction technique based on big data
CN109635118A (en) * 2019-01-10 2019-04-16 博拉网络股份有限公司 A kind of user's searching and matching method based on big data
CN109885834A (en) * 2019-02-18 2019-06-14 中国联合网络通信集团有限公司 A kind of prediction technique and device of age of user gender
CN110009030A (en) * 2019-03-29 2019-07-12 华南理工大学 Sewage treatment method for diagnosing faults based on stacking meta learning strategy
CN110414716A (en) * 2019-07-03 2019-11-05 北京科技大学 A kind of enterprise based on LightGBM breaks one's promise probability forecasting method and system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104573048A (en) * 2015-01-20 2015-04-29 电子科技大学 User basic attribute predicting method based on flow data of smart phone
CN108256537A (en) * 2016-12-28 2018-07-06 北京酷我科技有限公司 A kind of user gender prediction method and system
CN106651057A (en) * 2017-01-03 2017-05-10 有米科技股份有限公司 Mobile terminal user age prediction method based on installation package sequence table
CN108256052A (en) * 2018-01-15 2018-07-06 成都初联创智软件有限公司 Automobile industry potential customers' recognition methods based on tri-training
CN109255506A (en) * 2018-11-22 2019-01-22 重庆邮电大学 A kind of internet finance user's overdue loan prediction technique based on big data
CN109635118A (en) * 2019-01-10 2019-04-16 博拉网络股份有限公司 A kind of user's searching and matching method based on big data
CN109885834A (en) * 2019-02-18 2019-06-14 中国联合网络通信集团有限公司 A kind of prediction technique and device of age of user gender
CN110009030A (en) * 2019-03-29 2019-07-12 华南理工大学 Sewage treatment method for diagnosing faults based on stacking meta learning strategy
CN110414716A (en) * 2019-07-03 2019-11-05 北京科技大学 A kind of enterprise based on LightGBM breaks one's promise probability forecasting method and system

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
KUNG-HSIANG, HUANG等: "A-HA: A Hybrid Approach for Hotel Recommendation", 《PROCESSING OF THE WORKSHOP ON ACM RECOMMENDER SSTEMS CHALLENGE》 *
彭赞等: "基于集成模型的移动应用广告转化率预测", 《计算机系统应用》 *
李雄飞等: "基于混合模型的广告转化率问题研究", 《东北大学学报(自然科学版)》 *
李雯: "基于参数服务器ps_lite的大规模Embedding系统的研究与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
陶竹林等: "点击预测的关键技术研究", 《中国传媒大学学报(自然科学版)》 *
高洁等: "一种基于LightGBM机器学习算法的用户年龄及性别预测方法", 《邮电设计技术》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084402A (en) * 2020-08-24 2020-12-15 浙江云合数据科技有限责任公司 Method for predicting user attribute by analyzing application program use data
CN111988327A (en) * 2020-08-25 2020-11-24 北京天融信网络安全技术有限公司 Threat behavior detection and model establishment method and device, electronic equipment and storage medium
CN111988327B (en) * 2020-08-25 2022-07-12 北京天融信网络安全技术有限公司 Threat behavior detection and model establishment method and device, electronic equipment and storage medium
CN112036572A (en) * 2020-08-28 2020-12-04 上海冰鉴信息科技有限公司 Text list-based user feature extraction method and device
CN112036572B (en) * 2020-08-28 2024-03-12 上海冰鉴信息科技有限公司 Text list-based user feature extraction method and device
CN112561500A (en) * 2021-02-25 2021-03-26 深圳平安智汇企业信息管理有限公司 Salary data generation method, device, equipment and medium based on user data
CN113706040A (en) * 2021-09-01 2021-11-26 深圳前海微众银行股份有限公司 Risk identification method, device, equipment and storage medium
CN114372698A (en) * 2022-01-07 2022-04-19 武大吉奥信息技术有限公司 Social risk index classification model construction method, system, device and storage medium

Also Published As

Publication number Publication date
CN111291798B (en) 2021-04-20

Similar Documents

Publication Publication Date Title
CN111291798B (en) User basic attribute prediction method based on ensemble learning
CN111444236B (en) Mobile terminal user portrait construction method and system based on big data
CN109949936B (en) Re-hospitalization risk prediction method based on deep learning mixed model
CN111209386B (en) Personalized text recommendation method based on deep learning
US11829855B2 (en) Time-factored performance prediction
Lenz et al. Measuring the diffusion of innovations with paragraph vector topic models
CN109145180B (en) Enterprise hot event mining method based on incremental clustering
Hu et al. Latent topic model for audio retrieval
CN109271527A (en) A kind of appellative function point intelligent identification Method
CN105225135B (en) Potential customer identification method and device
CN111859010B (en) Semi-supervised audio event identification method based on depth mutual information maximization
CN109299266B (en) A kind of text classification and abstracting method for Chinese news emergency event
CN110727864B (en) User portrait method based on mobile phone App installation list
CN112529638B (en) Service demand dynamic prediction method and system based on user classification and deep learning
CN115048464A (en) User operation behavior data detection method and device and electronic equipment
CN110704738B (en) Service information pushing method, device, terminal and storage medium based on legal image
CN113010705B (en) Label prediction method, device, equipment and storage medium
CN114722810A (en) Real estate customer portrait method and system based on information extraction and multi-attribute decision
Bouguila On multivariate binary data clustering and feature weighting
CN111859955A (en) Public opinion data analysis model based on deep learning
CN115018207B (en) Upstream and downstream based supply chain management method, system and equipment
Leng et al. Audio scene recognition based on audio events and topic model
CN116452353A (en) Financial data management method and system
CN112541080B (en) New media account label intelligent verification method based on deep learning
Stankevičius et al. Lithuanian news clustering using document embeddings

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant