CN111125486A

CN111125486A - Microblog user attribute analysis method based on multiple features

Info

Publication number: CN111125486A
Application number: CN201911340531.4A
Authority: CN
Inventors: 程克非; 单凤池
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2019-12-23
Filing date: 2019-12-23
Publication date: 2020-05-08
Anticipated expiration: 2039-12-23
Also published as: CN111125486B

Abstract

The invention relates to a microblog user attribute analysis method based on multiple features, and belongs to the technical field of intelligent media calculation and big data analysis. The method comprises the following steps: s1, crawling user microblog blog information by using crawler software, cleaning and marking; s2, constructing word vectors of microblog blog information through a word2vec model, and obtaining user microblog text characteristics on the basis of the word vectors according to a combination strategy of ensemble learning; s3, constructing a multi-feature system for microblog attribute analysis from user microblog data, and constructing a composite feature according with the user attribute analysis through basic features; s4, fusing the multiple base classifiers by adopting a Stacking model fusion technology, constructing a microblog user attribute analysis model, and inputting data to be detected to obtain a final microblog user attribute analysis result. According to the invention, the accuracy of attribute classification of the microblog users is improved, and technical support is provided for merchants to provide more efficient personalized recommendation for the users.

Description

Microblog user attribute analysis method based on multiple features

Technical Field

The invention belongs to the technical field of intelligent media calculation and big data analysis, and relates to a microblog user attribute analysis method based on multiple features.

Background

With the increasing popularity of online social media, network information becomes voluminous and confusing. By means of computer technology, the method deeply understands the basic information of individuals and groups, excavates social psychology and behavior modes, quickly and accurately provides personalized and multi-aspect decision support, assists in solving the actual social problems, and becomes an important subject of common attention in the academic and industrial fields at present. Deep understanding of user information and user behavior is one of the core contents therein. Since personal attribute data often involves privacy problems, users often choose to hide their personal information in ways of not filling in or filling in false information, and so on, so that the basic information related to the users often cannot be directly acquired. User attribute analysis may address such issues.

At present, the research work in the aspect of user attribute analysis at home and abroad usually starts from three aspects of supervised learning, semi-supervised learning and unsupervised learning. Compared with semi-supervised learning data sparseness and unsupervised learning, the accuracy is lower, and the combination of novel composite features is more suitable for analysis of user attributes under the condition that a multi-feature system is constructed by supervised learning. And the characteristics considered by the existing microblog user attribute analysis method are not perfect enough, so that the accuracy of the obtained analysis result is not high.

Disclosure of Invention

In view of this, the invention aims to provide a multi-feature-based microblog user attribute analysis method, which aims to improve the accuracy of microblog user attribute classification so that merchants provide more efficient personalized recommendation for users.

In order to achieve the purpose, the invention provides the following technical scheme:

a microblog user attribute analysis method based on multiple features specifically comprises the following steps:

s1: crawling user microblog blog information by using crawler software, cleaning and marking;

s2: constructing a word vector of microblog blog information through a word2vec model, and obtaining user microblog text characteristics according to an ensemble learning combination strategy on the basis;

s3: constructing a multi-feature system for microblog attribute analysis from user microblog data, and constructing a composite feature which accords with the user attribute analysis through a basic feature;

s4: fusing the plurality of base classifiers by adopting a Stacking model fusion technology, constructing a microblog user attribute analysis model, and inputting data to be detected to obtain a final microblog user attribute analysis result.

Further, in step S2, the specific construction step of the user microblog text feature includes:

s21: carrying out word segmentation processing on the sample by using a Jieba word segmentation tool, removing stop words, merging microblogs of each user to obtain a user blog collection

m_iA set of microblogs with user ID i is represented,

a set of micro-blogs representing a single user,

w_ta word representing a single microblog;

s22: training microblogs of microblog users through a Skip-Gram model to obtain 300-dimensional word vectors in the microblogs, and calculating the microblog vector of each user, wherein the calculation formula is as follows:

wherein u is_iDenotes a user with ID i, K denotes a user u_iNumber of microblog words, Wvec_kA word vector representing a kth word;

s23: the method comprises the steps of taking a Stacking model as a combined strategy of integrated learning, taking a Support Vector Machine (SVM), a decision tree (decision tree), Logistic regression (Logistic), an optical gradient elevator (LightGBM) and extreme gradient elevator (XGboost) as primary classifiers, combining prediction results of the primary classifiers by Logistic regression (Logistic) serving as a two-layer classifier, and finally obtaining microblog text characteristics of a user.

Further, in step S3, the constructed composite feature includes: the user activity, the user microblog time distribution and the user behavior habits;

the user liveness feature f_useractive(u_i) The calculation formula of (a) is as follows:

wherein u is_iIndicating a user with an ID of i, f_sum(u_i) Representing user u_iTotal number of microblogs, f_transpond(u_i) Representing user u_iNumber of microblogs forwarded, f_time(u_i) Representing user u_iThe time interval between the first microblog and the last microblog;

the user microblog time distribution

The calculation formula of (a) is as follows:

wherein,

indicating a user with ID i located in time period j,

representing user u_iThe number of microblogs issued at time j,

representing user u_iThe number of microblogs forwarded at time j;

the user behavior habit f_userBehavior(u_i) The calculation formula of (a) is as follows:

f_userBehavior(u_i)＝f_textBehavior(u_i)+f_textSource(u_i)+f_{inforIntegrity}(u_i)

wherein f is_textBehavior(u_i) Representing user u_iText behavior habit of f_textSource(u_i) Representing user u_iInformation of the source of the blog article, f_{inforIntegrity}(u_i) Representing user u_iThe basic information integrity of (1).

Further, the text behavior habit of the user is obtained by calculating the proportion of the emoticons and the pictures in the microblog according to the following specific calculation formula:

wherein f is_emoticons(text_n) Representing the number of expression symbols in the nth microblog, f_picture(text_n) Representing the number of pictures in the nth microblog, N representing the user u_iThe number of microblogs.

Further, the user's blog source information is based on the male's idiomatic text source f_mSource(u_i) And a source of female's customary text f_fSource(u_i) The calculation formula is as follows: f. of_textSource(u_i)＝f_mSource(u_i)-f_fSource(u_i)。

Further, the male idiomatic text source f_mSource(u_i) The calculation formula of (a) is as follows:

wherein N represents a user u_iNumber of microblogs, f_mSourceNum(text_j) The nth microblog source is a male text source, and sourceNum is the total number of the text sources.

Further, the calculation formula of the female familiar text source is as follows:

wherein N represents a user u_iNumber of microblogs, f_fSourceNum(text_j) The nth microblog source is a female text source, and sourceNum is the total number of the text sources.

Further, the user information integrity specifically includes: f. of_{inforIntegrity}The basic information integrity of the user is represented, the basic information comprises a nickname, a registered place, gender, birthday, brief introduction, education information and head portrait information of the user, and the calculation formula is as follows:

wherein f is_nameIndicating whether there is a nickname, f_locationIndicating whether there is a registered location, f_birthdayIndicating whether there is birthday information, f_introductionIndicating whether there is a profile, f_educationIndicating whether there is educational information, f_headPhotoIndicating whether there is avatar information, and m indicates the total number of basic information.

Further, in step S4, the fusing the multiple base classifiers by using a Stacking model fusion technique to construct the microblog user attribute analysis model specifically includes: the microblog user attribute analysis model is constructed by using a Support Vector Machine (SVM), a decision tree (decision tree), a Logistic regression (Logistic), an optical gradient elevator (LightGBM) and an extreme gradient elevator (XGboost) as primary classifiers, and the Logistic regression (Logistic) as a two-layer classifier.

The invention has the beneficial effects that: according to the invention, various characteristics of the microblog of the user are fully considered, and various personalized data of the microblog user are obtained by training according to the established microblog user attribute analysis model, so that the accuracy of attribute classification of the microblog user is improved, and technical support is provided for merchants to provide more efficient personalized recommendation for the user.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a general flow chart of microblog user attribute analysis according to the present invention;

FIG. 2 is a flow chart of construction and extraction of microblog user attribute analysis text features in the invention;

FIG. 3 is a flow chart of microblog user attribute analysis non-text feature construction and extraction in the present invention;

FIG. 4 is a flowchart of microblog user attribute analysis model construction in the invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Referring to fig. 1 to 4, fig. 1 is a general flowchart of a microblog user attribute analysis method according to a preferred embodiment of the present invention, where the microblog user attribute analysis method according to this embodiment may be executed as a computer program or as a plug-in executed in other programs, and the specific execution process includes:

step S1: and (5) preprocessing the data.

The data preprocessing comprises two stages of data cleaning and marking. And in the cleaning stage, processing abnormal values and null values in the data to ensure the integrity of the sample data. And in the marking stage, manual marking is carried out on the acquired data according to priori knowledge, and the data are divided into a male type and a female type, wherein 0 represents the male type, and 1 represents the female type.

Step S2: and constructing a word vector of the microblog blog information through word2vec, and obtaining microblog text characteristics according to an ensemble learning combination strategy on the basis. The method specifically comprises the following steps:

step S3: constructing a multi-feature system for microblog attribute analysis from user microblog data, and constructing a composite feature which accords with the user attribute analysis through a basic feature;

step S4: and fusing the plurality of base classifiers by adopting a Stacking model fusion technology to obtain a final microblog user attribute analysis result.

Specifically, as shown in fig. 2, step S2 specifically includes the following steps:

step S21: performing word segmentation processing on each microblog of the users to stop words, and merging the microblogs of each user on the basis to obtain a user blog collection

m_iA set of microblogs with user ID i is represented,

a set of micro-blogs representing a single user,

w_ta word representing a single microblog.

Step S22: the method comprises the steps of training a crawled microblog user microblog through a Skip-Gram model to obtain a 300-dimensional word vector in the microblog, and calculating a microblog vector of each user through a formula, wherein the formula is as follows:

wherein u is_iDenotes a user with ID i, K denotes a user u_iThe number of words in the microblog words,Wvec_ka word vector representing a kth word;

step S23: a stacking model is adopted as an integrated learning strategy, a Support Vector Machine (SVM), a decision tree (decision tree), Logistic regression (Logistic), an optical gradient elevator (LightGBM) and extreme gradient elevator (XGboost) are used as base classifiers, and the Logistic regression (Logistic) is used as a meta classifier to construct a microblog user attribute analysis model.

And step S24, inputting the training set into the model for fitting, and performing parameter tuning by a grid search method to obtain an optimal model.

And step S25, inputting the training set into the model obtained in step S24 to obtain text features.

As shown in fig. 3, step S3 specifically includes the following steps:

step S31: a multi-feature system for microblog attribute analysis is constructed from user microblog data, and comprises text features, time features, statistical features, numerical features and content features, as shown in table 1:

TABLE 1 Multi-feature systems Table

Step S32: and constructing three composite characteristics of user activity, microblog time distribution and user behavior habits on the basis of the extracted multi-characteristic system.

Specifically, the calculation formula of the user activity characteristic is as follows:

wherein u is_iIndicating a user with an ID of i, f_sum(u_i) Representing user u_iTotal number of microblogs, f_transpond(u_i) Representing user u_iNumber of microblogs forwarded, f_time(u_i) Representing user u_iThe time interval of the first microblog and the last microblog is released.

The calculation formula of the microblog time distribution characteristics of the user is as follows:

wherein,

represents a user whose ID is i in a time period j (0. ltoreq. j.ltoreq.23),

the ID is represented as the number of microblogs issued by the user at the moment j,

and the number of microblogs forwarded by the user with the ID i at the moment j is shown.

The user behavior habit characteristics are as follows: according to the user text behavior habit f_textBehaviorUser Bowen Source information f_textSourceAnd user information integrity f_{inforIntegrity}The calculation is carried out, and the specific calculation formula is as follows:

f_userBehavior(u_i)＝f_textBehavior+f_textSource+f_{inforIntegrity}

the user text behavior habit is obtained by calculating the proportion of emoticons and pictures in the user microblog, and the calculation formula is as follows:

wherein f is_textBehavior(u_i) Representing user u_iHair habit of u_iDenotes a user with ID i, N denotes a user u_iNumber of microblogs, f_emoticons(text_j) Representing the number of expression symbols in the nth microblog, f_picture(text_n) And representing the number of pictures in the nth microblog.

User Bowen source information: according to the male's habitual text source f_mSource(u_i) And a source of female's customary text f_fSource(u_i) Calculating to obtain the user Bowen source information, wherein the calculation formula is as follows:

f_textSource(u_i)＝f_mSource(u_i)-f_fSource(u_i)

male idiomatic text sources: obtaining a male familiar text source f for a male text source and the number of the text sources according to the microblog source of the user_mSource(u_i) The formula is as follows:

wherein f is_mSourceNum(text_j) The nth microblog source is a male text source, and sourceNum is the total number of the text sources.

Female familiar text sources: obtaining a female conventional text source f according to the number of the female text sources and the microblog sources of the user_fSource(u_i) The formula is as follows:

wherein f is_fSourceNum(text_j) The nth microblog source is a female text source, and sourceNum is the total number of the text sources.

The user information integrity specifically includes: f. of_{inforIntegrity}The basic information integrity of the user is represented, the basic information comprises a nickname, a location, a gender, a birthday, a brief introduction, education information and head portrait information of the user, and the specific formula is as follows:

wherein f is_nameIndicating whether there is a nickname, f_locationIndicating whether there is a registered location, f_birthdayIndicating whether there is birthday information, f_introductionIndicating whether there is a profile, f_educationIndicating whether there is educational information, f_headPhotoIndicating whether there is head portrait informationAnd m denotes the total number of basic information.

As shown in fig. 4, step S4 includes:

step S41: the Stacking method is adopted as a combined strategy of ensemble learning to construct a rumor recognition model, a Support Vector Machine (SVM), a decision tree (decision tree), a Logistic regression (Logistic), an optical gradient elevator (LightGBM) and an extreme gradient elevator (XGboost) are used as a primary classifier of the Stacking model, and the Logistic regression (Logistic) model is used as a two-layer classifier.

Step S42: inputting the training set into a model for fitting, and performing parameter tuning by a grid search method to obtain an optimal model.

Step S43: and inputting the test set into the fitting model to obtain a final user attribute analysis result.

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. A microblog user attribute analysis method based on multiple features is characterized by specifically comprising the following steps of:

2. The method for analyzing the attributes of the microblog users based on the multiple features according to claim 1, wherein in the step S2, the construction of the microblog text features of the users specifically comprises the following steps:

m_iA set of microblogs with user ID i is represented,

a set of micro-blogs representing a single user,

w_ta word representing a single microblog;

s23: and taking a Stacking model as a combined strategy of ensemble learning, taking a support vector machine, a decision tree, logistic regression, an optical gradient elevator and extreme gradient elevator as primary classifiers, and obtaining a prediction result by combining the logistic regression as a two-layer classifier to finally obtain the microblog text characteristics of the user.

3. The method for analyzing the attributes of the microblog users based on the multiple features according to claim 1, wherein in the step S3, the constructed composite features comprise: the user activity, the user microblog time distribution and the user behavior habits;

the user microblog time distribution

The calculation formula of (a) is as follows:

wherein,

indicating a user with ID i located in time period j,

representing user u_iThe number of microblogs issued at time j,

representing user u_iThe number of microblogs forwarded at time j;

4. The multi-feature-based microblog user attribute analysis method according to claim 3, wherein the text behavior habit of the user is calculated according to the proportion of emoticons and pictures in the microblog of the user, and the specific calculation formula is as follows:

5. The method for analyzing the attributes of microblog users based on multiple features of claim 3, wherein the user Bowen source information is according to a male familiar text source f_mSource(u_i) And a source of female's customary text f_fSource(u_i) The calculation formula is as follows: f. of_textSource(u_i)＝f_mSource(u_i)-f_fSource(u_i)。

6. The method according to claim 5, wherein the male idiomatic text source f is a source of multi-feature-based microblog user attributes_mSource(u_i) The calculation formula of (a) is as follows:

7. The method for analyzing the attributes of the microblog users based on the multi-feature of claim 5, wherein the calculation formula of the female familiar text source is as follows:

8. The multi-feature-based microblog user attribute analysis method according to claim 3, wherein the user information integrity degree specifically includes: f. of_{inforIntegrity}The basic information integrity of the user is represented, the basic information comprises a nickname, a registered place, gender, birthday, brief introduction, education information and head portrait information of the user, and the calculation formula is as follows:

9. The method for analyzing attributes of microblog users based on multiple features according to claim 1, wherein in the step S4, fusing the multiple base classifiers by using a Stacking model fusion technique to construct the microblog user attribute analysis model specifically comprises: and constructing a microblog user attribute analysis model by using a support vector machine, a decision tree, logistic regression, an optical gradient elevator and extreme gradient elevator as primary classifiers and using the logistic regression as a secondary classifier.