CN116842478A

CN116842478A - User attribute prediction method based on twitter content

Info

Publication number: CN116842478A
Application number: CN202310882146.2A
Authority: CN
Inventors: 樊静; 郭玮; 陈伟; 方楚喻; 李亦非; 庄福振
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2023-07-18
Filing date: 2023-07-18
Publication date: 2023-10-03

Abstract

The application relates to the technical field of artificial intelligent recommendation systems, and particularly discloses a user attribute prediction method based on twitter content, which comprises the steps of constructing a data set, preprocessing data, cleaning the data, carrying out user modeling by using a vector space model based on time sequence information, training a classifier, testing the performance of the trained classifier by using a test set, optimizing the performance of the model by using a verification set, and carrying out user attribute completion; according to the application, the user generated content in the online social network is utilized to predict the demographic characteristics of the user, including the gender, age and occupation information of the user, so that the characteristic sparseness problem of the traditional basic attribute prediction method is relieved, the basic attribute of the user in the social network is predicted, and the problems in the aspects of false account number identification, personalized recommendation, cold start of a recommendation system and the like in the social network are solved.

Description

User attribute prediction method based on twitter content

Technical Field

The application relates to the technical field of artificial intelligent recommendation systems, in particular to a user attribute prediction method based on twitter content.

Background

With the development of the internet, a social network platform represented by Twitter (Twitter) gradually becomes a new core for spreading by virtue of huge user quantity and considerable flow, people are more and more willing to express the opinion, attitude and emotion of individuals on certain things on the social network, and the data lays a foundation for user portrait construction. At present, the importance of the personalized recommendation system in the market marketing and electronic commerce fields is increased increasingly, and research shows that the personalized recommendation technology can obviously improve the sales of electronic commerce platforms. For example, in a recommendation system, when the e-commerce platform determines that the user is female, the recommended merchandise is products of interest to the female, such as cosmetics and clothing; when the e-commerce platform knows that the age of the user is under 20 years old, the recommended merchandise is a trending brand of interest to teenagers. If the user likes basketball, the user is recommended to the basketball shoes, sports equipment and other related goods. User portrayal construction is the key to successful application of the personalized recommendation system. Therefore, research on online social network user portrayal construction has important application value.

The user representation includes user interests and basic attributes. The user interest portrayal is used to describe the user's interest features, while the user's basic attributes describe the user's demographics. Basic attributes of the user, such as the user's gender, age, and professional attributes, are indispensable in the user profile creation process. However, in social networks, users typically do not need to provide these basic attributes when registering an account, so most users may choose not to provide or provide false information to protect their privacy. If only basic attributes submitted by users are used for relevant calculation and research, serious deviation is caused, so that accurate prediction of the basic attributes of the users is very important for personalized recommendation systems and marketing fields.

With the advent of the big data age, how to predict the sex, age, occupation and other basic attributes of the user through the content generated by the user in history, further mine the interests of the user, establish the user portraits, and be vital to realizing accurate marketing and personalized recommendation, thus becoming the focus of attention of each big company. In order to solve the problem that the basic attributes of users in a social network cannot be determined, the application provides a user attribute prediction method based on twitter content.

Disclosure of Invention

In order to overcome the above-mentioned drawbacks of the prior art, an embodiment of the present application provides a method for predicting user attributes based on twitter content, which uses user generated content in an online social network to predict demographic characteristics of users, including gender, age and occupation information of users, and uses a vector space model to predict basic attributes of users in the social network, so as to help solve the problems of false account identification, personalized recommendation and cold start of a recommendation system in the social network, so as to solve the problems presented in the background art.

In order to achieve the above purpose, the present application provides the following technical solutions:

a user attribute prediction method based on twitter content comprises the following steps:

step S1, constructing a data set: splicing all twitter data of a user to form a text document, and dividing the text document into a training set, a verification set and a test set, wherein the training set and the verification set all contain all attribute tags;

step S2, data processing: preprocessing the text and cleaning the data, and filtering out noise in the text;

step S3, text representation: fusing all the push texts issued by the user together to be used as text documents, sequentially arranging the text documents by combining time sequence information, carrying out text representation by adopting a vector space model, and taking the text representation as input of a text classifier;

step S4, constructing a text classifier: training a support vector machine classification algorithm in machine learning by using class labels in a training set, obtaining an optimal classifier, evaluating and optimizing the classifier model by using the performance of a class label verification model in a verification set, and predicting the trained classifier model by using a test set;

and S5, performing attribute completion by using a classifier.

As a further aspect of the present application, in step S1, all of the user' S twitter data is twitter attributes of the user, and the twitter attributes of the user include text data, basic attribute information, and social network attributes.

As a further aspect of the application, the basic attribute information includes structured user characteristics of name, alias, gender, nationality, ethnicity, age, date of birth, place of birth, state of residence, academy of education, graduation school, academic specialty, occupation, workplace, and job position.

As a further aspect of the application, social network attributes include account ID, account name, home page link, IP attribution, account creation time, user volume of interest, fan volume, posting number, forwarding number, posting frequency, posting device, self-profile, hobbies, character estimation, active area, network community of interest, participation hot spot topic, topic of interest, interaction account number, social network circle structure of location, fan community, community of interest, KOL of interest (Key Opinion Leader ), media of interest.

As a further aspect of the application, in step S2, noise contained in the original text feature comes from the addition of "@" other users, emoticons and URL addresses in the tweet content, such noise in the text being removed using regular expressions.

As a further scheme of the application, in step S3, a vector space model represents a document as a document vector, each component in the document vector represents the weight of a feature order in the document, CHI is adopted for feature extraction for each category, feature words capable of representing the category are selected, then the weight of the feature words is calculated by FREQUENCY-inverse document FREQUENCY operation (TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY), feature extraction and feature value calculation are carried out for each prediction task, a feature dictionary is constructed, user modeling is carried out for each prediction task by using the feature dictionary, and thus a vector space model of a twitter user is constructed, and the expression formula of the user is as follows:

U＝K ₁ ,W ₁ ；k ₂ ,W ₂ ,…,K _n ,W _n

wherein U is the representation of the user, K is the feature word, W is the weight of the feature word, and n is the number of the feature words.

As a further scheme of the present application, in step S3, feature extraction and feature value calculation are performed for each prediction task, a feature dictionary is constructed, user modeling is performed for each prediction task using the feature dictionary, a vector space model of a twitter user is constructed, and gender, age and occupation labels of the user are predicted.

As a further scheme of the present application, in step S3, text information is arranged according to a time sequence to generate a text stream, and in a twitter' S actual scene, there are two time modes of an absolute text time and a relative text time, the absolute text time is a specific time, minute, second, the relative text time is a timestamp before a specified duration, and the current timestamp is used to subtract the specified relative timestamp, so as to convert the relative text time into the absolute text time.

The user attribute prediction method based on the twitter content has the technical effects and advantages that:

according to the application, the user generated content in the online social network is utilized to predict the demographic characteristics of the user, including the gender, age and occupation information of the user, so that the characteristic sparseness problem of the traditional basic attribute prediction method is relieved, the basic attributes of the user in the social network are predicted, the problems in false account identification, personalized recommendation and cold start of a recommendation system in the social network are solved, the requirements of core algorithms such as user portrait construction and application based on the social network are met, the user attribute prediction is performed by combining the twitter text content and time information, the problem that the basic attributes of the user are difficult to determine is solved, more features and algorithms can be added according to the requirements to optimize and expand, and the prediction accuracy and practicability are improved.

Drawings

FIG. 1 is a flowchart of a user attribute prediction method based on twitter content;

fig. 2 is a schematic structural diagram of a user attribute prediction method based on twitter content according to the present application.

Detailed Description

The following description of the embodiments of the present application will be made in detail, but not necessarily with reference to the accompanying drawings, wherein the disclosure is shown in the drawings. Based on the teachings herein, all other arrangements that may be made by one of ordinary skill in the art without undue burden are within the scope of the present application.

As shown in fig. 1, the user attribute prediction method based on the twitter content provided by the application specifically comprises the following steps:

and S5, performing attribute completion by using a classifier.

According to the application, the user generated content in the online social network is utilized to predict the demographic characteristics of the user, including the gender, age and occupation information of the user, so that the characteristic sparse problem of the traditional basic attribute prediction method is relieved, the basic attribute of the user in the social network is predicted, and the problems in false account number identification, personalized recommendation and cold start of a recommendation system in the social network are solved;

the application aims to meet the requirements of core algorithms such as user portrait construction and application based on social networks, combines with social science theory, and solves the problem that the basic attributes of users are difficult to determine.

It should be noted that, the application takes the twitter user as a research object, and the demographics characteristics of the user, such as gender, age and occupation attributes, can be predicted through the content generated by the twitter user, including text data, structured user characteristics and social network attributes. The data generated by the user on the twitter is mostly text data, so that the problem can be converted into a text classification problem in machine learning.

The text document is divided into a training set, a verification set and a test set, wherein the training set and the verification set all have attribute labels. Then, after text representation is carried out on the text document, a classification algorithm fits training set data to obtain a classifier. Finally, the trained classifier is used for predicting the labels of the documents in the test set, namely the basic attributes of the users, and the accuracy of the prediction results is evaluated.

In order to predict basic attributes of online social network users, the application adopts a text classification method in machine learning. Key technologies include text preprocessing, feature extraction, text representation, and classification technologies. The tweet has an open characteristic, a user can issue a tweet according to own preference, and "@" other users, emoticons and URL addresses can be added in the tweet content. The direct extraction of the original text features can bring a great deal of noise, and has great influence on experimental results. Thus, it is necessary to pre-process the twitter text and filter out noise therein.

Specifically, a user forwards the text of other people in the social network, so that an @ XXX type mark appears in the twitter text, and at the same time, if the user reminds other users in the text, the @ XXX mark appears, and the noise in the text is removed in a regular expression mode; URLs often appear in the twitter text, where the URL does not contain any useful information, which is a supplement to the twitter content, linked to an entry of other websites, and therefore needs to be filtered, in the twitter, the URL address starts with http, and therefore such noise is filtered using regular expressions; twitter text typically contains emoticons that reflect the mood and attitudes of the user when distributing the twitter, but the emoticons also introduce a lot of noise, e.g., the user sometimes uses opposite-meaning emoticons to represent emotions, which is challenging to understand for a computer, and therefore, to reduce the impact of noise, regular expressions are chosen to filter the emotions in the twitter.

As a further scheme of the application, after the text document is simply processed in Chinese, in step S3, the vector space model represents the document as a document vector, each component in the document vector represents the weight of the feature times in the document, CHI is adopted for feature extraction for each category, feature words capable of representing the category are selected, then the weight of the feature words is calculated by FREQUENCY-inverse document FREQUENCY operation (TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY), feature extraction and feature value calculation are carried out for each prediction task, a feature dictionary is constructed, user modeling is carried out for each prediction task by using the feature dictionary, and thus a vector space model of a user is constructed, and the expression formula of the user is as follows:

U＝K ₁ ,W ₁ ；K ₂ ,W ₂ ,…,K _n ,W _n

Further, in step S3, the text information is arranged according to a time sequence to generate a text stream, and in the actual scene of the twitter, there are two time modes of an absolute text time and a relative text time, the absolute text time is specific time, minutes and seconds, the relative text time is a timestamp before a specified duration, the specified relative timestamp is subtracted from the current timestamp, and the relative text time is converted into the absolute text time.

It should be noted that, the collected original user information lacks timing information, however, the present application aims to arrange the text information in the sequence of the text messages to generate a text stream, so that the front tweet content text time is earlier than the later tweet text, however, in the practical scene of tweets, there are two common time patterns: absolute hair time and relative hair time. Absolute hair time refers to a specific time, minutes, seconds, while relative hair time indicates a timestamp a few minutes ago. To convert the relative hair time to absolute hair time, the relative timestamp is subtracted from the current timestamp to obtain the final absolute timestamp, and when the relative hair timestamp of the tweet is 5 minutes ago, the current time is 12:00PM, and the absolute hair time of the tweet is calculated by subtracting the current time from the time interval of 5 minutes, namely 11:55AM.

Specifically, in the VSM model, it is a common method to represent text using the one-hot model. The feature dictionary is formed by extracting feature words, and the text vector is represented as a vector consisting of only 0 and 1, wherein 1 represents that the feature word appears in the text, and 0 represents that the feature word does not appear in the text. The model has the characteristics of simplicity and high efficiency, but also has the defects of weak model expression capability, sparse vector, neglect of high-frequency words and the like. The present application aims to predict the gender, age and occupation label of a user, so that feature extraction and feature value calculation are required for each prediction task, and a feature dictionary is constructed. User modeling is performed on each prediction task by using the feature dictionary, so that a vector space model of the twitter user is constructed.

Further, for time series feature representation, obtaining VSM vectorized text features has two obvious characteristics, firstly, the text feature of each user is not uniform in number, which leads to non-uniform vectorized text feature length, secondly, the text feature of each user has a strong dependency relationship in time front and back, in order to better capture the relationship between front and back text contents, the application adopts a cyclic neural network (RNN) to model the sequence text, however, considering that the correlation between front and back of the twitter text is strong, the effect of the content of the post text on the content of the pre text cannot be fully captured by using unidirectional RNN alone, so the concept of bidirectional RNN is presented herein to model such bidirectional information flow simultaneously.

The two-way RNN is composed of two independent RNNs, one of which processes the input sequence forward in time sequence and the other processes in reverse time sequence, the forward RNN models the input state at the current moment, the reverse RNN models the input state after the current moment, and by connecting the outputs of the two RNNs, the application obtains a comprehensive characteristic representation which contains both past context information and future context information, which can better capture the long-term dependency relationship and context information in the text sequence, so that the model can more comprehensively understand and encode the content of the user's text, and the text trend and behavior pattern of the user can be predicted more accurately by learning the correlation between the text before and after.

Finally, obtaining the output vector h of the last time step of the forward RNN network and the reverse RNN network through bidirectional learning _l And h _r And then splicing the two vectors to obtain a characteristic vector h of the user text.

It should be noted that, the method predicts some basic attributes of the user based on the user generated content in the online social network, is essentially a text classification task in machine learning, and key technologies include text preprocessing, text feature extraction, text representation and text classification technology, and all the pushers issued by one user are fused together to be regarded as one text document, so that the user modeling problem is converted into the problem of the representation of the pushers text document.

As a further scheme of the present application, in step S3, the text information is arranged according to a time sequence to generate a text stream, in a practical scene of twitter, there are two time modes of absolute text time and relative text time, the absolute text time is specific time, minutes and seconds, the relative text time is a time stamp before a specified duration, the specified relative time stamp is subtracted from the current time stamp, and the relative text time is converted into the absolute text time.

Specifically, the vector space model is a document representation model proposed by Salton et al in the 70 th century, the model represents a document as a vector, each component in the vector represents the weight of a feature word in the document, and the key to realizing the vector space model is extraction of the feature word of the document and calculation of the feature word weight.

In the VSM model, a one-hot model is used for representing text, a feature dictionary is formed by extracting feature words, a text vector is represented as a vector consisting of only 0 and 1, wherein 1 represents that the feature words appear in the text, and 0 represents that the feature words do not appear in the text, and the model has the characteristics of simplicity and high efficiency, but also has the defects of weak model expression capability, sparse vector and neglecting high-frequency words.

In text classifier construction, each user is represented as a text document using a text representation, and classification algorithms in machine learning are trained using class labels in a training set to arrive at an optimal classifier. In this way, user category labels in the test set can be predicted, thereby obtaining the basic attributes of the user. The text classification algorithm used in the present application is a Support Vector Machine (SVM). The SVM is a classification algorithm, improves the generalization capability of a learner by minimizing the structural risk, and realizes the minimization of experience risk and confidence range. Even if the number of training samples is small, the SVM algorithm can well reflect the real data distribution and obtain a classifier with better generalization capability. SVM is essentially a two-class classification model whose basic model is defined as the most widely spaced linear classifier in feature space. The learning strategy of the support vector machine is to maximize the interval and finally convert the interval into a convex primary planning problem to solve.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Finally: the foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and principles of the application are intended to be included within the scope of the application.

Claims

1. A user attribute prediction method based on twitter content is characterized by comprising the following steps:

step S2, data processing: preprocessing the text and cleaning the data, and filtering out noise;

step S3, text representation: fusing all the push texts issued by the user together to be used as text documents, sequentially arranging the text documents by combining time sequence information, and using a vector space model to represent the text documents and using the text representations as input of a text classifier;

and S5, performing attribute completion by using a classifier.

2. The method according to claim 1, wherein in step S1, all of the user' S twitter data is twitter attributes of the user, and the twitter attributes of the user include text data, basic attribute information, and social network attributes.

3. The method of claim 2, wherein the basic attribute information includes structured user characteristics of name, alias, gender, nationality, ethnicity, age, date of birth, place of birth, state, academy, graduation, academy of study, profession, workplace, and job position.

4. The method according to claim 1, wherein the social network attribute includes account ID, account name, homepage link, IP attribution, account creation time, user quantity of interest, fan quantity, number of posts, forwarding number, posting frequency, posting device, self-profile, hobbies, character estimation, active field, affiliated network group, participation hot topic, topic of interest, interactive account number, social network ring structure of interest, fan group, community of interest, KOL of interest (Key Opinion Leader ), media of interest.

5. The method according to claim 1, wherein in step S2, noise contained in the original text feature is derived from adding "@" other users, emoticons and URL addresses in the twitter content, and such noise in the text is removed by means of regular expressions.

6. The method of claim 1, wherein in step S3, the vector space model combines all the tweets issued by the user and arranges them according to time sequence information, the document is represented as a document vector, each component in the document vector represents the weight of the feature times in the document, the CHI is used for feature extraction for each category, the feature word capable of representing the category is selected, the weight of the feature word is calculated by FREQUENCY-inverse document FREQUENCY operation (TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY), the feature extraction and feature value calculation are performed for each prediction task, and a feature dictionary is constructed, and the user is modeled for each prediction task by using the feature dictionary, thereby constructing the vector space model of the tweet user, and the expression formula of the user is:

U＝K ₁ ,W ₁ ；K ₂ ,W ₂ ,…,K _n ,W _n

7. The method according to claim 1, wherein in step S3, feature extraction and feature value calculation are performed for each prediction task, a feature dictionary is constructed, user modeling is performed for each prediction task using the feature dictionary, a vector space model of the user of the twitter is constructed, and gender, age, and occupation labels of the user are predicted.

8. The method according to claim 1, wherein in step S3, the text information is arranged in time sequence to generate a text stream, and in the actual scene of the twitter, there are two time modes of absolute text time and relative text time, the absolute text time is a specific time, minutes and seconds, the relative text time is a time stamp before a specified duration, the specified relative time stamp is subtracted from the current time stamp, and the relative text time is converted into the absolute text time.