CN106354818B

CN106354818B - Social media-based dynamic user attribute extraction method

Info

Publication number: CN106354818B
Application number: CN201610767430.5A
Authority: CN
Inventors: 杨阳; 黄秀; 胡玥; 沈复民; 邵杰
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2016-08-30
Filing date: 2016-08-30
Publication date: 2020-01-10
Anticipated expiration: 2036-08-30
Also published as: CN106354818A

Abstract

The invention discloses a method for extracting dynamic user attributes based on social media. According to the method, firstly, text preprocessing is carried out on an acquired training sample set, and then subject word extraction is carried out, so that K subjects and m subject words of each subject are obtained. And extracting short texts of the users to be processed, dividing time subsections, filling data through a time sliding window to obtain text data of each time subsection, preprocessing the texts, counting the occurrence frequency of subject words of each subject to obtain attribute weight information of each subject, introducing a time attenuation coefficient, sequentially obtaining user attribute characteristics associated with time attributes in a time sequence, and taking the user attribute characteristics of the latest time subsection as the current attribute characteristics of the users to be output. According to the method, on the premise of not using external knowledge, semantic expansion of the short text of the social media is realized through unordered words in the text, and the dynamic attribute of the user can be extracted from the microblog text published or forwarded by the user.

Description

Social media-based dynamic user attribute extraction method

Technical Field

The invention belongs to the field of computers, and particularly relates to a method for extracting dynamic user attributes based on social media.

Background

Social media services define a new way for users to communicate, self-express, and share with each other over a network. With the continuous development of social media, more and more people publish and share instant messages on a social media platform, and common social media include newcastle microblog, Twitter, Facebook, LinkedIn and the like. For example: on the Sina microblog platform, a user can issue microblog information within 140 characters, and the microblogs can be composed of Chinese and English, custom characters, external links and the like. Therefore, the dynamic attributes of the microblog short text stream detected user are effectively analyzed, and the method has important significance for research and application of related fields, such as social recommendation, personalized retrieval, online popularization and the like.

The user portrait, namely the tagging of user information, is a basic way for enterprises to perfectly abstract a user's business overall view after the enterprises collect and analyze data of main information such as social attributes, living habits, consumption behaviors and the like of consumers, and the method is applied to large data technologies by the enterprises. The user portrait provides enough information foundation for enterprises, and can help the enterprises to quickly find more extensive feedback information such as accurate user groups and user requirements. In combination with the development of current social networks, more and more users express themselves on social platforms in various forms, and it is meaningful to research user portrayal in social media environments. At present, a user portrait based on social media still has a lot of places to be improved, such as the user attribute description is not deep and incomplete, and timely updating is not achieved, and further intensive research is needed, so that the found problems are solved, a deep and comprehensive user portrait is constructed, and comprehensive and detailed information is provided for services such as a personalized recommendation system and information retrieval.

Due to the particularity of the short text on the social media platform, the sparsity of the short text needs to be solved when the image is interesting, and the conventional processing mode is as follows: augmenting short text semantics with external knowledge, such as augmenting the content of short text by connecting the content posted by a user on social media with related news articles to more efficiently analyze the user's activities on social media; or automatically classifying the user interests on the social media by utilizing the Wikipedia and a method of weighting the related interests; or extracting interest tags by utilizing the self-transmission of the user on social media to expand the information of the short text. The three processing methods for solving the problem of sparse short texts through external knowledge are required to heavily depend on the availability of external data and the correlation of original data, and if the external data is wrong or insufficient, the technical defect that the obtained interest may not conform to the real interest of the user is caused.

In addition, there is currently a large body of modeling of user portraits across platforms, with more accurate modeling analysis for the user being achieved through data on two or more social media platforms. Such as portraying user interests using basic information that the user fills in when the social media platform registers, and user tags set by the social media platform for the user. Or research and analysis of the user's behavior and interest by using the user data of the user who has an associated account at the same time on different social media platforms. However, the problem of short text sparsity is solved by extending semantics, and the finally obtained user attributes are all static user attributes, and the condition that the user attributes change along with time is not considered.

Disclosure of Invention

The invention aims to: in order to solve the sparsity problem of short texts and overcome the defects that the user attribute mining in the prior art is inaccurate and cannot be updated in time and the like, the method is based on the constructed new dynamic user attribute model (the dynamic attribute of the user can be automatically mined from the text and the change of the user attribute is shown), and realizes the semantic extension of the short texts of social media through unordered words in the text on the premise of not using external knowledge, smoothes data through a time window, and introduces a decay function to express the influence of the past attribute on the current attribute.

The invention discloses a social media-based dynamic user attribute extraction method, which comprises the following steps:

step 1: theme extraction:

101: collecting a training sample set:

extracting short texts published on social media by users, and screening the users with the number of the short texts being greater than or equal to a threshold value theta 1 (for example, 200) as sample users;

forming a training sample set by short texts of different sample users, and performing text preprocessing on the training samples (namely the short texts): the method comprises the steps of removing links, non-Chinese characters and self-defined words in a short text, then performing word segmentation operation on the short text, filtering stop words and nonsense high-frequency words, wherein the self-defined words can be removed by matching the short text with a preset self-defined word bank, removing matched self-defined words, filtering stop words and nonsense high-frequency words, or pre-constructing word banks related to the stop words and the nonsense high-frequency words on the basis of the same mode, then matching words obtained after word segmentation operation with the constructed word bank, and filtering the matched words;

102: and performing text theme extraction processing on the training sample set to obtain K themes, wherein in the step, a BTM (Biterm Topic Model) is adopted to extract the themes. The method has the advantage that the semantic meaning of the short text is expanded by utilizing unordered co-occurrence word pairs in the corpus, so that the problem of sparse text is solved. After the topic extraction process, K topics can be obtained, each topic includes a series of keywords, and weight information of each keyword, such as a document-topic distribution matrix of the BTM model, can be obtained at the same time. The top m with the largest weight are selected from the keywords of each topic as the topic words, and the weight information of each topic word is recorded, for example, as shown in table 1, which relates to 10 topics, each topic includes 5 topic words, and the numerical value in the parentheses after each topic word is the corresponding weight value.

TABLE 1

Step 2: extracting dynamic attributes of the user:

201: extracting short texts published on social media by a user to be processed in a time period T (such as within the last year), and dividing the time period T into q time subsections to obtain the short texts published by the time subsections; performing text preprocessing on the short text to obtain text data corresponding to each time subsection;

expanding the text data of p time subsections nearest to the current time subsection to the text data of the current time subsection through a sliding time window;

step 202: respectively carrying out word frequency statistics of the subject words on the text data of each time sub-segment and calculating the weight of each subject based on the m subject words of each subject obtained in the step 102

Wherein n is_kiWord frequency, w, of the ith subject word representing subject k_kiWeight under topic of ith subject word representing topic K, K being 1,2, …, K_tThe subscript t ═ 0,1, …, q is the time sub-segment identifier; by K topic weights of the same temporal sub-segment

Obtaining the theme weight information A of each time subsection_t；

Step 203: calculating user attribute features according to a formula

Wherein the attenuation coefficient lambda (T)_j)＝1-μT_j ^v，T_jRepresents the time interval of each time subsection, and 0 < mu < 1, and v > 0. For the weight information A obtained in step 202_tCorresponding to the static attributes of the respective time sub-segments of the user. But the user attributes for each temporal sub-segment are limited to that temporal sub-segment, and those attributes that the user itself owns but are not mentioned in the current temporal sub-segment are ignored. The user attribute is a coherent change process, and the previous attribute is still owned and is changed with a trend along with the time, so that the invention introduces the attenuation coefficient to carry out certain attenuation on the previous attribute of the user, namely, the user attribute characteristic is calculated according to a formula

The parameters μ, v need to be adjusted experimentally to determine their values. Namely the topic weight information { A) of the current time subsection (t) and the previous time subsections₀,A₁,…,A_tAre respectively related to the attenuation coefficient { T }₀,T₁,…,T_tAfter multiplication, accumulating and summing to obtain a user attribute characteristic A 'of the current time subsection'_t. Therefore, the user attribute characteristics of the current time sub-section are combined with the attribute characteristics before the user, and accord with the real attribute change trend of the user. Step 204: user attribute characteristic A 'of the q-th time subsegment'_qAs a user's current attributeAnd (5) performing sign and output.

Due to the adoption of the technical scheme, the invention has the beneficial effects that: the dynamic attribute of the user can be mined from the microblog text published or forwarded by the user, and the change trend of the user attribute can be displayed, so that the attribute of the user in a future period of time is predicted, and the result proves that the method has better effect than a method for discovering the static attribute of the user and better accords with the interest of the current state of the user.

Drawings

FIG. 1 is a diagram of an implementation model framework of an embodiment.

Fig. 2 is a diagram showing 10 attribute changes of a certain user in the embodiment.

Fig. 3 is a distribution diagram of attributes of 3 users in the embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.

Referring to fig. 1, the method for extracting the social media-based dynamic user attribute mainly involves three parts: text data preprocessing (text preprocessing for short), theme extraction and user dynamic attribute mining.

Short texts of the Xinlang microblog users are obtained through the crawler, and due to the fact that a lot of noise information exists, text information with low noise can be obtained through preprocessing means such as word segmentation and meaningless character removal. The BTM topic model is used to extract 10 topics (respectively, fitness, food, digital, sports, makeup, travel, military, music, favorite and games) and the corresponding top 20 weighted high-frequency keywords, and then the top 5 weighted high-frequency keywords are extracted from the extracted high-frequency keywords as the topic words of each topic, as shown in table 1.

The short text of a user (single) to be processed in one year is extracted, the extracted short text is divided into a plurality of subsets according to different time subsections, a time window with the size of 3 months is set, and the short text of a plurality of (such as 3) time subsections which are nearest to the current time subsection is expanded to the current time subsection by sliding the time window.

After the short text of each time sub-segment is preprocessed, the weight of each topic can be obtained based on 10 (namely, K is 10) topics extracted by the BTM topic model

Further by 10 topic weights for the same temporal sub-segmentObtaining the theme weight information A of each time subsection_tI.e. byWhere the subscript t (t ═ 0,1, …, q) is the time sub-segment identifier. As shown in fig. 2, it is a graph of the variation of the theme weight information of a certain randomly selected user with respect to the time period, and the variation trend of each attribute with time can be seen.

Finally, the user attribute characteristic A 'of the 11 th time subsegment'₁₁As the current attribute features of the users and output, the current attribute distribution diagram of three random users as shown in fig. 3 shows different preferences of each user.

Further, the current attribute feature of the user (A 'corresponding to the latest time subsection) can be further processed'₁₁) Performing normalization processing, namely taking the ratio of a single theme to the sum of 10 themes as a normalization result, and judging whether the attribute of each theme exists or not based on a preset threshold value theta 2: if the current theme is larger than or equal to the threshold theta 2, judging that the current theme exists, otherwise, judging that the current theme does not exist. For each topic, 1 is used if it exists, and 0 is used otherwise, thereby obtaining a vector L_tTherefore, the attribute distribution of the user can be obtained more intuitively.

While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps.

Claims

1. The method for extracting the dynamic user attribute based on the social media is characterized by comprising the following steps of:

step 1: theme extraction:

101: collecting a training sample set:

extracting short texts published on social media by users, and screening the users with the number of the short texts being greater than or equal to a threshold value theta 1 as sample users;

forming a training sample set by short texts of different sample users, and performing text preprocessing on the training samples: removing links, non-Chinese characters and user-defined words in the short text, performing word segmentation operation on the short text, and filtering stop words and nonsense high-frequency words;

102: performing text theme extraction processing on the training sample set by adopting a BTM (text to Model) Model to obtain K themes, wherein each theme comprises a series of keywords and weight information of each keyword is obtained at the same time, and the BTM Model represents a Biterm Topic Model;

selecting the top m with the largest weight from the keywords of each topic as the subject words, and simultaneously recording the weight information of each subject word;

step 2: extracting dynamic attributes of the user:

201: extracting short texts published on social media by a user to be processed in a time period T, and dividing the time period T into q time subsections to obtain the short texts published by each time subsection; performing text preprocessing on the short text to obtain text data corresponding to each time subsection;

Wherein n is_kiI-th principal representing a subject kWord frequency, w, of the inscription_kiWeight under topic for the ith topic word representing topic K, K being 1,2, …, K; by K topic weights of the same temporal sub-segment

Obtaining the theme weight information A of each time subsection_t：

The subscript t ═ 0,1, …, q is the time sub-segment identifier;

step 203: calculating user attribute features according to a formula

Wherein A is_jTheme weight information representing time sub-section j, and j ═ 0,1, …, T, attenuation coefficient λ (T)_j)＝1-μT_j ^v，T_jRepresenting the time interval of the time subsection j, the value of the attenuation coefficient parameter mu is 0.56, and the value of the parameter v is 0.06;

step 204: the user attribute characteristic A of the qth time subsection_q' as the current attribute feature of the user and output;

and the current attribute characteristics of the user are normalized: taking the ratio of the theme weight of a single theme in the current attribute characteristics of the user to the sum of the theme weights of K themes as an attribute normalization result of each theme, and then judging whether the attribute normalization result of each theme exists or not based on a preset threshold theta 2: if the attribute normalization result is greater than or equal to the threshold value theta 2, judging that the current theme exists, otherwise, judging that the current theme does not exist; and for each theme, if the theme exists, the theme is represented by 1, otherwise, the theme is represented by 0, so that the attribute distribution result of the user is obtained and output.