CN106354818B - Social media-based dynamic user attribute extraction method - Google Patents

Social media-based dynamic user attribute extraction method Download PDF

Info

Publication number
CN106354818B
CN106354818B CN201610767430.5A CN201610767430A CN106354818B CN 106354818 B CN106354818 B CN 106354818B CN 201610767430 A CN201610767430 A CN 201610767430A CN 106354818 B CN106354818 B CN 106354818B
Authority
CN
China
Prior art keywords
time
user
theme
text
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610767430.5A
Other languages
Chinese (zh)
Other versions
CN106354818A (en
Inventor
杨阳
黄秀
胡玥
沈复民
邵杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201610767430.5A priority Critical patent/CN106354818B/en
Publication of CN106354818A publication Critical patent/CN106354818A/en
Application granted granted Critical
Publication of CN106354818B publication Critical patent/CN106354818B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Abstract

The invention discloses a method for extracting dynamic user attributes based on social media. According to the method, firstly, text preprocessing is carried out on an acquired training sample set, and then subject word extraction is carried out, so that K subjects and m subject words of each subject are obtained. And extracting short texts of the users to be processed, dividing time subsections, filling data through a time sliding window to obtain text data of each time subsection, preprocessing the texts, counting the occurrence frequency of subject words of each subject to obtain attribute weight information of each subject, introducing a time attenuation coefficient, sequentially obtaining user attribute characteristics associated with time attributes in a time sequence, and taking the user attribute characteristics of the latest time subsection as the current attribute characteristics of the users to be output. According to the method, on the premise of not using external knowledge, semantic expansion of the short text of the social media is realized through unordered words in the text, and the dynamic attribute of the user can be extracted from the microblog text published or forwarded by the user.

Description

Social media-based dynamic user attribute extraction method
Technical Field
The invention belongs to the field of computers, and particularly relates to a method for extracting dynamic user attributes based on social media.
Background
Social media services define a new way for users to communicate, self-express, and share with each other over a network. With the continuous development of social media, more and more people publish and share instant messages on a social media platform, and common social media include newcastle microblog, Twitter, Facebook, LinkedIn and the like. For example: on the Sina microblog platform, a user can issue microblog information within 140 characters, and the microblogs can be composed of Chinese and English, custom characters, external links and the like. Therefore, the dynamic attributes of the microblog short text stream detected user are effectively analyzed, and the method has important significance for research and application of related fields, such as social recommendation, personalized retrieval, online popularization and the like.
The user portrait, namely the tagging of user information, is a basic way for enterprises to perfectly abstract a user's business overall view after the enterprises collect and analyze data of main information such as social attributes, living habits, consumption behaviors and the like of consumers, and the method is applied to large data technologies by the enterprises. The user portrait provides enough information foundation for enterprises, and can help the enterprises to quickly find more extensive feedback information such as accurate user groups and user requirements. In combination with the development of current social networks, more and more users express themselves on social platforms in various forms, and it is meaningful to research user portrayal in social media environments. At present, a user portrait based on social media still has a lot of places to be improved, such as the user attribute description is not deep and incomplete, and timely updating is not achieved, and further intensive research is needed, so that the found problems are solved, a deep and comprehensive user portrait is constructed, and comprehensive and detailed information is provided for services such as a personalized recommendation system and information retrieval.
Due to the particularity of the short text on the social media platform, the sparsity of the short text needs to be solved when the image is interesting, and the conventional processing mode is as follows: augmenting short text semantics with external knowledge, such as augmenting the content of short text by connecting the content posted by a user on social media with related news articles to more efficiently analyze the user's activities on social media; or automatically classifying the user interests on the social media by utilizing the Wikipedia and a method of weighting the related interests; or extracting interest tags by utilizing the self-transmission of the user on social media to expand the information of the short text. The three processing methods for solving the problem of sparse short texts through external knowledge are required to heavily depend on the availability of external data and the correlation of original data, and if the external data is wrong or insufficient, the technical defect that the obtained interest may not conform to the real interest of the user is caused.
In addition, there is currently a large body of modeling of user portraits across platforms, with more accurate modeling analysis for the user being achieved through data on two or more social media platforms. Such as portraying user interests using basic information that the user fills in when the social media platform registers, and user tags set by the social media platform for the user. Or research and analysis of the user's behavior and interest by using the user data of the user who has an associated account at the same time on different social media platforms. However, the problem of short text sparsity is solved by extending semantics, and the finally obtained user attributes are all static user attributes, and the condition that the user attributes change along with time is not considered.
Disclosure of Invention
The invention aims to: in order to solve the sparsity problem of short texts and overcome the defects that the user attribute mining in the prior art is inaccurate and cannot be updated in time and the like, the method is based on the constructed new dynamic user attribute model (the dynamic attribute of the user can be automatically mined from the text and the change of the user attribute is shown), and realizes the semantic extension of the short texts of social media through unordered words in the text on the premise of not using external knowledge, smoothes data through a time window, and introduces a decay function to express the influence of the past attribute on the current attribute.
The invention discloses a social media-based dynamic user attribute extraction method, which comprises the following steps:
step 1: theme extraction:
101: collecting a training sample set:
extracting short texts published on social media by users, and screening the users with the number of the short texts being greater than or equal to a threshold value theta 1 (for example, 200) as sample users;
forming a training sample set by short texts of different sample users, and performing text preprocessing on the training samples (namely the short texts): the method comprises the steps of removing links, non-Chinese characters and self-defined words in a short text, then performing word segmentation operation on the short text, filtering stop words and nonsense high-frequency words, wherein the self-defined words can be removed by matching the short text with a preset self-defined word bank, removing matched self-defined words, filtering stop words and nonsense high-frequency words, or pre-constructing word banks related to the stop words and the nonsense high-frequency words on the basis of the same mode, then matching words obtained after word segmentation operation with the constructed word bank, and filtering the matched words;
102: and performing text theme extraction processing on the training sample set to obtain K themes, wherein in the step, a BTM (Biterm Topic Model) is adopted to extract the themes. The method has the advantage that the semantic meaning of the short text is expanded by utilizing unordered co-occurrence word pairs in the corpus, so that the problem of sparse text is solved. After the topic extraction process, K topics can be obtained, each topic includes a series of keywords, and weight information of each keyword, such as a document-topic distribution matrix of the BTM model, can be obtained at the same time. The top m with the largest weight are selected from the keywords of each topic as the topic words, and the weight information of each topic word is recorded, for example, as shown in table 1, which relates to 10 topics, each topic includes 5 topic words, and the numerical value in the parentheses after each topic word is the corresponding weight value.
TABLE 1
Figure BDA0001099772180000021
Figure BDA0001099772180000031
Step 2: extracting dynamic attributes of the user:
201: extracting short texts published on social media by a user to be processed in a time period T (such as within the last year), and dividing the time period T into q time subsections to obtain the short texts published by the time subsections; performing text preprocessing on the short text to obtain text data corresponding to each time subsection;
expanding the text data of p time subsections nearest to the current time subsection to the text data of the current time subsection through a sliding time window;
step 202: respectively carrying out word frequency statistics of the subject words on the text data of each time sub-segment and calculating the weight of each subject based on the m subject words of each subject obtained in the step 102
Figure BDA0001099772180000032
Wherein n iskiWord frequency, w, of the ith subject word representing subject kkiWeight under topic of ith subject word representing topic K, K being 1,2, …, KtThe subscript t ═ 0,1, …, q is the time sub-segment identifier; by K topic weights of the same temporal sub-segment
Figure BDA0001099772180000033
Obtaining the theme weight information A of each time subsectiont
Step 203: calculating user attribute features according to a formula
Figure BDA0001099772180000034
Wherein the attenuation coefficient lambda (T)j)=1-μTj v,TjRepresents the time interval of each time subsection, and 0 < mu < 1, and v > 0. For the weight information A obtained in step 202tCorresponding to the static attributes of the respective time sub-segments of the user. But the user attributes for each temporal sub-segment are limited to that temporal sub-segment, and those attributes that the user itself owns but are not mentioned in the current temporal sub-segment are ignored. The user attribute is a coherent change process, and the previous attribute is still owned and is changed with a trend along with the time, so that the invention introduces the attenuation coefficient to carry out certain attenuation on the previous attribute of the user, namely, the user attribute characteristic is calculated according to a formula
Figure BDA0001099772180000035
The parameters μ, v need to be adjusted experimentally to determine their values. Namely the topic weight information { A) of the current time subsection (t) and the previous time subsections0,A1,…,AtAre respectively related to the attenuation coefficient { T }0,T1,…,TtAfter multiplication, accumulating and summing to obtain a user attribute characteristic A 'of the current time subsection't. Therefore, the user attribute characteristics of the current time sub-section are combined with the attribute characteristics before the user, and accord with the real attribute change trend of the user. Step 204: user attribute characteristic A 'of the q-th time subsegment'qAs a user's current attributeAnd (5) performing sign and output.
Due to the adoption of the technical scheme, the invention has the beneficial effects that: the dynamic attribute of the user can be mined from the microblog text published or forwarded by the user, and the change trend of the user attribute can be displayed, so that the attribute of the user in a future period of time is predicted, and the result proves that the method has better effect than a method for discovering the static attribute of the user and better accords with the interest of the current state of the user.
Drawings
FIG. 1 is a diagram of an implementation model framework of an embodiment.
Fig. 2 is a diagram showing 10 attribute changes of a certain user in the embodiment.
Fig. 3 is a distribution diagram of attributes of 3 users in the embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.
Referring to fig. 1, the method for extracting the social media-based dynamic user attribute mainly involves three parts: text data preprocessing (text preprocessing for short), theme extraction and user dynamic attribute mining.
Short texts of the Xinlang microblog users are obtained through the crawler, and due to the fact that a lot of noise information exists, text information with low noise can be obtained through preprocessing means such as word segmentation and meaningless character removal. The BTM topic model is used to extract 10 topics (respectively, fitness, food, digital, sports, makeup, travel, military, music, favorite and games) and the corresponding top 20 weighted high-frequency keywords, and then the top 5 weighted high-frequency keywords are extracted from the extracted high-frequency keywords as the topic words of each topic, as shown in table 1.
The short text of a user (single) to be processed in one year is extracted, the extracted short text is divided into a plurality of subsets according to different time subsections, a time window with the size of 3 months is set, and the short text of a plurality of (such as 3) time subsections which are nearest to the current time subsection is expanded to the current time subsection by sliding the time window.
After the short text of each time sub-segment is preprocessed, the weight of each topic can be obtained based on 10 (namely, K is 10) topics extracted by the BTM topic model
Figure BDA0001099772180000041
Further by 10 topic weights for the same temporal sub-segmentObtaining the theme weight information A of each time subsectiontI.e. byWhere the subscript t (t ═ 0,1, …, q) is the time sub-segment identifier. As shown in fig. 2, it is a graph of the variation of the theme weight information of a certain randomly selected user with respect to the time period, and the variation trend of each attribute with time can be seen.
Finally, the user attribute characteristic A 'of the 11 th time subsegment'11As the current attribute features of the users and output, the current attribute distribution diagram of three random users as shown in fig. 3 shows different preferences of each user.
Further, the current attribute feature of the user (A 'corresponding to the latest time subsection) can be further processed'11) Performing normalization processing, namely taking the ratio of a single theme to the sum of 10 themes as a normalization result, and judging whether the attribute of each theme exists or not based on a preset threshold value theta 2: if the current theme is larger than or equal to the threshold theta 2, judging that the current theme exists, otherwise, judging that the current theme does not exist. For each topic, 1 is used if it exists, and 0 is used otherwise, thereby obtaining a vector LtTherefore, the attribute distribution of the user can be obtained more intuitively.
While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps.

Claims (1)

1. The method for extracting the dynamic user attribute based on the social media is characterized by comprising the following steps of:
step 1: theme extraction:
101: collecting a training sample set:
extracting short texts published on social media by users, and screening the users with the number of the short texts being greater than or equal to a threshold value theta 1 as sample users;
forming a training sample set by short texts of different sample users, and performing text preprocessing on the training samples: removing links, non-Chinese characters and user-defined words in the short text, performing word segmentation operation on the short text, and filtering stop words and nonsense high-frequency words;
102: performing text theme extraction processing on the training sample set by adopting a BTM (text to Model) Model to obtain K themes, wherein each theme comprises a series of keywords and weight information of each keyword is obtained at the same time, and the BTM Model represents a Biterm Topic Model;
selecting the top m with the largest weight from the keywords of each topic as the subject words, and simultaneously recording the weight information of each subject word;
step 2: extracting dynamic attributes of the user:
201: extracting short texts published on social media by a user to be processed in a time period T, and dividing the time period T into q time subsections to obtain the short texts published by each time subsection; performing text preprocessing on the short text to obtain text data corresponding to each time subsection;
expanding the text data of p time subsections nearest to the current time subsection to the text data of the current time subsection through a sliding time window;
step 202: respectively carrying out word frequency statistics of the subject words on the text data of each time sub-segment and calculating the weight of each subject based on the m subject words of each subject obtained in the step 102
Figure FDA0002132700370000011
Wherein n iskiI-th principal representing a subject kWord frequency, w, of the inscriptionkiWeight under topic for the ith topic word representing topic K, K being 1,2, …, K; by K topic weights of the same temporal sub-segment
Figure FDA0002132700370000012
Obtaining the theme weight information A of each time subsectiont
Figure FDA0002132700370000013
The subscript t ═ 0,1, …, q is the time sub-segment identifier;
step 203: calculating user attribute features according to a formula
Figure FDA0002132700370000014
Wherein A isjTheme weight information representing time sub-section j, and j ═ 0,1, …, T, attenuation coefficient λ (T)j)=1-μTj v,TjRepresenting the time interval of the time subsection j, the value of the attenuation coefficient parameter mu is 0.56, and the value of the parameter v is 0.06;
step 204: the user attribute characteristic A of the qth time subsectionq' as the current attribute feature of the user and output;
and the current attribute characteristics of the user are normalized: taking the ratio of the theme weight of a single theme in the current attribute characteristics of the user to the sum of the theme weights of K themes as an attribute normalization result of each theme, and then judging whether the attribute normalization result of each theme exists or not based on a preset threshold theta 2: if the attribute normalization result is greater than or equal to the threshold value theta 2, judging that the current theme exists, otherwise, judging that the current theme does not exist; and for each theme, if the theme exists, the theme is represented by 1, otherwise, the theme is represented by 0, so that the attribute distribution result of the user is obtained and output.
CN201610767430.5A 2016-08-30 2016-08-30 Social media-based dynamic user attribute extraction method Active CN106354818B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610767430.5A CN106354818B (en) 2016-08-30 2016-08-30 Social media-based dynamic user attribute extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610767430.5A CN106354818B (en) 2016-08-30 2016-08-30 Social media-based dynamic user attribute extraction method

Publications (2)

Publication Number Publication Date
CN106354818A CN106354818A (en) 2017-01-25
CN106354818B true CN106354818B (en) 2020-01-10

Family

ID=57856620

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610767430.5A Active CN106354818B (en) 2016-08-30 2016-08-30 Social media-based dynamic user attribute extraction method

Country Status (1)

Country Link
CN (1) CN106354818B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108984555B (en) * 2017-06-01 2021-09-28 腾讯科技(深圳)有限公司 User state mining and information recommendation method, device and equipment
CN109377401B (en) * 2018-08-24 2022-02-18 腾讯科技(武汉)有限公司 Data processing method, device, system, server and storage medium
CN109993570B (en) * 2019-01-14 2023-09-01 深圳市东信时代信息技术有限公司 Method and system for directionally delivering mobile advertisement
CN111694949B (en) * 2019-03-14 2023-12-05 京东科技控股股份有限公司 Multi-text classification method and device
CN110209316B (en) * 2019-06-11 2021-03-16 北京达佳互联信息技术有限公司 Category label display method, device, terminal and storage medium
CN110297887B (en) * 2019-06-26 2021-07-27 山东大学 Service robot personalized dialogue system and method based on cloud platform
CN112541792A (en) * 2020-12-22 2021-03-23 作业帮教育科技(北京)有限公司 Data processing method and device for mining user requirements and electronic equipment
CN116541527B (en) * 2023-07-05 2023-09-29 国网北京市电力公司 Document classification method based on model integration and data expansion

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103390051A (en) * 2013-07-25 2013-11-13 南京邮电大学 Topic detection and tracking method based on microblog data
CN104102648A (en) * 2013-04-07 2014-10-15 腾讯科技(深圳)有限公司 User behavior data based interest recommending method and device
CN105608192A (en) * 2015-12-23 2016-05-25 南京大学 Short text recommendation method for user-based biterm topic model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104102648A (en) * 2013-04-07 2014-10-15 腾讯科技(深圳)有限公司 User behavior data based interest recommending method and device
CN103390051A (en) * 2013-07-25 2013-11-13 南京邮电大学 Topic detection and tracking method based on microblog data
CN105608192A (en) * 2015-12-23 2016-05-25 南京大学 Short text recommendation method for user-based biterm topic model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Profiling vs. Time vs. Content: What does Matter for Top-k Publication Recommendation based on Twitter Profiles?;Chifumi Nishioka等;《Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries》;20160623;第4.2节 *

Also Published As

Publication number Publication date
CN106354818A (en) 2017-01-25

Similar Documents

Publication Publication Date Title
CN106354818B (en) Social media-based dynamic user attribute extraction method
Medvedev et al. The anatomy of Reddit: An overview of academic research
Wu et al. Tracing fake-news footprints: Characterizing social media messages by how they propagate
CN103324665B (en) Hot spot information extraction method and device based on micro-blog
CN109033408B (en) Information pushing method and device, computer readable storage medium and electronic equipment
CN104077417B (en) People tag in social networks recommends method and system
CN109472027A (en) A kind of social robot detection system and method based on blog article similitude
CN104484343A (en) Topic detection and tracking method for microblog
CN109992781B (en) Text feature processing method and device and storage medium
CN106126605B (en) Short text classification method based on user portrait
CN110750648A (en) Text emotion classification method based on deep learning and feature fusion
CN109299277A (en) The analysis of public opinion method, server and computer readable storage medium
CN111026866B (en) Domain-oriented text information extraction clustering method, device and storage medium
Cui et al. Personalized microblog recommendation using sentimental features
Daouadi et al. Real-Time Bot Detection from Twitter Using the Twitterbot+ Framework.
CN112115712B (en) Topic-based group emotion analysis method
CN110069686A (en) User behavior analysis method, apparatus, computer installation and storage medium
CN105447013A (en) News recommendation system
CN115168568B (en) Data content identification method, device and storage medium
Morzy Evolution of online forum communities
Wibawa et al. Classification Analysis of MotoGP Comments on Media Social Twitter Using Algorithm Support Vector Machine and Naive Bayes
Abulaish et al. A layered approach for summarization and context learning from microblogging data
CN110020120A (en) Feature word treatment method, device and storage medium in content delivery system
Jasti et al. A review on sentiment analysis of opinion mining
CN113761125A (en) Dynamic summary determination method and device, computing equipment and computer storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Yang Yang

Inventor after: Huang Xiu

Inventor after: Hu Yue

Inventor after: Shen Fumin

Inventor after: Shao Jie

Inventor before: Huang Xiu

Inventor before: Yang Yang

Inventor before: Hu Yue

Inventor before: Shen Fumin

Inventor before: Shao Jie

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant