CN104268214A - Micro-blog user relationship based user gender identification method and system - Google Patents

Micro-blog user relationship based user gender identification method and system Download PDF

Info

Publication number
CN104268214A
CN104268214A CN201410494539.7A CN201410494539A CN104268214A CN 104268214 A CN104268214 A CN 104268214A CN 201410494539 A CN201410494539 A CN 201410494539A CN 104268214 A CN104268214 A CN 104268214A
Authority
CN
China
Prior art keywords
user
userid
module
users
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410494539.7A
Other languages
Chinese (zh)
Other versions
CN104268214B (en
Inventor
李寿山
黄磊
周国栋
孔芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201410494539.7A priority Critical patent/CN104268214B/en
Publication of CN104268214A publication Critical patent/CN104268214A/en
Application granted granted Critical
Publication of CN104268214B publication Critical patent/CN104268214B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a micro-blog user relationship based user gender identification method and system. The method includes the following steps of S1 collecting micro-blog user information and classifying different users according to an API (application program interface) provided by a micro-blog website; S2 respectively obtaining userids of followers and fans of classified users according to userids of the classified users and organizing the userids of the followers and the fans into text; S3 performing feature extraction on training samples through information gain and classifying the samples to be classified through a maximum entropy classifier. Compared to user gender identification methods and systems using micro-blog text, the micro-blog user relationship based user gender identification method and system has a better micro-blog user gender classification effect.

Description

A kind of user's gender identification method based on microblog users relation and system
Technical field
The invention belongs to natural language processing technique field, be specifically related to a kind of user's gender identification method based on microblog users relation and system.
Background technology
At present, microblogging is the internet social interaction server of a kind of integrated, the Opening newly risen in the web2.0 epoch.It has got through the boundary of mobile radio communication and internet, and user can pass through the approach such as mobile phone, IM software and outside api interface, and the text within instant outwards issue 140 words, is therefore more and more subject to the favor of Internet user.Data show, and by the end of in by the end of May, 2011, only just reach 300,000,000 the microblogging registered user that Twitter is online.For Sina's microblogging, issue Sina's microblogging from August, 2009, in April, 2011, the only time of 20 months, Sina microblogging registered user just reaches 1.42 hundred million.After Sina's microblogging is reached the standard grade, Tengxun, Netease, Sohu etc. are numerous and confused microblogging service also.Microblogging has become one of main activities of Chinese netizen online, and in such circumstances, microblogging analytical technology receives the concern of numerous researchers gradually.
The automatic analysis of microblogging generally concentrates on above two basic tasks: microblog users analysis and content of microblog analysis.Wherein, microblog users analysis is the basis that content of microblog is analyzed.For the identification of microblog users sex, existing research is mainly for foreign language websites such as Twitter, and great majority are by the various analyses to text message, process, and carry out other classification realisation, this class is mainly realized by content of microblog analysis.Because Twitter message is unlike traditional text, the short and small and multiplex colloquial style of its content, and often have some emoticons in message, traditional file classification method, do not reach good classifying quality.
Given this, the present invention proposes a kind of user's gender identification method based on microblog users relation and system, to solve the problem.
Summary of the invention
The invention provides a kind of user's gender identification method based on microblog users relation, comprise the following steps.
S1: the api interface provided according to microblogging website, collects the user profile of microblog users, and classifies to different user.
S2: the userid obtaining its follower and bean vermicelli according to the userid of sorted users, and the userid of described follower and bean vermicelli is organized into text.
S3: use information gain to carry out feature extraction to training sample, and use maximum entropy classifiers to be classified by sample to be sorted.
Preferably, in step sl, described user profile comprises the follower of user and userid and the gender field of bean vermicelli, and classifies according to gender field to different user.
Preferably, in step sl, the user profile process of described collection microblog users comprises the following steps:
S101, Stochastic choice user are as seed user, and the api interface utilizing microblogging to provide captures the user profile of user;
S102, userid according to the follower in captured user profile and bean vermicelli, continue the user profile capturing described follower and bean vermicelli, until capture quantity to reach required scale.
Preferably, in step sl, be according to the gender field value in captured user profile, classify to user type, wherein gender field value comprises m, f and n, and m represents man, and f represents female, and n represents unknown.
Preferably, step S2 also comprises: after the userid of described follower and bean vermicelli is organized into text, leave two row of file respectively in, and choose the male sex and the female user text formation training sample of equivalent, the male sex and the female user text of choosing equivalent in addition form test sample book.
Preferably, step S3 also comprises, and utilize training sample to build maximum entropy classifiers, the maximum entropy wherein used is MALLET Machine learning tools bag.
Preferably, the information gain account form described in step S3 is:
G ( t ) = { - Σ i = 1 m P ( c i ) log P ( c i ) } + { P ( t ) [ Σ i = 1 m P ( c i | t ) log P ( c i | t ) ] + P ( t ‾ ) [ Σ i = 1 m P ( c i | t ‾ ) log P ( c i | t ‾ ) ] }
Wherein, P (c j) represent c jthe probability that class document occurs in language material, P (t i) represent in language material and comprise characteristic item t ithe probability of document, P (c j| t i) represent that document package is containing characteristic item t itime belong to C jconditional probability during class, represent in language material and do not comprise characteristic item t ithe probability of document, represent that document does not comprise characteristic item t itime belong to C jconditional probability, M represents classification number.
Preferably, after computing information gain, information gain value is selected to come the userid of first 4000.
The present invention also provides a kind of user's sex recognition system based on microblog users relation, comprise language material to obtain and pretreatment module, information processing module of user's, training classifier module and users classification module to be measured, described language material obtains and is connected information processing module of user's with pretreatment module, described information processing module of user's connects training classifier module, described training classifier model calling users classification module to be measured.Described language material obtains and pretreatment module, for obtaining the user profile of microblog users according to api interface.Described information processing module of user's, for according to user gender field value by users classification, then according to user userid, customer relationship is organized into the text of corresponding format, and therefrom selects training sample, test sample book at random.Described training classifier module, for building maximum entropy classifiers.Described users classification module to be measured, for classifying to testing data according to described maximum entropy classifiers.
According to the user's gender identification method based on microblog users relation provided by the invention and system, collect the user profile of microblog users, and different user is classified, without the need to carrying out complex process to the text message of microblogging.Obtain the userid of its follower and bean vermicelli according to the userid of sorted users, and after both userid are organized into text, use information gain to carry out feature extraction to training sample, and use maximum entropy classifiers to be classified by sample to be sorted.So, compare and use microblogging text, there is better microblog users Gender Classification effect.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is the user's gender identification method process flow diagram based on microblog users relation that present pre-ferred embodiments provides;
Fig. 2 is the user's sex recognition system schematic diagram based on microblog users relation that present pre-ferred embodiments provides.
Embodiment
Hereinafter also describe the present invention in detail with reference to accompanying drawing in conjunction with the embodiments.It should be noted that, when not conflicting, the embodiment in the application and the feature in embodiment can combine mutually.
Fig. 1 is the user's gender identification method process flow diagram based on microblog users relation that present pre-ferred embodiments provides.As shown in Figure 1, the user's gender identification method based on microblog users relation that present pre-ferred embodiments provides comprises step S1 ~ S3.
Step S1: the api interface provided according to microblogging website, collects the user profile of microblog users, and classifies to different user.
Specifically, microblogging website described in the present embodiment is Sina's microblogging, in other embodiments, can sets itself as required, this present invention is not construed as limiting.
The user profile process of collecting microblog users described in the present embodiment comprises the following steps.
S101, Stochastic choice user are as seed user, and the api interface utilizing microblogging to provide captures the user profile of user.
In this, described user profile comprises the follower of user and userid and the gender field of bean vermicelli, and classifies according to gender field to different user.Wherein, gender field value comprises m, f and n, and m represents man, and f represents female, and n represents unknown, accordingly user is divided into above-mentioned three classes by sex.
S102, userid according to the follower in captured user profile and bean vermicelli, continue the user profile capturing described follower and bean vermicelli, until capture quantity to reach required scale.
Step S2: the userid obtaining its follower and bean vermicelli according to the userid of sorted users, and the userid of described follower and bean vermicelli is organized into text.
Specifically, in this step, after the userid of described follower and bean vermicelli is organized into text, leave two row of file respectively in.Wherein, special symbol can be used between each userid to split, in the present embodiment, use space to split.In this, if the follower of certain user or bean vermicelli number are zero, then corresponding behavior null.
In addition, the male sex and the female user text that need choose equivalent form training sample, and the male sex and the female user text of choosing equivalent in addition form test sample book, form sample to be sorted.The present embodiment chooses the male sex, each 1000 of female user forms training sample, and the male sex, each 1000 of female user form test sample book.
Step S3: use information gain to carry out feature extraction to training sample, and use maximum entropy classifiers to be classified by sample to be sorted.
Specifically, the present embodiment utilizes training sample to build maximum entropy classifiers, and the maximum entropy of use is MALLET Machine learning tools bag.
Wherein, information gain account form is:
G ( t ) = { - Σ i = 1 m P ( c i ) log P ( c i ) } + { P ( t ) [ Σ i = 1 m P ( c i | t ) log P ( c i | t ) ] + P ( t ‾ ) [ Σ i = 1 m P ( c i | t ‾ ) log P ( c i | t ‾ ) ] }
Wherein, P (c j) represent c jthe probability that class document occurs in language material, P (t i) represent in language material and comprise characteristic item t ithe probability of document, P (c j| t i) represent that document package is containing characteristic item t itime belong to C jconditional probability during class, represent in language material and do not comprise characteristic item t ithe probability of document, represent that document does not comprise characteristic item t itime belong to C jconditional probability, M represents classification number.
Wherein, under maximum entropy model, the formula of predicted condition probability P (c|d) is:
P ( c | d ) = 1 Z ( d ) exp ( Σ i λ i , c F i , c ( d , c ) )
Wherein Z (d) is normalized factor.F i,cbe fundamental function, be defined as:
F i , c ( d , c ′ ) = 1 , n i ( d ) > 0 and c ′ = c 0 otherwise
After the gain of said process computing information, the present embodiment selects information gain value to come the userid of first 4000.
As previously mentioned, take each 1000 of the training sample male sex, female user, the experimental data of the male sex, each 1000 of female user in test sample book, the inventive method is 0.843 to the accuracy rate that microblog users is classified.
Fig. 2 is the user's sex recognition system schematic diagram based on microblog users relation that present pre-ferred embodiments provides.As shown in Figure 2, the user's sex recognition system based on microblog users relation that present pre-ferred embodiments provides comprises language material and obtains and pretreatment module 1, information processing module of user's 2, training classifier module 3 and users classification module 4 to be measured, described language material obtains and is connected information processing module of user's 2 with pretreatment module 1, described information processing module of user's 2 connects training classifier module 3, and described training classifier module 3 connects users classification module 4 to be measured.Described language material obtains and pretreatment module 1, for obtaining the user profile of microblog users according to api interface.Described information processing module of user's 2, for according to user gender field value by users classification, then according to user userid, customer relationship is organized into the text of corresponding format, and therefrom selects training sample, test sample book at random.Described training classifier module 3, for building maximum entropy classifiers.Described users classification module 4 to be measured, for classifying to testing data according to described maximum entropy classifiers.About the operating process of said system and the inventive method similar, therefore to repeat no more in this.
The user's gender identification method based on microblog users relation provided according to present pre-ferred embodiments and system, with the customer relationship of microblog users (i.e. the userid of user follower and bean vermicelli) for source resource composition text, take into full account the importance of customer relationship in microblogging.Meanwhile, combining information gain carries out feature extraction to training sample, greatly reduces characteristic dimension, thus avoids the harmful effect using Twitter message to bring in assorting process, and has better classifying quality.
To the above-mentioned explanation of the disclosed embodiments, professional and technical personnel in the field are realized or uses the present invention.To be apparent for those skilled in the art to the multiple amendment of these embodiments, General Principle as defined herein can without departing from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention can not be restricted to embodiment illustrated herein, but will meet the widest scope consistent with principle disclosed herein and features of novelty.To the above-mentioned explanation of the disclosed embodiments, professional and technical personnel in the field are realized or uses the present invention.To be apparent for those skilled in the art to the multiple amendment of these embodiments, General Principle as defined herein can without departing from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention can not be restricted to embodiment illustrated herein, but will meet the widest scope consistent with principle disclosed herein and features of novelty.

Claims (9)

1., based on user's gender identification method of microblog users relation, it is characterized in that, comprise the following steps:
S1, the api interface provided according to microblogging website, collect the user profile of microblog users, and classify to different user;
S2, obtain the userid of its follower and bean vermicelli according to the userid of sorted users, and the userid of described follower and bean vermicelli is organized into text;
S3, use information gain are carried out feature extraction to training sample, and are used maximum entropy classifiers to be classified by sample to be sorted.
2. method according to claim 1, is characterized in that, in step sl, described user profile comprises the follower of user and userid and the gender field of bean vermicelli, and classifies according to gender field to different user.
3. method according to claim 1 and 2, is characterized in that, in step sl, the user profile process of described collection microblog users comprises the following steps:
S101, Stochastic choice user are as seed user, and the api interface utilizing microblogging to provide captures the user profile of user;
S102, userid according to the follower in captured user profile and bean vermicelli, continue the user profile capturing described follower and bean vermicelli, until capture quantity to reach required scale.
4. method according to claim 1 and 2, is characterized in that, in step sl, be according to the gender field value in captured user profile, classify to user type, wherein gender field value comprises m, f and n, m represents man, and f represents female, and n represents unknown.
5. method according to claim 1, it is characterized in that, step S2 also comprises: after the userid of described follower and bean vermicelli is organized into text, leave two row of file respectively in, and choose the male sex and the female user text formation training sample of equivalent, the male sex and the female user text of choosing equivalent in addition form test sample book.
6. method according to claim 1, is characterized in that, step S3 also comprises, and utilize training sample to build maximum entropy classifiers, the maximum entropy wherein used is MALLET Machine learning tools bag.
7. method according to claim 1, is characterized in that, the information gain account form described in step S3 is:
G ( t ) = { - Σ i = 1 m P ( c i ) log P ( c i ) } + { P ( t ) [ Σ i = 1 m P ( c i | t ) log P ( c i | t ) ] + P ( t ‾ ) [ Σ i = 1 m P ( c i | t ‾ ) log P ( c i | t ‾ ) ] }
Wherein, P (c j) represent c jthe probability that class document occurs in language material, P (t i) represent in language material and comprise characteristic item t ithe probability of document, P (c j| t i) represent that document package is containing characteristic item t itime belong to C jconditional probability during class, represent in language material and do not comprise characteristic item t ithe probability of document, represent that document does not comprise characteristic item t itime belong to C jconditional probability, M represents classification number.
8. method according to claim 7, is characterized in that, after computing information gain, selects information gain value to come the userid of first 4000.
9. the user's sex recognition system based on microblog users relation, it is characterized in that, comprise language material to obtain and pretreatment module, information processing module of user's, training classifier module and users classification module to be measured, described language material obtains and is connected information processing module of user's with pretreatment module, described information processing module of user's connects training classifier module, described training classifier model calling users classification module to be measured
Described language material obtains and pretreatment module, for obtaining the user profile of microblog users according to api interface;
Described information processing module of user's, for according to user gender field value by users classification, then according to user userid, customer relationship is organized into the text of corresponding format, and therefrom selects training sample, test sample book at random;
Described training classifier module, for building maximum entropy classifiers;
Described users classification module to be measured, for classifying to testing data according to described maximum entropy classifiers.
CN201410494539.7A 2014-09-24 2014-09-24 A kind of user's gender identification method and system based on microblog users relation Active CN104268214B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410494539.7A CN104268214B (en) 2014-09-24 2014-09-24 A kind of user's gender identification method and system based on microblog users relation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410494539.7A CN104268214B (en) 2014-09-24 2014-09-24 A kind of user's gender identification method and system based on microblog users relation

Publications (2)

Publication Number Publication Date
CN104268214A true CN104268214A (en) 2015-01-07
CN104268214B CN104268214B (en) 2018-01-19

Family

ID=52159736

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410494539.7A Active CN104268214B (en) 2014-09-24 2014-09-24 A kind of user's gender identification method and system based on microblog users relation

Country Status (1)

Country Link
CN (1) CN104268214B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106126607A (en) * 2016-06-21 2016-11-16 重庆邮电大学 A kind of customer relationship towards social networks analyzes method
CN106327341A (en) * 2016-08-15 2017-01-11 首都师范大学 Weibo user gender deduction method and system based on combined theme
CN106682118A (en) * 2016-12-08 2017-05-17 华中科技大学 Social network site false fan detection method achieved on basis of network crawler by means of machine learning
EP3188094A1 (en) * 2015-12-30 2017-07-05 Xiaomi Inc. Method and device for classification model training

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120083255A1 (en) * 2010-10-04 2012-04-05 Telefonica, S.A. Method for gender identification of a cell-phone subscriber
CN102663027A (en) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 Method for predicting attributes of webpage crowd

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120083255A1 (en) * 2010-10-04 2012-04-05 Telefonica, S.A. Method for gender identification of a cell-phone subscriber
CN102663027A (en) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 Method for predicting attributes of webpage crowd

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
LI S等: "《A Framework of Feature Selection Methods for Text Categorization》", 《IN PROCEEDINGS OF ACL-IJCNLP-09》 *
MILLER Z等: "《"Gender predication on twitter using stream algorithms with N-gram character features"Gender predication on twitter using stream algorithms with N-gram character features》", 《INTERNATIONAL JOURNAL OF INTELLIGENCE SCIENCE》 *
廉营: "《基于语义角色标注的微博人物关系抽取》", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *
杨芹: "《基于最大熵模型的中文网页分类器设计和实现》", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3188094A1 (en) * 2015-12-30 2017-07-05 Xiaomi Inc. Method and device for classification model training
CN106126607A (en) * 2016-06-21 2016-11-16 重庆邮电大学 A kind of customer relationship towards social networks analyzes method
CN106327341A (en) * 2016-08-15 2017-01-11 首都师范大学 Weibo user gender deduction method and system based on combined theme
CN106682118A (en) * 2016-12-08 2017-05-17 华中科技大学 Social network site false fan detection method achieved on basis of network crawler by means of machine learning

Also Published As

Publication number Publication date
CN104268214B (en) 2018-01-19

Similar Documents

Publication Publication Date Title
CN108885623B (en) Semantic analysis system and method based on knowledge graph
Wu et al. Twitter spam detection based on deep learning
JP5759228B2 (en) A method for calculating semantic similarity between messages and conversations based on extended entity extraction
US20180124193A1 (en) System and method for displaying contextual activity streams
CN103336766A (en) Short text garbage identification and modeling method and device
CN104239539A (en) Microblog information filtering method based on multi-information fusion
CN104933113A (en) Expression input method and device based on semantic understanding
CN103458042A (en) Microblog advertisement user detection method
CN102279890A (en) Sentiment word extracting and collecting method based on micro blog
CN108199951A (en) A kind of rubbish mail filtering method based on more algorithm fusion models
CN104298665A (en) Identification method and device of evaluation objects of Chinese texts
CN107688576B (en) Construction and tendency classification method of CNN-SVM model
CN104317784A (en) Cross-platform user identification method and cross-platform user identification system
CN102169496A (en) Anchor text analysis-based automatic domain term generating method
CN105893484A (en) Microblog Spammer recognition method based on text characteristics and behavior characteristics
CN104268214A (en) Micro-blog user relationship based user gender identification method and system
CN102708164A (en) Method and system for calculating movie expectation
US11243907B2 (en) Digital file recognition and deposit system
CN104281694A (en) Analysis system of emotional tendency of text
CN104598648A (en) Interactive gender identification method and device for microblog user
CN105243095A (en) Microblog text based emotion classification method and system
US9165053B2 (en) Multi-source contextual information item grouping for document analysis
CN106296249A (en) A kind of user classification method based on LBS and interest and system
Alshaikh et al. Sentiment Analysis for Smartphone Operating System: Privacy and Security on Twitter Data
CN104572767B (en) A kind of method and system of website languages classification

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant