CN104268214A - Micro-blog user relationship based user gender identification method and system - Google Patents
Micro-blog user relationship based user gender identification method and system Download PDFInfo
- Publication number
- CN104268214A CN104268214A CN201410494539.7A CN201410494539A CN104268214A CN 104268214 A CN104268214 A CN 104268214A CN 201410494539 A CN201410494539 A CN 201410494539A CN 104268214 A CN104268214 A CN 104268214A
- Authority
- CN
- China
- Prior art keywords
- user
- userid
- module
- users
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a micro-blog user relationship based user gender identification method and system. The method includes the following steps of S1 collecting micro-blog user information and classifying different users according to an API (application program interface) provided by a micro-blog website; S2 respectively obtaining userids of followers and fans of classified users according to userids of the classified users and organizing the userids of the followers and the fans into text; S3 performing feature extraction on training samples through information gain and classifying the samples to be classified through a maximum entropy classifier. Compared to user gender identification methods and systems using micro-blog text, the micro-blog user relationship based user gender identification method and system has a better micro-blog user gender classification effect.
Description
Technical field
The invention belongs to natural language processing technique field, be specifically related to a kind of user's gender identification method based on microblog users relation and system.
Background technology
At present, microblogging is the internet social interaction server of a kind of integrated, the Opening newly risen in the web2.0 epoch.It has got through the boundary of mobile radio communication and internet, and user can pass through the approach such as mobile phone, IM software and outside api interface, and the text within instant outwards issue 140 words, is therefore more and more subject to the favor of Internet user.Data show, and by the end of in by the end of May, 2011, only just reach 300,000,000 the microblogging registered user that Twitter is online.For Sina's microblogging, issue Sina's microblogging from August, 2009, in April, 2011, the only time of 20 months, Sina microblogging registered user just reaches 1.42 hundred million.After Sina's microblogging is reached the standard grade, Tengxun, Netease, Sohu etc. are numerous and confused microblogging service also.Microblogging has become one of main activities of Chinese netizen online, and in such circumstances, microblogging analytical technology receives the concern of numerous researchers gradually.
The automatic analysis of microblogging generally concentrates on above two basic tasks: microblog users analysis and content of microblog analysis.Wherein, microblog users analysis is the basis that content of microblog is analyzed.For the identification of microblog users sex, existing research is mainly for foreign language websites such as Twitter, and great majority are by the various analyses to text message, process, and carry out other classification realisation, this class is mainly realized by content of microblog analysis.Because Twitter message is unlike traditional text, the short and small and multiplex colloquial style of its content, and often have some emoticons in message, traditional file classification method, do not reach good classifying quality.
Given this, the present invention proposes a kind of user's gender identification method based on microblog users relation and system, to solve the problem.
Summary of the invention
The invention provides a kind of user's gender identification method based on microblog users relation, comprise the following steps.
S1: the api interface provided according to microblogging website, collects the user profile of microblog users, and classifies to different user.
S2: the userid obtaining its follower and bean vermicelli according to the userid of sorted users, and the userid of described follower and bean vermicelli is organized into text.
S3: use information gain to carry out feature extraction to training sample, and use maximum entropy classifiers to be classified by sample to be sorted.
Preferably, in step sl, described user profile comprises the follower of user and userid and the gender field of bean vermicelli, and classifies according to gender field to different user.
Preferably, in step sl, the user profile process of described collection microblog users comprises the following steps:
S101, Stochastic choice user are as seed user, and the api interface utilizing microblogging to provide captures the user profile of user;
S102, userid according to the follower in captured user profile and bean vermicelli, continue the user profile capturing described follower and bean vermicelli, until capture quantity to reach required scale.
Preferably, in step sl, be according to the gender field value in captured user profile, classify to user type, wherein gender field value comprises m, f and n, and m represents man, and f represents female, and n represents unknown.
Preferably, step S2 also comprises: after the userid of described follower and bean vermicelli is organized into text, leave two row of file respectively in, and choose the male sex and the female user text formation training sample of equivalent, the male sex and the female user text of choosing equivalent in addition form test sample book.
Preferably, step S3 also comprises, and utilize training sample to build maximum entropy classifiers, the maximum entropy wherein used is MALLET Machine learning tools bag.
Preferably, the information gain account form described in step S3 is:
Wherein, P (c
j) represent c
jthe probability that class document occurs in language material, P (t
i) represent in language material and comprise characteristic item t
ithe probability of document, P (c
j| t
i) represent that document package is containing characteristic item t
itime belong to C
jconditional probability during class,
represent in language material and do not comprise characteristic item t
ithe probability of document,
represent that document does not comprise characteristic item t
itime belong to C
jconditional probability, M represents classification number.
Preferably, after computing information gain, information gain value is selected to come the userid of first 4000.
The present invention also provides a kind of user's sex recognition system based on microblog users relation, comprise language material to obtain and pretreatment module, information processing module of user's, training classifier module and users classification module to be measured, described language material obtains and is connected information processing module of user's with pretreatment module, described information processing module of user's connects training classifier module, described training classifier model calling users classification module to be measured.Described language material obtains and pretreatment module, for obtaining the user profile of microblog users according to api interface.Described information processing module of user's, for according to user gender field value by users classification, then according to user userid, customer relationship is organized into the text of corresponding format, and therefrom selects training sample, test sample book at random.Described training classifier module, for building maximum entropy classifiers.Described users classification module to be measured, for classifying to testing data according to described maximum entropy classifiers.
According to the user's gender identification method based on microblog users relation provided by the invention and system, collect the user profile of microblog users, and different user is classified, without the need to carrying out complex process to the text message of microblogging.Obtain the userid of its follower and bean vermicelli according to the userid of sorted users, and after both userid are organized into text, use information gain to carry out feature extraction to training sample, and use maximum entropy classifiers to be classified by sample to be sorted.So, compare and use microblogging text, there is better microblog users Gender Classification effect.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is the user's gender identification method process flow diagram based on microblog users relation that present pre-ferred embodiments provides;
Fig. 2 is the user's sex recognition system schematic diagram based on microblog users relation that present pre-ferred embodiments provides.
Embodiment
Hereinafter also describe the present invention in detail with reference to accompanying drawing in conjunction with the embodiments.It should be noted that, when not conflicting, the embodiment in the application and the feature in embodiment can combine mutually.
Fig. 1 is the user's gender identification method process flow diagram based on microblog users relation that present pre-ferred embodiments provides.As shown in Figure 1, the user's gender identification method based on microblog users relation that present pre-ferred embodiments provides comprises step S1 ~ S3.
Step S1: the api interface provided according to microblogging website, collects the user profile of microblog users, and classifies to different user.
Specifically, microblogging website described in the present embodiment is Sina's microblogging, in other embodiments, can sets itself as required, this present invention is not construed as limiting.
The user profile process of collecting microblog users described in the present embodiment comprises the following steps.
S101, Stochastic choice user are as seed user, and the api interface utilizing microblogging to provide captures the user profile of user.
In this, described user profile comprises the follower of user and userid and the gender field of bean vermicelli, and classifies according to gender field to different user.Wherein, gender field value comprises m, f and n, and m represents man, and f represents female, and n represents unknown, accordingly user is divided into above-mentioned three classes by sex.
S102, userid according to the follower in captured user profile and bean vermicelli, continue the user profile capturing described follower and bean vermicelli, until capture quantity to reach required scale.
Step S2: the userid obtaining its follower and bean vermicelli according to the userid of sorted users, and the userid of described follower and bean vermicelli is organized into text.
Specifically, in this step, after the userid of described follower and bean vermicelli is organized into text, leave two row of file respectively in.Wherein, special symbol can be used between each userid to split, in the present embodiment, use space to split.In this, if the follower of certain user or bean vermicelli number are zero, then corresponding behavior null.
In addition, the male sex and the female user text that need choose equivalent form training sample, and the male sex and the female user text of choosing equivalent in addition form test sample book, form sample to be sorted.The present embodiment chooses the male sex, each 1000 of female user forms training sample, and the male sex, each 1000 of female user form test sample book.
Step S3: use information gain to carry out feature extraction to training sample, and use maximum entropy classifiers to be classified by sample to be sorted.
Specifically, the present embodiment utilizes training sample to build maximum entropy classifiers, and the maximum entropy of use is MALLET Machine learning tools bag.
Wherein, information gain account form is:
Wherein, P (c
j) represent c
jthe probability that class document occurs in language material, P (t
i) represent in language material and comprise characteristic item t
ithe probability of document, P (c
j| t
i) represent that document package is containing characteristic item t
itime belong to C
jconditional probability during class,
represent in language material and do not comprise characteristic item t
ithe probability of document,
represent that document does not comprise characteristic item t
itime belong to C
jconditional probability, M represents classification number.
Wherein, under maximum entropy model, the formula of predicted condition probability P (c|d) is:
Wherein Z (d) is normalized factor.F
i,cbe fundamental function, be defined as:
After the gain of said process computing information, the present embodiment selects information gain value to come the userid of first 4000.
As previously mentioned, take each 1000 of the training sample male sex, female user, the experimental data of the male sex, each 1000 of female user in test sample book, the inventive method is 0.843 to the accuracy rate that microblog users is classified.
Fig. 2 is the user's sex recognition system schematic diagram based on microblog users relation that present pre-ferred embodiments provides.As shown in Figure 2, the user's sex recognition system based on microblog users relation that present pre-ferred embodiments provides comprises language material and obtains and pretreatment module 1, information processing module of user's 2, training classifier module 3 and users classification module 4 to be measured, described language material obtains and is connected information processing module of user's 2 with pretreatment module 1, described information processing module of user's 2 connects training classifier module 3, and described training classifier module 3 connects users classification module 4 to be measured.Described language material obtains and pretreatment module 1, for obtaining the user profile of microblog users according to api interface.Described information processing module of user's 2, for according to user gender field value by users classification, then according to user userid, customer relationship is organized into the text of corresponding format, and therefrom selects training sample, test sample book at random.Described training classifier module 3, for building maximum entropy classifiers.Described users classification module 4 to be measured, for classifying to testing data according to described maximum entropy classifiers.About the operating process of said system and the inventive method similar, therefore to repeat no more in this.
The user's gender identification method based on microblog users relation provided according to present pre-ferred embodiments and system, with the customer relationship of microblog users (i.e. the userid of user follower and bean vermicelli) for source resource composition text, take into full account the importance of customer relationship in microblogging.Meanwhile, combining information gain carries out feature extraction to training sample, greatly reduces characteristic dimension, thus avoids the harmful effect using Twitter message to bring in assorting process, and has better classifying quality.
To the above-mentioned explanation of the disclosed embodiments, professional and technical personnel in the field are realized or uses the present invention.To be apparent for those skilled in the art to the multiple amendment of these embodiments, General Principle as defined herein can without departing from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention can not be restricted to embodiment illustrated herein, but will meet the widest scope consistent with principle disclosed herein and features of novelty.To the above-mentioned explanation of the disclosed embodiments, professional and technical personnel in the field are realized or uses the present invention.To be apparent for those skilled in the art to the multiple amendment of these embodiments, General Principle as defined herein can without departing from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention can not be restricted to embodiment illustrated herein, but will meet the widest scope consistent with principle disclosed herein and features of novelty.
Claims (9)
1., based on user's gender identification method of microblog users relation, it is characterized in that, comprise the following steps:
S1, the api interface provided according to microblogging website, collect the user profile of microblog users, and classify to different user;
S2, obtain the userid of its follower and bean vermicelli according to the userid of sorted users, and the userid of described follower and bean vermicelli is organized into text;
S3, use information gain are carried out feature extraction to training sample, and are used maximum entropy classifiers to be classified by sample to be sorted.
2. method according to claim 1, is characterized in that, in step sl, described user profile comprises the follower of user and userid and the gender field of bean vermicelli, and classifies according to gender field to different user.
3. method according to claim 1 and 2, is characterized in that, in step sl, the user profile process of described collection microblog users comprises the following steps:
S101, Stochastic choice user are as seed user, and the api interface utilizing microblogging to provide captures the user profile of user;
S102, userid according to the follower in captured user profile and bean vermicelli, continue the user profile capturing described follower and bean vermicelli, until capture quantity to reach required scale.
4. method according to claim 1 and 2, is characterized in that, in step sl, be according to the gender field value in captured user profile, classify to user type, wherein gender field value comprises m, f and n, m represents man, and f represents female, and n represents unknown.
5. method according to claim 1, it is characterized in that, step S2 also comprises: after the userid of described follower and bean vermicelli is organized into text, leave two row of file respectively in, and choose the male sex and the female user text formation training sample of equivalent, the male sex and the female user text of choosing equivalent in addition form test sample book.
6. method according to claim 1, is characterized in that, step S3 also comprises, and utilize training sample to build maximum entropy classifiers, the maximum entropy wherein used is MALLET Machine learning tools bag.
7. method according to claim 1, is characterized in that, the information gain account form described in step S3 is:
Wherein, P (c
j) represent c
jthe probability that class document occurs in language material, P (t
i) represent in language material and comprise characteristic item t
ithe probability of document, P (c
j| t
i) represent that document package is containing characteristic item t
itime belong to C
jconditional probability during class,
represent in language material and do not comprise characteristic item t
ithe probability of document,
represent that document does not comprise characteristic item t
itime belong to C
jconditional probability, M represents classification number.
8. method according to claim 7, is characterized in that, after computing information gain, selects information gain value to come the userid of first 4000.
9. the user's sex recognition system based on microblog users relation, it is characterized in that, comprise language material to obtain and pretreatment module, information processing module of user's, training classifier module and users classification module to be measured, described language material obtains and is connected information processing module of user's with pretreatment module, described information processing module of user's connects training classifier module, described training classifier model calling users classification module to be measured
Described language material obtains and pretreatment module, for obtaining the user profile of microblog users according to api interface;
Described information processing module of user's, for according to user gender field value by users classification, then according to user userid, customer relationship is organized into the text of corresponding format, and therefrom selects training sample, test sample book at random;
Described training classifier module, for building maximum entropy classifiers;
Described users classification module to be measured, for classifying to testing data according to described maximum entropy classifiers.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410494539.7A CN104268214B (en) | 2014-09-24 | 2014-09-24 | A kind of user's gender identification method and system based on microblog users relation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410494539.7A CN104268214B (en) | 2014-09-24 | 2014-09-24 | A kind of user's gender identification method and system based on microblog users relation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104268214A true CN104268214A (en) | 2015-01-07 |
CN104268214B CN104268214B (en) | 2018-01-19 |
Family
ID=52159736
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410494539.7A Active CN104268214B (en) | 2014-09-24 | 2014-09-24 | A kind of user's gender identification method and system based on microblog users relation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104268214B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106126607A (en) * | 2016-06-21 | 2016-11-16 | 重庆邮电大学 | A kind of customer relationship towards social networks analyzes method |
CN106327341A (en) * | 2016-08-15 | 2017-01-11 | 首都师范大学 | Weibo user gender deduction method and system based on combined theme |
CN106682118A (en) * | 2016-12-08 | 2017-05-17 | 华中科技大学 | Social network site false fan detection method achieved on basis of network crawler by means of machine learning |
EP3188094A1 (en) * | 2015-12-30 | 2017-07-05 | Xiaomi Inc. | Method and device for classification model training |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120083255A1 (en) * | 2010-10-04 | 2012-04-05 | Telefonica, S.A. | Method for gender identification of a cell-phone subscriber |
CN102663027A (en) * | 2012-03-22 | 2012-09-12 | 浙江盘石信息技术有限公司 | Method for predicting attributes of webpage crowd |
-
2014
- 2014-09-24 CN CN201410494539.7A patent/CN104268214B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120083255A1 (en) * | 2010-10-04 | 2012-04-05 | Telefonica, S.A. | Method for gender identification of a cell-phone subscriber |
CN102663027A (en) * | 2012-03-22 | 2012-09-12 | 浙江盘石信息技术有限公司 | Method for predicting attributes of webpage crowd |
Non-Patent Citations (4)
Title |
---|
LI S等: "《A Framework of Feature Selection Methods for Text Categorization》", 《IN PROCEEDINGS OF ACL-IJCNLP-09》 * |
MILLER Z等: "《"Gender predication on twitter using stream algorithms with N-gram character features"Gender predication on twitter using stream algorithms with N-gram character features》", 《INTERNATIONAL JOURNAL OF INTELLIGENCE SCIENCE》 * |
廉营: "《基于语义角色标注的微博人物关系抽取》", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 * |
杨芹: "《基于最大熵模型的中文网页分类器设计和实现》", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3188094A1 (en) * | 2015-12-30 | 2017-07-05 | Xiaomi Inc. | Method and device for classification model training |
CN106126607A (en) * | 2016-06-21 | 2016-11-16 | 重庆邮电大学 | A kind of customer relationship towards social networks analyzes method |
CN106327341A (en) * | 2016-08-15 | 2017-01-11 | 首都师范大学 | Weibo user gender deduction method and system based on combined theme |
CN106682118A (en) * | 2016-12-08 | 2017-05-17 | 华中科技大学 | Social network site false fan detection method achieved on basis of network crawler by means of machine learning |
Also Published As
Publication number | Publication date |
---|---|
CN104268214B (en) | 2018-01-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108885623B (en) | Semantic analysis system and method based on knowledge graph | |
Wu et al. | Twitter spam detection based on deep learning | |
JP5759228B2 (en) | A method for calculating semantic similarity between messages and conversations based on extended entity extraction | |
US20180124193A1 (en) | System and method for displaying contextual activity streams | |
CN103336766A (en) | Short text garbage identification and modeling method and device | |
CN104239539A (en) | Microblog information filtering method based on multi-information fusion | |
CN104933113A (en) | Expression input method and device based on semantic understanding | |
CN103458042A (en) | Microblog advertisement user detection method | |
CN102279890A (en) | Sentiment word extracting and collecting method based on micro blog | |
CN108199951A (en) | A kind of rubbish mail filtering method based on more algorithm fusion models | |
CN104298665A (en) | Identification method and device of evaluation objects of Chinese texts | |
CN107688576B (en) | Construction and tendency classification method of CNN-SVM model | |
CN104317784A (en) | Cross-platform user identification method and cross-platform user identification system | |
CN102169496A (en) | Anchor text analysis-based automatic domain term generating method | |
CN105893484A (en) | Microblog Spammer recognition method based on text characteristics and behavior characteristics | |
CN104268214A (en) | Micro-blog user relationship based user gender identification method and system | |
CN102708164A (en) | Method and system for calculating movie expectation | |
US11243907B2 (en) | Digital file recognition and deposit system | |
CN104281694A (en) | Analysis system of emotional tendency of text | |
CN104598648A (en) | Interactive gender identification method and device for microblog user | |
CN105243095A (en) | Microblog text based emotion classification method and system | |
US9165053B2 (en) | Multi-source contextual information item grouping for document analysis | |
CN106296249A (en) | A kind of user classification method based on LBS and interest and system | |
Alshaikh et al. | Sentiment Analysis for Smartphone Operating System: Privacy and Security on Twitter Data | |
CN104572767B (en) | A kind of method and system of website languages classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |