CN107330081A - A kind of information characteristics extracting method - Google Patents

A kind of information characteristics extracting method Download PDF

Info

Publication number
CN107330081A
CN107330081A CN201710531273.2A CN201710531273A CN107330081A CN 107330081 A CN107330081 A CN 107330081A CN 201710531273 A CN201710531273 A CN 201710531273A CN 107330081 A CN107330081 A CN 107330081A
Authority
CN
China
Prior art keywords
user
content
microblog
feature
property
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710531273.2A
Other languages
Chinese (zh)
Inventor
裴炜平
万里
黄娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen City Street Technology Media Ltd
Original Assignee
Shenzhen City Street Technology Media Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen City Street Technology Media Ltd filed Critical Shenzhen City Street Technology Media Ltd
Priority to CN201710531273.2A priority Critical patent/CN107330081A/en
Publication of CN107330081A publication Critical patent/CN107330081A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Primary Health Care (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • General Health & Medical Sciences (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of information characteristics extracting method, this method includes:Identified according to user, obtain user and identify corresponding user property;Identified according to user, obtain user and identify corresponding content of microblog, and corresponding user behavior is determined according to content of microblog;According to user property and user behavior, determine that user identifies corresponding user property feature;Set up and store the corresponding relation between user property feature and class label.Verified by gathering True Data, as a result show that proposed Feature Selection obtains the accuracy rate of very high waterborne troops's identification with processing method on Machine learning classifiers.

Description

A kind of information characteristics extracting method
Technical field
The application is related to areas of information technology, more particularly to a kind of information characteristics extracting method.
Background technology
Microblogging has become people daily as domestic most popularity, the coverage most maximum product of wide, degree of influence The important social platform of life.Different from other social platforms, microblogging or important news messages distribution platform and public opinion Platform.With the increase of microblogging influence power, a collection of " network navy " arises at the historic moment in microblog, threatens network social intercourse and puts down The order of platform.This kind of " network navy " is sometimes referred as a collection of microblog account manipulated by public relations firms, also referred to as microblog water army, passes through The means such as forwarding, comment spread news, and guide public opinion, are usually used in information popularization, advertising and Crisis processing Deng.
Microblog water army is with possessing independent account normal navy account number, independent user profile, and with normal User can equally issue in microblog, forward, comment on message.It is micro- as Sina weibo strengthens the monitoring to abnormal account Rich waterborne troops has evolved increasingly as a normal users to escape the abnormality detection of Sina weibo, waterborne troops's identification Problem is more difficult.
At present it is existing on microblog water army recognize research in, to waterborne troops know method for distinguishing mainly include it is rule-based and Method based on machine learning.The rule-based method of early stage is by manually finding out between navy account number and normal account Boundary standard, but this rule-based method its boundary criterion it is unalterable, be not suitable for recognizing what is constantly evolved Navy account number.Waterborne troops's identification based on machine learning method, its effect depends primarily on the selection of feature and the selection of model, The Feature Selection that better effects can wherein be obtained is mainly the progress Feature Selection in terms of following two:
(1) Feature Selection based on customer relationship figure.This Feature Selection based on customer relationship figure is needed by user Concern user and bean vermicelli user, it is established that customer relationship figure weighs communication situation of the targeted customer in microblog.
(2) characteristic based on content of text is chosen.This Feature Selection based on content of text mainly passes through text envelope Feature is extracted in terms of the multiplicity of breath, text emotion analysis.
It is conventional mainly to have following several with handling based on machine learning method to carry out the Feature Selection of microblog water army identification Point is not enough:
(1) for the Feature Selection based on customer relationship, it is necessary to microblog users relation group be set up, so as to set up microblogging The social networks of user.It is this to need to spend larger although this feature, which is obtained, can improve the accuracy rate of waterborne troops's identification Space storage user and consume the more time customer relationship feature obtained from customer relationship figure.
(2) for the acquisition of content of text messages, mainly from content of text multiplicity and content of text sentiment analysis What the two aspects were carried out.Feature Selection for text multiplicity needs to set up text library or needs on-line search work( Can, the cost that this spends is excessive;The emotion learning and mark of early stage are needed for content of text sentiment analysis, needs also exist for spending Energy is being set up on dictionary and emotion learning, while the accuracy of sentiment analysis is not high.
The content of the invention
The embodiments of the invention provide a kind of information characteristics extracting method, to solve user characteristics extraction in the prior art Accuracy it is not high the problem of.
Its specific technical scheme is as follows:
A kind of information characteristics extracting method, methods described includes:
Identified according to user, obtain the user and identify corresponding user property;
Identified according to the user, obtain the user and identify corresponding content of microblog, and it is true according to the content of microblog Fixed corresponding user behavior;
According to the user property and the user behavior, determine that the user identifies corresponding user property feature;
Set up and store the corresponding relation between the user property feature and class label.
Optionally, identified according to user, obtain the user and identify corresponding user property, specifically include:
Identified according to the user, at least obtain the user identify corresponding user gradation, user authentication, bean vermicelli ratio, Profile information;
It regard the user gradation, the user authentication, bean vermicelli ratio, profile information as the user property.
Optionally, identified according to the user, obtain the user and identify corresponding content of microblog, and according to the microblogging Content determines corresponding user behavior, including:
Obtain the corresponding text of the content of microblog and formulate character, content of text length, microblogging issuing time;
The text is formulated into character, the content of text length, microblogging issuing time and is used as the user behavior.
Optionally, according to the user property and the user behavior, determine that the user identifies corresponding user's category Property feature, including:
By parameters combination of the parameters in the user property got in the user behavior, combined As a result;
It regard the combined result as the user property feature.
Optionally, it is described after setting up and storing the corresponding relation between the user property feature and class label Method also includes:
Obtain and specify user's mark, corresponding specified user property feature is obtained according to specified user's mark;
According to the corresponding relation between the user property feature and class label, the specified user property feature is determined Corresponding class label.
Above-mentioned technical proposal at least has the following technical effect that:
(1) verified by gathering true microblog data, as a result show that proposed Feature Selection exists with processing method The accuracy rate of very high waterborne troops's identification is obtained on Machine learning classifiers.
(2) Feature Selection proposed mainly obtains information with processing method from user home page, and acquired Feature is very low to the dependence of other microblog users, therefore, it is possible to realize real-time microblog water army identification.
Brief description of the drawings
Fig. 1 is a kind of flow chart of information characteristics extracting method in the embodiment of the present invention.
Embodiment
Technical solution of the present invention is described in detail below by accompanying drawing and specific embodiment, it will be appreciated that this hair Particular technique feature in bright embodiment and embodiment is the explanation to technical solution of the present invention, rather than is limited, not In the case of conflict, the particular technique feature in the embodiment of the present invention and embodiment can be mutually combined.
It is a kind of flow chart of information characteristics extracting method in the embodiment of the present invention as shown in Figure 1, this method includes:
S101, is identified according to user, is obtained the user and is identified corresponding user property;
User's mark for giving user, captures microblogging homepage, and the content of microblogging homepage is entered by web crawlers Row parsing, obtains the attribute of user and the content for sending out microblogging on the user home page.
S102, is identified according to user, is obtained user and is identified corresponding content of microblog, and is determined according to content of microblog corresponding User behavior;
S103, according to user property and user behavior, determines that user identifies corresponding user property feature;
S104, sets up and stores the corresponding relation between user property feature and class label.
Based on the method in step S101, user home page can be got, based on user home page, microblog users category is obtained Property:User gradation, bean vermicelli ratio, user authentication and whether there is brief introduction.User property can be obtained from microblog users homepage, specifically It is as follows:
1st, user gradation:Because user's highest grade is 48 grades, therefore really user gradation number divided by 48 it will be returned One changes.
2nd, bean vermicelli ratio:In user home page it can be seen that " bean vermicelli " and " concern " of user, corresponding property value is respectively powder Silk number and concern number, define bean vermicelli than the ratio for bean vermicelli number and bean vermicelli number and concern number sum.
3rd, user authentication:Microblog users certification typically has " personal domestic consumer ", " individual plus V user ", " microblogging intelligent use Family " and " enterprise customer ".User authentication feature is represented used here as the vector characteristics of one 4 dimension.
4th, whether there is brief introduction:In user home page it can be seen that whether user has brief introduction, used here as " 1 " and " 0 " difference table It is shown with brief introduction and without brief introduction.
For example, it is assumed that have 1 microblog users, its user gradation is 24 grades, and bean vermicelli number is 200, and concern number is 300, Microblogging certification be " personal domestic consumer ", no brief introduction, then carry out characteristic processing be:
(1) user gradation:24/48=0.5
(2) bean vermicelli ratio:200/ (200+300)=0.4
(3) user authentication:(1,0,0,0)
(4) whether there is brief introduction:0
Therefore for the user, the characteristic vector of its user property is (0.5,0.4,1,0,0,0,0).
Further, in addition to obtaining user property, in addition it is also necessary to obtain user behavior, the user behavior is specially:User Interaction, text spcial character, content of text length and microblogging issuing time, user behavior processing are as follows:
1st, user interaction:Forwarding under microblogging, comment, thumb up information are obtained, uses " 1 " and " 0 " to indicate corresponding respectively Behavior and without corresponding behavior.
2nd, text spcial character:Obtain the spcial character situation specified that content of microblog is included.The present invention is special using 6 kinds Character, be respectively " I ", "@", " # ", " //@", " web page interlinkage ", " second beats video "." 1 " and " 0 " is used to indicate pair respectively The character answered and without corresponding character.
3rd, content of text length:Microblogging content of text is counted, total word number, not repetitor number, non-deactivation is counted Word number and non-deactivation not this 4 features of repetitor number.Wherein total word number is the total word number included in sent out content of text, Sina The maximum number of words of every text microblogging is 140 words in microblog, in order to be normalized, and total word number is all divided by 140.Do not weigh Compound word number then represents that unduplicated word number in text accounts for the ratio of total word number, and non-stop words number represents to be not belonging to the word number of stop words The ratio of total word number is accounted for, non-stop words number is not repeated and is then represented neither repetitor and be not off the word number of word and account for total word number Ratio.
4th, microblogging issuing time:Extracting microblogging issuing time includes issuing time point and issue in the two several features of week, Wherein issuing time point then represents issue at intraday time point.The two features are needed first to carry out to carry out solely after discretization Heat coding, wherein the discretization process at time point is used as spacing using 3 hour length.
For example, it is assumed that a newest content of microblog for user hair is " I am very happy today.Friend@", uses Stammerer participle obtains 7 words, and without dittograph, stop words has two (" very ", "@"), and the forwarding of this microblogging Number be 0, comment number be 5, thumb up number be 8, and microblogging issuing time be 17 days 16 April in 2017 when 30 points, then carry out feature It is processed as:
(1) user interaction:(0,1,1)
(2) spcial character:(1,1,0,0,0,0)
(3) text size:(7/140,0/7,5/7,5/7)
(4) issuing time:
For issuing time point, discretization is carried out by spacing of every 3 hours, when such as representing 0~3 with " 0 ", with " 1 " table When showing 3~6, by that analogy, 30 points of use " 5 " represent when 16.And it is several in week for issuing, on April 17th, 2017 is Monday, we It can be represented with " 0 ".Therefore issuing time can be expressed as (5,0), it is carried out to obtain characteristic vector after one-hot coding be (0,0,0,0,0,1,0,0,1,0,0,0,0,0,0), wherein first 8 be issuing time one-hot coding, latter 7 be issue in week Several one-hot codings are represented.
Therefore, for this microblogging behavioural characteristic vector for (0,1,1,1,1,0,0,0,0,0.05,0,0.714, 0.714,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0).This is the feature of one 28 dimension.
The behavioural characteristic of preceding 10 microbloggings is obtained according to above method for each user, therefore each user can obtain The behavioural characteristic of 28*10=280 dimensions.
Based on above-mentioned method, each microblog users are obtained with the characteristic vector of 280+7 dimensions, mechanical energy is inputted as feature Grader is recognized.
Above-mentioned technical proposal at least has the following technical effect that:
(1) verified by gathering true microblog data, as a result show that proposed Feature Selection exists with processing method The accuracy rate of very high waterborne troops's identification is obtained on Machine learning classifiers.
(2) Feature Selection proposed mainly obtains information with processing method from user home page, and acquired Feature is very low to the dependence of other microblog users, therefore, it is possible to realize real-time microblog water army identification.
Although having been described for the preferred embodiment of the application, one of ordinary skilled in the art once knows substantially Creative concept, then can make other change and modification to these embodiments.So, appended claims are intended to be construed to bag Include preferred embodiment and fall into having altered and changing for the application scope.
Obviously, those skilled in the art can carry out the essence of various changes and modification without departing from the application to the application God and scope.So, if these modifications and variations of the application belong to the scope of the application claim and its equivalent technologies Within, then the application is also intended to comprising including these changes and modification.

Claims (5)

1. a kind of information characteristics extracting method, it is characterised in that methods described includes:
Identified according to user, obtain the user and identify corresponding user property;
Identified according to the user, obtain the user and identify corresponding content of microblog, and according to content of microblog determination pair The user behavior answered;
According to the user property and the user behavior, determine that the user identifies corresponding user property feature;
Set up and store the corresponding relation between the user property feature and class label.
2. the method as described in claim 1, it is characterised in that identified according to user, obtains the corresponding use of user's mark Family attribute, is specifically included:
Identified according to the user, at least obtain the user and identify corresponding user gradation, user authentication, bean vermicelli ratio, brief introduction Information;
It regard the user gradation, the user authentication, bean vermicelli ratio, profile information as the user property.
3. the method as described in claim 1, it is characterised in that identified according to the user, obtains user's mark correspondence Content of microblog, and corresponding user behavior is determined according to the content of microblog, including:
Obtain the corresponding text of the content of microblog and formulate character, content of text length, microblogging issuing time;
The text is formulated into character, the content of text length, microblogging issuing time and is used as the user behavior.
4. the method as described in claim 1, it is characterised in that according to the user property and the user behavior, it is determined that The user identifies corresponding user property feature, including:
By parameters combination of the parameters in the user property got in the user behavior, combination knot is obtained Really;
It regard the combined result as the user property feature.
5. the method as described in claim 1, it is characterised in that setting up and storing the user property feature and class label Between corresponding relation after, methods described also includes:
Obtain and specify user's mark, corresponding specified user property feature is obtained according to specified user's mark;
According to the corresponding relation between the user property feature and class label, the specified user property feature correspondence is determined Class label.
CN201710531273.2A 2017-07-03 2017-07-03 A kind of information characteristics extracting method Pending CN107330081A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710531273.2A CN107330081A (en) 2017-07-03 2017-07-03 A kind of information characteristics extracting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710531273.2A CN107330081A (en) 2017-07-03 2017-07-03 A kind of information characteristics extracting method

Publications (1)

Publication Number Publication Date
CN107330081A true CN107330081A (en) 2017-11-07

Family

ID=60198640

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710531273.2A Pending CN107330081A (en) 2017-07-03 2017-07-03 A kind of information characteristics extracting method

Country Status (1)

Country Link
CN (1) CN107330081A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108109071A (en) * 2017-12-29 2018-06-01 长威信息科技发展股份有限公司 The monitoring method and electronic equipment dynamically associated based on personnel's social relationships
CN109947526A (en) * 2019-03-29 2019-06-28 北京百度网讯科技有限公司 Method and apparatus for output information
CN111105117A (en) * 2018-10-29 2020-05-05 微梦创科网络科技(中国)有限公司 Method and device for determining user information

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102571484A (en) * 2011-12-14 2012-07-11 上海交通大学 Method for detecting and finding online water army
CN103077240A (en) * 2013-01-10 2013-05-01 北京工商大学 Microblog water army identifying method based on probabilistic graphical model
CN105893484A (en) * 2016-03-29 2016-08-24 西安交通大学 Microblog Spammer recognition method based on text characteristics and behavior characteristics

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102571484A (en) * 2011-12-14 2012-07-11 上海交通大学 Method for detecting and finding online water army
CN103077240A (en) * 2013-01-10 2013-05-01 北京工商大学 Microblog water army identifying method based on probabilistic graphical model
CN105893484A (en) * 2016-03-29 2016-08-24 西安交通大学 Microblog Spammer recognition method based on text characteristics and behavior characteristics

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
袁旭萍 等: "基于综合指数和熵值法的微博水军自动识别", 《情报杂志》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108109071A (en) * 2017-12-29 2018-06-01 长威信息科技发展股份有限公司 The monitoring method and electronic equipment dynamically associated based on personnel's social relationships
CN111105117A (en) * 2018-10-29 2020-05-05 微梦创科网络科技(中国)有限公司 Method and device for determining user information
CN109947526A (en) * 2019-03-29 2019-06-28 北京百度网讯科技有限公司 Method and apparatus for output information
CN109947526B (en) * 2019-03-29 2023-04-11 北京百度网讯科技有限公司 Method and apparatus for outputting information

Similar Documents

Publication Publication Date Title
CN106980692B (en) Influence calculation method based on microblog specific events
CN104008106B (en) A kind of method and device obtaining much-talked-about topic
CN105005594B (en) Abnormal microblog users recognition methods
CN105488092B (en) A kind of time-sensitive and adaptive sub-topic online test method and system
CN102722709B (en) Method and device for identifying garbage pictures
CN101071418B (en) Chat method and system
CN103514238B (en) Sensitive word identifying processing method based on classification searching
CN104866478B (en) Malicious text detection and identification method and device
CN104133817A (en) Online community interaction method and device and online community platform
CN103970891B (en) A kind of user interest information querying method based on situation
CN103049440A (en) Recommendation processing method and processing system for related articles
CN101510856A (en) Method and apparatus for extracting member relation loop in SNS network
CN103324666A (en) Topic tracing method and device based on micro-blog data
CN106844786A (en) A kind of public sentiment region focus based on text similarity finds method
CN103425649A (en) Method and device for adding friend information
CN107330081A (en) A kind of information characteristics extracting method
CN101963972A (en) Method and system for extracting emotional keywords
CN110956210A (en) Semi-supervised network water force identification method and system based on AP clustering
CN106789572A (en) A kind of instant communicating system and instant communication method for realizing self adaptation message screening
CN106886296A (en) The treating method and apparatus of the dictionary of input method
CN103186555A (en) Evaluation information generation method and system
CN111242218A (en) Cross-social network user identity recognition method fusing user multi-attribute information
CN104915388B (en) It is a kind of that method is recommended based on spectral clustering and the book labels of mass-rent technology
CN103929499B (en) A kind of Internet of Things isomery index identification method and system
CN108415971B (en) Method and device for recommending supply and demand information by using knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20171107

RJ01 Rejection of invention patent application after publication