CN107330081A - A kind of information characteristics extracting method - Google Patents
A kind of information characteristics extracting method Download PDFInfo
- Publication number
- CN107330081A CN107330081A CN201710531273.2A CN201710531273A CN107330081A CN 107330081 A CN107330081 A CN 107330081A CN 201710531273 A CN201710531273 A CN 201710531273A CN 107330081 A CN107330081 A CN 107330081A
- Authority
- CN
- China
- Prior art keywords
- user
- content
- microblog
- feature
- property
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 244000046052 Phaseolus vulgaris Species 0.000 claims description 13
- 235000010627 Phaseolus vulgaris Nutrition 0.000 claims description 13
- 230000006399 behavior Effects 0.000 abstract description 15
- 238000010801 machine learning Methods 0.000 abstract description 6
- 238000003672 processing method Methods 0.000 abstract description 5
- 230000000875 corresponding effect Effects 0.000 description 24
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 244000097202 Rathbunia alamosensis Species 0.000 description 3
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 3
- 230000003542 behavioural effect Effects 0.000 description 3
- 230000008451 emotion Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 230000009849 deactivation Effects 0.000 description 2
- 210000003813 thumb Anatomy 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000005856 abnormality Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 239000000843 powder Substances 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Primary Health Care (AREA)
- General Business, Economics & Management (AREA)
- Marketing (AREA)
- General Health & Medical Sciences (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- Economics (AREA)
- Human Resources & Organizations (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a kind of information characteristics extracting method, this method includes:Identified according to user, obtain user and identify corresponding user property;Identified according to user, obtain user and identify corresponding content of microblog, and corresponding user behavior is determined according to content of microblog;According to user property and user behavior, determine that user identifies corresponding user property feature;Set up and store the corresponding relation between user property feature and class label.Verified by gathering True Data, as a result show that proposed Feature Selection obtains the accuracy rate of very high waterborne troops's identification with processing method on Machine learning classifiers.
Description
Technical field
The application is related to areas of information technology, more particularly to a kind of information characteristics extracting method.
Background technology
Microblogging has become people daily as domestic most popularity, the coverage most maximum product of wide, degree of influence
The important social platform of life.Different from other social platforms, microblogging or important news messages distribution platform and public opinion
Platform.With the increase of microblogging influence power, a collection of " network navy " arises at the historic moment in microblog, threatens network social intercourse and puts down
The order of platform.This kind of " network navy " is sometimes referred as a collection of microblog account manipulated by public relations firms, also referred to as microblog water army, passes through
The means such as forwarding, comment spread news, and guide public opinion, are usually used in information popularization, advertising and Crisis processing
Deng.
Microblog water army is with possessing independent account normal navy account number, independent user profile, and with normal
User can equally issue in microblog, forward, comment on message.It is micro- as Sina weibo strengthens the monitoring to abnormal account
Rich waterborne troops has evolved increasingly as a normal users to escape the abnormality detection of Sina weibo, waterborne troops's identification
Problem is more difficult.
At present it is existing on microblog water army recognize research in, to waterborne troops know method for distinguishing mainly include it is rule-based and
Method based on machine learning.The rule-based method of early stage is by manually finding out between navy account number and normal account
Boundary standard, but this rule-based method its boundary criterion it is unalterable, be not suitable for recognizing what is constantly evolved
Navy account number.Waterborne troops's identification based on machine learning method, its effect depends primarily on the selection of feature and the selection of model,
The Feature Selection that better effects can wherein be obtained is mainly the progress Feature Selection in terms of following two:
(1) Feature Selection based on customer relationship figure.This Feature Selection based on customer relationship figure is needed by user
Concern user and bean vermicelli user, it is established that customer relationship figure weighs communication situation of the targeted customer in microblog.
(2) characteristic based on content of text is chosen.This Feature Selection based on content of text mainly passes through text envelope
Feature is extracted in terms of the multiplicity of breath, text emotion analysis.
It is conventional mainly to have following several with handling based on machine learning method to carry out the Feature Selection of microblog water army identification
Point is not enough:
(1) for the Feature Selection based on customer relationship, it is necessary to microblog users relation group be set up, so as to set up microblogging
The social networks of user.It is this to need to spend larger although this feature, which is obtained, can improve the accuracy rate of waterborne troops's identification
Space storage user and consume the more time customer relationship feature obtained from customer relationship figure.
(2) for the acquisition of content of text messages, mainly from content of text multiplicity and content of text sentiment analysis
What the two aspects were carried out.Feature Selection for text multiplicity needs to set up text library or needs on-line search work(
Can, the cost that this spends is excessive;The emotion learning and mark of early stage are needed for content of text sentiment analysis, needs also exist for spending
Energy is being set up on dictionary and emotion learning, while the accuracy of sentiment analysis is not high.
The content of the invention
The embodiments of the invention provide a kind of information characteristics extracting method, to solve user characteristics extraction in the prior art
Accuracy it is not high the problem of.
Its specific technical scheme is as follows:
A kind of information characteristics extracting method, methods described includes:
Identified according to user, obtain the user and identify corresponding user property;
Identified according to the user, obtain the user and identify corresponding content of microblog, and it is true according to the content of microblog
Fixed corresponding user behavior;
According to the user property and the user behavior, determine that the user identifies corresponding user property feature;
Set up and store the corresponding relation between the user property feature and class label.
Optionally, identified according to user, obtain the user and identify corresponding user property, specifically include:
Identified according to the user, at least obtain the user identify corresponding user gradation, user authentication, bean vermicelli ratio,
Profile information;
It regard the user gradation, the user authentication, bean vermicelli ratio, profile information as the user property.
Optionally, identified according to the user, obtain the user and identify corresponding content of microblog, and according to the microblogging
Content determines corresponding user behavior, including:
Obtain the corresponding text of the content of microblog and formulate character, content of text length, microblogging issuing time;
The text is formulated into character, the content of text length, microblogging issuing time and is used as the user behavior.
Optionally, according to the user property and the user behavior, determine that the user identifies corresponding user's category
Property feature, including:
By parameters combination of the parameters in the user property got in the user behavior, combined
As a result;
It regard the combined result as the user property feature.
Optionally, it is described after setting up and storing the corresponding relation between the user property feature and class label
Method also includes:
Obtain and specify user's mark, corresponding specified user property feature is obtained according to specified user's mark;
According to the corresponding relation between the user property feature and class label, the specified user property feature is determined
Corresponding class label.
Above-mentioned technical proposal at least has the following technical effect that:
(1) verified by gathering true microblog data, as a result show that proposed Feature Selection exists with processing method
The accuracy rate of very high waterborne troops's identification is obtained on Machine learning classifiers.
(2) Feature Selection proposed mainly obtains information with processing method from user home page, and acquired
Feature is very low to the dependence of other microblog users, therefore, it is possible to realize real-time microblog water army identification.
Brief description of the drawings
Fig. 1 is a kind of flow chart of information characteristics extracting method in the embodiment of the present invention.
Embodiment
Technical solution of the present invention is described in detail below by accompanying drawing and specific embodiment, it will be appreciated that this hair
Particular technique feature in bright embodiment and embodiment is the explanation to technical solution of the present invention, rather than is limited, not
In the case of conflict, the particular technique feature in the embodiment of the present invention and embodiment can be mutually combined.
It is a kind of flow chart of information characteristics extracting method in the embodiment of the present invention as shown in Figure 1, this method includes:
S101, is identified according to user, is obtained the user and is identified corresponding user property;
User's mark for giving user, captures microblogging homepage, and the content of microblogging homepage is entered by web crawlers
Row parsing, obtains the attribute of user and the content for sending out microblogging on the user home page.
S102, is identified according to user, is obtained user and is identified corresponding content of microblog, and is determined according to content of microblog corresponding
User behavior;
S103, according to user property and user behavior, determines that user identifies corresponding user property feature;
S104, sets up and stores the corresponding relation between user property feature and class label.
Based on the method in step S101, user home page can be got, based on user home page, microblog users category is obtained
Property:User gradation, bean vermicelli ratio, user authentication and whether there is brief introduction.User property can be obtained from microblog users homepage, specifically
It is as follows:
1st, user gradation:Because user's highest grade is 48 grades, therefore really user gradation number divided by 48 it will be returned
One changes.
2nd, bean vermicelli ratio:In user home page it can be seen that " bean vermicelli " and " concern " of user, corresponding property value is respectively powder
Silk number and concern number, define bean vermicelli than the ratio for bean vermicelli number and bean vermicelli number and concern number sum.
3rd, user authentication:Microblog users certification typically has " personal domestic consumer ", " individual plus V user ", " microblogging intelligent use
Family " and " enterprise customer ".User authentication feature is represented used here as the vector characteristics of one 4 dimension.
4th, whether there is brief introduction:In user home page it can be seen that whether user has brief introduction, used here as " 1 " and " 0 " difference table
It is shown with brief introduction and without brief introduction.
For example, it is assumed that have 1 microblog users, its user gradation is 24 grades, and bean vermicelli number is 200, and concern number is 300,
Microblogging certification be " personal domestic consumer ", no brief introduction, then carry out characteristic processing be:
(1) user gradation:24/48=0.5
(2) bean vermicelli ratio:200/ (200+300)=0.4
(3) user authentication:(1,0,0,0)
(4) whether there is brief introduction:0
Therefore for the user, the characteristic vector of its user property is (0.5,0.4,1,0,0,0,0).
Further, in addition to obtaining user property, in addition it is also necessary to obtain user behavior, the user behavior is specially:User
Interaction, text spcial character, content of text length and microblogging issuing time, user behavior processing are as follows:
1st, user interaction:Forwarding under microblogging, comment, thumb up information are obtained, uses " 1 " and " 0 " to indicate corresponding respectively
Behavior and without corresponding behavior.
2nd, text spcial character:Obtain the spcial character situation specified that content of microblog is included.The present invention is special using 6 kinds
Character, be respectively " I ", "@", " # ", " //@", " web page interlinkage ", " second beats video "." 1 " and " 0 " is used to indicate pair respectively
The character answered and without corresponding character.
3rd, content of text length:Microblogging content of text is counted, total word number, not repetitor number, non-deactivation is counted
Word number and non-deactivation not this 4 features of repetitor number.Wherein total word number is the total word number included in sent out content of text, Sina
The maximum number of words of every text microblogging is 140 words in microblog, in order to be normalized, and total word number is all divided by 140.Do not weigh
Compound word number then represents that unduplicated word number in text accounts for the ratio of total word number, and non-stop words number represents to be not belonging to the word number of stop words
The ratio of total word number is accounted for, non-stop words number is not repeated and is then represented neither repetitor and be not off the word number of word and account for total word number
Ratio.
4th, microblogging issuing time:Extracting microblogging issuing time includes issuing time point and issue in the two several features of week,
Wherein issuing time point then represents issue at intraday time point.The two features are needed first to carry out to carry out solely after discretization
Heat coding, wherein the discretization process at time point is used as spacing using 3 hour length.
For example, it is assumed that a newest content of microblog for user hair is " I am very happy today.Friend@", uses
Stammerer participle obtains 7 words, and without dittograph, stop words has two (" very ", "@"), and the forwarding of this microblogging
Number be 0, comment number be 5, thumb up number be 8, and microblogging issuing time be 17 days 16 April in 2017 when 30 points, then carry out feature
It is processed as:
(1) user interaction:(0,1,1)
(2) spcial character:(1,1,0,0,0,0)
(3) text size:(7/140,0/7,5/7,5/7)
(4) issuing time:
For issuing time point, discretization is carried out by spacing of every 3 hours, when such as representing 0~3 with " 0 ", with " 1 " table
When showing 3~6, by that analogy, 30 points of use " 5 " represent when 16.And it is several in week for issuing, on April 17th, 2017 is Monday, we
It can be represented with " 0 ".Therefore issuing time can be expressed as (5,0), it is carried out to obtain characteristic vector after one-hot coding be
(0,0,0,0,0,1,0,0,1,0,0,0,0,0,0), wherein first 8 be issuing time one-hot coding, latter 7 be issue in week
Several one-hot codings are represented.
Therefore, for this microblogging behavioural characteristic vector for (0,1,1,1,1,0,0,0,0,0.05,0,0.714,
0.714,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0).This is the feature of one 28 dimension.
The behavioural characteristic of preceding 10 microbloggings is obtained according to above method for each user, therefore each user can obtain
The behavioural characteristic of 28*10=280 dimensions.
Based on above-mentioned method, each microblog users are obtained with the characteristic vector of 280+7 dimensions, mechanical energy is inputted as feature
Grader is recognized.
Above-mentioned technical proposal at least has the following technical effect that:
(1) verified by gathering true microblog data, as a result show that proposed Feature Selection exists with processing method
The accuracy rate of very high waterborne troops's identification is obtained on Machine learning classifiers.
(2) Feature Selection proposed mainly obtains information with processing method from user home page, and acquired
Feature is very low to the dependence of other microblog users, therefore, it is possible to realize real-time microblog water army identification.
Although having been described for the preferred embodiment of the application, one of ordinary skilled in the art once knows substantially
Creative concept, then can make other change and modification to these embodiments.So, appended claims are intended to be construed to bag
Include preferred embodiment and fall into having altered and changing for the application scope.
Obviously, those skilled in the art can carry out the essence of various changes and modification without departing from the application to the application
God and scope.So, if these modifications and variations of the application belong to the scope of the application claim and its equivalent technologies
Within, then the application is also intended to comprising including these changes and modification.
Claims (5)
1. a kind of information characteristics extracting method, it is characterised in that methods described includes:
Identified according to user, obtain the user and identify corresponding user property;
Identified according to the user, obtain the user and identify corresponding content of microblog, and according to content of microblog determination pair
The user behavior answered;
According to the user property and the user behavior, determine that the user identifies corresponding user property feature;
Set up and store the corresponding relation between the user property feature and class label.
2. the method as described in claim 1, it is characterised in that identified according to user, obtains the corresponding use of user's mark
Family attribute, is specifically included:
Identified according to the user, at least obtain the user and identify corresponding user gradation, user authentication, bean vermicelli ratio, brief introduction
Information;
It regard the user gradation, the user authentication, bean vermicelli ratio, profile information as the user property.
3. the method as described in claim 1, it is characterised in that identified according to the user, obtains user's mark correspondence
Content of microblog, and corresponding user behavior is determined according to the content of microblog, including:
Obtain the corresponding text of the content of microblog and formulate character, content of text length, microblogging issuing time;
The text is formulated into character, the content of text length, microblogging issuing time and is used as the user behavior.
4. the method as described in claim 1, it is characterised in that according to the user property and the user behavior, it is determined that
The user identifies corresponding user property feature, including:
By parameters combination of the parameters in the user property got in the user behavior, combination knot is obtained
Really;
It regard the combined result as the user property feature.
5. the method as described in claim 1, it is characterised in that setting up and storing the user property feature and class label
Between corresponding relation after, methods described also includes:
Obtain and specify user's mark, corresponding specified user property feature is obtained according to specified user's mark;
According to the corresponding relation between the user property feature and class label, the specified user property feature correspondence is determined
Class label.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710531273.2A CN107330081A (en) | 2017-07-03 | 2017-07-03 | A kind of information characteristics extracting method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710531273.2A CN107330081A (en) | 2017-07-03 | 2017-07-03 | A kind of information characteristics extracting method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107330081A true CN107330081A (en) | 2017-11-07 |
Family
ID=60198640
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710531273.2A Pending CN107330081A (en) | 2017-07-03 | 2017-07-03 | A kind of information characteristics extracting method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107330081A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108109071A (en) * | 2017-12-29 | 2018-06-01 | 长威信息科技发展股份有限公司 | The monitoring method and electronic equipment dynamically associated based on personnel's social relationships |
CN109947526A (en) * | 2019-03-29 | 2019-06-28 | 北京百度网讯科技有限公司 | Method and apparatus for output information |
CN111105117A (en) * | 2018-10-29 | 2020-05-05 | 微梦创科网络科技(中国)有限公司 | Method and device for determining user information |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102571484A (en) * | 2011-12-14 | 2012-07-11 | 上海交通大学 | Method for detecting and finding online water army |
CN103077240A (en) * | 2013-01-10 | 2013-05-01 | 北京工商大学 | Microblog water army identifying method based on probabilistic graphical model |
CN105893484A (en) * | 2016-03-29 | 2016-08-24 | 西安交通大学 | Microblog Spammer recognition method based on text characteristics and behavior characteristics |
-
2017
- 2017-07-03 CN CN201710531273.2A patent/CN107330081A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102571484A (en) * | 2011-12-14 | 2012-07-11 | 上海交通大学 | Method for detecting and finding online water army |
CN103077240A (en) * | 2013-01-10 | 2013-05-01 | 北京工商大学 | Microblog water army identifying method based on probabilistic graphical model |
CN105893484A (en) * | 2016-03-29 | 2016-08-24 | 西安交通大学 | Microblog Spammer recognition method based on text characteristics and behavior characteristics |
Non-Patent Citations (1)
Title |
---|
袁旭萍 等: "基于综合指数和熵值法的微博水军自动识别", 《情报杂志》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108109071A (en) * | 2017-12-29 | 2018-06-01 | 长威信息科技发展股份有限公司 | The monitoring method and electronic equipment dynamically associated based on personnel's social relationships |
CN111105117A (en) * | 2018-10-29 | 2020-05-05 | 微梦创科网络科技(中国)有限公司 | Method and device for determining user information |
CN109947526A (en) * | 2019-03-29 | 2019-06-28 | 北京百度网讯科技有限公司 | Method and apparatus for output information |
CN109947526B (en) * | 2019-03-29 | 2023-04-11 | 北京百度网讯科技有限公司 | Method and apparatus for outputting information |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106980692B (en) | Influence calculation method based on microblog specific events | |
CN104008106B (en) | A kind of method and device obtaining much-talked-about topic | |
CN105005594B (en) | Abnormal microblog users recognition methods | |
CN105488092B (en) | A kind of time-sensitive and adaptive sub-topic online test method and system | |
CN102722709B (en) | Method and device for identifying garbage pictures | |
CN101071418B (en) | Chat method and system | |
CN103514238B (en) | Sensitive word identifying processing method based on classification searching | |
CN104866478B (en) | Malicious text detection and identification method and device | |
CN104133817A (en) | Online community interaction method and device and online community platform | |
CN103970891B (en) | A kind of user interest information querying method based on situation | |
CN103049440A (en) | Recommendation processing method and processing system for related articles | |
CN101510856A (en) | Method and apparatus for extracting member relation loop in SNS network | |
CN103324666A (en) | Topic tracing method and device based on micro-blog data | |
CN106844786A (en) | A kind of public sentiment region focus based on text similarity finds method | |
CN103425649A (en) | Method and device for adding friend information | |
CN107330081A (en) | A kind of information characteristics extracting method | |
CN101963972A (en) | Method and system for extracting emotional keywords | |
CN110956210A (en) | Semi-supervised network water force identification method and system based on AP clustering | |
CN106789572A (en) | A kind of instant communicating system and instant communication method for realizing self adaptation message screening | |
CN106886296A (en) | The treating method and apparatus of the dictionary of input method | |
CN103186555A (en) | Evaluation information generation method and system | |
CN111242218A (en) | Cross-social network user identity recognition method fusing user multi-attribute information | |
CN104915388B (en) | It is a kind of that method is recommended based on spectral clustering and the book labels of mass-rent technology | |
CN103929499B (en) | A kind of Internet of Things isomery index identification method and system | |
CN108415971B (en) | Method and device for recommending supply and demand information by using knowledge graph |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20171107 |
|
RJ01 | Rejection of invention patent application after publication |