CN103593359A - Text negative tendency judgment method based on industries - Google Patents

Text negative tendency judgment method based on industries Download PDF

Info

Publication number
CN103593359A
CN103593359A CN201210290556.XA CN201210290556A CN103593359A CN 103593359 A CN103593359 A CN 103593359A CN 201210290556 A CN201210290556 A CN 201210290556A CN 103593359 A CN103593359 A CN 103593359A
Authority
CN
China
Prior art keywords
text
negative
word
rule
weights
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201210290556.XA
Other languages
Chinese (zh)
Inventor
陈国华
陈宗华
陈永江
仲兆满
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JIANGSU JINGE NETWORK TECHNOLOGY Co Ltd
Original Assignee
JIANGSU JINGE NETWORK TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JIANGSU JINGE NETWORK TECHNOLOGY Co Ltd filed Critical JIANGSU JINGE NETWORK TECHNOLOGY Co Ltd
Priority to CN201210290556.XA priority Critical patent/CN103593359A/en
Publication of CN103593359A publication Critical patent/CN103593359A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention relates to a text negative tendency judgment method based on industries. The method comprises the steps of first collecting negative texts in the industries to enable the negative texts to serve as a text corpus L; extracting a representative negative rule set S1 and a negative word set S2 from the text corpus L; utilizing a to-be-recognized text T to be matched with every rule of the negative rule set S1, and calculating a negative rule weight to judge whether the text is negative; performing word segmentation processing on the text which is not judged to be negative, and obtaining a word set S3 and a word number N; finally enabling every word in the set S3 to be matched in the set S2, calculating the proportion of negative words in the text and a weight accumulated value, and judging whether the text is negative. The text negative tendency judgment method based on the industries is high in accuracy rate, and the accuracy rate is over 90%; and the method can be widely applied to various industries and is high in universality.

Description

A kind of text negative tendency determination methods based on industry
Technical field
The invention belongs to internet information process field, specifically relate to a kind of text negative tendency determination methods based on industry.
Background technology
Along with the fast development of social informatization, internet has become the important place that people express viewpoint, make comments.At some, touch in the responsive neural event of society, the attitude that media, netizen evaluate event is often depended in the development of the state of affairs, and this has just formed network public-opinion.As event litigant, by manual type, think that it is very difficult from the information of magnanimity, filtering out out negative public feelings information fast.The practical text negative tendency determination methods based on industry is also disclosed in prior art.
Summary of the invention
The technical problem to be solved in the present invention is for the deficiencies in the prior art, and a kind of new, method text negative tendency determination methods based on industry fast reasonable in design, easy to operate is provided.
Technical matters to be solved by this invention is to realize by following technical scheme.The present invention is a kind of text negative tendency determination methods based on industry, is characterized in, its step is as follows:
(1) news, model, comment of from interconnected online collection, describing the negative information of industry, as language material, are set up industry corpus L;
(2) from corpus L, extract representational negative regular collection S1 and negation words S set 2; Concrete operation step is as follows:
(2-1), according to industrial characteristic, from each language material of corpus, extract negative rule, and compose weights to each rule;
(2-2), according to industrial characteristic, from each language material of corpus, extract negation words, and compose weights to each word;
(3) with text T to be identified, mate each rule of negative regular collection S1, add up negative regular weights, judge that whether text is negative; Concrete operation step is as follows:
(3-1) industry text rule statistics threshold values V1 is set;
(3-2) by each rule in regular collection S1, remove to mate text to be identified, the regular weights that match are cumulative, and accumulated value is Vt1, and relatively whether Vt1 is more than or equal to V1;
If (3-3) in step (3-2), comparative result Vt1 is more than or equal to V1, the sign text is negative text and exits;
If (3-4) comparative result Vt1 is less than V1 in step (3-2), continue step (3-2);
If (3-5) traveled through strictly all rules, and Vt1 is less than V1, and the sign text is unidentified text;
(4) with participle instrument, unidentified text is carried out to participle, remove stop word, form the S set 3 of word, word quantity is N;
(5) each word in S set 3 is mated in S set 2, count negation words proportion and weights accumulated value in text, judge whether text is negative; Concrete operation step is as follows:
(5-1) negation words statistics threshold values V2 is set, being added up in word negation words proportion threshold values P1, word, to count Nt be 0;
(5-2) each word in S set 3 goes to mate negation words S set 2, if matched, the weights of this word is done cumulative, and accumulation result is Vt2, and word is counted Nt and added 1;
(5-3) traveled through all words in S3, if Vt2 is more than or equal to V2, the ratio of Nt and N is more than or equal to P1 simultaneously, and the text is designated negative text, otherwise be designated, can not identify text.
Whether the information that the inventive method can be judged all kinds of media (news, forum, mhkc, blog etc.) is fast negative information.The method can be applied in the analysis of public opinion system on the one hand, for government bodies, obtains the negative public sentiment about our unit fast as units such as government, public security, procuratorial work from network.Can be applied on the other hand in product public praise analytic system, for the negative public praise of enterprise's quick obtaining product from network, the image of monitoring brand.
Compared with prior art, the inventive method has following technique effect:
1, the negative judging nicety rate of the text based on industry is reached higher, can reach more than 90%.
2, the inventive method can be widely used in industry-by-industry, and versatility is stronger.
3, the inventive method is swift to operate.
Accompanying drawing explanation
Fig. 1 is a kind of FB(flow block) of the inventive method;
Fig. 2 be in Fig. 1 the negative text in the collection industry described in step 101 as corpus L process flow diagram;
Fig. 3 extracts representational negative regular collection S1 and negation words S set 2 process flow diagrams described in step 102 in Fig. 1 from corpus L;
Fig. 4 is that in Fig. 1, step 103 is mated each rule of negative regular collection S1 with text T to be identified, adds up negative regular weights and judges the process flow diagram whether text is negative;
Fig. 5 is that in Fig. 1, step 104 pair is not judged as negative text and carries out word segmentation processing, draws set of words S3, and the process flow diagram of N counted in word;
Fig. 6 is that in Fig. 1, step 105, the coupling in S2 set of each word in S set 3, counts negation words proportion and weights accumulated value in text, judges whether text is negative process flow diagram.
Embodiment
Referring to accompanying drawing, further describe concrete technical scheme of the present invention, so that those skilled in the art understands the present invention further, and do not form the restriction to its right.
Embodiment 1, a kind of text negative tendency determination methods based on industry, and its step is as follows:
(1) news, model, comment of from interconnected online collection, describing the negative information of industry, as language material, are set up industry corpus L;
(2) from corpus L, extract representational negative regular collection S1 and negation words S set 2; Concrete operation step is as follows:
(2-1), according to industrial characteristic, from each language material of corpus, extract negative rule, and compose weights to each rule;
(2-2), according to industrial characteristic, from each language material of corpus, extract negation words, and compose weights to each word;
(3) with text T to be identified, mate each rule of negative regular collection S1, add up negative regular weights, judge that whether text is negative; Concrete operation step is as follows:
(3-1) industry text rule statistics threshold values V1 is set;
(3-2) by each rule in regular collection S1, remove to mate text to be identified, the regular weights that match are cumulative, and accumulated value is Vt1, and relatively whether Vt1 is more than or equal to V1;
If (3-3) in step (3-2), comparative result Vt1 is more than or equal to V1, the sign text is negative text and exits;
If (3-4) comparative result Vt1 is less than V1 in step (3-2), continue step (3-2);
If (3-5) traveled through strictly all rules, and Vt1 is less than V1, and the sign text is unidentified text;
(4) with participle instrument, unidentified text is carried out to participle, remove stop word, form the S set 3 of word, word quantity is N;
(5) each word in S set 3 is mated in S set 2, count negation words proportion and weights accumulated value in text, judge whether text is negative; Concrete operation step is as follows:
(5-1) negation words statistics threshold values V2 is set, being added up in word negation words proportion threshold values P1, word, to count Nt be 0;
(5-2) each word in S set 3 goes to mate negation words S set 2, if matched, the weights of this word is done cumulative, and accumulation result is Vt2, and word is counted Nt and added 1;
(5-3) traveled through all words in S3, if Vt2 is more than or equal to V2, the ratio of Nt and N is more than or equal to P1 simultaneously, and the text is designated negative text, otherwise be designated, can not identify text.
Embodiment 2, with reference to Fig. 1-6, and the operation experiments of being undertaken by the text negative tendency determination methods based on industry of the present invention, its step is as follows:
Step 101, the negative text in collection industry, as corpus L, with reference to Fig. 2, comprises the steps:
Step 201, from internet, comprise on the media such as news, forum, mhkc, blog, microblogging and collect a large amount of language material information;
Step 202, take out a language material information;
Step 203, judge whether this language material is industry language material, is to proceed to step 204, otherwise get next language material;
Step 204, judge whether this language material is negative language material, is to proceed to step 205, otherwise get next language material;
Step 205, this language material is joined in corpus, get next language material.
Step 102 is extracted representational negative regular collection S1 and negation words S set 2 from corpus L.With reference to Fig. 3, comprise the steps:
Step 301, judge in corpus L, whether there is language material, exist and proceed to step 302, otherwise finish;
Step 302, from L, find out a language material, search negative rule wherein;
Step 303, find negative rule, proceed to step 304, can not find and proceed to step 305;
Step 304, negative rule is added in regular collection S1, proceed to step 305
Step 305, check the negation words in language material, have negation words, proceed to step 306, do not have negation words to proceed to step 301.
Step 306, negation words is joined in negation words S set 2, proceed to step 301.
Step 103, in text T to be judged, mate negative tendency regular collection S1, having matched is exactly negative text, with reference to Fig. 4, comprises the steps:
Step 401, industry text rule statistics threshold values V1 is set;
In step 402, judgment rule S set 1, whether there is rule, exist and proceed to step 403; Do not exist and proceed to step 408;
Step 403, from S1, take out a rule, and in text T, search this rule;
Step 404, find and proceed to step 405, can not find and proceed to step 402;
Step 405, this rule authority credentials is added in Vt1;
The size of step 406, comparison Vt1 and V1, if Vt1 is more than or equal to V1, proceeds to step 407, otherwise proceeds to step 402;
Step 407, Text Flag are negative tendency text;
Step 408, Text Flag are unidentified text;
Step 104, to not being judged as negative text, carry out word segmentation processing, draw set of words S3, N counted in word, with reference to Fig. 5, comprises the steps:
Step 501, with participle instrument, Unidentified text is carried out to participle;
Step 502, remove useless word;
Step 503, the result of participle is left in to S set 3, word quantity is recorded as N;
Step 105, each word in S set 3 is mated with S set 2, count proportion and the weights of negation words in text, judge that whether text is negative, with reference to Fig. 6, comprises the steps:
Step 601, negation words statistics threshold values V2 be set, added up negation words proportion threshold values P1 in word,
Nt counted in word is 0;
In step 602, S set 3, whether having negation words, is to proceed to step 603, is noly proceeding to step 606;
In step 603, S set 3, take out a word, in S set 2, find this word;
If step 604 finds this word in S set 2, proceed to step 605, otherwise proceed to step 602;
The weights of step 605, cumulative this word are in Vt2, and word is counted Nt and added 1; Proceed to step 602;
Step 606, judge whether Vt2 is more than or equal to V2, is to proceed to step 607, otherwise proceed to step 609;
Step 607, judge whether Nt/N is more than or equal to P1, is to proceed to step 608, otherwise proceed to step 609;
Step 608, the sign text are negative tendency text;
Step 609, the sign text are for can not identify text.

Claims (1)

1. the text negative tendency determination methods based on industry, is characterized in that, its step is as follows:
(1) news, model, comment of from interconnected online collection, describing the negative information of industry, as language material, are set up industry corpus L;
(2) from corpus L, extract representational negative regular collection S1 and negation words S set 2; Concrete operation step is as follows:
(2-1), according to industrial characteristic, from each language material of corpus, extract negative rule, and compose weights to each rule;
(2-2), according to industrial characteristic, from each language material of corpus, extract negation words, and compose weights to each word;
(3) with text T to be identified, mate each rule of negative regular collection S1, add up negative regular weights, judge that whether text is negative; Concrete operation step is as follows:
(3-1) industry text rule statistics threshold values V1 is set;
(3-2) by each rule in regular collection S1, remove to mate text to be identified, the regular weights that match are cumulative, and accumulated value is Vt1, and relatively whether Vt1 is more than or equal to V1;
If (3-3) in step (3-2), comparative result Vt1 is more than or equal to V1, the sign text is negative text and exits;
If (3-4) comparative result Vt1 is less than V1 in step (3-2), continue step (3-2);
If (3-5) traveled through strictly all rules, and Vt1 is less than V1, and the sign text is unidentified text;
(4) with participle instrument, unidentified text is carried out to participle, remove stop word, form the S set 3 of word, word quantity is N;
(5) each word in S set 3 is mated in S set 2, count negation words proportion and weights accumulated value in text, judge whether text is negative; Concrete operation step is as follows:
(5-1) negation words statistics threshold values V2 is set, being added up in word negation words proportion threshold values P1, word, to count Nt be 0;
(5-2) each word in S set 3 goes to mate negation words S set 2, if matched, the weights of this word is done cumulative, and accumulation result is Vt2, and word is counted Nt and added 1;
(5-3) traveled through all words in S3, if Vt2 is more than or equal to V2, the ratio of Nt and N is more than or equal to P1 simultaneously, and the text is designated negative text, otherwise be designated, can not identify text.
CN201210290556.XA 2012-08-16 2012-08-16 Text negative tendency judgment method based on industries Pending CN103593359A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210290556.XA CN103593359A (en) 2012-08-16 2012-08-16 Text negative tendency judgment method based on industries

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210290556.XA CN103593359A (en) 2012-08-16 2012-08-16 Text negative tendency judgment method based on industries

Publications (1)

Publication Number Publication Date
CN103593359A true CN103593359A (en) 2014-02-19

Family

ID=50083508

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210290556.XA Pending CN103593359A (en) 2012-08-16 2012-08-16 Text negative tendency judgment method based on industries

Country Status (1)

Country Link
CN (1) CN103593359A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104978332A (en) * 2014-04-04 2015-10-14 腾讯科技(深圳)有限公司 UGC label data generating method, UGC label data generating device, relevant method and relevant device
CN105989081A (en) * 2015-02-11 2016-10-05 联想(北京)有限公司 Corpus processing method and apparatus
CN109614551A (en) * 2018-12-12 2019-04-12 上海优扬新媒信息技术有限公司 A kind of negative public sentiment judgment method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040049498A1 (en) * 2002-07-03 2004-03-11 Dehlinger Peter J. Text-classification code, system and method
CN101876974A (en) * 2009-04-30 2010-11-03 日电(中国)有限公司 System and method for classifying text feeling polarities
CN102541840A (en) * 2011-12-23 2012-07-04 中科鼎富(北京)科技发展有限公司 System and method for analyzing tendency of short text

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040049498A1 (en) * 2002-07-03 2004-03-11 Dehlinger Peter J. Text-classification code, system and method
CN101876974A (en) * 2009-04-30 2010-11-03 日电(中国)有限公司 System and method for classifying text feeling polarities
CN102541840A (en) * 2011-12-23 2012-07-04 中科鼎富(北京)科技发展有限公司 System and method for analyzing tendency of short text

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
俞飞: "基于网络信息文本倾向性分析的领域应用研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
潘文彬: "基于情感词词典的中文句子情感倾向分析", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
石振梁: "中文新闻情感分类系统的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
罗亚平: "面向网络舆情的中文评论文本情感倾向分析研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104978332A (en) * 2014-04-04 2015-10-14 腾讯科技(深圳)有限公司 UGC label data generating method, UGC label data generating device, relevant method and relevant device
CN104978332B (en) * 2014-04-04 2019-06-14 腾讯科技(深圳)有限公司 User-generated content label data generation method, device and correlation technique and device
CN105989081A (en) * 2015-02-11 2016-10-05 联想(北京)有限公司 Corpus processing method and apparatus
CN105989081B (en) * 2015-02-11 2019-09-24 联想(北京)有限公司 A kind of corpus treating method and apparatus
CN109614551A (en) * 2018-12-12 2019-04-12 上海优扬新媒信息技术有限公司 A kind of negative public sentiment judgment method and device

Similar Documents

Publication Publication Date Title
CN105653706B (en) A kind of multilayer quotation based on literature content knowledge mapping recommends method
Mishra et al. Sentiment analysis of Twitter data: Case study on digital India
Dadvar et al. Improving cyberbullying detection with user context
CN103020159A (en) Method and device for news presentation facing events
CN103606097A (en) Method and system based on credibility evaluation for product information recommendation
CN106250513A (en) A kind of event personalization sorting technique based on event modeling and system
CN103164427A (en) Method and device of news aggregation
CN102760142A (en) Method and device for extracting subject label in search result aiming at searching query
CN103226576A (en) Comment spam filtering method based on semantic similarity
CN102542061B (en) Intelligent product classification method
WO2010036013A3 (en) Apparatus and method for extracting and analyzing opinions in web documents
CN104504081A (en) Intelligent analysis system for all-media detection and monitoring big data behaviors
CN104424308A (en) Web page classification standard acquisition method and device and web page classification method and device
WO2013185601A1 (en) Method and device for obtaining product information and computer storage medium
CN105512333A (en) Product comment theme searching method based on emotional tendency
CN111324801B (en) Hot event discovery method in judicial field based on hot words
CN103455613A (en) Interest aware service recommendation method based on MapReduce model
CN104348871B (en) A kind of similar account extended method and device
CN105843796A (en) Microblog emotional tendency analysis method and device
CN104951430A (en) Product feature tag extraction method and device
CN104504024A (en) Method and system for mining keywords based on microblog content
CN103593359A (en) Text negative tendency judgment method based on industries
CN103744918A (en) Vertical domain based micro blog searching ranking method and system
CN102567494A (en) Website classification method and device
CN103365879A (en) Method and device for obtaining page similarity

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140219

WD01 Invention patent application deemed withdrawn after publication