CN103593359A

CN103593359A - Text negative tendency judgment method based on industries

Info

Publication number: CN103593359A
Application number: CN201210290556.XA
Authority: CN
Inventors: 陈国华; 陈宗华; 陈永江; 仲兆满
Original assignee: JIANGSU JINGE NETWORK TECHNOLOGY Co Ltd
Current assignee: JIANGSU JINGE NETWORK TECHNOLOGY Co Ltd
Priority date: 2012-08-16
Filing date: 2012-08-16
Publication date: 2014-02-19

Abstract

The invention relates to a text negative tendency judgment method based on industries. The method comprises the steps of first collecting negative texts in the industries to enable the negative texts to serve as a text corpus L; extracting a representative negative rule set S1 and a negative word set S2 from the text corpus L; utilizing a to-be-recognized text T to be matched with every rule of the negative rule set S1, and calculating a negative rule weight to judge whether the text is negative; performing word segmentation processing on the text which is not judged to be negative, and obtaining a word set S3 and a word number N; finally enabling every word in the set S3 to be matched in the set S2, calculating the proportion of negative words in the text and a weight accumulated value, and judging whether the text is negative. The text negative tendency judgment method based on the industries is high in accuracy rate, and the accuracy rate is over 90%; and the method can be widely applied to various industries and is high in universality.

Description

A kind of text negative tendency determination methods based on industry

Technical field

The invention belongs to internet information process field, specifically relate to a kind of text negative tendency determination methods based on industry.

Background technology

Along with the fast development of social informatization, internet has become the important place that people express viewpoint, make comments.At some, touch in the responsive neural event of society, the attitude that media, netizen evaluate event is often depended in the development of the state of affairs, and this has just formed network public-opinion.As event litigant, by manual type, think that it is very difficult from the information of magnanimity, filtering out out negative public feelings information fast.The practical text negative tendency determination methods based on industry is also disclosed in prior art.

Summary of the invention

The technical problem to be solved in the present invention is for the deficiencies in the prior art, and a kind of new, method text negative tendency determination methods based on industry fast reasonable in design, easy to operate is provided.

Technical matters to be solved by this invention is to realize by following technical scheme.The present invention is a kind of text negative tendency determination methods based on industry, is characterized in, its step is as follows:

(1) news, model, comment of from interconnected online collection, describing the negative information of industry, as language material, are set up industry corpus L;

(2) from corpus L, extract representational negative regular collection S1 and negation words S set 2; Concrete operation step is as follows:

(2-1), according to industrial characteristic, from each language material of corpus, extract negative rule, and compose weights to each rule;

(2-2), according to industrial characteristic, from each language material of corpus, extract negation words, and compose weights to each word;

(3) with text T to be identified, mate each rule of negative regular collection S1, add up negative regular weights, judge that whether text is negative; Concrete operation step is as follows:

(3-1) industry text rule statistics threshold values V1 is set;

(3-2) by each rule in regular collection S1, remove to mate text to be identified, the regular weights that match are cumulative, and accumulated value is Vt1, and relatively whether Vt1 is more than or equal to V1;

If (3-3) in step (3-2), comparative result Vt1 is more than or equal to V1, the sign text is negative text and exits;

If (3-4) comparative result Vt1 is less than V1 in step (3-2), continue step (3-2);

If (3-5) traveled through strictly all rules, and Vt1 is less than V1, and the sign text is unidentified text;

(4) with participle instrument, unidentified text is carried out to participle, remove stop word, form the S set 3 of word, word quantity is N;

(5) each word in S set 3 is mated in S set 2, count negation words proportion and weights accumulated value in text, judge whether text is negative; Concrete operation step is as follows:

(5-1) negation words statistics threshold values V2 is set, being added up in word negation words proportion threshold values P1, word, to count Nt be 0;

(5-2) each word in S set 3 goes to mate negation words S set 2, if matched, the weights of this word is done cumulative, and accumulation result is Vt2, and word is counted Nt and added 1;

(5-3) traveled through all words in S3, if Vt2 is more than or equal to V2, the ratio of Nt and N is more than or equal to P1 simultaneously, and the text is designated negative text, otherwise be designated, can not identify text.

Whether the information that the inventive method can be judged all kinds of media (news, forum, mhkc, blog etc.) is fast negative information.The method can be applied in the analysis of public opinion system on the one hand, for government bodies, obtains the negative public sentiment about our unit fast as units such as government, public security, procuratorial work from network.Can be applied on the other hand in product public praise analytic system, for the negative public praise of enterprise's quick obtaining product from network, the image of monitoring brand.

Compared with prior art, the inventive method has following technique effect:

1, the negative judging nicety rate of the text based on industry is reached higher, can reach more than 90%.

2, the inventive method can be widely used in industry-by-industry, and versatility is stronger.

3, the inventive method is swift to operate.

Accompanying drawing explanation

Fig. 1 is a kind of FB(flow block) of the inventive method;

Fig. 2 be in Fig. 1 the negative text in the collection industry described in step 101 as corpus L process flow diagram;

Fig. 3 extracts representational negative regular collection S1 and negation words S set 2 process flow diagrams described in step 102 in Fig. 1 from corpus L;

Fig. 4 is that in Fig. 1, step 103 is mated each rule of negative regular collection S1 with text T to be identified, adds up negative regular weights and judges the process flow diagram whether text is negative;

Fig. 5 is that in Fig. 1, step 104 pair is not judged as negative text and carries out word segmentation processing, draws set of words S3, and the process flow diagram of N counted in word;

Fig. 6 is that in Fig. 1, step 105, the coupling in S2 set of each word in S set 3, counts negation words proportion and weights accumulated value in text, judges whether text is negative process flow diagram.

Embodiment

Referring to accompanying drawing, further describe concrete technical scheme of the present invention, so that those skilled in the art understands the present invention further, and do not form the restriction to its right.

Embodiment 1, a kind of text negative tendency determination methods based on industry, and its step is as follows:

(3-1) industry text rule statistics threshold values V1 is set;

Embodiment 2, with reference to Fig. 1-6, and the operation experiments of being undertaken by the text negative tendency determination methods based on industry of the present invention, its step is as follows:

Step 101, the negative text in collection industry, as corpus L, with reference to Fig. 2, comprises the steps:

Step 201, from internet, comprise on the media such as news, forum, mhkc, blog, microblogging and collect a large amount of language material information;

Step 202, take out a language material information;

Step 203, judge whether this language material is industry language material, is to proceed to step 204, otherwise get next language material;

Step 204, judge whether this language material is negative language material, is to proceed to step 205, otherwise get next language material;

Step 205, this language material is joined in corpus, get next language material.

Step 102 is extracted representational negative regular collection S1 and negation words S set 2 from corpus L.With reference to Fig. 3, comprise the steps:

Step 301, judge in corpus L, whether there is language material, exist and proceed to step 302, otherwise finish;

Step 302, from L, find out a language material, search negative rule wherein;

Step 303, find negative rule, proceed to step 304, can not find and proceed to step 305;

Step 304, negative rule is added in regular collection S1, proceed to step 305

Step 305, check the negation words in language material, have negation words, proceed to step 306, do not have negation words to proceed to step 301.

Step 306, negation words is joined in negation words S set 2, proceed to step 301.

Step 103, in text T to be judged, mate negative tendency regular collection S1, having matched is exactly negative text, with reference to Fig. 4, comprises the steps:

Step 401, industry text rule statistics threshold values V1 is set;

In step 402, judgment rule S set 1, whether there is rule, exist and proceed to step 403; Do not exist and proceed to step 408;

Step 403, from S1, take out a rule, and in text T, search this rule;

Step 404, find and proceed to step 405, can not find and proceed to step 402;

Step 405, this rule authority credentials is added in Vt1;

The size of step 406, comparison Vt1 and V1, if Vt1 is more than or equal to V1, proceeds to step 407, otherwise proceeds to step 402;

Step 407, Text Flag are negative tendency text;

Step 408, Text Flag are unidentified text;

Step 104, to not being judged as negative text, carry out word segmentation processing, draw set of words S3, N counted in word, with reference to Fig. 5, comprises the steps:

Step 501, with participle instrument, Unidentified text is carried out to participle;

Step 502, remove useless word;

Step 503, the result of participle is left in to S set 3, word quantity is recorded as N;

Step 105, each word in S set 3 is mated with S set 2, count proportion and the weights of negation words in text, judge that whether text is negative, with reference to Fig. 6, comprises the steps:

Step 601, negation words statistics threshold values V2 be set, added up negation words proportion threshold values P1 in word,

Nt counted in word is 0;

In step 602, S set 3, whether having negation words, is to proceed to step 603, is noly proceeding to step 606;

In step 603, S set 3, take out a word, in S set 2, find this word;

If step 604 finds this word in S set 2, proceed to step 605, otherwise proceed to step 602;

The weights of step 605, cumulative this word are in Vt2, and word is counted Nt and added 1; Proceed to step 602;

Step 606, judge whether Vt2 is more than or equal to V2, is to proceed to step 607, otherwise proceed to step 609;

Step 607, judge whether Nt/N is more than or equal to P1, is to proceed to step 608, otherwise proceed to step 609;

Step 608, the sign text are negative tendency text;

Step 609, the sign text are for can not identify text.

Claims

1. the text negative tendency determination methods based on industry, is characterized in that, its step is as follows:

(3-1) industry text rule statistics threshold values V1 is set;