CN104331394A

CN104331394A - Text classification method based on viewpoint

Info

Publication number: CN104331394A
Application number: CN201410434035.6A
Authority: CN
Inventors: 程实; 何海棠; 沈学华; 程显毅; 施佺
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2014-08-29
Filing date: 2014-08-29
Publication date: 2015-02-04

Abstract

The invention discloses a text classification method based on a viewpoint. The text classification method comprises the following specific steps: (100) dividing a topic paragraph; (200) differentiating the subjectivity of a sentence; (300) identifying perspective sentences; (400) calculating the similarity of the perspective sentences; and (500) clustering the perspective sentences. Through the above way, the text classification method based on the viewpoint can realize dynamic, semantic, low-dimensional and efficient sentence classification, so that the processing of network text information can more conform to a cognitive process of a person to better meet practical application requirements.

Description

A kind of file classification method based on viewpoint

Technical field

The present invention relates to text mining and affection computation technical field, especially relate to a kind of file classification method based on viewpoint.

Background technology

Along with the development of Web2.0 technology, Web Community, blog and forum provide broader platform to the network user and come exchange of information and expression of opinion, commercial undertaking can understand client's suggestion by the comment of network surveying client to product and carry out MarketingResearchandAnalysis, on-line tracing is carried out to product, the properties of product that constantly lose no time and after sale service, cultivate the potential consumer group, consumer also can select by the user's evaluation information browsing certain product whether to buy this product simultaneously, government department can be understood the view of people to certain policies and regulations or current events and understands the common people timely to the society and politics attitude of social governor and make scientific and reasonable decision-making in network forum, therefore, how quick, effective process and analyze the comment text of these subjectivities, understand other people idea and be one of network text field of information processing major issue to be solved to the viewpoint of things and attitude.

So-called viewpoint, refer to that a people is to the idea of something or other and understanding, viewpoint is not true, because viewpoint was not both verified, also be not proven and confirm, if a viewpoint can be proven and confirm afterwards, it is no longer just a viewpoint, and become a fact, according to the definition of Kim and Hovy to viewpoint: viewpoint is made up of four elements: i.e. theme, holder, statement, emotion, there is inherent contact between these four elements, namely the holder of viewpoint has delivered the statement with emotion for certain theme.

As an emerging research field, opining mining research causes the extensive concern of NLP research circle, in recent years, some international conferences that NLP is relevant are all provided with special topic to discuss opining mining problem, and numerous achievements in research can be divided into two large classes: documentation level (coarseness) opining mining and Sentence-level (middle granularity) opining mining.

Evaluation text is divided into support, opposes and neutral three major types by coarseness opining mining, although coarseness opining mining can regard text classification as, but be very different with the text classification of traditional subject-oriented, in the text classification of traditional subject-oriented, the word relevant to theme is extremely important; And in coarseness opining mining, show that the emotion word of commendation or derogatory sense viewpoint is the most useful.

Coarseness opining mining can not find that user likes and the detail do not liked, such as user may be satisfied to the configuration design of a digital camera, but to the but not too satisfaction in serviceable life of its battery, be only many times this judgement generally not enough, because people to carry out time viewpoint and attitude are expressed except to except evaluation generally for a certain topic, often further comprises the evaluation to wherein certain part or characteristic.

Middle granularity opining mining is mainly applied to extraction Article characteristic being delivered to viewpoint, the method enters into sentence level, the detail of viewpoint can be extracted, certainly things here can be a product, a kind of service, people, tissue, an event etc., such as " battery life of this camera is too short " the words, the product feature that user evaluates is this camera " battery life ", and the conclusion that this user provides (viewpoint) is passive.

No matter be coarseness opining mining or middle granularity opining mining, picture " Xian Da Iraq of the U.S. " and " the Xian Da U.S. of Iraq " two kinds of different viewpoints all can classify as an identical class, because they take word as essential characteristic, do not use semantic feature (viewpoint), fine granularity opining mining is by text or sentence classification by viewpoint, the quantity of classification is dynamic, because different people has different views to same thing, be not only agree with, oppose and neutrality, because fine granularity opining mining cannot obtain a general corpus, so be viewpoint cluster based on the text classification of viewpoint.

Another motivation proposed based on the text classification of viewpoint is in the past few decades, semantic computation, affection computation have had significant progress, and dynamic text categorization, the text classification based on semanteme, the text classification of many technological synthesiss, the efficient text classification of low-dimensional have urgent application demand.

Summary of the invention

The technical matters that the present invention mainly solves is to provide a kind of file classification method based on viewpoint, the method can realize dynamic, semantic, the efficient text classification of low-dimensional, make network text information processing more meet the cognitive process of people, more can meet the demand of practical application.

For solving the problems of the technologies described above, the technical scheme that the present invention adopts is: a kind of file classification method based on viewpoint, and concrete steps comprise:

(100) division of subject matter segments: first input text, passes through (1 ￡ i, j ￡ n) calculates the semantic similarity between every two paragraphs in text, then finds out the paragraph candidate point P that theme changes one by one _k1, P _k2..., P _krif, P _krmeet , , , then determine be the theme the division candidate point of paragraph, continues the next candidate point of process, if all theme paragraph divides candidate point and is all disposed, terminates, if do not meet, judge whether meet if meet, then think the paragraph that is the theme divides candidate point, and continues process next theme paragraph division candidate point, if do not meet, then judges that theme paragraph divides candidate point next paragraph whether meet , think be not divide section, until all theme paragraph division candidate point is disposed, terminate, determine the theme paragraph division points in text, all natural paragraph in text is merged into several subject matter segments, and namely text can be expressed as D=S ₁eS ₂e ... eS _n, S _nrepresent subject matter segments;

Wherein, ( , ), F (P _i)=(W _i1, W _i2..., W _ij..., W _ik) be paragraph proper vector, W _ijrepresent the weights of a jth element in paragraph i in the list of text feature word, the frequency computation part that weights occur at this section according to word, k is the number of proper vector element, Text eigenvector F (D)=(W ₁, W ₂..., W _l), W _lrepresent l element weights in the text in the list of text feature word, the frequency computation part that weights occur in the text according to word, it is the subscript that r theme paragraph divides the paragraph of candidate point;

(200) differentiation of statement subjectivity: adopt CHI statistical method to carry out the extraction of 2-POS subjective mode to subjective text and objective text respectively, first participle and part-of-speech tagging are carried out to the sentence in training corpus, then 2-POS statistical language model is constructed, the each 2-POS type be finally respectively in master, objective set of modes according to formula (1) calculates CHI statistic, and according to the sequence of CHI value

（1）

Wherein represent and belong to class c _isentence in comprise a kth 2-POS pattern sentence number, represent and do not belong to class c _isentence in comprise a kth 2-POS pattern sentence number, represent and belong to class c _isentence in do not comprise a kth 2-POS pattern sentence number, represent and neither belong to class c _ialso do not comprise a kth 2-POS pattern sentence number, N represents the sentence sum in language material;

The film review data set utilizing statistical method to provide in Cornell University obtains subjective rule;

Described subjective mode and described subjective rule are referred to as subjective clue, first calculate subjective clue Clue weight according to formula (2),

Wight (Clue)=Max (CHI value/maximum CHI value, degree of confidence * falg) (2)

Wherein flag=1, then subjective clue is subjective sentence degree of confidence, otherwise flag=0;

Then follow and calculate subjective clue density according to subjective clue density defined formula (3):

（3）

Wherein, the subjective clue word that sentence comprises adds up to n, two adjacent subjective clue words w _iwith w _i+1between non-subjective clue word quantity be expressed as distance( w _i, w _i+1), keyword w _i+1weight in sentence is expressed as score( w _i+1);

Adopt according to formula (4) tf-idfmethod calculates the weight of subjective clue word:

（4）

Wherein, df (wi) represent comprise word wthe sentence number of i, | s| be total sentence number, wi is at sentence sthe number of times occurred in j is expressed as tf( wi, sj);

Sentence is that the size of the possibility size of subjectivity sentence and SD (S) value is proportional;

(300) viewpoint sentence identification: viewpoint sentence is different from subjective sentence, it is subjective sentence collection, first the identification of viewpoint sentence will construct viewpoint word dictionary, then viewpoint word dictionary is utilized to add up the viewpoint word that sentence occurs, by the result of statistics, ID3 algorithm is utilized to generate decision tree, thus for the identification of viewpoint sentence;

(400) viewpoint sentence Similarity Measure: first carry out viewpoint extraction, according to step (100) to Subject Clustering, then to same subject, extract the attribute describing theme, viewpoint word is to the word class of passing judgement on of same attribute evaluation, finally calculates the weight of word according to formula (5)

（5）

Wherein, k represents the part of speech number occurred in sentence, n _irepresent the number of i class word in sentence, g _irepresent the weight of i-th viewpoint.

Suppose that the viewpoint weight sets that sentence A comprises word is combined into WordSet (A)={ W ₁, W ₂w _n, the viewpoint weight sets that sentence B comprises word is combined into WordSet (B)={ W ₁, W ₂...., W _m, if viewpoint weight set WordSet (B) of sentence B comprises i-th word (1≤i≤n) in WordSet (A), i.e. W _i∈ WordSet (A) ∩ WordSet (B), then i-th word occurs, W _ithe contribution of distich A and sentence B similarity is S _i, in like manner, if W _ido not occur in WordSet (B), and W _j(W _j∈ WordSet (A), 1≤j≤n) occur in WordSet (B), i.e. W _j∈ WordSet (A) ∩ WordSet (B), if the appearance of a now word jth word, W _jthe contribution of distich A and sentence B similarity is S _jif i-th word and a jth word occur in sentence A and sentence B, then W simultaneously _iand W _jthe contribution of the similarity of distich A and sentence B is S _ij, and have S _ij>S _i+ S _j, then close word is to W _iand W _jbe S to the contribution degree of A, B similarity _ij-(S _i+ S _j), W _iand W _jsimilarity size and S _ij-(S _i+ S _j) the size of value be inversely proportional to, S _ij-(S _i+ S _j) value less, then W _iand W _jmore similar;

(500) viewpoint sentence cluster: integrating step (100), to step (400), carries out viewpoint cluster according to formula (6),

（6）

Wherein, for viewpoint sentence 1 word close to viewpoint sentence 2 is to total weight of contribution degree, n is close word logarithm, W _ifor priority weighting, not all feature all has contribution to similarity, and effectively pairing refers to the characteristic matching meeting priority rule, PairCiunt ₁for the word number of viewpoint sentence 1, PairCiunt ₂for the word number of viewpoint sentence 2.

In a preferred embodiment of the present invention, the subjectivity rule in described step (200) comprises:

The subjective sentence (0.75) of rule 1: degree adverb (definitely, very, quite) T

Rule 2: first person pronoun (I, I, individual) the subjective sentence (0.85) of T

Rule 3: the subjective sentence (0.90) of interrogative (, why) T

Rule 4: the subjective sentence (0.72) of deictic words (this, that, some) T

Rule 5: the subjective sentence (0.64) of conjunction (otherwise and, on the contrary) T

Rule 6: the quotations objective sentence of (he says, he thinks) T (1.0)

Rule 7: the concept definition core verb objective sentence of (be, comprise, be called, be called, be defined as) T (0.99)

Rule 8: the affirmation core verb objective sentence of (be described as, report, tell about) T (0.98)

Rule 9: advocate the subjective sentence (0.77) of class viewpoint word (think, should, determine, wish, think) T

Wherein viewpoint word is divided into 18 classes, to subjective sentence, inhomogeneity differentiates that contribution is different, and the numeral of the regular unquote of described subjectivity is subjective regular degree of confidence described in this.

The invention has the beneficial effects as follows: a kind of file classification method based on viewpoint of the present invention, the method belongs to the file classification method of semantic level, by the Fusion Model of event, viewpoint and emotion, annotate the semanteme of text from entirety, avoid " semantic lack " that tradition occurs based on the text classification of theme, the problem of " dimension disaster " and " depending on corpus unduly " by viewpoint cluster.

Embodiment

Below preferred embodiment of the present invention is described in detail, can be easier to make advantages and features of the invention be readily appreciated by one skilled in the art, thus more explicit defining is made to protection scope of the present invention.

The embodiment of the present invention comprises: a kind of file classification method based on viewpoint, and concrete steps comprise:

(100) due to when an elaboration theme, its heavy duty word Correspondent used be often confined to represent one of content involved by this theme more among a small circle in, there is certain repeatability, if word contained by two paragraphs, particularly high frequency noun, repeat to a certain extent, show as these two paragraphic similarities larger, what tentatively can think two paragraphs discussions is same subject, should draw in same subject matter segments, and word contained by the paragraph of different themes especially high frequency noun is general not too identical, similarity between their corresponding paragraphs is lower, diversity factor is larger, so the division of subject matter segments first will be carried out: input text, pass through (1 ￡ i, j ￡ n) calculates the semantic similarity between every two paragraphs in text, then finds out the paragraph candidate point P that theme changes one by one _k1, P _k2..., P _krif, P _krmeet , , , then determine be the theme the division candidate point of paragraph, continues the next candidate point of process, if all theme paragraph divides candidate point and is all disposed, terminates, if do not meet, judge whether meet if meet, then think the paragraph that is the theme divides candidate point, and continues process next theme paragraph division candidate point, if do not meet, then judges that theme paragraph divides candidate point next paragraph whether meet , think be not divide section, until all theme paragraph division candidate point is disposed, terminate, determine the theme paragraph division points in text, all natural paragraph in text is merged into several subject matter segments, and namely text can be expressed as D=S ₁eS ₂e ... eS _n, S _nrepresent subject matter segments,

Wherein, ( , ), F (P _i)=(W _i1, W _i2..., W _ij..., W _ik) be paragraph proper vector, W _ijrepresent the weights of a jth element in paragraph i in the list of text feature word, the frequency computation part that weights occur at this section according to word, k is the number of proper vector element, Text eigenvector F (D)=(W ₁, W ₂..., W _l), W _lrepresent l element weights in the text in the list of text feature word, the frequency computation part that weights occur in the text according to word, it is the subscript that r theme paragraph divides the paragraph of candidate point.

(200) a small amount of objective information is usually mingled with owing to evaluating in text, these information can produce impact in various degree to the accuracy of opining mining and quality, therefore objective information is separated from evaluation text and become very important, contribute to the complexity reducing opinion mining problem, thus the analysis efficiency of very big raising system and performance, the described differentiation will carrying out statement subjectivity: adopt CHI statistical method to carry out the extraction of 2-POS subjective mode to subjective text and objective text respectively, CHI statistical value is used for the degree of correlation between measures characteristic t and classification c, feature t is higher for the CHI statistical value of certain class, then the correlativity of it and such is larger, the classification information of carrying is also more, first participle and part-of-speech tagging are carried out to the sentence in training corpus, then 2-POS statistical language model is constructed, finally be respectively main according to formula (1), each 2-POS type in objective set of modes calculates CHI statistic, and according to the sequence of CHI value,

（1）

The film review data set utilizing statistical method to provide in Cornell University obtains subjective rule, the film review data set that Cornell University provides is made up of film comment, each 1000 sections that wherein hold affirmation and negation attitude, mark each 5331 of the sentence of passing judgement on polarity in addition in addition, mark each 5000 of the sentence of subjective and objective label, current film review storehouse is widely used in various granularity, and as in the research of word, sentence and chapter level sentiment analysis, described subjective rule comprises:

Rule 3: the subjective sentence (0.90) of interrogative (, why) T

Rule 4: the subjective sentence (0.72) of deictic words (this, that, some) T

Rule 6: the quotations objective sentence of (he says, he thinks) T (1.0)

Wight (Clue)=Max (CHI value/maximum CHI value, degree of confidence * falg) (2)

（3）

（4）

(300) viewpoint sentence identification: viewpoint sentence is different from subjective sentence, it is subjective sentence collection, first the identification of viewpoint sentence will construct viewpoint word dictionary, then viewpoint word dictionary is utilized to add up the viewpoint word that sentence occurs, by the result of statistics, utilize ID3 algorithm to generate decision tree, thus for the identification of viewpoint sentence, table 1 give some aspects word dictionary;

Table 1 some aspects word dictionary

(400) viewpoint sentence Similarity Measure: subjective sentence and viewpoint sentence have the different of essence, subjective sentence refers to describe idea, suggestion, view, the sentence of evaluation etc., express the attitude that people speaks, and viewpoint sentence is the subjective judgement to things or event, find out from the pedigree of Fig. 1 speaker's attitude, viewpoint sentence must be subjective sentence, otherwise it is not right, the present invention by disclosed be a kind of file classification method based on viewpoint, so, the identification of viewpoint sentence is an important ring, viewpoint sentence identification gordian technique meticulously constructs viewpoint word dictionary, the present invention mainly utilizes the close word of viewpoint to carrying out computed view point sentence similarity, first viewpoint extraction is carried out, according to step (100) to Subject Clustering, then to same subject, extract the attribute describing theme, the attribute extraction of theme will utilize the dependency tree analysis to viewpoint sentence, the LTP platform that the present invention adopts Harbin Institute of Technology NLP to announce, according to the object of the modified relationship determination viewpoint sentence of dependency tree, i.e. attribute, viewpoint word is to the word class of passing judgement on of same attribute evaluation, passing judgement on word class is on the basis of passing judgement on dictionary, be summed up as 16 classes, this shows that the three elements that viewpoint word relies on are respectively: theme, attribute and pass judgement on word class,

Calculate the weight of word according to formula (5) after having extracted viewpoint word,

（5）

(500) extraction that close word is right: because term weighing refers to the degree that word represents sentence concept, namely be the tolerance of word to the significance level of sentence hint expression, the subsemantic ability to express of word distich is not only relevant to the characteristic of word itself, and it is relevant with factors such as the grammatical functions of sentence structure, sentence length, word, such as, subject and modifier are different for the significance level that sentence expectation reaches;

Ideally, the grammatical function of word should be analyzed, calculate the weight of word accordingly, but the complete syntactic analysis in present stage for sentence, identification completely for word grammatical roles is impossible, part of speech and the grammatical roles of word in sentence of another aspect word have certain corresponding relation: noun and pronoun are generally as subject and object, verb is generally as predicate, adjective, number and measure word are generally as attribute or the adverbial modifier, we can carry out the grammatical function information of indirect utilization word by part-of-speech information, calculate the weight of word, in Chinese text, function word generally only plays grammer connection, do not express actual concepts, function word is not examined when calculating, notional word comprises noun (N), verb (V), adjective (A), number (M), measure word (Q) and pronoun (R) etc., wherein measure word is more weak to the degree of being expressed as of concept, pronoun is generally other concepts of repetition, give and measure word and the less weight of pronoun, the weight of different part of speech is provided: noun weight is g according to part of speech and corresponding grammatical item thereof and experience ₁, verb weight is g ₂, adjective weight is g ₃, number weight is g ₄, other notional word weight is g ₅in reality, noun, verb, adjective and number etc. may not be all appear in a sentence, so by the weight of the weight sum of the whole parts of speech occurred in sentence in the weight ratio of such word as such word when calculating, think similar in word distich meaning ability to express be identical, word is also subject to the impact of sentence length for the significance level that sentence expectation reaches, the vocabulary that sentence is longer, sentence comprises is more, and it is less that each word reaches role for sentence expectation

(400) viewpoint sentence cluster: integrating step (100) is to step (400), and the conversion that the semantic granularity of text is from coarse to fine: subject matter segments → subjective sentence → viewpoint sentence → close word pair, carries out viewpoint cluster according to formula (6) on this basis,

（6）

Compared with prior art, a kind of file classification method based on viewpoint of the present invention, the method belongs to the file classification method of semantic level, by the Fusion Model of event, viewpoint and emotion, the semanteme of text is annotated from entirety,, the problem of " dimension disaster " and " depending on corpus unduly ", table 2 gives the contrast of three kinds of sorting techniques to avoid by viewpoint sentence cluster " semantic lack " that tradition occurs based on the text classification of theme.

The contrast of table 2 three kinds of sorting techniques

The foregoing is only embodiments of the invention; not thereby the scope of the claims of the present invention is limited; every utilize instructions of the present invention to do equivalent structure or the conversion of equivalent flow process, or be directly or indirectly used in other relevant technical fields, be all in like manner included in scope of patent protection of the present invention.

Claims

1. based on a file classification method for viewpoint, it is characterized in that, concrete steps comprise:

（1）

Wight (Clue)=Max (CHI value/maximum CHI value, degree of confidence * falg) (2)

（3）

（4）

（5）

2. suppose that the viewpoint weight sets that sentence A comprises word is combined into WordSet (A)={ W ₁, W ₂w _n, the viewpoint weight sets that sentence B comprises word is combined into WordSet (B)={ W ₁, W ₂...., W _m, if viewpoint weight set WordSet (B) of sentence B comprises i-th word (1≤i≤n) in WordSet (A), i.e. W _i∈ WordSet (A) ∩ WordSet (B), then i-th word occurs, W _ithe contribution of distich A and sentence B similarity is S _i, in like manner, if W _ido not occur in WordSet (B), and W _j(W _j∈ WordSet (A), 1≤j≤n) occur in WordSet (B), i.e. W _j∈ WordSet (A) ∩ WordSet (B), if the appearance of a now word jth word, W _jthe contribution of distich A and sentence B similarity is S _jif i-th word and a jth word occur in sentence A and sentence B, then W simultaneously _iand W _jthe contribution of the similarity of distich A and sentence B is S _ij, and have S _ij>S _i+ S _j, then close word is to W _iand W _jbe S to the contribution degree of A, B similarity _ij-(S _i+ S _j), W _iand W _jsimilarity size and S _ij-(S _i+ S _j) the size of value be inversely proportional to, S _ij-(S _i+ S _j) value less, then W _iand W _jmore similar;

（6）

3. a kind of file classification method based on viewpoint according to claim 1, is characterized in that: the subjectivity rule in described step (200) comprises:

Rule 3: the subjective sentence (0.90) of interrogative (, why) T

Rule 4: the subjective sentence (0.72) of deictic words (this, that, some) T

Rule 6: the quotations objective sentence of (he says, he thinks) T (1.0)