CN104331394A - Text classification method based on viewpoint - Google Patents

Text classification method based on viewpoint Download PDF

Info

Publication number
CN104331394A
CN104331394A CN201410434035.6A CN201410434035A CN104331394A CN 104331394 A CN104331394 A CN 104331394A CN 201410434035 A CN201410434035 A CN 201410434035A CN 104331394 A CN104331394 A CN 104331394A
Authority
CN
China
Prior art keywords
sentence
word
viewpoint
subjective
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410434035.6A
Other languages
Chinese (zh)
Inventor
程实
何海棠
沈学华
程显毅
施佺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nantong University
Original Assignee
Nantong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nantong University filed Critical Nantong University
Priority to CN201410434035.6A priority Critical patent/CN104331394A/en
Publication of CN104331394A publication Critical patent/CN104331394A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a text classification method based on a viewpoint. The text classification method comprises the following specific steps: (100) dividing a topic paragraph; (200) differentiating the subjectivity of a sentence; (300) identifying perspective sentences; (400) calculating the similarity of the perspective sentences; and (500) clustering the perspective sentences. Through the above way, the text classification method based on the viewpoint can realize dynamic, semantic, low-dimensional and efficient sentence classification, so that the processing of network text information can more conform to a cognitive process of a person to better meet practical application requirements.

Description

A kind of file classification method based on viewpoint
Technical field
The present invention relates to text mining and affection computation technical field, especially relate to a kind of file classification method based on viewpoint.
Background technology
Along with the development of Web2.0 technology, Web Community, blog and forum provide broader platform to the network user and come exchange of information and expression of opinion, commercial undertaking can understand client's suggestion by the comment of network surveying client to product and carry out MarketingResearchandAnalysis, on-line tracing is carried out to product, the properties of product that constantly lose no time and after sale service, cultivate the potential consumer group, consumer also can select by the user's evaluation information browsing certain product whether to buy this product simultaneously, government department can be understood the view of people to certain policies and regulations or current events and understands the common people timely to the society and politics attitude of social governor and make scientific and reasonable decision-making in network forum, therefore, how quick, effective process and analyze the comment text of these subjectivities, understand other people idea and be one of network text field of information processing major issue to be solved to the viewpoint of things and attitude.
So-called viewpoint, refer to that a people is to the idea of something or other and understanding, viewpoint is not true, because viewpoint was not both verified, also be not proven and confirm, if a viewpoint can be proven and confirm afterwards, it is no longer just a viewpoint, and become a fact, according to the definition of Kim and Hovy to viewpoint: viewpoint is made up of four elements: i.e. theme, holder, statement, emotion, there is inherent contact between these four elements, namely the holder of viewpoint has delivered the statement with emotion for certain theme.
As an emerging research field, opining mining research causes the extensive concern of NLP research circle, in recent years, some international conferences that NLP is relevant are all provided with special topic to discuss opining mining problem, and numerous achievements in research can be divided into two large classes: documentation level (coarseness) opining mining and Sentence-level (middle granularity) opining mining.
Evaluation text is divided into support, opposes and neutral three major types by coarseness opining mining, although coarseness opining mining can regard text classification as, but be very different with the text classification of traditional subject-oriented, in the text classification of traditional subject-oriented, the word relevant to theme is extremely important; And in coarseness opining mining, show that the emotion word of commendation or derogatory sense viewpoint is the most useful.
Coarseness opining mining can not find that user likes and the detail do not liked, such as user may be satisfied to the configuration design of a digital camera, but to the but not too satisfaction in serviceable life of its battery, be only many times this judgement generally not enough, because people to carry out time viewpoint and attitude are expressed except to except evaluation generally for a certain topic, often further comprises the evaluation to wherein certain part or characteristic.
Middle granularity opining mining is mainly applied to extraction Article characteristic being delivered to viewpoint, the method enters into sentence level, the detail of viewpoint can be extracted, certainly things here can be a product, a kind of service, people, tissue, an event etc., such as " battery life of this camera is too short " the words, the product feature that user evaluates is this camera " battery life ", and the conclusion that this user provides (viewpoint) is passive.
No matter be coarseness opining mining or middle granularity opining mining, picture " Xian Da Iraq of the U.S. " and " the Xian Da U.S. of Iraq " two kinds of different viewpoints all can classify as an identical class, because they take word as essential characteristic, do not use semantic feature (viewpoint), fine granularity opining mining is by text or sentence classification by viewpoint, the quantity of classification is dynamic, because different people has different views to same thing, be not only agree with, oppose and neutrality, because fine granularity opining mining cannot obtain a general corpus, so be viewpoint cluster based on the text classification of viewpoint.
Another motivation proposed based on the text classification of viewpoint is in the past few decades, semantic computation, affection computation have had significant progress, and dynamic text categorization, the text classification based on semanteme, the text classification of many technological synthesiss, the efficient text classification of low-dimensional have urgent application demand.
Summary of the invention
The technical matters that the present invention mainly solves is to provide a kind of file classification method based on viewpoint, the method can realize dynamic, semantic, the efficient text classification of low-dimensional, make network text information processing more meet the cognitive process of people, more can meet the demand of practical application.
For solving the problems of the technologies described above, the technical scheme that the present invention adopts is: a kind of file classification method based on viewpoint, and concrete steps comprise:
(100) division of subject matter segments: first input text, passes through (1 £ i, j £ n) calculates the semantic similarity between every two paragraphs in text, then finds out the paragraph candidate point P that theme changes one by one k1, P k2..., P krif, P krmeet , , , then determine be the theme the division candidate point of paragraph, continues the next candidate point of process, if all theme paragraph divides candidate point and is all disposed, terminates, if do not meet, judge whether meet if meet, then think the paragraph that is the theme divides candidate point, and continues process next theme paragraph division candidate point, if do not meet, then judges that theme paragraph divides candidate point next paragraph whether meet , think be not divide section, until all theme paragraph division candidate point is disposed, terminate, determine the theme paragraph division points in text, all natural paragraph in text is merged into several subject matter segments, and namely text can be expressed as D=S 1eS 2e ... eS n, S nrepresent subject matter segments;
Wherein, ( , ), F (P i)=(W i1, W i2..., W ij..., W ik) be paragraph proper vector, W ijrepresent the weights of a jth element in paragraph i in the list of text feature word, the frequency computation part that weights occur at this section according to word, k is the number of proper vector element, Text eigenvector F (D)=(W 1, W 2..., W l), W lrepresent l element weights in the text in the list of text feature word, the frequency computation part that weights occur in the text according to word, it is the subscript that r theme paragraph divides the paragraph of candidate point;
(200) differentiation of statement subjectivity: adopt CHI statistical method to carry out the extraction of 2-POS subjective mode to subjective text and objective text respectively, first participle and part-of-speech tagging are carried out to the sentence in training corpus, then 2-POS statistical language model is constructed, the each 2-POS type be finally respectively in master, objective set of modes according to formula (1) calculates CHI statistic, and according to the sequence of CHI value
(1)
Wherein represent and belong to class c isentence in comprise a kth 2-POS pattern sentence number, represent and do not belong to class c isentence in comprise a kth 2-POS pattern sentence number, represent and belong to class c isentence in do not comprise a kth 2-POS pattern sentence number, represent and neither belong to class c ialso do not comprise a kth 2-POS pattern sentence number, N represents the sentence sum in language material;
The film review data set utilizing statistical method to provide in Cornell University obtains subjective rule;
Described subjective mode and described subjective rule are referred to as subjective clue, first calculate subjective clue Clue weight according to formula (2),
Wight (Clue)=Max (CHI value/maximum CHI value, degree of confidence * falg) (2)
Wherein flag=1, then subjective clue is subjective sentence degree of confidence, otherwise flag=0;
Then follow and calculate subjective clue density according to subjective clue density defined formula (3):
(3)
Wherein, the subjective clue word that sentence comprises adds up to n, two adjacent subjective clue words w iwith w i+1between non-subjective clue word quantity be expressed as distance( w i, w i+1), keyword w i+1weight in sentence is expressed as score( w i+1);
Adopt according to formula (4) tf-idfmethod calculates the weight of subjective clue word:
(4)
Wherein, df (wi) represent comprise word wthe sentence number of i, | s| be total sentence number, wi is at sentence sthe number of times occurred in j is expressed as tf( wi, sj);
Sentence is that the size of the possibility size of subjectivity sentence and SD (S) value is proportional;
(300) viewpoint sentence identification: viewpoint sentence is different from subjective sentence, it is subjective sentence collection, first the identification of viewpoint sentence will construct viewpoint word dictionary, then viewpoint word dictionary is utilized to add up the viewpoint word that sentence occurs, by the result of statistics, ID3 algorithm is utilized to generate decision tree, thus for the identification of viewpoint sentence;
(400) viewpoint sentence Similarity Measure: first carry out viewpoint extraction, according to step (100) to Subject Clustering, then to same subject, extract the attribute describing theme, viewpoint word is to the word class of passing judgement on of same attribute evaluation, finally calculates the weight of word according to formula (5)
(5)
Wherein, k represents the part of speech number occurred in sentence, n irepresent the number of i class word in sentence, g irepresent the weight of i-th viewpoint.
Suppose that the viewpoint weight sets that sentence A comprises word is combined into WordSet (A)={ W 1, W 2w n, the viewpoint weight sets that sentence B comprises word is combined into WordSet (B)={ W 1, W 2...., W m, if viewpoint weight set WordSet (B) of sentence B comprises i-th word (1≤i≤n) in WordSet (A), i.e. W i∈ WordSet (A) ∩ WordSet (B), then i-th word occurs, W ithe contribution of distich A and sentence B similarity is S i, in like manner, if W ido not occur in WordSet (B), and W j(W j∈ WordSet (A), 1≤j≤n) occur in WordSet (B), i.e. W j∈ WordSet (A) ∩ WordSet (B), if the appearance of a now word jth word, W jthe contribution of distich A and sentence B similarity is S jif i-th word and a jth word occur in sentence A and sentence B, then W simultaneously iand W jthe contribution of the similarity of distich A and sentence B is S ij, and have S ij>S i+ S j, then close word is to W iand W jbe S to the contribution degree of A, B similarity ij-(S i+ S j), W iand W jsimilarity size and S ij-(S i+ S j) the size of value be inversely proportional to, S ij-(S i+ S j) value less, then W iand W jmore similar;
(500) viewpoint sentence cluster: integrating step (100), to step (400), carries out viewpoint cluster according to formula (6),
(6)
Wherein, for viewpoint sentence 1 word close to viewpoint sentence 2 is to total weight of contribution degree, n is close word logarithm, W ifor priority weighting, not all feature all has contribution to similarity, and effectively pairing refers to the characteristic matching meeting priority rule, PairCiunt 1for the word number of viewpoint sentence 1, PairCiunt 2for the word number of viewpoint sentence 2.
In a preferred embodiment of the present invention, the subjectivity rule in described step (200) comprises:
The subjective sentence (0.75) of rule 1: degree adverb (definitely, very, quite) T
Rule 2: first person pronoun (I, I, individual) the subjective sentence (0.85) of T
Rule 3: the subjective sentence (0.90) of interrogative (, why) T
Rule 4: the subjective sentence (0.72) of deictic words (this, that, some) T
Rule 5: the subjective sentence (0.64) of conjunction (otherwise and, on the contrary) T
Rule 6: the quotations objective sentence of (he says, he thinks) T (1.0)
Rule 7: the concept definition core verb objective sentence of (be, comprise, be called, be called, be defined as) T (0.99)
Rule 8: the affirmation core verb objective sentence of (be described as, report, tell about) T (0.98)
Rule 9: advocate the subjective sentence (0.77) of class viewpoint word (think, should, determine, wish, think) T
Wherein viewpoint word is divided into 18 classes, to subjective sentence, inhomogeneity differentiates that contribution is different, and the numeral of the regular unquote of described subjectivity is subjective regular degree of confidence described in this.
The invention has the beneficial effects as follows: a kind of file classification method based on viewpoint of the present invention, the method belongs to the file classification method of semantic level, by the Fusion Model of event, viewpoint and emotion, annotate the semanteme of text from entirety, avoid " semantic lack " that tradition occurs based on the text classification of theme, the problem of " dimension disaster " and " depending on corpus unduly " by viewpoint cluster.
Embodiment
Below preferred embodiment of the present invention is described in detail, can be easier to make advantages and features of the invention be readily appreciated by one skilled in the art, thus more explicit defining is made to protection scope of the present invention.
The embodiment of the present invention comprises: a kind of file classification method based on viewpoint, and concrete steps comprise:
(100) due to when an elaboration theme, its heavy duty word Correspondent used be often confined to represent one of content involved by this theme more among a small circle in, there is certain repeatability, if word contained by two paragraphs, particularly high frequency noun, repeat to a certain extent, show as these two paragraphic similarities larger, what tentatively can think two paragraphs discussions is same subject, should draw in same subject matter segments, and word contained by the paragraph of different themes especially high frequency noun is general not too identical, similarity between their corresponding paragraphs is lower, diversity factor is larger, so the division of subject matter segments first will be carried out: input text, pass through (1 £ i, j £ n) calculates the semantic similarity between every two paragraphs in text, then finds out the paragraph candidate point P that theme changes one by one k1, P k2..., P krif, P krmeet , , , then determine be the theme the division candidate point of paragraph, continues the next candidate point of process, if all theme paragraph divides candidate point and is all disposed, terminates, if do not meet, judge whether meet if meet, then think the paragraph that is the theme divides candidate point, and continues process next theme paragraph division candidate point, if do not meet, then judges that theme paragraph divides candidate point next paragraph whether meet , think be not divide section, until all theme paragraph division candidate point is disposed, terminate, determine the theme paragraph division points in text, all natural paragraph in text is merged into several subject matter segments, and namely text can be expressed as D=S 1eS 2e ... eS n, S nrepresent subject matter segments,
Wherein, ( , ), F (P i)=(W i1, W i2..., W ij..., W ik) be paragraph proper vector, W ijrepresent the weights of a jth element in paragraph i in the list of text feature word, the frequency computation part that weights occur at this section according to word, k is the number of proper vector element, Text eigenvector F (D)=(W 1, W 2..., W l), W lrepresent l element weights in the text in the list of text feature word, the frequency computation part that weights occur in the text according to word, it is the subscript that r theme paragraph divides the paragraph of candidate point.
(200) a small amount of objective information is usually mingled with owing to evaluating in text, these information can produce impact in various degree to the accuracy of opining mining and quality, therefore objective information is separated from evaluation text and become very important, contribute to the complexity reducing opinion mining problem, thus the analysis efficiency of very big raising system and performance, the described differentiation will carrying out statement subjectivity: adopt CHI statistical method to carry out the extraction of 2-POS subjective mode to subjective text and objective text respectively, CHI statistical value is used for the degree of correlation between measures characteristic t and classification c, feature t is higher for the CHI statistical value of certain class, then the correlativity of it and such is larger, the classification information of carrying is also more, first participle and part-of-speech tagging are carried out to the sentence in training corpus, then 2-POS statistical language model is constructed, finally be respectively main according to formula (1), each 2-POS type in objective set of modes calculates CHI statistic, and according to the sequence of CHI value,
(1)
Wherein represent and belong to class c isentence in comprise a kth 2-POS pattern sentence number, represent and do not belong to class c isentence in comprise a kth 2-POS pattern sentence number, represent and belong to class c isentence in do not comprise a kth 2-POS pattern sentence number, represent and neither belong to class c ialso do not comprise a kth 2-POS pattern sentence number, N represents the sentence sum in language material;
The film review data set utilizing statistical method to provide in Cornell University obtains subjective rule, the film review data set that Cornell University provides is made up of film comment, each 1000 sections that wherein hold affirmation and negation attitude, mark each 5331 of the sentence of passing judgement on polarity in addition in addition, mark each 5000 of the sentence of subjective and objective label, current film review storehouse is widely used in various granularity, and as in the research of word, sentence and chapter level sentiment analysis, described subjective rule comprises:
The subjective sentence (0.75) of rule 1: degree adverb (definitely, very, quite) T
Rule 2: first person pronoun (I, I, individual) the subjective sentence (0.85) of T
Rule 3: the subjective sentence (0.90) of interrogative (, why) T
Rule 4: the subjective sentence (0.72) of deictic words (this, that, some) T
Rule 5: the subjective sentence (0.64) of conjunction (otherwise and, on the contrary) T
Rule 6: the quotations objective sentence of (he says, he thinks) T (1.0)
Rule 7: the concept definition core verb objective sentence of (be, comprise, be called, be called, be defined as) T (0.99)
Rule 8: the affirmation core verb objective sentence of (be described as, report, tell about) T (0.98)
Rule 9: advocate the subjective sentence (0.77) of class viewpoint word (think, should, determine, wish, think) T
Wherein viewpoint word is divided into 18 classes, to subjective sentence, inhomogeneity differentiates that contribution is different, and the numeral of the regular unquote of described subjectivity is subjective regular degree of confidence described in this.
Described subjective mode and described subjective rule are referred to as subjective clue, first calculate subjective clue Clue weight according to formula (2),
Wight (Clue)=Max (CHI value/maximum CHI value, degree of confidence * falg) (2)
Wherein flag=1, then subjective clue is subjective sentence degree of confidence, otherwise flag=0;
Then follow and calculate subjective clue density according to subjective clue density defined formula (3):
(3)
Wherein, the subjective clue word that sentence comprises adds up to n, two adjacent subjective clue words w iwith w i+1between non-subjective clue word quantity be expressed as distance( w i, w i+1), keyword w i+1weight in sentence is expressed as score( w i+1);
Adopt according to formula (4) tf-idfmethod calculates the weight of subjective clue word:
(4)
Wherein, df (wi) represent comprise word wthe sentence number of i, | s| be total sentence number, wi is at sentence sthe number of times occurred in j is expressed as tf( wi, sj);
Sentence is that the size of the possibility size of subjectivity sentence and SD (S) value is proportional;
(300) viewpoint sentence identification: viewpoint sentence is different from subjective sentence, it is subjective sentence collection, first the identification of viewpoint sentence will construct viewpoint word dictionary, then viewpoint word dictionary is utilized to add up the viewpoint word that sentence occurs, by the result of statistics, utilize ID3 algorithm to generate decision tree, thus for the identification of viewpoint sentence, table 1 give some aspects word dictionary;
Table 1 some aspects word dictionary
(400) viewpoint sentence Similarity Measure: subjective sentence and viewpoint sentence have the different of essence, subjective sentence refers to describe idea, suggestion, view, the sentence of evaluation etc., express the attitude that people speaks, and viewpoint sentence is the subjective judgement to things or event, find out from the pedigree of Fig. 1 speaker's attitude, viewpoint sentence must be subjective sentence, otherwise it is not right, the present invention by disclosed be a kind of file classification method based on viewpoint, so, the identification of viewpoint sentence is an important ring, viewpoint sentence identification gordian technique meticulously constructs viewpoint word dictionary, the present invention mainly utilizes the close word of viewpoint to carrying out computed view point sentence similarity, first viewpoint extraction is carried out, according to step (100) to Subject Clustering, then to same subject, extract the attribute describing theme, the attribute extraction of theme will utilize the dependency tree analysis to viewpoint sentence, the LTP platform that the present invention adopts Harbin Institute of Technology NLP to announce, according to the object of the modified relationship determination viewpoint sentence of dependency tree, i.e. attribute, viewpoint word is to the word class of passing judgement on of same attribute evaluation, passing judgement on word class is on the basis of passing judgement on dictionary, be summed up as 16 classes, this shows that the three elements that viewpoint word relies on are respectively: theme, attribute and pass judgement on word class,
Calculate the weight of word according to formula (5) after having extracted viewpoint word,
(5)
Wherein, k represents the part of speech number occurred in sentence, n irepresent the number of i class word in sentence, g irepresent the weight of i-th viewpoint.
Suppose that the viewpoint weight sets that sentence A comprises word is combined into WordSet (A)={ W 1, W 2w n, the viewpoint weight sets that sentence B comprises word is combined into WordSet (B)={ W 1, W 2...., W m, if viewpoint weight set WordSet (B) of sentence B comprises i-th word (1≤i≤n) in WordSet (A), i.e. W i∈ WordSet (A) ∩ WordSet (B), then i-th word occurs, W ithe contribution of distich A and sentence B similarity is S i, in like manner, if W ido not occur in WordSet (B), and W j(W j∈ WordSet (A), 1≤j≤n) occur in WordSet (B), i.e. W j∈ WordSet (A) ∩ WordSet (B), if the appearance of a now word jth word, W jthe contribution of distich A and sentence B similarity is S jif i-th word and a jth word occur in sentence A and sentence B, then W simultaneously iand W jthe contribution of the similarity of distich A and sentence B is S ij, and have S ij>S i+ S j, then close word is to W iand W jbe S to the contribution degree of A, B similarity ij-(S i+ S j), W iand W jsimilarity size and S ij-(S i+ S j) the size of value be inversely proportional to, S ij-(S i+ S j) value less, then W iand W jmore similar;
(500) extraction that close word is right: because term weighing refers to the degree that word represents sentence concept, namely be the tolerance of word to the significance level of sentence hint expression, the subsemantic ability to express of word distich is not only relevant to the characteristic of word itself, and it is relevant with factors such as the grammatical functions of sentence structure, sentence length, word, such as, subject and modifier are different for the significance level that sentence expectation reaches;
Ideally, the grammatical function of word should be analyzed, calculate the weight of word accordingly, but the complete syntactic analysis in present stage for sentence, identification completely for word grammatical roles is impossible, part of speech and the grammatical roles of word in sentence of another aspect word have certain corresponding relation: noun and pronoun are generally as subject and object, verb is generally as predicate, adjective, number and measure word are generally as attribute or the adverbial modifier, we can carry out the grammatical function information of indirect utilization word by part-of-speech information, calculate the weight of word, in Chinese text, function word generally only plays grammer connection, do not express actual concepts, function word is not examined when calculating, notional word comprises noun (N), verb (V), adjective (A), number (M), measure word (Q) and pronoun (R) etc., wherein measure word is more weak to the degree of being expressed as of concept, pronoun is generally other concepts of repetition, give and measure word and the less weight of pronoun, the weight of different part of speech is provided: noun weight is g according to part of speech and corresponding grammatical item thereof and experience 1, verb weight is g 2, adjective weight is g 3, number weight is g 4, other notional word weight is g 5in reality, noun, verb, adjective and number etc. may not be all appear in a sentence, so by the weight of the weight sum of the whole parts of speech occurred in sentence in the weight ratio of such word as such word when calculating, think similar in word distich meaning ability to express be identical, word is also subject to the impact of sentence length for the significance level that sentence expectation reaches, the vocabulary that sentence is longer, sentence comprises is more, and it is less that each word reaches role for sentence expectation
(400) viewpoint sentence cluster: integrating step (100) is to step (400), and the conversion that the semantic granularity of text is from coarse to fine: subject matter segments → subjective sentence → viewpoint sentence → close word pair, carries out viewpoint cluster according to formula (6) on this basis,
(6)
Wherein, for viewpoint sentence 1 word close to viewpoint sentence 2 is to total weight of contribution degree, n is close word logarithm, W ifor priority weighting, not all feature all has contribution to similarity, and effectively pairing refers to the characteristic matching meeting priority rule, PairCiunt 1for the word number of viewpoint sentence 1, PairCiunt 2for the word number of viewpoint sentence 2.
Compared with prior art, a kind of file classification method based on viewpoint of the present invention, the method belongs to the file classification method of semantic level, by the Fusion Model of event, viewpoint and emotion, the semanteme of text is annotated from entirety,, the problem of " dimension disaster " and " depending on corpus unduly ", table 2 gives the contrast of three kinds of sorting techniques to avoid by viewpoint sentence cluster " semantic lack " that tradition occurs based on the text classification of theme.
The contrast of table 2 three kinds of sorting techniques
The foregoing is only embodiments of the invention; not thereby the scope of the claims of the present invention is limited; every utilize instructions of the present invention to do equivalent structure or the conversion of equivalent flow process, or be directly or indirectly used in other relevant technical fields, be all in like manner included in scope of patent protection of the present invention.

Claims (3)

1. based on a file classification method for viewpoint, it is characterized in that, concrete steps comprise:
(100) division of subject matter segments: first input text, passes through (1 £ i, j £ n) calculates the semantic similarity between every two paragraphs in text, then finds out the paragraph candidate point P that theme changes one by one k1, P k2..., P krif, P krmeet , , , then determine be the theme the division candidate point of paragraph, continues the next candidate point of process, if all theme paragraph divides candidate point and is all disposed, terminates, if do not meet, judge whether meet if meet, then think the paragraph that is the theme divides candidate point, and continues process next theme paragraph division candidate point, if do not meet, then judges that theme paragraph divides candidate point next paragraph whether meet , think be not divide section, until all theme paragraph division candidate point is disposed, terminate, determine the theme paragraph division points in text, all natural paragraph in text is merged into several subject matter segments, and namely text can be expressed as D=S 1eS 2e ... eS n, S nrepresent subject matter segments;
Wherein, ( , ), F (P i)=(W i1, W i2..., W ij..., W ik) be paragraph proper vector, W ijrepresent the weights of a jth element in paragraph i in the list of text feature word, the frequency computation part that weights occur at this section according to word, k is the number of proper vector element, Text eigenvector F (D)=(W 1, W 2..., W l), W lrepresent l element weights in the text in the list of text feature word, the frequency computation part that weights occur in the text according to word, it is the subscript that r theme paragraph divides the paragraph of candidate point;
(200) differentiation of statement subjectivity: adopt CHI statistical method to carry out the extraction of 2-POS subjective mode to subjective text and objective text respectively, first participle and part-of-speech tagging are carried out to the sentence in training corpus, then 2-POS statistical language model is constructed, the each 2-POS type be finally respectively in master, objective set of modes according to formula (1) calculates CHI statistic, and according to the sequence of CHI value
(1)
Wherein represent and belong to class c isentence in comprise a kth 2-POS pattern sentence number, represent and do not belong to class c isentence in comprise a kth 2-POS pattern sentence number, represent and belong to class c isentence in do not comprise a kth 2-POS pattern sentence number, represent and neither belong to class c ialso do not comprise a kth 2-POS pattern sentence number, N represents the sentence sum in language material;
The film review data set utilizing statistical method to provide in Cornell University obtains subjective rule;
Described subjective mode and described subjective rule are referred to as subjective clue, first calculate subjective clue Clue weight according to formula (2),
Wight (Clue)=Max (CHI value/maximum CHI value, degree of confidence * falg) (2)
Wherein flag=1, then subjective clue is subjective sentence degree of confidence, otherwise flag=0;
Then follow and calculate subjective clue density according to subjective clue density defined formula (3):
(3)
Wherein, the subjective clue word that sentence comprises adds up to n, two adjacent subjective clue words w iwith w i+1between non-subjective clue word quantity be expressed as distance( w i, w i+1), keyword w i+1weight in sentence is expressed as score( w i+1);
Adopt according to formula (4) tf-idfmethod calculates the weight of subjective clue word:
(4)
Wherein, df (wi) represent comprise word wthe sentence number of i, | s| be total sentence number, wi is at sentence sthe number of times occurred in j is expressed as tf( wi, sj);
Sentence is that the size of the possibility size of subjectivity sentence and SD (S) value is proportional;
(300) viewpoint sentence identification: viewpoint sentence is different from subjective sentence, it is subjective sentence collection, first the identification of viewpoint sentence will construct viewpoint word dictionary, then viewpoint word dictionary is utilized to add up the viewpoint word that sentence occurs, by the result of statistics, ID3 algorithm is utilized to generate decision tree, thus for the identification of viewpoint sentence;
(400) viewpoint sentence Similarity Measure: first carry out viewpoint extraction, according to step (100) to Subject Clustering, then to same subject, extract the attribute describing theme, viewpoint word is to the word class of passing judgement on of same attribute evaluation, finally calculates the weight of word according to formula (5)
(5)
Wherein, k represents the part of speech number occurred in sentence, n irepresent the number of i class word in sentence, g irepresent the weight of i-th viewpoint.
2. suppose that the viewpoint weight sets that sentence A comprises word is combined into WordSet (A)={ W 1, W 2w n, the viewpoint weight sets that sentence B comprises word is combined into WordSet (B)={ W 1, W 2...., W m, if viewpoint weight set WordSet (B) of sentence B comprises i-th word (1≤i≤n) in WordSet (A), i.e. W i∈ WordSet (A) ∩ WordSet (B), then i-th word occurs, W ithe contribution of distich A and sentence B similarity is S i, in like manner, if W ido not occur in WordSet (B), and W j(W j∈ WordSet (A), 1≤j≤n) occur in WordSet (B), i.e. W j∈ WordSet (A) ∩ WordSet (B), if the appearance of a now word jth word, W jthe contribution of distich A and sentence B similarity is S jif i-th word and a jth word occur in sentence A and sentence B, then W simultaneously iand W jthe contribution of the similarity of distich A and sentence B is S ij, and have S ij>S i+ S j, then close word is to W iand W jbe S to the contribution degree of A, B similarity ij-(S i+ S j), W iand W jsimilarity size and S ij-(S i+ S j) the size of value be inversely proportional to, S ij-(S i+ S j) value less, then W iand W jmore similar;
(500) viewpoint sentence cluster: integrating step (100), to step (400), carries out viewpoint cluster according to formula (6),
(6)
Wherein, for viewpoint sentence 1 word close to viewpoint sentence 2 is to total weight of contribution degree, n is close word logarithm, W ifor priority weighting, not all feature all has contribution to similarity, and effectively pairing refers to the characteristic matching meeting priority rule, PairCiunt 1for the word number of viewpoint sentence 1, PairCiunt 2for the word number of viewpoint sentence 2.
3. a kind of file classification method based on viewpoint according to claim 1, is characterized in that: the subjectivity rule in described step (200) comprises:
The subjective sentence (0.75) of rule 1: degree adverb (definitely, very, quite) T
Rule 2: first person pronoun (I, I, individual) the subjective sentence (0.85) of T
Rule 3: the subjective sentence (0.90) of interrogative (, why) T
Rule 4: the subjective sentence (0.72) of deictic words (this, that, some) T
Rule 5: the subjective sentence (0.64) of conjunction (otherwise and, on the contrary) T
Rule 6: the quotations objective sentence of (he says, he thinks) T (1.0)
Rule 7: the concept definition core verb objective sentence of (be, comprise, be called, be called, be defined as) T (0.99)
Rule 8: the affirmation core verb objective sentence of (be described as, report, tell about) T (0.98)
Rule 9: advocate the subjective sentence (0.77) of class viewpoint word (think, should, determine, wish, think) T
Wherein viewpoint word is divided into 18 classes, to subjective sentence, inhomogeneity differentiates that contribution is different, and the numeral of the regular unquote of described subjectivity is subjective regular degree of confidence described in this.
CN201410434035.6A 2014-08-29 2014-08-29 Text classification method based on viewpoint Pending CN104331394A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410434035.6A CN104331394A (en) 2014-08-29 2014-08-29 Text classification method based on viewpoint

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410434035.6A CN104331394A (en) 2014-08-29 2014-08-29 Text classification method based on viewpoint

Publications (1)

Publication Number Publication Date
CN104331394A true CN104331394A (en) 2015-02-04

Family

ID=52406123

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410434035.6A Pending CN104331394A (en) 2014-08-29 2014-08-29 Text classification method based on viewpoint

Country Status (1)

Country Link
CN (1) CN104331394A (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105608068A (en) * 2014-11-17 2016-05-25 三星电子株式会社 Display apparatus and method for summarizing of document
CN106202200A (en) * 2016-06-28 2016-12-07 昆明理工大学 A kind of emotion tendentiousness of text sorting technique based on fixing theme
CN106294636A (en) * 2016-08-01 2017-01-04 中国电子科技集团公司第二十八研究所 A kind of search rank algorithm based on database data
CN106407999A (en) * 2016-08-25 2017-02-15 北京物思创想科技有限公司 Rule combined machine learning method and system
CN107247868A (en) * 2017-05-18 2017-10-13 深思考人工智能机器人科技(北京)有限公司 A kind of artificial intelligence aids in interrogation system
CN108334513A (en) * 2017-01-20 2018-07-27 阿里巴巴集团控股有限公司 A kind of identification processing method of Similar Text, apparatus and system
CN109033041A (en) * 2017-06-09 2018-12-18 北京国双科技有限公司 The treating method and apparatus of document similarity
CN109241297A (en) * 2018-07-09 2019-01-18 广州品唯软件有限公司 A kind of classifying content polymerization, electronic equipment, storage medium and engine
CN109871856A (en) * 2017-12-04 2019-06-11 北京京东尚科信息技术有限公司 A kind of method and apparatus optimizing training sample
CN109977418A (en) * 2019-04-09 2019-07-05 南瑞集团有限公司 A kind of short text method for measuring similarity based on semantic vector
CN110110326A (en) * 2019-04-25 2019-08-09 西安交通大学 A kind of text cutting method based on subject information
CN110162781A (en) * 2019-04-09 2019-08-23 国金涌富资产管理有限公司 A kind of finance text subjectivity sentence automatic identifying method
CN110399489A (en) * 2019-07-08 2019-11-01 厦门市美亚柏科信息股份有限公司 A kind of chat data segmentation method, device and storage medium
CN110738046A (en) * 2018-07-03 2020-01-31 百度在线网络技术(北京)有限公司 Viewpoint extraction method and device
CN111046282A (en) * 2019-12-06 2020-04-21 贝壳技术有限公司 Text label setting method, device, medium and electronic equipment
CN111178043A (en) * 2019-12-31 2020-05-19 武汉优聘科技有限公司 Method and system for recognizing academic viewpoint sentence
CN112131863A (en) * 2020-08-04 2020-12-25 中科天玑数据科技股份有限公司 Comment opinion theme extraction method, electronic equipment and storage medium
CN112464646A (en) * 2020-11-23 2021-03-09 中国船舶工业综合技术经济研究院 Text emotion analysis method for defense intelligence library in national defense field
CN112905766A (en) * 2021-02-09 2021-06-04 长沙冉星信息科技有限公司 Method for extracting core viewpoints from subjective answer text
CN113326411A (en) * 2020-02-28 2021-08-31 中国移动通信集团福建有限公司 Network behavior knowledge enhancement method and device and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101408883A (en) * 2008-11-24 2009-04-15 电子科技大学 Method for collecting network public feelings viewpoint
CN103116644A (en) * 2013-02-26 2013-05-22 华南理工大学 Method for mining orientation of Web themes and supporting decisions

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101408883A (en) * 2008-11-24 2009-04-15 电子科技大学 Method for collecting network public feelings viewpoint
CN103116644A (en) * 2013-02-26 2013-05-22 华南理工大学 Method for mining orientation of Web themes and supporting decisions

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
VIRENDRA KUMAR GUPTA等: "Multi-Document Summarization Using Sentence Clustering", 《IEEE PROCEEDINGS OF 4TH INTERNATIONAL CONFERENCE ON INTELLIGENT HUMAN COMPUTER INTERACTION》 *
XIN WANG 等: "Chinese Subjectivity Detection using a Sentiment Density-Based Naive Bayesian Classifier", 《PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON MACHINE AND CYBERNETICS》 *
倪茂树: "基于语义理解的观点评论挖掘研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
傅间莲 等: "基于连续段落相似度的主题划分算法", 《计算机应用》 *
刘亚亮等: "一种优化的AP-CAPSA中文文本结构分析算法", 《计算机应用研究》 *
张玉娟: "基于《知网》的句子相似度计算的研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
杨武等: "中文微博情感分析中主客观句分类方法", 《重庆理工大学学报》 *
陈旻等: "观点挖掘综述", 《浙江大学学报(工学版)》 *

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105608068A (en) * 2014-11-17 2016-05-25 三星电子株式会社 Display apparatus and method for summarizing of document
CN106202200A (en) * 2016-06-28 2016-12-07 昆明理工大学 A kind of emotion tendentiousness of text sorting technique based on fixing theme
CN106202200B (en) * 2016-06-28 2019-09-27 昆明理工大学 A kind of emotion tendentiousness of text classification method based on fixed theme
CN106294636A (en) * 2016-08-01 2017-01-04 中国电子科技集团公司第二十八研究所 A kind of search rank algorithm based on database data
CN106294636B (en) * 2016-08-01 2019-03-19 中国电子科技集团公司第二十八研究所 A kind of search rank method based on database data
CN106407999A (en) * 2016-08-25 2017-02-15 北京物思创想科技有限公司 Rule combined machine learning method and system
CN108334513A (en) * 2017-01-20 2018-07-27 阿里巴巴集团控股有限公司 A kind of identification processing method of Similar Text, apparatus and system
CN107247868A (en) * 2017-05-18 2017-10-13 深思考人工智能机器人科技(北京)有限公司 A kind of artificial intelligence aids in interrogation system
CN107247868B (en) * 2017-05-18 2020-05-12 深思考人工智能机器人科技(北京)有限公司 Artificial intelligence auxiliary inquiry system
CN109033041A (en) * 2017-06-09 2018-12-18 北京国双科技有限公司 The treating method and apparatus of document similarity
CN109871856B (en) * 2017-12-04 2022-03-04 北京京东尚科信息技术有限公司 Method and device for optimizing training sample
CN109871856A (en) * 2017-12-04 2019-06-11 北京京东尚科信息技术有限公司 A kind of method and apparatus optimizing training sample
CN110738046B (en) * 2018-07-03 2023-06-06 百度在线网络技术(北京)有限公司 Viewpoint extraction method and apparatus
CN110738046A (en) * 2018-07-03 2020-01-31 百度在线网络技术(北京)有限公司 Viewpoint extraction method and device
CN109241297A (en) * 2018-07-09 2019-01-18 广州品唯软件有限公司 A kind of classifying content polymerization, electronic equipment, storage medium and engine
CN110162781A (en) * 2019-04-09 2019-08-23 国金涌富资产管理有限公司 A kind of finance text subjectivity sentence automatic identifying method
CN109977418A (en) * 2019-04-09 2019-07-05 南瑞集团有限公司 A kind of short text method for measuring similarity based on semantic vector
CN110110326A (en) * 2019-04-25 2019-08-09 西安交通大学 A kind of text cutting method based on subject information
CN110399489A (en) * 2019-07-08 2019-11-01 厦门市美亚柏科信息股份有限公司 A kind of chat data segmentation method, device and storage medium
CN110399489B (en) * 2019-07-08 2022-06-17 厦门市美亚柏科信息股份有限公司 Chat data segmentation method, device and storage medium
CN111046282A (en) * 2019-12-06 2020-04-21 贝壳技术有限公司 Text label setting method, device, medium and electronic equipment
CN111046282B (en) * 2019-12-06 2021-04-16 北京房江湖科技有限公司 Text label setting method, device, medium and electronic equipment
CN111178043A (en) * 2019-12-31 2020-05-19 武汉优聘科技有限公司 Method and system for recognizing academic viewpoint sentence
CN113326411A (en) * 2020-02-28 2021-08-31 中国移动通信集团福建有限公司 Network behavior knowledge enhancement method and device and electronic equipment
CN112131863A (en) * 2020-08-04 2020-12-25 中科天玑数据科技股份有限公司 Comment opinion theme extraction method, electronic equipment and storage medium
CN112464646A (en) * 2020-11-23 2021-03-09 中国船舶工业综合技术经济研究院 Text emotion analysis method for defense intelligence library in national defense field
CN112905766A (en) * 2021-02-09 2021-06-04 长沙冉星信息科技有限公司 Method for extracting core viewpoints from subjective answer text

Similar Documents

Publication Publication Date Title
CN104331394A (en) Text classification method based on viewpoint
Ghosh et al. Fracking sarcasm using neural network
CN106598944B (en) A kind of civil aviaton's security public sentiment sentiment analysis method
Xue et al. A study on sentiment computing and classification of sina weibo with word2vec
Roberts et al. Investigating the emotional responses of individuals to urban green space using twitter data: A critical comparison of three different methods of sentiment analysis
Mukherjee et al. Effect of negation in sentences on sentiment analysis and polarity detection
Nagy et al. Crowd sentiment detection during disasters and crises.
Shi et al. Sentiment analysis of Chinese microblogging based on sentiment ontology: a case study of ‘7.23 Wenzhou Train Collision’
CN107239439A (en) Public sentiment sentiment classification method based on word2vec
CN107305539A (en) A kind of text tendency analysis method based on Word2Vec network sentiment new word discoveries
CN103544246A (en) Method and system for constructing multi-emotion dictionary for internet
CN103729456B (en) Microblog multi-modal sentiment analysis method based on microblog group environment
CN103995803A (en) Fine granularity text sentiment analysis method
Shahheidari et al. Twitter sentiment mining: A multi domain analysis
Ramírez-Tinoco et al. A brief review on the use of sentiment analysis approaches in social networks
Ahmed et al. A novel approach for Sentimental Analysis and Opinion Mining based on SentiWordNet using web data
CN105869058B (en) A kind of method that multilayer latent variable model user portrait extracts
Mozafari et al. Emotion detection by using similarity techniques
Chen et al. Sentiment classification of tourism based on rules and LDA topic model
KR101326313B1 (en) Method of classifying emotion from multi sentence using context information
Karampatsis et al. AUEB: Two stage sentiment analysis of social network messages
Fiarni et al. Implementing rule-based and naive bayes algorithm on incremental sentiment analysis system for Indonesian online transportation services review
Quan et al. Automatic Annotation of Word Emotion in Sentences Based on Ren-CECps.
Keyan et al. Multi-document and multi-lingual summarization using neural networks
Wen et al. A new analysis method for user reviews of mobile fitness apps

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150204