CN108268668B

CN108268668B - Topic diversity-based text data viewpoint abstract mining method

Info

Publication number: CN108268668B
Application number: CN201810166896.9A
Authority: CN
Inventors: 廖祥文; 陈国龙; 赵楠; 杨定达
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2018-02-28
Filing date: 2018-02-28
Publication date: 2022-01-18
Anticipated expiration: 2038-02-28
Also published as: CN108268668A

Abstract

The invention provides a text data viewpoint abstract mining method based on topic diversity, which comprises the following steps: step S1: preprocessing topic texts; step S2: inputting a topic corpus and a background corpus; step S3: extracting topic attributes of the topic corpus; step S4: adding emotion polarity to the obtained topic attribute for sentence vectorization; step S5: taking the obtained topic attributes as evaluation objects, analyzing the emotion polarity of the evaluation objects contained in the sentences by adopting a dynamic word sequence emotion analysis method facing to multiple evaluation objects to obtain the emotion attribute characteristics contained in the sentences, and carrying out characteristic vectorization on one sentence; step S6: and constructing a diversity objective function by using the text sentence feature vectors obtained in the step S5. The method can efficiently and accurately obtain the viewpoint abstract of the topic text, and can be applied to a larger-scale data set application scene.

Description

Topic diversity-based text data viewpoint abstract mining method

Technical Field

The invention relates to the field of text summarization and sentiment analysis, in particular to a method for generating a brief viewpoint summary rich in user sentiment information for massive topic text data of Chinese microblog linguistic data, wherein the viewpoint summary can accurately cover key contents discussed by a text and can be applied to practical application scenes such as news summaries, commodity comment summaries and the like.

Background

Currently, there are many technical approaches available for research in the field of opinion summarization. Conventional view summary models include graph models and ranking models. The representation method of the graph model comprises methods such as Textrank, PageRank and LexRank, sentences are used as nodes, a certain relation between the sentences is used as the weight of an edge, iterative updating calculation is carried out on scores of the sentences through a random walk model, scoring of the sentences is achieved, a certain number of sentences with high scores are selected to be combined into a viewpoint abstract, a ranking model is used for constructing a sentence scoring function to achieve scoring of the sentences from the consideration factors such as diversity and redundancy of the viewpoint abstract, or a KL divergence and MMR method are used for carrying out relative score ranking on the sentences, and the viewpoint abstract is obtained through score ranking. The two methods ignore the text topic attribute with finer granularity, consider the diversity of the text subject matter through the diversity of all words in the text, do not consider the influence of the keywords of the text subject matter on the view abstract, and limit the follow-up research of the model to a certain extent.

At present, researchers at home and abroad continuously research the viewpoint abstract models by means of a generative formula and a submodel function. The method has a good effect, but the time complexity of algorithm solution is too high, and it takes several times of time of other methods for a short data set, and the method can not be applied to an actual scene under a big data background. The view abstract method based on the submodular function ensures that the obtained local solution can be not lower than 63% of the optimal solution by using the greedy algorithm through the submodular function property, the greedy algorithm takes the conditions of various elements into consideration to select sentences, and although the experimental effect is relatively good, the mode of manually constructing the corpus tree is not suitable for wider application scenes.

In general, two fundamental properties of the view abstract are: 1) ensuring that the obtained abstract covers the subject text subject; 2) the obtained abstract should cover the topic subject matter with rich emotional colors. The method has the defects that most of the existing models consider that the viewpoint abstract covers the text subject by utilizing the diversity of all words of the text sentence, the diversity of the abstract is ensured by the diversity of the words, but the diversity of the words cannot ensure that the viewpoint abstract covers the subject of the source text, the words irrelevant to the subject can influence the finally generated viewpoint abstract, the emotion information of the abstract is described by the emotion information of the whole text sentence in the existing research method, and the emotions of the subject of many irrelevant texts are also considered, so that the finally obtained abstract comprises a plurality of contents and emotion information irrelevant to the text subject.

Therefore, a more efficient and accurate method for researching a viewpoint abstract is urgently needed, which extracts topic attribute words from a source text by an entity extraction method as text subject key words, researches emotion information about topic attributes serving as evaluation objects in each sentence by combining an emotion analysis research method, and selects sentences to combine into the viewpoint abstract by a topic attribute diversity method for fusing the importance of the sentences, so that the whole viewpoint abstract contains the most text subjects with emotion information.

Disclosure of Invention

The invention aims to solve the problem of compression of massive viewpoint text data, provides a viewpoint summarization method based on topic diversity, solves the problems of the current research method from topic attributes and emotional information thereof, can efficiently and accurately obtain the viewpoint summarization of topic texts, and can be applied to a larger-scale data set application scene.

In order to achieve the purpose, the invention adopts the following technical scheme: a text data viewpoint abstract mining method based on topic diversity is characterized by comprising the following steps: the method comprises the following steps: step S1: preprocessing topic texts, and filtering irrelevant texts without substantial content and any significance and common stop words; step S2: inputting a topic corpus and a background corpus; step S3: extracting topic attributes of the topic corpus; step S4: adding emotion polarities to the topic attributes obtained in the step S3, wherein the emotion polarities comprise positive emotions and negative emotions, and the positive topic attributes and the negative topic attributes are used as emotion attribute characteristics and are used for vectorizing the sentence; step S5: taking the topic attribute obtained in the step S3 as an evaluation object, analyzing the emotion polarity of the evaluation object contained in the sentence by adopting a dynamic word sequence emotion analysis method facing multiple evaluation objects to obtain the emotion attribute characteristics contained in the sentence, wherein if the sentence contains the emotion attribute characteristics, the corresponding characteristic value is 1, and if the sentence does not contain the emotion attribute characteristics, the corresponding characteristic value is 0, so that one sentence is subjected to characteristic vectorization by the topic attribute and emotion analysis method; step S6: and constructing a diversity objective function by using the text sentence feature vectors obtained in the step S5.

In one embodiment of the present invention, the filtering rule in step S1 is as follows: (1) removing the webpage links in the comment sentences; (2) removing the comment sentences with the character length smaller than 3; (3) removing common irrelevant words in the comment sentences; (4) and all English tables are written in small or large.

In an embodiment of the present invention, step S2 includes the following steps: and aiming at the preprocessed text, setting the current topic text as a topic corpus and other topic texts as a background corpus.

In an embodiment of the present invention, in step S3, the log likelihood ratio of the words in the topic corpus is calculated by using a log likelihood ratio method, and the words are filtered by using a threshold value to extract the topic attributes of the topic corpus, where the part-of-speech requirement of the words must be noun, adjective, verb, and digit.

In an embodiment of the present invention, step S4 includes the following specific steps: and taking the obtained topic attribute as an evaluation object, analyzing the emotion polarity of the evaluation object in a sentence by utilizing a dynamic word sequence emotion analysis method facing to multiple evaluation objects, and adding positive and negative emotion polarities to the topic attribute respectively to obtain the positive topic attribute and the negative topic attribute respectively.

Further, the topic attribute with emotion in step S4 is used as an emotional topic attribute and is used as a feature for sentence feature vectorization in step S6; the multi-evaluation object oriented dynamic word sequence emotion analysis method is a word bag model based on an emotion dictionary and mainly comprises the following two steps: firstly, cutting a sentence word sequence by using a dynamic word sequence method to obtain a word sequence of each evaluation object contained in a sentence; secondly, matching the word sequence emotional words of each evaluation object by using an emotional dictionary, calculating the emotional tendency of the evaluation object by using the polarity and the weight of the emotional words, and obtaining a sentence feature vector according to the topic attribute and the emotional polarity in the sentence;the method flow of the word sequence is briefly described as follows: step S41: determining the position of an evaluation object in a sentence; for each sentence, the topic attributes in the topic attribute set are used as evaluation objects, and the positions of the evaluation objects in the sentence are determined according to the front-to-back direction of the sentence, wherein the positions are from small to large; step S42: by evaluating the position of the object in the sentence

Expanding towards the left and right directions for the center until meeting punctuation marks or other evaluation objects; if the punctuation marks are encountered leftwards or rightwards, intercepting the punctuation marks to a left word sequence or a right word sequence in the evaluation object; if meeting other evaluation objects leftwards or rightwards, taking the middle coordinate of the position coordinates of the two evaluation objects, and intercepting the word sequence from the middle coordinate to the position of the evaluation object as a left word sequence or a right word sequence; step S423: obtaining a left word sequence of a certain evaluation object after the steps

And a sequence of right words

Combining the left word sequence and the right word sequence to obtain a complete word sequence of the evaluation object

Wherein

Is an evaluation object, is removed during emotion analysis, and has a parameter range

The specific values of the two are dynamically changed and have no fixed value, and the parameter values of the two are different for two different word sequences; at the same time, for any two dynamic word sequences

And

satisfies the conditions

I.e. a sequence of words where the two sequences of words do not coincide.

In an embodiment of the present invention, in step S6, a diversity objective function is constructed, where the objective function ensures that the emotional topic attributes included in a certain number of sentence sets are selected to be the most, and the sentence sets are used as viewpoint summaries, thereby ensuring that the finally obtained viewpoint summary has the best diversity.

In an embodiment of the present invention, the construction of the objective function includes the following steps: firstly, a viewpoint sentence scoring function fusing topic diversity and sentence importance is constructed, topic attribute differences between sentences and a summary set are considered in the scoring function, the importance of the sentences is fused, the importance of the sentences is obtained by using topic attribute weights and topic attributes contained in the sentences, the sentences with the maximum topic diversity of the viewpoint summary set are added into the viewpoint summary by selection, wherein the number of the sentences is limited within 20 or the number of the sentences is limited by a certain compression ratio.

Compared with the prior art, the invention has the following beneficial effects:

1. the method has the advantages that the data are preprocessed, so that the application is wider, the data are cleaned on the basis of the original data, irrelevant texts are filtered, the topic attribute extracted by the topic attribute extracting method is more accurate, and the method can be applied to the field of Chinese microblogs and can be applied to the field of website news and commodity comments.

2. Only positive and negative emotions of topic attributes are considered, and neutral emotions are not discussed. The problems existing in the current research method are solved from topic attributes and emotional information thereof, the viewpoint abstract of the topic text can be efficiently and accurately obtained, and the method can be applied to a larger-scale data set application scene.

Drawings

FIG. 1 is a schematic view of the main process of the present invention.

Detailed Description

The invention is further explained below with reference to the figures and the specific embodiments.

The invention provides a text data viewpoint abstract mining method based on topic diversity, which comprises the following steps: step S1: preprocessing topic texts, and filtering irrelevant texts without substantial content and any significance and common stop words; step S2: inputting a topic corpus and a background corpus; step S3: extracting topic attributes of the topic corpus; step S4: adding emotion polarities to the topic attributes obtained in the step S3, wherein the emotion polarities comprise positive emotions and negative emotions, and the positive topic attributes and the negative topic attributes are used as emotion attribute characteristics and are used for vectorizing the sentence; step S5: taking the topic attribute obtained in the step S3 as an evaluation object, analyzing the emotion polarity of the evaluation object contained in the sentence by adopting a dynamic word sequence emotion analysis method facing multiple evaluation objects to obtain the emotion attribute characteristics contained in the sentence, wherein if the sentence contains the emotion attribute characteristics, the corresponding characteristic value is 1, and if the sentence does not contain the emotion attribute characteristics, the corresponding characteristic value is 0, so that one sentence is subjected to characteristic vectorization by the topic attribute and emotion analysis method; step S6: and constructing a diversity objective function by using the text sentence feature vectors obtained in the step S5.

In one embodiment of the present invention, the filtering rule in step S1 is as follows:

(1) web page links in the comment sentence, such as "http:// t. cn/RcwWYQZ", are removed.

(2) And removing comment sentences with the character length smaller than 3, wherein the comment sentences contain too little information, most of the comment sentences are emoticons, and no other useful information exists.

(3) Common irrelevant words such as 'group pictures', 'original text forwarding' and the like in the comment sentences are removed.

(4) And (4) unifying all English tables into small-case English.

In an embodiment of the present invention, step S4 includes the following specific steps: the obtained topic attributes are used as evaluation objects, the emotion polarities of the evaluation objects in sentences are analyzed by utilizing a multi-evaluation-object-oriented dynamic word sequence emotion analysis method, positive emotion polarities and negative emotion polarities are added to the topic attributes respectively to obtain positive topic attributes and negative topic attributes respectively, and all the topic attributes with positive emotions are expressed as a set

The negative topic attribute set is

And taking the topic attributes with emotion as emotional topic attributes and using the topic attributes with emotion as features for sentence feature vectorization of the next step.

Further, the topic attribute with emotion in step S4 is used as an emotional topic attribute and is used as a feature for sentence feature vectorization in step S6; the multi-evaluation object oriented dynamic word sequence emotion analysis method is a word bag model based on an emotion dictionary and mainly comprises the following two steps: firstly, cutting a sentence word sequence by using a dynamic word sequence method to obtain a word sequence of each evaluation object contained in a sentence; secondly, matching the word sequence emotional words of each evaluation object by using an emotional dictionary, calculating the emotional tendency of the evaluation object by using the polarity and the weight of the emotional words, and obtaining a sentence feature vector according to the topic attribute and the emotional polarity in the sentence; the method flow of the word sequence is briefly described as follows: step S41: determining the position of an evaluation object in a sentence; for each sentence, the topic attributes in the topic attribute set are used as evaluation objects, and each evaluation pair is determined according to the front-to-back direction of the sentenceLike the position in the sentence, the position is from small to large; step S42: by evaluating the position of the object in the sentence

And a sequence of right words

Wherein

And, satisfies the conditions

I.e. a sequence of words where the two sequences of words do not coincide.

The method comprises two basic assumptions, 1) topic attribute of the text is the central thought and the main idea of the text; 2) the same topic attributes of different emotions are different, and both the attributes serve as the subject matter and the main discussion content of the text; the topic attribute extraction method based on the log likelihood ratio is included; the method comprises a dynamic word sequence sentiment analysis method facing multiple evaluation objects, wherein sentences are cut according to topic attributes by using the dynamic word sequence method to obtain subsequences of each topic attribute; the method comprises the steps of vectorizing a sentence based on emotional topic attributes, and vectorizing the sentence by using the topic attributes with emotion as features; the topic diversity-based viewpoint sentence selection method is included, and the topic diversity of the obtained viewpoint abstract is guaranteed to be maximum from the two aspects of topic diversity and sentence importance. Fig. 1 is a schematic main flow chart of an embodiment of the present invention.

Step S1: preprocessing the microblog corpus to clear some irrelevant words and avoid influencing the extraction of topic attributes.

Step S2: inputting topic corpus and background corpus, wherein the background corpus is composed of other topic corpora. Step S3: and extracting the topic attributes of the topic corpus by using a log likelihood ratio method.

Step S4: and adding emotion polarities to the topic attributes obtained in the step S3, wherein the emotion polarities include positive emotion and negative emotion, and thus the positive topic attributes and the negative topic attributes are used as emotion attribute features for vectorizing the sentence.

Step S5: and 3, taking the topic attribute obtained in the step 3 as an evaluation object, analyzing the emotion polarity of the evaluation object contained in the sentence by adopting a dynamic word sequence emotion analysis method facing to multiple evaluation objects to obtain the emotion attribute characteristics contained in the sentence, wherein if the sentence contains the emotion attribute characteristics, the corresponding characteristic value is 1, and if the sentence does not contain the emotion attribute characteristics, the corresponding characteristic value is 0, and thus, one sentence is subjected to characteristic vectorization through the topic attribute and emotion analysis method. The dynamic word sequence method facing to the multiple evaluation objects is a word bag model based on an emotion dictionary and mainly comprises the following two steps: firstly, cutting a sentence word sequence by using a dynamic word sequence method to obtain a word sequence of each evaluation object contained in a sentence; and secondly, matching the word sequence emotional words of each evaluation object by using an emotional dictionary, calculating the emotional tendency of the evaluation object by using the polarity and the weight of the emotional words, and obtaining a sentence feature vector according to the topic attribute and the emotional polarity in the sentence.

Step S6: and (4) constructing a diversity objective function by using the text sentence feature vectors obtained in the step (S5), wherein the objective function ensures that the emotional topic attributes contained in a certain number of sentence sets are selected to be the most, constructing the diversity objective function, the objective function ensures that the emotional topic attributes contained in a certain number of sentence sets are selected to be the most, and the sentence sets are used as viewpoint abstracts, so that the diversity of the finally obtained viewpoint abstracts is ensured to be the best. And aiming at an objective function, selecting sentences to form a viewpoint abstract by adopting a viewpoint sentence selection method based on topic diversity, firstly constructing a viewpoint sentence scoring function fusing topic diversity and sentence importance, and adding the sentences which enable the topic diversity of the viewpoint abstract set to be increased to the maximum into the viewpoint abstract each time the sentences are selected, wherein the number of the sentences is limited within 20 or the number of the sentences is limited by a certain compression ratio.

The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims

1. A text data viewpoint abstract mining method based on topic diversity is characterized by comprising the following steps: the method comprises the following steps:

step S1: preprocessing topic texts, and filtering irrelevant texts without substantial content and any significance and common stop words;

step S2: inputting a topic corpus and a background corpus;

step S3: extracting topic attributes of the topic corpus;

step S4: adding emotion polarities to the topic attributes obtained in the step S3, wherein the emotion polarities comprise positive emotions and negative emotions, and the positive topic attributes and the negative topic attributes are used as emotion attribute characteristics and are used for vectorizing the sentence;

the step S4 includes the following steps:

taking the obtained topic attribute as an evaluation object, analyzing the emotion polarity of the evaluation object in a sentence by utilizing a dynamic word sequence emotion analysis method facing to multiple evaluation objects, and adding positive or negative emotion polarities to the topic attribute respectively to obtain a positive topic attribute and a negative topic attribute respectively;

step S5: taking the topic attribute obtained in the step S3 as an evaluation object, analyzing the emotion polarity of the evaluation object contained in the sentence by adopting a dynamic word sequence emotion analysis method facing multiple evaluation objects to obtain the emotion attribute characteristics contained in the sentence, wherein if the sentence contains the emotion attribute characteristics, the corresponding characteristic value is 1, and if the sentence does not contain the emotion attribute characteristics, the corresponding characteristic value is 0, so that one sentence is subjected to characteristic vectorization by the topic attribute and emotion analysis method;

the multi-evaluation object oriented dynamic word sequence emotion analysis method is a word bag model based on an emotion dictionary and mainly comprises the following two steps: firstly, cutting a sentence word sequence by using a dynamic word sequence method to obtain a word sequence of each evaluation object contained in a sentence; secondly, matching the word sequence emotional words of each evaluation object by using an emotional dictionary, calculating the emotional tendency of the evaluation object by using the polarity and the weight of the emotional words, and obtaining a sentence feature vector according to the topic attribute and the emotional polarity in the sentence;

step S6: constructing a diversity objective function by using the text sentence feature vector obtained in the step S5;

constructing a diversity objective function in the step S6, wherein the objective function ensures that the emotional topic attributes contained in a certain number of sentence sets are selected to be the most, and the sentence sets are used as viewpoint abstracts, so that the diversity of the finally obtained viewpoint abstracts is ensured to be the best;

the construction of the objective function comprises the following steps: firstly, a viewpoint sentence scoring function fusing topic diversity and sentence importance is constructed, topic attribute differences between sentences and a summary set are considered in the scoring function, the importance of the sentences is fused, the importance of the sentences is obtained by using topic attribute weights and topic attributes contained in the sentences, the sentences with the maximum topic diversity of the viewpoint summary set are added into the viewpoint summary by selection, wherein the number of the sentences is limited within 20 or the number of the sentences is limited by a certain compression ratio.

2. The topic diversity-based text data perspective summary mining method of claim 1, wherein: the filtering rule in step S1 is as follows:

(1) removing the webpage links in the comment sentences;

(2) removing the comment sentences with the character length smaller than 3;

(3) removing common irrelevant words in the comment sentences;

(4) and all English tables are written in small or large.

3. The topic diversity-based text data perspective summary mining method of claim 1, wherein: step S2 includes the following steps: and aiming at the preprocessed text, setting the current topic text as a topic corpus and other topic texts as a background corpus.

4. The topic diversity-based text data perspective summary mining method of claim 1, wherein: step S3 is to calculate the log-likelihood ratio of the words in the topic corpus by means of log-likelihood ratio method, and filter the words by using a threshold value to extract the topic attributes of the topic corpus, wherein the part of speech of the word must be noun, adjective, verb or digit.

5. The topic diversity-based text data perspective summary mining method of claim 1, wherein:

taking the topic attribute with emotion in the step S4 as an emotional topic attribute and using the topic attribute as a feature for sentence feature vectorization in the step S6;

the method flow of the word sequence is briefly described as follows:

step S41: determining the position of an evaluation object in a sentence; for each sentence, the topic attributes in the topic attribute set are used as evaluation objects, and the positions of the evaluation objects in the sentence are determined according to the front-to-back direction of the sentence, wherein the positions are from small to large;

step S42: by evaluating the position of the object in the sentence

Expanding towards the left and right directions for the center until meeting punctuation marks or other evaluation objects;

if the punctuation marks are encountered leftwards or rightwards, intercepting the punctuation marks to a left word sequence or a right word sequence in the evaluation object;

if meeting other evaluation objects leftwards or rightwards, taking the middle coordinate of the position coordinates of the two evaluation objects, and intercepting the word sequence from the middle coordinate to the position of the evaluation object as a left word sequence or a right word sequence;

step S423: obtaining a left word sequence of a certain evaluation object after the steps

And a sequence of right words

Wherein

And

satisfies the conditions

I.e. a sequence of words where the two sequences of words do not coincide.