CN113722482A

CN113722482A - News comment opinion sentence identification method

Info

Publication number: CN113722482A
Application number: CN202110981244.2A
Authority: CN
Inventors: 王红斌; 李伊仝; 线岩团; 相艳
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2021-08-25
Filing date: 2021-08-25
Publication date: 2021-11-30

Abstract

The invention discloses a method for identifying a news comment viewpoint sentence, which comprises the steps of firstly extracting a plurality of key sentences of a news text through a Textrank algorithm, forming the key sentences into a simple abstract, then transmitting each comment of the news and news abstract information to a BERT model together to obtain text fusion representation, finally transmitting the text fusion representation into a full connection layer, and converting the output of the full connection layer into the probability of whether the comment sentence is a viewpoint sentence or not by utilizing a softmax function. Compared with the popular deep learning text classification model in recent years, the text achieves the effect of 84.01% in accuracy rate, and the effectiveness of the model is demonstrated. And the NLPCC &2012 micro blog opinion sentence identification data set verifies that the model has certain generalization capability.

Description

News comment opinion sentence identification method

Technical Field

The invention belongs to the field of viewpoint mining in natural language processing, and particularly relates to a news comment viewpoint sentence identification method.

Background

According to the definition of the viewpoint sentence by NLPCC &2012, any sentence expressing the evaluation of a specific thing or object is called a viewpoint sentence, and a sentence containing only the emotional, willingness, or mood of the mind is not a viewpoint sentence. The opinion sentence recognition task is considered herein as a two-classification task, i.e., classifying each sentence in the comment, with label Y representing an opinion sentence and label N representing a non-opinion sentence. Conventional classification methods generally classify only reviews, however, in news review opinion sentence recognition, we find that opinion sentences expressed by users are highly correlated with the content of news, and therefore news text information cannot be ignored. In recent years, due to the strong text representation capability of the BERT pre-training model, the best performance is obtained in the downstream tasks of the question and answer field, text classification and the like, and a huge heat tide is raised in the NLP (non line segment) boundary. The invention focuses on the viewpoint sentence recognition of news comments, but the BERT model cannot well process long texts such as news, so a method for combining the Textrank algorithm with the BERT model is provided on the basis of the viewpoint sentence recognition. The method comprises the steps of firstly extracting a plurality of key sentences of a news text through a Textrank algorithm, forming a simple abstract by the key sentences, then transmitting each comment of news and news abstract information to a BERT model together to obtain text fusion expression, finally transmitting the text fusion expression to a full-link layer, converting the output of the full-link layer into the probability of judging whether the comment is a viewpoint sentence or not by utilizing an activation function, and improving the identification effect of the viewpoint sentence by fusing the news abstract information.

Disclosure of Invention

The invention focuses on the viewpoint sentence recognition of news comments, but the BERT model cannot well process long texts such as news, so a method for combining the Textrank algorithm with the BERT model is provided on the basis of the viewpoint sentence recognition. The method comprises the steps of firstly extracting a plurality of key sentences of a news text through a Textrank algorithm, forming a simple abstract by the key sentences, then transmitting each comment of news and news abstract information to a BERT model together to obtain text fusion expression, finally transmitting the text fusion expression to a full-link layer, converting the output of the full-link layer into the probability of judging whether the comment is a viewpoint sentence or not by utilizing an activation function, and improving the identification effect of the viewpoint sentence by fusing the news abstract information.

In order to achieve the technical effects, the invention provides a news comment opinion sentence recognition method, which is realized by the following technical scheme, and is characterized by comprising the following steps:

s1: respectively extracting news texts and corresponding news comments from the data set;

s2: extracting n key sentences from the news text through a Textrank algorithm, and forming the n key sentences into an abstract;

s3: sending the news abstract information and the news comment text into a BERT pre-training model to obtain text fusion representation;

s4: transmitting the text fusion representation into a full connection layer, and converting the output of the full connection layer into the probability of whether the text fusion representation is a viewpoint sentence or not by using a softmax activation function;

preferably, the Textrank algorithm is to extract key sentences in the news text as corresponding abstracts by using the Textrank, the news text can be generally expressed by a small number of sentences, and the corresponding text representation also contains partial semantic information of the news; when extracting key sentences from a text, each sentence in the text is respectively regarded as a node, if two sentences have similarity, an undirected weighted edge exists between the nodes corresponding to the two sentences, and the similarity between the sentences is measured as shown in formula (1):

wherein, V_iAnd V_jNodes respectively representing sentences i and j also represent a set of sentence words; w is a_kRepresenting the kth word, the numerator fraction statistic w in the sentence_kThe number of times of simultaneous occurrence in the two sentences and the number of words in the sentences in the denominator part are subjected to logarithmic summation, so that the advantage of the long sentences in similarity calculation can be effectively limited;

preferably, after the formula (1) calculates the similarity between two nodes, the edge connection with lower similarity is removed, and a node connection graph is constructed, and the Textrank score of each node is calculated as shown in the formula (2):

wherein WS (V)_i) Represents V_iIts own Textrank score, WS (V)_j) Represents V_jThe Textrank fraction of the node itself, d is the damping coefficient, and the score of a certain node is prevented from being 0, In (V)_i) Representative node V_iOut (V) is the set of inbound chains_j) Representative node V_jGo-chain set of V_kRepresenting the kth node of the out-link set; each sentence distributes the score of the sentence to other sentences by calculating the proportion of the weight to the total weight; the scores of all sentences are initialized to be 1 at the beginning, the score of each sentence is iterated for a plurality of times through a formula (2), and the final score in convergence is the final score of the sentence;

preferably, the final Textrank score of each sentence is sorted from large to small, the top n sentences with the highest scores are taken as abstract Abs corresponding to news text, as shown in formula (3), and ws (v) represents the final scores of all sentences of news;

Abs＝Top_n(WS(V))， (3)

preferably, in S3, the extracted text excerpts Abs are used as background knowledge supplement of the comment sentences corresponding to the news, and are input into the text presentation layer together with the comment sentences, so as to perform deep fusion on semantic information of the comments and semantic information of the corresponding news text. BERT is a pre-trained model that performs stably and excellently in the field of natural language processing, and by which a high-quality text representation can be obtained. Specifically, the comment sentence Obj is input into the BERT model together with the news digest Abs as shown in equation (4):

R＝BERT[Obj,Abs]， (4)

wherein Obj is a comment sentence, Abs is an extracted text abstract, and R is a fused text representation obtained by BERT model training;

preferably, in S4, after the text representation of the merged news digest information is obtained, the text representation is imported into the full link layer, and the softmax activation function is used to convert the output of the full link layer into a probability value of whether the input is a viewpoint sentence, as shown in formula (5):

y＝softmax(WR+b)， (5)

where W is the weight matrix and b is the bias term.

The invention has the beneficial effects that: the invention provides a news comment viewpoint sentence identification method combining a Textrank algorithm and a BERT, which is characterized in that abstract information is extracted from a news text by using the Textrank algorithm, the abstract information and a comment text are sent into a BERT model together to obtain a fused text representation, and then the fused text representation is transmitted into a full-link layer to calculate the probability of judging whether the comment sentence is the viewpoint sentence or not. The effectiveness and certain generalization capability of the invention are proved by experiments.

Drawings

FIG. 1 is a schematic overall flow diagram of the present invention;

FIG. 2 is a diagram of a model of the present invention.

Detailed Description

The scheme of the invention will be explained in detail by specific examples;

example 1

The method comprises the following specific steps:

s1: in the patent, about 200 news and corresponding comments are crawled from the internet news by using a crawler means; selecting 4 news with the most comments from the news, wherein 5000 comments are used as a data set; after manual labeling, the data set labels are distributed as shown in the following table, wherein Y represents a viewpoint sentence, and N represents a non-viewpoint sentence; finally, randomly dividing the data set into a training set, a verification set and a test set according to the ratio of 6:2: 2; the data set distribution is shown in table 1:

TABLE 1 data set distribution

Categories	TrainingCollection	Verification set	Test set
				Y	1100	360	270
N	1900	640	730
				Total number of	3000	1000	1000

S2: the maximum length sequence of the text set by the model is 100, the batch size is set to be 24, the learning rate is 5e-5, and the Epoch is set to be 30;

s3: in addition to the data sets collected by the crawler, the generalization ability of the model was tested using the opinion sentence recognition common data set of NLPCC & 2012;

evaluation index

In order to verify the effect of the news comment viewpoint sentence identification model, common indexes such as Accuracy (Accuracy), Precision (Precision), Recall (Recall), F1 values and the like are adopted as evaluation indexes of the model, and formulas of the four indexes are as follows:

wherein TP represents the number of samples of the model correctly predicted as the viewpoint sentence, TN represents the number of samples of the model correctly predicted as the non-viewpoint sentence, FP represents the number of samples of the model incorrectly predicted as the viewpoint sentence, FN represents the number of samples of the model incorrectly predicted as the non-viewpoint sentence, N represents the total number of comment samples, and N is TP + TN + FP + FN.

Example 2

In order to verify the effectiveness of the invention, the following three groups of experiments are designed for verification:

experiment one: and selecting the number n of key sentences in the Textrank. The essence of the Textrank algorithm is to score and sort each sentence in the text, select n sentences as key sentences, and then compose the n sentences into an abstract. Therefore, a parameter selection experiment is set in the method to explore the influence of different key sentence numbers on the model effect. The results of the experiment are shown in table 2:

TABLE 2 influence of different key sentence numbers on model effect

From the experiment of parameter selection, the performance is better when the number of key sentences is 4, so that the model sets n to be 4 to form news text summary information. From the results of the effect of other key sentence numbers on the model, it can be seen that the greater the number of key sentences, the better, which may be related to the fact that the BERT model itself does not handle long text sequences well.

Experiment two: comparative experiments for classification models. The method of the present invention is compared to the TextCNN, TextRNN, Transformer, and BERT models. To further illustrate the effectiveness of the model, news digest information was also incorporated into the TextCNN, TextRNN, and transformer models, respectively, for comparison. The results of the experiment are shown in Table 3

TABLE 3 comparative experimental results

In the comparative experiment, compared with models such as reference models of TextCNN, TextRNN and Transformer, the model achieves the highest result in accuracy, and the effectiveness of the model in integrating news abstract information is illustrated. Second only to TextCNN in terms of accuracy and F1 value, it is stated that TextCNN is more suitable for the short text classification task, but TextCNN cannot capture word order information and position information of text. Compared with BERT, the four indexes are all improved, and the performance of the BERT model can be improved after news abstract information is blended.

After the news abstract information is blended into the reference model, the four indexes of the TextRNN, the Transformer and the BERT are all improved, which shows that the performance of the model can be effectively improved by blending the news abstract information. The performance is degraded in TextCNN, which may be due to the long-distance dependency of the text that TextCNN does not capture well after the text sequence becomes long.

The text model is inferior to the TextCNN and TextRNN incorporated with news digest information in accuracy and also achieves lower results in recall rate, F1, than the TextCNN, TextRNN, and Transformer reference models incorporated with news digest information. The two reasons are possible, firstly, the abstract composed of key sentences ignores the context relation of the original news sentences, and the model performance is influenced; secondly, after the summary information is merged, the text sequence becomes long, and BERT cannot well process long sequence texts due to the characteristics of BERT.

Experiment three: experiments on a common data set. The model and the benchmark model are also tested on the public data set identified in the NLPCC &2012 microblog opinion sentence, and although the fields are different, the model and the benchmark model can be used for verifying the generalization ability of the model, and the test results are shown in table 4:

table 4 experiments on public data sets

Model (model)	Rate of accuracy	Rate of accuracy	Recall rate	F1 value
					TextCNN	0.7155	0.7050	0.7155	0.6974
TextRNN	0.6464	0.5948	0.6464	0.5097
					Transformer	0.6499	0.6320	0.6499	0.6352
BERT	0.7967	0.8425	0.8425	0.8425
					Text model	0.4976	0.7597	0.3242	0.4544

From the results, although the model herein achieved lower results in terms of accuracy, recall, and F1 values, the reason for this may be that the public data set covered more comprehensive areas, whereas the data set covered relatively fewer areas, the model did not learn sufficient features; but the accuracy rate is second only to BERT, which shows that the model has certain generalization capability.

Claims

1. A news comment viewpoint sentence identification method is characterized by comprising the following steps:

s4: the text fusion representation is passed into the fully-connected layer, and the output of the fully-connected layer is converted into a probability of whether it is a viewpoint sentence or not using a softmax activation function.

2. The method as claimed in claim 1, wherein the Textrank algorithm is to use Textrank to extract key sentences in a news text as corresponding abstracts, the news text can be expressed in a general way by a small number of sentences, and the corresponding text representation also contains partial semantic information of news; when extracting key sentences from a text, each sentence in the text is respectively regarded as a node, if two sentences have similarity, an undirected weighted edge exists between the nodes corresponding to the two sentences, and the similarity between the sentences is measured as shown in formula (1):

wherein, V_iAnd V_jNodes respectively representing sentences i and j also represent a set of sentence words; w is a_kRepresenting the kth word, the numerator fraction statistic w in the sentence_kThe number of times of simultaneous occurrence in two sentences and the number of words in the sentence in the denominator part are subjected to logarithmic summation, so that the advantage of the long sentence in similarity calculation can be effectively limited.

3. The method for identifying a news comment viewpoint sentence according to claim 2, wherein after the formula (1) calculates the similarity between two nodes, the edge connection with lower similarity is removed to construct a node connection graph, and the Textrank score of each node is calculated as shown in formula (2):

wherein WS (V)_i) Represents V_iIts own Textrank score, WS (V)_j) Represents V_jThe Textrank fraction of the node itself, d is the damping coefficient, and the score of a certain node is prevented from being 0, In (V)_i) Representative node V_iOut (V) is the set of inbound chains_j) Representative node V_jGo-chain set of V_kRepresenting the kth node of the out-link set; each sentence distributes the score of the sentence to other sentences by calculating the proportion of the weight to the total weight; the scores of all sentences are initialized to 1 at the beginning, the score of each sentence is iterated for a plurality of times through formula (2), and the final score when convergence is carried out is the final score of the sentence.

4. A news comment viewpoint sentence recognition method as claimed in claim 3, wherein the final Textrank scores of each sentence are sorted from large to small, the top n sentences with the highest scores are taken as the abstracts Abs corresponding to the news text, as shown in formula (3), and ws (v) represents the final scores of all the sentences of the news;

Abs＝Top_n(WS(V))， (3)。

5. the method for identifying a news comment viewpoint sentence according to claim 1, wherein in S3, the extracted text abstract Abs is used as a background knowledge supplement for a comment sentence corresponding to the news, and is input to the text presentation layer together with the comment sentence, and the semantic information of the comment and the semantic information of the news text corresponding to the comment sentence are subjected to deep fusion; BERT is a pre-training model with stable and excellent performance in the field of natural language processing, and high-quality text representation can be obtained through the model; specifically, the comment sentence Obj is input into the BERT model together with the news digest Abs as shown in equation (4):

R＝BERT[Obj,Abs]， (4)

wherein Obj is a comment sentence, Abs is an extracted text abstract, and R is a fused text representation obtained by training a BERT model.

6. The method according to claim 1, wherein in S4, after the text representation of the merged news digest information is obtained, it is transmitted into the full link layer, and the output of the full link layer is converted into the probability value of the opinion sentence by using softmax activation function, as shown in formula (5):

y＝softmax(WR+b)， (5)

where W is the weight matrix and b is the bias term.

7. The method for identifying the opinion sentences of the news comments according to any one of claims 1 to 6 discloses application of the method in the opinion mining field in natural language processing.