CN113722482A - News comment opinion sentence identification method - Google Patents

News comment opinion sentence identification method Download PDF

Info

Publication number
CN113722482A
CN113722482A CN202110981244.2A CN202110981244A CN113722482A CN 113722482 A CN113722482 A CN 113722482A CN 202110981244 A CN202110981244 A CN 202110981244A CN 113722482 A CN113722482 A CN 113722482A
Authority
CN
China
Prior art keywords
news
sentence
text
sentences
comment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110981244.2A
Other languages
Chinese (zh)
Inventor
王红斌
李伊仝
线岩团
相艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202110981244.2A priority Critical patent/CN113722482A/en
Publication of CN113722482A publication Critical patent/CN113722482A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for identifying a news comment viewpoint sentence, which comprises the steps of firstly extracting a plurality of key sentences of a news text through a Textrank algorithm, forming the key sentences into a simple abstract, then transmitting each comment of the news and news abstract information to a BERT model together to obtain text fusion representation, finally transmitting the text fusion representation into a full connection layer, and converting the output of the full connection layer into the probability of whether the comment sentence is a viewpoint sentence or not by utilizing a softmax function. Compared with the popular deep learning text classification model in recent years, the text achieves the effect of 84.01% in accuracy rate, and the effectiveness of the model is demonstrated. And the NLPCC &2012 micro blog opinion sentence identification data set verifies that the model has certain generalization capability.

Description

News comment opinion sentence identification method
Technical Field
The invention belongs to the field of viewpoint mining in natural language processing, and particularly relates to a news comment viewpoint sentence identification method.
Background
According to the definition of the viewpoint sentence by NLPCC &2012, any sentence expressing the evaluation of a specific thing or object is called a viewpoint sentence, and a sentence containing only the emotional, willingness, or mood of the mind is not a viewpoint sentence. The opinion sentence recognition task is considered herein as a two-classification task, i.e., classifying each sentence in the comment, with label Y representing an opinion sentence and label N representing a non-opinion sentence. Conventional classification methods generally classify only reviews, however, in news review opinion sentence recognition, we find that opinion sentences expressed by users are highly correlated with the content of news, and therefore news text information cannot be ignored. In recent years, due to the strong text representation capability of the BERT pre-training model, the best performance is obtained in the downstream tasks of the question and answer field, text classification and the like, and a huge heat tide is raised in the NLP (non line segment) boundary. The invention focuses on the viewpoint sentence recognition of news comments, but the BERT model cannot well process long texts such as news, so a method for combining the Textrank algorithm with the BERT model is provided on the basis of the viewpoint sentence recognition. The method comprises the steps of firstly extracting a plurality of key sentences of a news text through a Textrank algorithm, forming a simple abstract by the key sentences, then transmitting each comment of news and news abstract information to a BERT model together to obtain text fusion expression, finally transmitting the text fusion expression to a full-link layer, converting the output of the full-link layer into the probability of judging whether the comment is a viewpoint sentence or not by utilizing an activation function, and improving the identification effect of the viewpoint sentence by fusing the news abstract information.
Disclosure of Invention
The invention focuses on the viewpoint sentence recognition of news comments, but the BERT model cannot well process long texts such as news, so a method for combining the Textrank algorithm with the BERT model is provided on the basis of the viewpoint sentence recognition. The method comprises the steps of firstly extracting a plurality of key sentences of a news text through a Textrank algorithm, forming a simple abstract by the key sentences, then transmitting each comment of news and news abstract information to a BERT model together to obtain text fusion expression, finally transmitting the text fusion expression to a full-link layer, converting the output of the full-link layer into the probability of judging whether the comment is a viewpoint sentence or not by utilizing an activation function, and improving the identification effect of the viewpoint sentence by fusing the news abstract information.
In order to achieve the technical effects, the invention provides a news comment opinion sentence recognition method, which is realized by the following technical scheme, and is characterized by comprising the following steps:
s1: respectively extracting news texts and corresponding news comments from the data set;
s2: extracting n key sentences from the news text through a Textrank algorithm, and forming the n key sentences into an abstract;
s3: sending the news abstract information and the news comment text into a BERT pre-training model to obtain text fusion representation;
s4: transmitting the text fusion representation into a full connection layer, and converting the output of the full connection layer into the probability of whether the text fusion representation is a viewpoint sentence or not by using a softmax activation function;
preferably, the Textrank algorithm is to extract key sentences in the news text as corresponding abstracts by using the Textrank, the news text can be generally expressed by a small number of sentences, and the corresponding text representation also contains partial semantic information of the news; when extracting key sentences from a text, each sentence in the text is respectively regarded as a node, if two sentences have similarity, an undirected weighted edge exists between the nodes corresponding to the two sentences, and the similarity between the sentences is measured as shown in formula (1):
Figure BDA0003229085360000031
wherein, ViAnd VjNodes respectively representing sentences i and j also represent a set of sentence words; w is akRepresenting the kth word, the numerator fraction statistic w in the sentencekThe number of times of simultaneous occurrence in the two sentences and the number of words in the sentences in the denominator part are subjected to logarithmic summation, so that the advantage of the long sentences in similarity calculation can be effectively limited;
preferably, after the formula (1) calculates the similarity between two nodes, the edge connection with lower similarity is removed, and a node connection graph is constructed, and the Textrank score of each node is calculated as shown in the formula (2):
Figure BDA0003229085360000032
wherein WS (V)i) Represents ViIts own Textrank score, WS (V)j) Represents VjThe Textrank fraction of the node itself, d is the damping coefficient, and the score of a certain node is prevented from being 0, In (V)i) Representative node ViOut (V) is the set of inbound chainsj) Representative node VjGo-chain set of VkRepresenting the kth node of the out-link set; each sentence distributes the score of the sentence to other sentences by calculating the proportion of the weight to the total weight; the scores of all sentences are initialized to be 1 at the beginning, the score of each sentence is iterated for a plurality of times through a formula (2), and the final score in convergence is the final score of the sentence;
preferably, the final Textrank score of each sentence is sorted from large to small, the top n sentences with the highest scores are taken as abstract Abs corresponding to news text, as shown in formula (3), and ws (v) represents the final scores of all sentences of news;
Abs=Topn(WS(V)), (3)
preferably, in S3, the extracted text excerpts Abs are used as background knowledge supplement of the comment sentences corresponding to the news, and are input into the text presentation layer together with the comment sentences, so as to perform deep fusion on semantic information of the comments and semantic information of the corresponding news text. BERT is a pre-trained model that performs stably and excellently in the field of natural language processing, and by which a high-quality text representation can be obtained. Specifically, the comment sentence Obj is input into the BERT model together with the news digest Abs as shown in equation (4):
R=BERT[Obj,Abs], (4)
wherein Obj is a comment sentence, Abs is an extracted text abstract, and R is a fused text representation obtained by BERT model training;
preferably, in S4, after the text representation of the merged news digest information is obtained, the text representation is imported into the full link layer, and the softmax activation function is used to convert the output of the full link layer into a probability value of whether the input is a viewpoint sentence, as shown in formula (5):
y=softmax(WR+b), (5)
where W is the weight matrix and b is the bias term.
The invention has the beneficial effects that: the invention provides a news comment viewpoint sentence identification method combining a Textrank algorithm and a BERT, which is characterized in that abstract information is extracted from a news text by using the Textrank algorithm, the abstract information and a comment text are sent into a BERT model together to obtain a fused text representation, and then the fused text representation is transmitted into a full-link layer to calculate the probability of judging whether the comment sentence is the viewpoint sentence or not. The effectiveness and certain generalization capability of the invention are proved by experiments.
Drawings
FIG. 1 is a schematic overall flow diagram of the present invention;
FIG. 2 is a diagram of a model of the present invention.
Detailed Description
The scheme of the invention will be explained in detail by specific examples;
example 1
The method comprises the following specific steps:
s1: in the patent, about 200 news and corresponding comments are crawled from the internet news by using a crawler means; selecting 4 news with the most comments from the news, wherein 5000 comments are used as a data set; after manual labeling, the data set labels are distributed as shown in the following table, wherein Y represents a viewpoint sentence, and N represents a non-viewpoint sentence; finally, randomly dividing the data set into a training set, a verification set and a test set according to the ratio of 6:2: 2; the data set distribution is shown in table 1:
TABLE 1 data set distribution
Categories TrainingCollection Verification set Test set
Y 1100 360 270
N 1900 640 730
Total number of 3000 1000 1000
S2: the maximum length sequence of the text set by the model is 100, the batch size is set to be 24, the learning rate is 5e-5, and the Epoch is set to be 30;
s3: in addition to the data sets collected by the crawler, the generalization ability of the model was tested using the opinion sentence recognition common data set of NLPCC & 2012;
evaluation index
In order to verify the effect of the news comment viewpoint sentence identification model, common indexes such as Accuracy (Accuracy), Precision (Precision), Recall (Recall), F1 values and the like are adopted as evaluation indexes of the model, and formulas of the four indexes are as follows:
Figure BDA0003229085360000051
Figure BDA0003229085360000061
Figure BDA0003229085360000062
Figure BDA0003229085360000063
wherein TP represents the number of samples of the model correctly predicted as the viewpoint sentence, TN represents the number of samples of the model correctly predicted as the non-viewpoint sentence, FP represents the number of samples of the model incorrectly predicted as the viewpoint sentence, FN represents the number of samples of the model incorrectly predicted as the non-viewpoint sentence, N represents the total number of comment samples, and N is TP + TN + FP + FN.
Example 2
In order to verify the effectiveness of the invention, the following three groups of experiments are designed for verification:
experiment one: and selecting the number n of key sentences in the Textrank. The essence of the Textrank algorithm is to score and sort each sentence in the text, select n sentences as key sentences, and then compose the n sentences into an abstract. Therefore, a parameter selection experiment is set in the method to explore the influence of different key sentence numbers on the model effect. The results of the experiment are shown in table 2:
TABLE 2 influence of different key sentence numbers on model effect
Figure BDA0003229085360000064
From the experiment of parameter selection, the performance is better when the number of key sentences is 4, so that the model sets n to be 4 to form news text summary information. From the results of the effect of other key sentence numbers on the model, it can be seen that the greater the number of key sentences, the better, which may be related to the fact that the BERT model itself does not handle long text sequences well.
Experiment two: comparative experiments for classification models. The method of the present invention is compared to the TextCNN, TextRNN, Transformer, and BERT models. To further illustrate the effectiveness of the model, news digest information was also incorporated into the TextCNN, TextRNN, and transformer models, respectively, for comparison. The results of the experiment are shown in Table 3
TABLE 3 comparative experimental results
Figure BDA0003229085360000071
In the comparative experiment, compared with models such as reference models of TextCNN, TextRNN and Transformer, the model achieves the highest result in accuracy, and the effectiveness of the model in integrating news abstract information is illustrated. Second only to TextCNN in terms of accuracy and F1 value, it is stated that TextCNN is more suitable for the short text classification task, but TextCNN cannot capture word order information and position information of text. Compared with BERT, the four indexes are all improved, and the performance of the BERT model can be improved after news abstract information is blended.
After the news abstract information is blended into the reference model, the four indexes of the TextRNN, the Transformer and the BERT are all improved, which shows that the performance of the model can be effectively improved by blending the news abstract information. The performance is degraded in TextCNN, which may be due to the long-distance dependency of the text that TextCNN does not capture well after the text sequence becomes long.
The text model is inferior to the TextCNN and TextRNN incorporated with news digest information in accuracy and also achieves lower results in recall rate, F1, than the TextCNN, TextRNN, and Transformer reference models incorporated with news digest information. The two reasons are possible, firstly, the abstract composed of key sentences ignores the context relation of the original news sentences, and the model performance is influenced; secondly, after the summary information is merged, the text sequence becomes long, and BERT cannot well process long sequence texts due to the characteristics of BERT.
Experiment three: experiments on a common data set. The model and the benchmark model are also tested on the public data set identified in the NLPCC &2012 microblog opinion sentence, and although the fields are different, the model and the benchmark model can be used for verifying the generalization ability of the model, and the test results are shown in table 4:
table 4 experiments on public data sets
Model (model) Rate of accuracy Rate of accuracy Recall rate F1 value
TextCNN 0.7155 0.7050 0.7155 0.6974
TextRNN 0.6464 0.5948 0.6464 0.5097
Transformer 0.6499 0.6320 0.6499 0.6352
BERT 0.7967 0.8425 0.8425 0.8425
Text model 0.4976 0.7597 0.3242 0.4544
From the results, although the model herein achieved lower results in terms of accuracy, recall, and F1 values, the reason for this may be that the public data set covered more comprehensive areas, whereas the data set covered relatively fewer areas, the model did not learn sufficient features; but the accuracy rate is second only to BERT, which shows that the model has certain generalization capability.

Claims (7)

1. A news comment viewpoint sentence identification method is characterized by comprising the following steps:
s1: respectively extracting news texts and corresponding news comments from the data set;
s2: extracting n key sentences from the news text through a Textrank algorithm, and forming the n key sentences into an abstract;
s3: sending the news abstract information and the news comment text into a BERT pre-training model to obtain text fusion representation;
s4: the text fusion representation is passed into the fully-connected layer, and the output of the fully-connected layer is converted into a probability of whether it is a viewpoint sentence or not using a softmax activation function.
2. The method as claimed in claim 1, wherein the Textrank algorithm is to use Textrank to extract key sentences in a news text as corresponding abstracts, the news text can be expressed in a general way by a small number of sentences, and the corresponding text representation also contains partial semantic information of news; when extracting key sentences from a text, each sentence in the text is respectively regarded as a node, if two sentences have similarity, an undirected weighted edge exists between the nodes corresponding to the two sentences, and the similarity between the sentences is measured as shown in formula (1):
Figure FDA0003229085350000011
wherein, ViAnd VjNodes respectively representing sentences i and j also represent a set of sentence words; w is akRepresenting the kth word, the numerator fraction statistic w in the sentencekThe number of times of simultaneous occurrence in two sentences and the number of words in the sentence in the denominator part are subjected to logarithmic summation, so that the advantage of the long sentence in similarity calculation can be effectively limited.
3. The method for identifying a news comment viewpoint sentence according to claim 2, wherein after the formula (1) calculates the similarity between two nodes, the edge connection with lower similarity is removed to construct a node connection graph, and the Textrank score of each node is calculated as shown in formula (2):
Figure FDA0003229085350000021
wherein WS (V)i) Represents ViIts own Textrank score, WS (V)j) Represents VjThe Textrank fraction of the node itself, d is the damping coefficient, and the score of a certain node is prevented from being 0, In (V)i) Representative node ViOut (V) is the set of inbound chainsj) Representative node VjGo-chain set of VkRepresenting the kth node of the out-link set; each sentence distributes the score of the sentence to other sentences by calculating the proportion of the weight to the total weight; the scores of all sentences are initialized to 1 at the beginning, the score of each sentence is iterated for a plurality of times through formula (2), and the final score when convergence is carried out is the final score of the sentence.
4. A news comment viewpoint sentence recognition method as claimed in claim 3, wherein the final Textrank scores of each sentence are sorted from large to small, the top n sentences with the highest scores are taken as the abstracts Abs corresponding to the news text, as shown in formula (3), and ws (v) represents the final scores of all the sentences of the news;
Abs=Topn(WS(V)), (3)。
5. the method for identifying a news comment viewpoint sentence according to claim 1, wherein in S3, the extracted text abstract Abs is used as a background knowledge supplement for a comment sentence corresponding to the news, and is input to the text presentation layer together with the comment sentence, and the semantic information of the comment and the semantic information of the news text corresponding to the comment sentence are subjected to deep fusion; BERT is a pre-training model with stable and excellent performance in the field of natural language processing, and high-quality text representation can be obtained through the model; specifically, the comment sentence Obj is input into the BERT model together with the news digest Abs as shown in equation (4):
R=BERT[Obj,Abs], (4)
wherein Obj is a comment sentence, Abs is an extracted text abstract, and R is a fused text representation obtained by training a BERT model.
6. The method according to claim 1, wherein in S4, after the text representation of the merged news digest information is obtained, it is transmitted into the full link layer, and the output of the full link layer is converted into the probability value of the opinion sentence by using softmax activation function, as shown in formula (5):
y=softmax(WR+b), (5)
where W is the weight matrix and b is the bias term.
7. The method for identifying the opinion sentences of the news comments according to any one of claims 1 to 6 discloses application of the method in the opinion mining field in natural language processing.
CN202110981244.2A 2021-08-25 2021-08-25 News comment opinion sentence identification method Pending CN113722482A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110981244.2A CN113722482A (en) 2021-08-25 2021-08-25 News comment opinion sentence identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110981244.2A CN113722482A (en) 2021-08-25 2021-08-25 News comment opinion sentence identification method

Publications (1)

Publication Number Publication Date
CN113722482A true CN113722482A (en) 2021-11-30

Family

ID=78677872

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110981244.2A Pending CN113722482A (en) 2021-08-25 2021-08-25 News comment opinion sentence identification method

Country Status (1)

Country Link
CN (1) CN113722482A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070100863A1 (en) * 2005-10-27 2007-05-03 Newsdb, Inc. Newsmaker verification and commenting method and system
CN110263153A (en) * 2019-05-15 2019-09-20 北京邮电大学 Mixing text topic towards multi-source information finds method
CN111008274A (en) * 2019-12-10 2020-04-14 昆明理工大学 Case microblog viewpoint sentence identification and construction method of feature extended convolutional neural network
CN111680120A (en) * 2020-04-30 2020-09-18 中国科学院信息工程研究所 News category detection method and system
CN112115256A (en) * 2020-09-15 2020-12-22 大连大学 Method and device for generating news text abstract integrated with Chinese stroke information

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070100863A1 (en) * 2005-10-27 2007-05-03 Newsdb, Inc. Newsmaker verification and commenting method and system
CN110263153A (en) * 2019-05-15 2019-09-20 北京邮电大学 Mixing text topic towards multi-source information finds method
CN111008274A (en) * 2019-12-10 2020-04-14 昆明理工大学 Case microblog viewpoint sentence identification and construction method of feature extended convolutional neural network
CN111680120A (en) * 2020-04-30 2020-09-18 中国科学院信息工程研究所 News category detection method and system
CN112115256A (en) * 2020-09-15 2020-12-22 大连大学 Method and device for generating news text abstract integrated with Chinese stroke information

Similar Documents

Publication Publication Date Title
CN110990564B (en) Negative news identification method based on emotion calculation and multi-head attention mechanism
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN110232149B (en) Hot event detection method and system
El-Halees Mining opinions in user-generated contents to improve course evaluation
CN108563638B (en) Microblog emotion analysis method based on topic identification and integrated learning
CN112528676A (en) Document-level event argument extraction method
CN112989802B (en) Bullet screen keyword extraction method, bullet screen keyword extraction device, bullet screen keyword extraction equipment and bullet screen keyword extraction medium
CN109933792B (en) Viewpoint type problem reading and understanding method based on multilayer bidirectional LSTM and verification model
CN112733533A (en) Multi-mode named entity recognition method based on BERT model and text-image relation propagation
CN110909529B (en) User emotion analysis and prejudgment system of company image promotion system
CN105740382A (en) Aspect classification method for short comment texts
CN107818173B (en) Vector space model-based Chinese false comment filtering method
CN112784878B (en) Intelligent correction method and system for Chinese treatises
CN104008187A (en) Semi-structured text matching method based on the minimum edit distance
CN112214989A (en) Chinese sentence simplification method based on BERT
CN112905736A (en) Unsupervised text emotion analysis method based on quantum theory
CN106681986A (en) Multi-dimensional sentiment analysis system
CN116245110A (en) Multi-dimensional information fusion user standing detection method based on graph attention network
CN116757218A (en) Short text event coreference resolution method based on sentence relation prediction
CN117252600A (en) Intelligent customer service system based on big data and method thereof
CN116578705A (en) Microblog emotion classification method based on pre-training language model and integrated neural network
CN113220964B (en) Viewpoint mining method based on short text in network message field
Daouadi et al. Deep Random Forest and AraBert for Hate Speech Detection from Arabic Tweets.
CN113722482A (en) News comment opinion sentence identification method
CN114970498A (en) Dependency information-fused news event time sequence relation identification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20211130

RJ01 Rejection of invention patent application after publication