CN113032569A

CN113032569A - Chinese automatic text abstract evaluation method based on semantic similarity

Info

Publication number: CN113032569A
Application number: CN202110382498.2A
Authority: CN
Inventors: 张祖平; 姜自高; 郑瑾
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2021-04-09
Filing date: 2021-04-09
Publication date: 2021-06-25

Abstract

The invention provides a Chinese automatic text abstract evaluation method based on semantic similarity, which comprises the following specific steps: abstract texts, news short texts and manual labels are extracted from LCTS Chinese abstract data set; preprocessing the abstract text and the short news text, and representing the abstract text and the short news text by using a pre-training word vector; and inputting the abstract text and the news short text into a DPCNN-Siemese mixed network model for grading. The invention provides a mixed improved model based on a Siamese network structure, which uses a part with manual evaluation in an LCSTS data set, takes news titles and news contents as input, respectively uses a DPCNN network structure to extract the characteristics of texts, combines the outputs of two network layers, trains by taking the scores of the manual evaluation as label data, and evaluates the quality of a Chinese text abstract generated by the model by simulating the language habit of a Chinese user.

Description

Chinese automatic text abstract evaluation method based on semantic similarity

Technical Field

The invention relates to the technical field of natural language processing, in particular to a Chinese automatic text abstract evaluation method based on semantic similarity.

Background

The current social development is developing towards the direction of artificial intelligence and big data informatization, various text information layers are endless, and a large amount of information such as green wave micro blogs, bean short comments, news daily newspapers and the like is brought into daily life of people, so that a large amount of redundant information is obtained in daily life of people, and many accurate and important contents are hidden by overlong texts, so that useful information is difficult to obtain quickly and efficiently.

With the continuous development and progress of artificial intelligence technology, the automatic text summarization technology gradually plays an important role in the aspect of efficiently compressing and extracting information. In a traditional text generation task, people often have difficulty evaluating the quality of texts generated by a neural network model. If the machine-generated text excerpt is evaluated purely by manual evaluation, the process is very expensive and time-consuming, and there is also a scoring error due to the inconsistency of each person's evaluation criteria. When the researcher updates and improves the model, the generated result needs to be reevaluated, and therefore, the experimental efficiency is greatly influenced. And because the text abstract is originated abroad, the generated text is mainly English with a relatively standard grammatical structure, so that the method of BLEU, ROUGE and the like is adopted, and the method has feasibility for judging the quality of the generated abstract by taking the word contact ratio as an evaluation standard. However, for chinese, the biggest feature in the standard chinese grammar is the morphological change without strict meaning, for example, the noun is usually unqualified, and the verb is also not called, which is a big difference from the european language. Another feature of chinese is omission, i.e. words that do not affect the meaning of the subject are often omitted. Therefore, people can express a meaning in different expression modes, and the method is not accurate when the coincidence degree of the characters is used as an evaluation standard.

Disclosure of Invention

The invention aims to provide a Chinese automatic text abstract evaluation method based on semantic similarity, which can be used for improving the accuracy of Chinese text abstract evaluation and aims to solve the problem that the existing Chinese text abstract evaluation method mostly adopts an English abstract evaluation method and has deviation from the language habit of an actual Chinese user.

In order to achieve the aim, the invention provides a Chinese automatic text abstract evaluation method based on semantic similarity, which comprises the following steps:

step one, abstract texts, news short texts and manual labels are extracted from LCTS Chinese abstract data sets PARTII and PARTIII;

secondly, preprocessing the extracted abstract text and the extracted short news text, and representing the abstract text and the short news text by using a pre-training word vector;

and step three, inputting the abstract text and the news short text represented by the pre-training word vector into a DPCNN-Siemese mixed network model for grading.

As a further scheme of the invention: the second step is to preprocess the abstract text and the short news text, and comprises the following specific steps:

step 2.1, extracting summary texts, news short texts and manually marked contents in LCTS Chinese summary data sets through an lxml library of python, and respectively outputting the contents to different files according to corresponding sequences;

step 2.2, using an LTP word segmentation tool to perform word segmentation processing on abstract texts and news short texts which are extracted from LCTS Chinese abstract data in a centralized manner, and using Chinese Wikipedia corpus pre-training word vectors as text word vectors of Chinese data;

step 2.3, converting Chinese in the abstract text and the short news text into a 300-dimensional pre-training word vector, and processing the length of each abstract text into 32 characters and the length of each short news text into 128 characters;

and 2.4, respectively inputting the abstract text which is processed into a 32-character long news short text and a 128-character long news short text into the neural network.

Preferably, the specific method for processing the length of the abstract text into 32 characters is as follows: setting an empty list of (n,32) dimensions, and entering abstract text data item by item, and when the length of the abstract text is less than 32 characters, processing the abstract text in a zero padding mode; when the size of the abstract text is larger than 32 characters, the content exceeding 32 characters is cut off, and only the data of the first 32 characters is input.

Preferably, the specific method for processing the length of the short news text into 128 characters is as follows: setting an empty list of (n,128) dimensions, and entering short news text data item by item, and when the length of the short news text is less than 128 characters, processing the short news text data in a zero padding mode; when the size of the news short text is larger than 128 characters, the content exceeding 128 characters is cut off, and only the data of the first 128 characters is recorded.

Wherein: n represents the number of samples of the LCTS Chinese summary data set.

As a further scheme of the invention: the third step of scoring the abstract text and the short news text comprises the following specific steps:

step 3.1, respectively inputting the abstract text and the short news text into a structure based on a Siamese network;

step 3.2, respectively extracting features through the DPCNN 1 and the DPCNN 2 with different depths according to the lengths of the abstract text and the short news text input into the structure based on the Siemese network;

step 3.3, splicing and pooling the characteristics of the abstract text and the news short text through a concat function, and inputting the spliced and pooled characteristics into a full connection layer;

step 3.4, taking the manual label as a classification result, and performing semantic similarity matching and scoring on the characteristics of the abstract text and the news short text input into the full connection layer and the manual label by using a softmax function;

and 3.5, carrying out weighted average on the similarity scores of the generated abstract texts or the short news texts to obtain the text score of the abstract text or the short news text.

As a further scheme of the invention: the DPCNN-Siemese mixed network model is set as a mixed network model combining a Siemese network model and the DPCNN network model, wherein the Siemese network model is used for semantic similarity matching, and the DPCNN network model is used for extracting the characteristics of abstract texts or news short texts.

As a further scheme of the invention: in order to better match the semantic similarity of the characteristics of the abstract text or the news short text with the manual labels to obtain an abstract text or news short text evaluation model which accords with the reading of a Chinese user, the DPCNN-Simese mixed network model is trained when the semantic similarity matching and scoring of the characteristics of the abstract text or the news short text and the manual labels are carried out.

Preferably, the training specific implementation of the DPCNN-Siamese hybrid network model is as follows:

step 1, selecting CNN1D, CNN2D and BilSTM models in an experiment to perform a comparison test with a DPCNN-Siemese mixed network model, wherein the CNN1D model is used as base to perform effect comparison;

step 2, selecting word vectors of Chinese Wikipedia corpus training in an embedding layer Chinese, selecting word vectors of Stanforglove pre-training in English, and setting embedding to be trainable, wherein a hidden layer of a BilSTM network is set to be 128, sizes of convolution kernels of a CNN network and a DPCNN network are set to be 3, 4 and 5, the number of the convolution kernels is set to be 256, L1 and L2 regularization are started, an Adam algorithm is used for an optimization algorithm, and a learning rate is set to be 0.001;

and 3, evaluating the model training result by taking three evaluation indexes of F1 value as results.

Preferably, the calculation formula of the F1 value is as follows:

wherein: p is expressed as correct rate, R is expressed as recall rate, TP is expressed as predicted value and true value are true, FP is expressed as predicted value is true and true value is false, FN is expressed as predicted value is false and true value is true.

As a further scheme of the invention: the construction of the DPCNN-Siemese mixed network model comprises the following specific steps:

firstly, the structure of the Siamese network is improved by combining the characteristics of a text abstract data structure;

step two, abstract texts and short news texts { C) represented by word vectors are processed by using the Simese network structure_A(v),C_B(v) As input, and with DPCNN 1 and DPCNN 2 of different depths as text feature extraction layers, { C_A(v),C_B(v) Encode into two text vectors of the same dimension, where C_A(v),C_B(v) Respectively expressed as word embedding a sequence and B sequence input using the Siamese network structure;

splicing the two text vectors by adopting a concatenate function, and then pooling the two text vectors through a posing layer;

and fourthly, after the text feature vectors are pooled, carrying out classification processing after the text feature vectors pass through a full connection layer and a softmax function.

Preferably, the specific method for improving the structure of the siemese network is as follows: two sub-networks with the same structure and sharing weight in the Simese network are improved into two DPCNN networks with different depths, the two DPCNN networks are used for carrying out feature extraction on abstract texts and short news texts with different lengths in a Chinese abstract data set, concat operation is used for splicing the outputs of the two DPCNN networks into a text vector, and after the text vector passes through a pooling layer and a full connection layer, classification processing is carried out by using softmax.

Preferably, the text feature extraction layer comprises two DPCNN network modules with different depths.

Compared with the prior art, the invention has the following beneficial effects:

(1) the invention provides a mixed improved model based on a Siamese network structure, which uses a part with manual evaluation in an LCSTS data set, takes news titles and news contents as input, respectively extracts the characteristics of texts in a DPCNN network structure, combines the outputs of two network layers, trains by taking the level of manual evaluation as label data, and evaluates the quality of a Chinese text abstract generated by the model by simulating the language habit of a Chinese user.

(2) The invention provides an automatic text abstract evaluation model for simulating the reading habits of Chinese users, which uses a Simese network as a basic framework, wherein aiming at different lengths of abstract texts and original texts, the DPCNN networks with different depths are adopted to extract text characteristics so as to evaluate the generated abstract texts from the semantic perspective.

In addition to the objects, features and advantages described above, other objects, features and advantages of the present invention are also provided. The present invention will be described in further detail below with reference to the drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a schematic diagram of the overall architecture of the present invention;

FIG. 2 is an architectural diagram of text processing in the present invention;

FIG. 3 is a schematic diagram of an LCTS Chinese short text summary raw data set according to the present invention;

FIG. 4 is a schematic diagram of the DPCNN-Siamese hybrid network model architecture in the present invention;

FIG. 5 is a diagram of the network model architecture of TextCNN in the present invention;

FIG. 6 is a schematic diagram of the DPCNN network model architecture in the present invention;

fig. 7 is a schematic diagram of the siemese network model architecture in the present invention.

Detailed Description

In order to make the aforementioned objects, features, advantages, and the like of the present invention more clearly understandable, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that the drawings of the present invention are simplified and are not to precise scale, and are provided for convenience and clarity in assisting the description of the embodiments of the present invention; the several references in this disclosure are not limited to the particular numbers in the examples of the figures; the directions or positional relationships indicated by ' front ' middle, ' rear ' left ', right ', upper ', lower ', top ', bottom ', middle ', etc. in the present invention are based on the directions or positional relationships shown in the drawings of the present invention, and do not indicate or imply that the devices or components referred to must have a specific direction, nor should be construed as limiting the present invention.

In this embodiment:

referring to fig. 1 to 3, the method for automatically evaluating a chinese text summary based on semantic similarity provided by the present invention specifically includes the following steps:

preprocessing the abstract text and the short news text, and representing the abstract text and the short news text by using a pre-training word vector;

and step three, inputting the abstract text and the news short text into a DPCNN-Siemese mixed network model for grading.

Preferably, the specific steps of preprocessing the abstract text and the short news text in the second step are as follows:

step 2.2, performing word segmentation processing on the text by using a word segmentation tool of a Language Technology Platform (LTP), and using a Chinese Wikipedia corpus pre-training word vector (Chinese word vector) as a text word vector;

Preferably, n represents the number of samples of the LCSTS Chinese summary data set.

Preferably, the third step of scoring the abstract text and the short news text includes the following specific steps:

step 3.4, taking the manual label as a classification result, and scoring the similarity of the abstract text and the short news text input into the full connection layer by using a softmax function;

Preferably, the DPCNN-Siamese hybrid network model is set as a hybrid network model combining a Siamese network model and a DPCNN network model, wherein: the Siamese network model is used for carrying out semantic similarity matching; the DPCNN network model is used for extracting the characteristics of the abstract text or the short news text.

Referring to fig. 4, the specific steps of constructing the DPCNN-Siamese hybrid network model are as follows:

step two, abstract texts and short news texts { C) represented by word vectors are processed by using the Simese network structure_A(v),C_B(v) In which C is_A(v),C_B(v) Respectively expressed as word-embedded a-sequence and B-sequence using Siamese network structure) as input and through DPCNN 1 and DPCNN 2 at different depths as text feature extraction layers (the text feature extraction layer contains two DPCNN network blocks at different depths, aiming at better capturing C_A(v),C_B(v) Because the reference text of the summary data is the news headline, which is short relative to the news text, and the news content text is long, the learning and extraction of corresponding features through the DPCNN networks with different depths are considered), the { C (C) is obtained_A(v),C_B(v) Encoding into two text vectors with the same dimensionality;

The specific implementation of using CNN network (convolutional neural network) to capture local features in text is as follows:

DPCNN network (Deep volume, Deep pyramid relational networks)Product neural network) is proposed based on the application of a TextCNN network (Text convolutional neural network) in the field of natural language processing, and the core idea of the CNN network (convolutional neural network) is to capture local features, wherein for texts, the local features are sliding windows consisting of a plurality of words or characters, similar to N-grams, and feature vectors c_iBy sliding window x_i:i+h-1The Chinese text convolution yields the following formula 1:

c_i＝f(W*x_i:i+h-1+b)..................................................(1)

wherein: c. C_iExpressed as a feature vector obtained by the convolution of the ith sliding window in the CNN network, f is expressed as W is expressed as a coefficient matrix parameter of the CNN network, and x is expressed as_i:i+h-1Expressed as the text vector contained in the sliding window, b as the constant matrix parameter of the CNN network, h table the step size of the convolution kernel move.

The characteristics obtained by collecting all the character vectors by the sliding window of the CNN network are the characteristics

c＝[c₁,c₂,…,c_n-h+1]；............................................(2)

Wherein: c is the text characteristic vector after the convolution of the CNN convolution neural network is completed, c₁,c₂,…,c_n-h+1Respectively, feature vectors obtained by convolution of 1 st, 2 … th and n-h +1 th sliding windows in the CNN neural network.

Referring to fig. 4, in the TextCNN network, each word in the text is composed of n-dimensional word vectors, and the size of the input matrix is m × n, where m is the sentence length; the CNN network needs to carry out convolution operation on input samples, for text data, convolution kernels do not slide transversely but only slide downwards, local correlation between word words is extracted similarly to n-gram, for example, a graph contains three step length strategies (2, 3 and 4 respectively), each step length has two convolution kernels, different convolution kernels are applied to different word windows, and 6 vectors after convolution are finally obtained; and then performing maximum pooling operation on each vector and performing full connection to finally obtain the feature representation of the sentence, and inputting the sentence vector into a classifier for classification.

Referring to fig. 5, the bottom layer of the DPCNN network maintains a structure similar to that of the TextCNN network, and the convolution result of the convolution layer containing the multi-size convolution filter is a Region embedding process, which is used as an embedding feature generated after a set of convolution operations is performed on a text Region or segment (e.g., 3 gram); the DPCNN network performs two equal-length convolutions after Region embedding so as to enable each word to be more abundantly represented; then, performing equal-length convolution and 1/2 pooling repeatedly to extract features, and performing residual connection to solve the problem of gradient dispersion, wherein the length of the text sequence is exponentially reduced along with the increase of the number of blocks due to the existence of the 1/2 pooling layer, as shown in formula 2:

num_blocks＝log₂seq_len.....................................(3)

wherein: num _ blocks is the number of blocks in the DPCNN network, and seq _ len is the sequence length of the text.

In order to solve the problem of gradient dispersion in a deep network, the DPCNN applies a short connect method to connect front and rear equal-length convolution networks, as shown in formula 3:

G(W)＝z+f(z)....................................................(4)

wherein: z is the output of the neural network of the previous layer, f (z) is the output of the neural network of the current layer, and G (W) is the output of the neural network of the next layer.

The convolutional neural network can play a role in a text processing task, and has the advantages that N-gram features can be automatically combined and screened to obtain semantic information of different abstraction levels.

Referring to fig. 6, the Siamese network (neural network) is provided with two sub-networks with the same structure and sharing weight, and the two sub-networks are respectively used for receiving two inputs X₁And X₂Convert it to vector G_w(X₁) And G_w(X₂) Then calculating the distance E (X) of the two output vectors by means of cosine distance measurement₁,X₂) As shown in formula 4:

wherein: e (X)₁,X₂) The cosine similarity calculation formula is used, so-1 ≦ E (X)₁,X₂) 1 or less, which differs from the Euclidean distance in that E_cosThe larger the value of (A), the closer the distance is, namely the higher the similarity between two sections of texts is; e_cosThe smaller the value of (c) is, the farther the distance is, i.e., the lower the similarity between two pieces of text, and thus the same holds in the design of the LOSS function of the model.

When y is 0, the LOSS function monotonically increases with E; when y is 1, the LOSS function decreases monotonically with E, and the specific formula is as follows:

L_-(X₁,X₂)＝0,otherwise.......................(8)

wherein: y is an indication of whether the two sentences are similar, similarity is 1, and dissimilarity is 0; l is represented byLOSS function The value of (a) is,E_wexpressed as the distance between two text vectors, m is expressed as an artificially set threshold, and otherwise is expressed as the division by E_w<Other cases of m.

Preferably, in order to better perform semantic similarity matching on the features of the abstract text or the short news text and the manual labels to obtain an abstract text or short news text evaluation model which is in line with the reading of a Chinese user, the DPCNN-Simese mixed network model is trained when performing semantic similarity matching and scoring on the features of the abstract text or the short news text and the manual labels.

The specific implementation mode of the DPCNN-Siemese mixed network model is as follows:

and 4, evaluating the model training result by taking three evaluation indexes of F1 value as results.

Preferably, the calculation formula of the F1 value is as follows:

The specific correspondence can be seen in table 1.

TABLE 1

Through comparison, the DPCNN-Siamese mixed network model is superior to other networks in performance on a Chinese data set, is higher in efficiency and has more advantages in comparison of short text semantic feature similarity.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A Chinese automatic text abstract evaluation method based on semantic similarity is characterized by comprising the following steps:

2. The method for automatically evaluating the abstract of the chinese language according to claim 1, wherein the second step of preprocessing the abstract text and the short news text comprises the following steps:

3. The Chinese automatic text summarization evaluation method of claim 2,

the specific method for processing the length of the abstract text into 32 characters is as follows: setting an empty list of (n,32) dimensions, and entering abstract text data item by item, and when the length of the abstract text is less than 32 characters, processing the abstract text in a zero padding mode; when the size of the abstract text is larger than 32 characters, cutting off the content exceeding 32 characters, and only inputting the data of the first 32 characters;

the concrete method for processing the length of the short news text into 128 characters is as follows: setting an empty list of (n,128) dimensions, and entering short news text data item by item, and when the length of the short news text is less than 128 characters, processing the short news text data in a zero padding mode; when the size of the news short text is larger than 128 characters, cutting off the content exceeding 128 characters, and only inputting the data of the first 128 characters;

4. The method for automatically evaluating a Chinese text summary according to claim 1, wherein the third step of scoring the summary text and the short news text comprises the following specific steps:

5. The method as claimed in claim 1, wherein the DPCNN-Siamese hybrid network model is a hybrid network model combining a Siamese network model and a DPCNN network model, wherein the Siamese network model is used for semantic similarity matching, and the DPCNN network model is used for extracting features of the digest text or the short news text.

6. The method for automatically evaluating the abstract of Chinese text as claimed in claim 5, wherein in order to better match the semantic similarity between the features of the abstract text or the short news text and the manual labels to obtain an abstract text or short news text evaluation model conforming to the Chinese user reading, the DPCNN-Siense mixed network model is trained when the semantic similarity between the features of the abstract text or the short news text and the manual labels is matched and scored;

7. The method for Chinese automatic text summarization evaluation according to claim 6, wherein the F1 value is calculated by the following formula:

8. The method for Chinese automatic text summarization evaluation according to claim 1, wherein the specific steps of building the DPCNN-Siemese hybrid network model are as follows:

9. The method for automatically evaluating the abstract of Chinese text as claimed in claim 8, wherein the specific method for improving the structure of the Siemese network is as follows:

two sub-networks with the same structure and sharing weight in the Simese network are improved into two DPCNN networks with different depths, the two DPCNN networks are used for carrying out feature extraction on abstract texts and short news texts with different lengths in a Chinese abstract data set, concat operation is used for splicing the outputs of the two DPCNN networks into a text vector, and after the text vector passes through a pooling layer and a full connection layer, classification processing is carried out by using softmax.

10. The method of automatic chinese text summarization evaluation according to claim 8 wherein the text feature extraction layer comprises two DPCNN network modules of different depths.