CN113032569A - Chinese automatic text abstract evaluation method based on semantic similarity - Google Patents

Chinese automatic text abstract evaluation method based on semantic similarity Download PDF

Info

Publication number
CN113032569A
CN113032569A CN202110382498.2A CN202110382498A CN113032569A CN 113032569 A CN113032569 A CN 113032569A CN 202110382498 A CN202110382498 A CN 202110382498A CN 113032569 A CN113032569 A CN 113032569A
Authority
CN
China
Prior art keywords
text
abstract
news
short
chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110382498.2A
Other languages
Chinese (zh)
Inventor
张祖平
姜自高
郑瑾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202110382498.2A priority Critical patent/CN113032569A/en
Publication of CN113032569A publication Critical patent/CN113032569A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a Chinese automatic text abstract evaluation method based on semantic similarity, which comprises the following specific steps: abstract texts, news short texts and manual labels are extracted from LCTS Chinese abstract data set; preprocessing the abstract text and the short news text, and representing the abstract text and the short news text by using a pre-training word vector; and inputting the abstract text and the news short text into a DPCNN-Siemese mixed network model for grading. The invention provides a mixed improved model based on a Siamese network structure, which uses a part with manual evaluation in an LCSTS data set, takes news titles and news contents as input, respectively uses a DPCNN network structure to extract the characteristics of texts, combines the outputs of two network layers, trains by taking the scores of the manual evaluation as label data, and evaluates the quality of a Chinese text abstract generated by the model by simulating the language habit of a Chinese user.

Description

Chinese automatic text abstract evaluation method based on semantic similarity
Technical Field
The invention relates to the technical field of natural language processing, in particular to a Chinese automatic text abstract evaluation method based on semantic similarity.
Background
The current social development is developing towards the direction of artificial intelligence and big data informatization, various text information layers are endless, and a large amount of information such as green wave micro blogs, bean short comments, news daily newspapers and the like is brought into daily life of people, so that a large amount of redundant information is obtained in daily life of people, and many accurate and important contents are hidden by overlong texts, so that useful information is difficult to obtain quickly and efficiently.
With the continuous development and progress of artificial intelligence technology, the automatic text summarization technology gradually plays an important role in the aspect of efficiently compressing and extracting information. In a traditional text generation task, people often have difficulty evaluating the quality of texts generated by a neural network model. If the machine-generated text excerpt is evaluated purely by manual evaluation, the process is very expensive and time-consuming, and there is also a scoring error due to the inconsistency of each person's evaluation criteria. When the researcher updates and improves the model, the generated result needs to be reevaluated, and therefore, the experimental efficiency is greatly influenced. And because the text abstract is originated abroad, the generated text is mainly English with a relatively standard grammatical structure, so that the method of BLEU, ROUGE and the like is adopted, and the method has feasibility for judging the quality of the generated abstract by taking the word contact ratio as an evaluation standard. However, for chinese, the biggest feature in the standard chinese grammar is the morphological change without strict meaning, for example, the noun is usually unqualified, and the verb is also not called, which is a big difference from the european language. Another feature of chinese is omission, i.e. words that do not affect the meaning of the subject are often omitted. Therefore, people can express a meaning in different expression modes, and the method is not accurate when the coincidence degree of the characters is used as an evaluation standard.
Disclosure of Invention
The invention aims to provide a Chinese automatic text abstract evaluation method based on semantic similarity, which can be used for improving the accuracy of Chinese text abstract evaluation and aims to solve the problem that the existing Chinese text abstract evaluation method mostly adopts an English abstract evaluation method and has deviation from the language habit of an actual Chinese user.
In order to achieve the aim, the invention provides a Chinese automatic text abstract evaluation method based on semantic similarity, which comprises the following steps:
step one, abstract texts, news short texts and manual labels are extracted from LCTS Chinese abstract data sets PARTII and PARTIII;
secondly, preprocessing the extracted abstract text and the extracted short news text, and representing the abstract text and the short news text by using a pre-training word vector;
and step three, inputting the abstract text and the news short text represented by the pre-training word vector into a DPCNN-Siemese mixed network model for grading.
As a further scheme of the invention: the second step is to preprocess the abstract text and the short news text, and comprises the following specific steps:
step 2.1, extracting summary texts, news short texts and manually marked contents in LCTS Chinese summary data sets through an lxml library of python, and respectively outputting the contents to different files according to corresponding sequences;
step 2.2, using an LTP word segmentation tool to perform word segmentation processing on abstract texts and news short texts which are extracted from LCTS Chinese abstract data in a centralized manner, and using Chinese Wikipedia corpus pre-training word vectors as text word vectors of Chinese data;
step 2.3, converting Chinese in the abstract text and the short news text into a 300-dimensional pre-training word vector, and processing the length of each abstract text into 32 characters and the length of each short news text into 128 characters;
and 2.4, respectively inputting the abstract text which is processed into a 32-character long news short text and a 128-character long news short text into the neural network.
Preferably, the specific method for processing the length of the abstract text into 32 characters is as follows: setting an empty list of (n,32) dimensions, and entering abstract text data item by item, and when the length of the abstract text is less than 32 characters, processing the abstract text in a zero padding mode; when the size of the abstract text is larger than 32 characters, the content exceeding 32 characters is cut off, and only the data of the first 32 characters is input.
Preferably, the specific method for processing the length of the short news text into 128 characters is as follows: setting an empty list of (n,128) dimensions, and entering short news text data item by item, and when the length of the short news text is less than 128 characters, processing the short news text data in a zero padding mode; when the size of the news short text is larger than 128 characters, the content exceeding 128 characters is cut off, and only the data of the first 128 characters is recorded.
Wherein: n represents the number of samples of the LCTS Chinese summary data set.
As a further scheme of the invention: the third step of scoring the abstract text and the short news text comprises the following specific steps:
step 3.1, respectively inputting the abstract text and the short news text into a structure based on a Siamese network;
step 3.2, respectively extracting features through the DPCNN 1 and the DPCNN 2 with different depths according to the lengths of the abstract text and the short news text input into the structure based on the Siemese network;
step 3.3, splicing and pooling the characteristics of the abstract text and the news short text through a concat function, and inputting the spliced and pooled characteristics into a full connection layer;
step 3.4, taking the manual label as a classification result, and performing semantic similarity matching and scoring on the characteristics of the abstract text and the news short text input into the full connection layer and the manual label by using a softmax function;
and 3.5, carrying out weighted average on the similarity scores of the generated abstract texts or the short news texts to obtain the text score of the abstract text or the short news text.
As a further scheme of the invention: the DPCNN-Siemese mixed network model is set as a mixed network model combining a Siemese network model and the DPCNN network model, wherein the Siemese network model is used for semantic similarity matching, and the DPCNN network model is used for extracting the characteristics of abstract texts or news short texts.
As a further scheme of the invention: in order to better match the semantic similarity of the characteristics of the abstract text or the news short text with the manual labels to obtain an abstract text or news short text evaluation model which accords with the reading of a Chinese user, the DPCNN-Simese mixed network model is trained when the semantic similarity matching and scoring of the characteristics of the abstract text or the news short text and the manual labels are carried out.
Preferably, the training specific implementation of the DPCNN-Siamese hybrid network model is as follows:
step 1, selecting CNN1D, CNN2D and BilSTM models in an experiment to perform a comparison test with a DPCNN-Siemese mixed network model, wherein the CNN1D model is used as base to perform effect comparison;
step 2, selecting word vectors of Chinese Wikipedia corpus training in an embedding layer Chinese, selecting word vectors of Stanforglove pre-training in English, and setting embedding to be trainable, wherein a hidden layer of a BilSTM network is set to be 128, sizes of convolution kernels of a CNN network and a DPCNN network are set to be 3, 4 and 5, the number of the convolution kernels is set to be 256, L1 and L2 regularization are started, an Adam algorithm is used for an optimization algorithm, and a learning rate is set to be 0.001;
and 3, evaluating the model training result by taking three evaluation indexes of F1 value as results.
Preferably, the calculation formula of the F1 value is as follows:
Figure BDA0003013556970000031
Figure BDA0003013556970000032
Figure BDA0003013556970000033
wherein: p is expressed as correct rate, R is expressed as recall rate, TP is expressed as predicted value and true value are true, FP is expressed as predicted value is true and true value is false, FN is expressed as predicted value is false and true value is true.
As a further scheme of the invention: the construction of the DPCNN-Siemese mixed network model comprises the following specific steps:
firstly, the structure of the Siamese network is improved by combining the characteristics of a text abstract data structure;
step two, abstract texts and short news texts { C) represented by word vectors are processed by using the Simese network structureA(v),CB(v) As input, and with DPCNN 1 and DPCNN 2 of different depths as text feature extraction layers, { CA(v),CB(v) Encode into two text vectors of the same dimension, where CA(v),CB(v) Respectively expressed as word embedding a sequence and B sequence input using the Siamese network structure;
splicing the two text vectors by adopting a concatenate function, and then pooling the two text vectors through a posing layer;
and fourthly, after the text feature vectors are pooled, carrying out classification processing after the text feature vectors pass through a full connection layer and a softmax function.
Preferably, the specific method for improving the structure of the siemese network is as follows: two sub-networks with the same structure and sharing weight in the Simese network are improved into two DPCNN networks with different depths, the two DPCNN networks are used for carrying out feature extraction on abstract texts and short news texts with different lengths in a Chinese abstract data set, concat operation is used for splicing the outputs of the two DPCNN networks into a text vector, and after the text vector passes through a pooling layer and a full connection layer, classification processing is carried out by using softmax.
Preferably, the text feature extraction layer comprises two DPCNN network modules with different depths.
Compared with the prior art, the invention has the following beneficial effects:
(1) the invention provides a mixed improved model based on a Siamese network structure, which uses a part with manual evaluation in an LCSTS data set, takes news titles and news contents as input, respectively extracts the characteristics of texts in a DPCNN network structure, combines the outputs of two network layers, trains by taking the level of manual evaluation as label data, and evaluates the quality of a Chinese text abstract generated by the model by simulating the language habit of a Chinese user.
(2) The invention provides an automatic text abstract evaluation model for simulating the reading habits of Chinese users, which uses a Simese network as a basic framework, wherein aiming at different lengths of abstract texts and original texts, the DPCNN networks with different depths are adopted to extract text characteristics so as to evaluate the generated abstract texts from the semantic perspective.
In addition to the objects, features and advantages described above, other objects, features and advantages of the present invention are also provided. The present invention will be described in further detail below with reference to the drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a schematic diagram of the overall architecture of the present invention;
FIG. 2 is an architectural diagram of text processing in the present invention;
FIG. 3 is a schematic diagram of an LCTS Chinese short text summary raw data set according to the present invention;
FIG. 4 is a schematic diagram of the DPCNN-Siamese hybrid network model architecture in the present invention;
FIG. 5 is a diagram of the network model architecture of TextCNN in the present invention;
FIG. 6 is a schematic diagram of the DPCNN network model architecture in the present invention;
fig. 7 is a schematic diagram of the siemese network model architecture in the present invention.
Detailed Description
In order to make the aforementioned objects, features, advantages, and the like of the present invention more clearly understandable, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that the drawings of the present invention are simplified and are not to precise scale, and are provided for convenience and clarity in assisting the description of the embodiments of the present invention; the several references in this disclosure are not limited to the particular numbers in the examples of the figures; the directions or positional relationships indicated by ' front ' middle, ' rear ' left ', right ', upper ', lower ', top ', bottom ', middle ', etc. in the present invention are based on the directions or positional relationships shown in the drawings of the present invention, and do not indicate or imply that the devices or components referred to must have a specific direction, nor should be construed as limiting the present invention.
In this embodiment:
referring to fig. 1 to 3, the method for automatically evaluating a chinese text summary based on semantic similarity provided by the present invention specifically includes the following steps:
step one, abstract texts, news short texts and manual labels are extracted from LCTS Chinese abstract data sets PARTII and PARTIII;
preprocessing the abstract text and the short news text, and representing the abstract text and the short news text by using a pre-training word vector;
and step three, inputting the abstract text and the news short text into a DPCNN-Siemese mixed network model for grading.
Preferably, the specific steps of preprocessing the abstract text and the short news text in the second step are as follows:
step 2.1, extracting summary texts, news short texts and manually marked contents in LCTS Chinese summary data sets through an lxml library of python, and respectively outputting the contents to different files according to corresponding sequences;
step 2.2, performing word segmentation processing on the text by using a word segmentation tool of a Language Technology Platform (LTP), and using a Chinese Wikipedia corpus pre-training word vector (Chinese word vector) as a text word vector;
step 2.3, converting Chinese in the abstract text and the short news text into a 300-dimensional pre-training word vector, and processing the length of each abstract text into 32 characters and the length of each short news text into 128 characters;
and 2.4, respectively inputting the abstract text which is processed into a 32-character long news short text and a 128-character long news short text into the neural network.
Preferably, the specific method for processing the length of the abstract text into 32 characters is as follows: setting an empty list of (n,32) dimensions, and entering abstract text data item by item, and when the length of the abstract text is less than 32 characters, processing the abstract text in a zero padding mode; when the size of the abstract text is larger than 32 characters, the content exceeding 32 characters is cut off, and only the data of the first 32 characters is input.
Preferably, the specific method for processing the length of the short news text into 128 characters is as follows: setting an empty list of (n,128) dimensions, and entering short news text data item by item, and when the length of the short news text is less than 128 characters, processing the short news text data in a zero padding mode; when the size of the news short text is larger than 128 characters, the content exceeding 128 characters is cut off, and only the data of the first 128 characters is recorded.
Preferably, n represents the number of samples of the LCSTS Chinese summary data set.
Preferably, the third step of scoring the abstract text and the short news text includes the following specific steps:
step 3.1, respectively inputting the abstract text and the short news text into a structure based on a Siamese network;
step 3.2, respectively extracting features through the DPCNN 1 and the DPCNN 2 with different depths according to the lengths of the abstract text and the short news text input into the structure based on the Siemese network;
step 3.3, splicing and pooling the characteristics of the abstract text and the news short text through a concat function, and inputting the spliced and pooled characteristics into a full connection layer;
step 3.4, taking the manual label as a classification result, and scoring the similarity of the abstract text and the short news text input into the full connection layer by using a softmax function;
and 3.5, carrying out weighted average on the similarity scores of the generated abstract texts or the short news texts to obtain the text score of the abstract text or the short news text.
Preferably, the DPCNN-Siamese hybrid network model is set as a hybrid network model combining a Siamese network model and a DPCNN network model, wherein: the Siamese network model is used for carrying out semantic similarity matching; the DPCNN network model is used for extracting the characteristics of the abstract text or the short news text.
Referring to fig. 4, the specific steps of constructing the DPCNN-Siamese hybrid network model are as follows:
firstly, the structure of the Siamese network is improved by combining the characteristics of a text abstract data structure;
step two, abstract texts and short news texts { C) represented by word vectors are processed by using the Simese network structureA(v),CB(v) In which C isA(v),CB(v) Respectively expressed as word-embedded a-sequence and B-sequence using Siamese network structure) as input and through DPCNN 1 and DPCNN 2 at different depths as text feature extraction layers (the text feature extraction layer contains two DPCNN network blocks at different depths, aiming at better capturing CA(v),CB(v) Because the reference text of the summary data is the news headline, which is short relative to the news text, and the news content text is long, the learning and extraction of corresponding features through the DPCNN networks with different depths are considered), the { C (C) is obtainedA(v),CB(v) Encoding into two text vectors with the same dimensionality;
splicing the two text vectors by adopting a concatenate function, and then pooling the two text vectors through a posing layer;
and fourthly, after the text feature vectors are pooled, carrying out classification processing after the text feature vectors pass through a full connection layer and a softmax function.
The specific implementation of using CNN network (convolutional neural network) to capture local features in text is as follows:
DPCNN network (Deep volume, Deep pyramid relational networks)Product neural network) is proposed based on the application of a TextCNN network (Text convolutional neural network) in the field of natural language processing, and the core idea of the CNN network (convolutional neural network) is to capture local features, wherein for texts, the local features are sliding windows consisting of a plurality of words or characters, similar to N-grams, and feature vectors ciBy sliding window xi:i+h-1The Chinese text convolution yields the following formula 1:
ci=f(W*xi:i+h-1+b)..................................................(1)
wherein: c. CiExpressed as a feature vector obtained by the convolution of the ith sliding window in the CNN network, f is expressed as W is expressed as a coefficient matrix parameter of the CNN network, and x is expressed asi:i+h-1Expressed as the text vector contained in the sliding window, b as the constant matrix parameter of the CNN network, h table the step size of the convolution kernel move.
The characteristics obtained by collecting all the character vectors by the sliding window of the CNN network are the characteristics
c=[c1,c2,…,cn-h+1];............................................(2)
Wherein: c is the text characteristic vector after the convolution of the CNN convolution neural network is completed, c1,c2,…,cn-h+1Respectively, feature vectors obtained by convolution of 1 st, 2 … th and n-h +1 th sliding windows in the CNN neural network.
Referring to fig. 4, in the TextCNN network, each word in the text is composed of n-dimensional word vectors, and the size of the input matrix is m × n, where m is the sentence length; the CNN network needs to carry out convolution operation on input samples, for text data, convolution kernels do not slide transversely but only slide downwards, local correlation between word words is extracted similarly to n-gram, for example, a graph contains three step length strategies (2, 3 and 4 respectively), each step length has two convolution kernels, different convolution kernels are applied to different word windows, and 6 vectors after convolution are finally obtained; and then performing maximum pooling operation on each vector and performing full connection to finally obtain the feature representation of the sentence, and inputting the sentence vector into a classifier for classification.
Referring to fig. 5, the bottom layer of the DPCNN network maintains a structure similar to that of the TextCNN network, and the convolution result of the convolution layer containing the multi-size convolution filter is a Region embedding process, which is used as an embedding feature generated after a set of convolution operations is performed on a text Region or segment (e.g., 3 gram); the DPCNN network performs two equal-length convolutions after Region embedding so as to enable each word to be more abundantly represented; then, performing equal-length convolution and 1/2 pooling repeatedly to extract features, and performing residual connection to solve the problem of gradient dispersion, wherein the length of the text sequence is exponentially reduced along with the increase of the number of blocks due to the existence of the 1/2 pooling layer, as shown in formula 2:
num_blocks=log2seq_len.....................................(3)
wherein: num _ blocks is the number of blocks in the DPCNN network, and seq _ len is the sequence length of the text.
In order to solve the problem of gradient dispersion in a deep network, the DPCNN applies a short connect method to connect front and rear equal-length convolution networks, as shown in formula 3:
G(W)=z+f(z)....................................................(4)
wherein: z is the output of the neural network of the previous layer, f (z) is the output of the neural network of the current layer, and G (W) is the output of the neural network of the next layer.
The convolutional neural network can play a role in a text processing task, and has the advantages that N-gram features can be automatically combined and screened to obtain semantic information of different abstraction levels.
Referring to fig. 6, the Siamese network (neural network) is provided with two sub-networks with the same structure and sharing weight, and the two sub-networks are respectively used for receiving two inputs X1And X2Convert it to vector Gw(X1) And Gw(X2) Then calculating the distance E (X) of the two output vectors by means of cosine distance measurement1,X2) As shown in formula 4:
Figure BDA0003013556970000081
wherein: e (X)1,X2) The cosine similarity calculation formula is used, so-1 ≦ E (X)1,X2) 1 or less, which differs from the Euclidean distance in that EcosThe larger the value of (A), the closer the distance is, namely the higher the similarity between two sections of texts is; ecosThe smaller the value of (c) is, the farther the distance is, i.e., the lower the similarity between two pieces of text, and thus the same holds in the design of the LOSS function of the model.
When y is 0, the LOSS function monotonically increases with E; when y is 1, the LOSS function decreases monotonically with E, and the specific formula is as follows:
Figure BDA0003013556970000082
Figure BDA0003013556970000083
L-(X1,X2)=0,otherwise.......................(8)
wherein: y is an indication of whether the two sentences are similar, similarity is 1, and dissimilarity is 0; l is represented byLOSS function The value of (a) is,Ewexpressed as the distance between two text vectors, m is expressed as an artificially set threshold, and otherwise is expressed as the division by Ew<Other cases of m.
Preferably, in order to better perform semantic similarity matching on the features of the abstract text or the short news text and the manual labels to obtain an abstract text or short news text evaluation model which is in line with the reading of a Chinese user, the DPCNN-Simese mixed network model is trained when performing semantic similarity matching and scoring on the features of the abstract text or the short news text and the manual labels.
The specific implementation mode of the DPCNN-Siemese mixed network model is as follows:
step 1, selecting CNN1D, CNN2D and BilSTM models in an experiment to perform a comparison test with a DPCNN-Siemese mixed network model, wherein the CNN1D model is used as base to perform effect comparison;
step 2, selecting word vectors of Chinese Wikipedia corpus training in an embedding layer Chinese, selecting word vectors of Stanforglove pre-training in English, and setting embedding to be trainable, wherein a hidden layer of a BilSTM network is set to be 128, sizes of convolution kernels of a CNN network and a DPCNN network are set to be 3, 4 and 5, the number of the convolution kernels is set to be 256, L1 and L2 regularization are started, an Adam algorithm is used for an optimization algorithm, and a learning rate is set to be 0.001;
and 4, evaluating the model training result by taking three evaluation indexes of F1 value as results.
Preferably, the calculation formula of the F1 value is as follows:
Figure BDA0003013556970000091
Figure BDA0003013556970000092
Figure BDA0003013556970000093
wherein: p is expressed as correct rate, R is expressed as recall rate, TP is expressed as predicted value and true value are true, FP is expressed as predicted value is true and true value is false, FN is expressed as predicted value is false and true value is true.
The specific correspondence can be seen in table 1.
TABLE 1
Figure BDA0003013556970000094
Through comparison, the DPCNN-Siamese mixed network model is superior to other networks in performance on a Chinese data set, is higher in efficiency and has more advantages in comparison of short text semantic feature similarity.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A Chinese automatic text abstract evaluation method based on semantic similarity is characterized by comprising the following steps:
step one, abstract texts, news short texts and manual labels are extracted from LCTS Chinese abstract data sets PARTII and PARTIII;
secondly, preprocessing the extracted abstract text and the extracted short news text, and representing the abstract text and the short news text by using a pre-training word vector;
and step three, inputting the abstract text and the news short text represented by the pre-training word vector into a DPCNN-Siemese mixed network model for grading.
2. The method for automatically evaluating the abstract of the chinese language according to claim 1, wherein the second step of preprocessing the abstract text and the short news text comprises the following steps:
step 2.1, extracting summary texts, news short texts and manually marked contents in LCTS Chinese summary data sets through an lxml library of python, and respectively outputting the contents to different files according to corresponding sequences;
step 2.2, using an LTP word segmentation tool to perform word segmentation processing on abstract texts and news short texts which are extracted from LCTS Chinese abstract data in a centralized manner, and using Chinese Wikipedia corpus pre-training word vectors as text word vectors of Chinese data;
step 2.3, converting Chinese in the abstract text and the short news text into a 300-dimensional pre-training word vector, and processing the length of each abstract text into 32 characters and the length of each short news text into 128 characters;
and 2.4, respectively inputting the abstract text which is processed into a 32-character long news short text and a 128-character long news short text into the neural network.
3. The Chinese automatic text summarization evaluation method of claim 2,
the specific method for processing the length of the abstract text into 32 characters is as follows: setting an empty list of (n,32) dimensions, and entering abstract text data item by item, and when the length of the abstract text is less than 32 characters, processing the abstract text in a zero padding mode; when the size of the abstract text is larger than 32 characters, cutting off the content exceeding 32 characters, and only inputting the data of the first 32 characters;
the concrete method for processing the length of the short news text into 128 characters is as follows: setting an empty list of (n,128) dimensions, and entering short news text data item by item, and when the length of the short news text is less than 128 characters, processing the short news text data in a zero padding mode; when the size of the news short text is larger than 128 characters, cutting off the content exceeding 128 characters, and only inputting the data of the first 128 characters;
wherein: n represents the number of samples of the LCTS Chinese summary data set.
4. The method for automatically evaluating a Chinese text summary according to claim 1, wherein the third step of scoring the summary text and the short news text comprises the following specific steps:
step 3.1, respectively inputting the abstract text and the short news text into a structure based on a Siamese network;
step 3.2, respectively extracting features through the DPCNN 1 and the DPCNN 2 with different depths according to the lengths of the abstract text and the short news text input into the structure based on the Siemese network;
step 3.3, splicing and pooling the characteristics of the abstract text and the news short text through a concat function, and inputting the spliced and pooled characteristics into a full connection layer;
step 3.4, taking the manual label as a classification result, and performing semantic similarity matching and scoring on the characteristics of the abstract text and the news short text input into the full connection layer and the manual label by using a softmax function;
and 3.5, carrying out weighted average on the similarity scores of the generated abstract texts or the short news texts to obtain the text score of the abstract text or the short news text.
5. The method as claimed in claim 1, wherein the DPCNN-Siamese hybrid network model is a hybrid network model combining a Siamese network model and a DPCNN network model, wherein the Siamese network model is used for semantic similarity matching, and the DPCNN network model is used for extracting features of the digest text or the short news text.
6. The method for automatically evaluating the abstract of Chinese text as claimed in claim 5, wherein in order to better match the semantic similarity between the features of the abstract text or the short news text and the manual labels to obtain an abstract text or short news text evaluation model conforming to the Chinese user reading, the DPCNN-Siense mixed network model is trained when the semantic similarity between the features of the abstract text or the short news text and the manual labels is matched and scored;
the specific implementation mode of the DPCNN-Siemese mixed network model is as follows:
step 1, selecting CNN1D, CNN2D and BilSTM models in an experiment to perform a comparison test with a DPCNN-Siemese mixed network model, wherein the CNN1D model is used as base to perform effect comparison;
step 2, selecting word vectors of Chinese Wikipedia corpus training in an embedding layer Chinese, selecting word vectors of Stanforglove pre-training in English, and setting embedding to be trainable, wherein a hidden layer of a BilSTM network is set to be 128, sizes of convolution kernels of a CNN network and a DPCNN network are set to be 3, 4 and 5, the number of the convolution kernels is set to be 256, L1 and L2 regularization are started, an Adam algorithm is used for an optimization algorithm, and a learning rate is set to be 0.001;
and 3, evaluating the model training result by taking three evaluation indexes of F1 value as results.
7. The method for Chinese automatic text summarization evaluation according to claim 6, wherein the F1 value is calculated by the following formula:
Figure FDA0003013556960000021
wherein: p is expressed as correct rate, R is expressed as recall rate, TP is expressed as predicted value and true value are true, FP is expressed as predicted value is true and true value is false, FN is expressed as predicted value is false and true value is true.
8. The method for Chinese automatic text summarization evaluation according to claim 1, wherein the specific steps of building the DPCNN-Siemese hybrid network model are as follows:
firstly, the structure of the Siamese network is improved by combining the characteristics of a text abstract data structure;
step two, abstract texts and short news texts { C) represented by word vectors are processed by using the Simese network structureA(v),CB(v) As input, and with DPCNN 1 and DPCNN 2 of different depths as text feature extraction layers, { CA(v),CB(v) Encode into two text vectors of the same dimension, where CA(v),CB(v) Respectively expressed as word embedding a sequence and B sequence input using the Siamese network structure;
splicing the two text vectors by adopting a concatenate function, and then pooling the two text vectors through a posing layer;
and fourthly, after the text feature vectors are pooled, carrying out classification processing after the text feature vectors pass through a full connection layer and a softmax function.
9. The method for automatically evaluating the abstract of Chinese text as claimed in claim 8, wherein the specific method for improving the structure of the Siemese network is as follows:
two sub-networks with the same structure and sharing weight in the Simese network are improved into two DPCNN networks with different depths, the two DPCNN networks are used for carrying out feature extraction on abstract texts and short news texts with different lengths in a Chinese abstract data set, concat operation is used for splicing the outputs of the two DPCNN networks into a text vector, and after the text vector passes through a pooling layer and a full connection layer, classification processing is carried out by using softmax.
10. The method of automatic chinese text summarization evaluation according to claim 8 wherein the text feature extraction layer comprises two DPCNN network modules of different depths.
CN202110382498.2A 2021-04-09 2021-04-09 Chinese automatic text abstract evaluation method based on semantic similarity Pending CN113032569A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110382498.2A CN113032569A (en) 2021-04-09 2021-04-09 Chinese automatic text abstract evaluation method based on semantic similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110382498.2A CN113032569A (en) 2021-04-09 2021-04-09 Chinese automatic text abstract evaluation method based on semantic similarity

Publications (1)

Publication Number Publication Date
CN113032569A true CN113032569A (en) 2021-06-25

Family

ID=76456077

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110382498.2A Pending CN113032569A (en) 2021-04-09 2021-04-09 Chinese automatic text abstract evaluation method based on semantic similarity

Country Status (1)

Country Link
CN (1) CN113032569A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109766432A (en) * 2018-07-12 2019-05-17 中国科学院信息工程研究所 A kind of Chinese abstraction generating method and device based on generation confrontation network
CN110390103A (en) * 2019-07-23 2019-10-29 中国民航大学 Short text auto-abstracting method and system based on Dual-encoder
CN110688479A (en) * 2019-08-19 2020-01-14 中国科学院信息工程研究所 Evaluation method and sequencing network for generating abstract
CN111325029A (en) * 2020-02-21 2020-06-23 河海大学 Text similarity calculation method based on deep learning integration model
CN111552801A (en) * 2020-04-20 2020-08-18 大连理工大学 Neural network automatic abstract model based on semantic alignment
CN111930931A (en) * 2020-07-20 2020-11-13 桂林电子科技大学 Abstract evaluation method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109766432A (en) * 2018-07-12 2019-05-17 中国科学院信息工程研究所 A kind of Chinese abstraction generating method and device based on generation confrontation network
CN110390103A (en) * 2019-07-23 2019-10-29 中国民航大学 Short text auto-abstracting method and system based on Dual-encoder
CN110688479A (en) * 2019-08-19 2020-01-14 中国科学院信息工程研究所 Evaluation method and sequencing network for generating abstract
CN111325029A (en) * 2020-02-21 2020-06-23 河海大学 Text similarity calculation method based on deep learning integration model
CN111552801A (en) * 2020-04-20 2020-08-18 大连理工大学 Neural network automatic abstract model based on semantic alignment
CN111930931A (en) * 2020-07-20 2020-11-13 桂林电子科技大学 Abstract evaluation method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
RIE JOHNSON等: "Deep Pyramid Convolutional Neural Networks for Text Categorization", 《PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINDUISTICS》 *

Similar Documents

Publication Publication Date Title
CN110119765B (en) Keyword extraction method based on Seq2Seq framework
CN109960724B (en) Text summarization method based on TF-IDF
CN111125349A (en) Graph model text abstract generation method based on word frequency and semantics
CN112257453B (en) Chinese-Yue text similarity calculation method fusing keywords and semantic features
CN113011533A (en) Text classification method and device, computer equipment and storage medium
CN106598959B (en) Method and system for determining mutual translation relationship of bilingual sentence pairs
CN110362819B (en) Text emotion analysis method based on convolutional neural network
CN113505200B (en) Sentence-level Chinese event detection method combined with document key information
CN113377897B (en) Multi-language medical term standard standardization system and method based on deep confrontation learning
CN110717341B (en) Method and device for constructing old-Chinese bilingual corpus with Thai as pivot
CN111897917B (en) Rail transit industry term extraction method based on multi-modal natural language features
CN111858842A (en) Judicial case screening method based on LDA topic model
CN116244445B (en) Aviation text data labeling method and labeling system thereof
CN115204143B (en) Method and system for calculating text similarity based on prompt
CN110222250A (en) A kind of emergency event triggering word recognition method towards microblogging
CN111581943A (en) Chinese-over-bilingual multi-document news viewpoint sentence identification method based on sentence association graph
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN111984782A (en) Method and system for generating text abstract of Tibetan language
CN113962228A (en) Long document retrieval method based on semantic fusion of memory network
CN116756303A (en) Automatic generation method and system for multi-topic text abstract
CN111460147A (en) Title short text classification method based on semantic enhancement
CN115422939A (en) Fine-grained commodity named entity identification method based on big data
CN117371534B (en) Knowledge graph construction method and system based on BERT
CN115033753A (en) Training corpus construction method, text processing method and device
CN111191029A (en) AC construction method based on supervised learning and text classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210625

RJ01 Rejection of invention patent application after publication