CN114757203A

CN114757203A - Chinese sentence simplification method and system based on contrast learning

Info

Publication number: CN114757203A
Application number: CN202210458189.3A
Authority: CN
Inventors: 王路路; 张鹏; 杜冀中; 闫磊; 陆弘锴; 彭钰婷; 刘佳
Original assignee: Beijing Zhipu Huazhang Technology Co ltd
Current assignee: Beijing Zhipu Huazhang Technology Co ltd
Priority date: 2022-04-27
Filing date: 2022-04-27
Publication date: 2022-07-15

Abstract

The application provides a Chinese sentence simplification method and a system based on contrast learning, wherein the method comprises the following steps: mining a plurality of complex sentence-simple sentence pairs with similar semantics based on an unsupervised learning mode; calculating a supervision signal of each sentence pair; adding a monitoring signal to the initial position of a complex sentence in a sentence pair in a character string manner, generating a data set of the complex sentence-simple sentence pair with the monitoring signal, and dividing the data set into a training set, a verification set and a test set; performing model pruning on a preset multi-language pre-training model based on a coder-decoder to obtain a Chinese single-language pre-training model; introducing comparative learning loss to fine tune the single-language pre-training model of the Chinese, and jointly training a simplified model of the Chinese sentence; inputting the complex sentences in the test set into the simplified model to generate predicted simplified sentences, and evaluating the effect of the simplified model of the Chinese sentences. The method can control the generated simplified sentences according to actual requirements, and improve the fidelity of the generated simplified sentences.

Description

Chinese sentence simplification method and system based on contrast learning

Technical Field

The application relates to the technical field of natural language processing, in particular to a Chinese sentence simplification method and system based on contrast learning.

Background

Today, where the amount of information is explosively increasing, the reduction of sentences has attracted attention in various fields. Sentence reduction aims at modifying the content and structure of the original sentence to reduce the complexity of the sentence, but at the same time requires that its main ideas be preserved and close to its original semantics in order to be easier to read and understand. Especially for people with low literacy, such as children, hearing impaired people, second language acquirers, and people with low literacy, etc. Moreover, sentence reduction can also improve the performance of other Natural Language Processing (NLP) tasks, such as text summarization, information extraction, semantic role labeling, syntactic analysis, machine translation, and the like.

In the related art, most of sentence reduction is concentrated on english, chinese sentence reduction research is less developed, and no data set is disclosed, so a mature and available sentence reduction method is urgently needed. Early sentence reduction methods employed rule-based methods, mainly considering vocabulary simplification or syntax simplification. The current research methods are mainly divided into two types: one is to take sentence simplification as a task of machine translation between monolingues; another is to view sentence compaction as a sentence editing task, i.e. delete, insert, retain, etc.

However, in some specific application scenarios, the simplified content may have an uncontrollable length, and the simplified content is prone to have low fidelity and missing important points compared with the original sentence.

Disclosure of Invention

The present application is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, a first objective of the present application is to provide a chinese sentence reduction method based on contrast learning, which considers a sentence reduction task as an end-to-end conditional generation task, first uses an unsupervised method to mine a complex sentence-simple sentence pair with similar semantics, then calculates an edit signal and a vocabulary syntax complexity signal as supervised signals, and introduces contrast learning for fine tuning on the basis of a pre-training model of an Encoder-Decoder (Encoder-Decoder), so that a user can not only control the generated simplified sentence according to actual needs, but also improve the fidelity of the generated simplified sentence.

A second objective of the present application is to provide a chinese sentence reduction system based on contrast learning.

A third object of the present application is to propose a non-transitory computer-readable storage medium.

To achieve the above object, an embodiment of a first aspect of the present application provides a chinese sentence reduction method based on contrast learning, including the following steps:

mining a plurality of complex sentence-simple sentence pairs with similar semantics based on an unsupervised learning mode;

calculating a supervision signal of each complex sentence-simple sentence pair;

adding each monitoring signal to the initial position of a complex sentence in a corresponding sentence pair in a character string manner to generate a data set of the complex sentence-simple sentence pair with the monitoring signal, and dividing the data set of the complex sentence-simple sentence pair with the monitoring signal into a training set, a verification set and a test set according to a preset proportion;

performing model pruning on a preset multi-language pre-training model based on a coder-decoder to obtain a Chinese single-language pre-training model;

based on the training set and the verification set, introducing contrast learning loss to finely adjust the Chinese single-language pre-training model, and jointly training a Chinese sentence simplified model;

and inputting the complex sentences in the test set into the Chinese sentence simplification model, generating predicted simplified sentences through the Chinese sentence simplification model, evaluating the simplification effect of the Chinese sentence simplification model, and simplifying the Chinese sentences to be simplified through the Chinese sentence simplification model when the simplification effect is greater than a preset threshold value.

Optionally, in an embodiment of the present application, mining a plurality of complex sentence-simple sentence pairs with similar semantics based on an unsupervised learning manner includes: acquiring large-data-volume Chinese sentences from a preset resource library; obtaining a vector of each sentence through a language tool library, creating an index, and excavating a plurality of similar candidate sentences corresponding to each sentence; and performing condition filtering on the candidate sentences corresponding to each sentence, determining a target sentence corresponding to each sentence, and generating a plurality of complex sentence-simple sentence pairs with similar semantics.

Optionally, in an embodiment of the present application, the supervisory signals include a sentence length ratio, an edit distance ratio, a vocabulary complexity ratio, and a syntax tree depth ratio, and the calculating the supervisory signals for each of the complex sentence-simple sentence pairs includes: calculating the ratio of the length of the simple sentence to the length of the complex sentence in each sentence pair to obtain the sentence length ratio; calculating a levens distance between the complex sentence and the simple sentence, and calculating a levens distance ratio of each editing operation to obtain the editing distance ratio, wherein the editing operations comprise: delete, insert, and replace; calculating the ratio of the vocabulary complexity of the complex sentence to the vocabulary complexity of the simple sentence to obtain the vocabulary complexity ratio, wherein the vocabulary complexity is expressed by the word frequency of the vocabulary; and respectively obtaining the syntactic tree depths of the complex sentence and the simple sentence through a natural language text processing library, and calculating the ratio of the syntactic tree depth of the complex sentence to the syntactic tree depth of the simple sentence to obtain the syntactic tree depth ratio.

Optionally, in an embodiment of the present application, adding each of the supervision signals to a start position of an initial complex sentence in a corresponding sentence pair in the form of a character string includes: and sequentially adding the supervision signals of each sentence pair to the front of the starting position of the corresponding complex sentence in a mode of setting the ratio of the supervision signals after the names of the supervision signals.

Optionally, in an embodiment of the present application, the performing model pruning on the preset multi-lingual pre-training model based on the encoder-decoder includes: selecting punctuation marks, numbers, English letters and high-frequency Chinese words commonly used in Chinese sentences as a new vocabulary; replacing the original vocabulary of the multilingual pre-training model with the new vocabulary, and updating the expression parameters of the input vector and the output vector of the multilingual pre-training model to update the multilingual pre-training model; and saving the new vocabulary and the updated pre-training model.

Optionally, in an embodiment of the present application, evaluating the simplification effect of the chinese sentence simplification model includes: comparing the predicted reduced sentence to a standard reference reduced sentence; evaluating the simplifying effect of the Chinese sentence simplifying model through a plurality of preset evaluating indexes, wherein the plurality of evaluating indexes comprise: BLEU-4 index, Rouge-L index, and SARI index.

To achieve the above object, an embodiment of a second aspect of the present application provides a chinese sentence reduction system based on contrast learning, including the following modules:

the mining module is used for mining a plurality of complex sentence-simple sentence pairs with similar semantics based on an unsupervised learning mode;

the calculation module is used for calculating a supervision signal of each complex sentence-simple sentence pair;

the first generation module is used for adding each monitoring signal to the initial position of a complex sentence in a corresponding sentence pair in a character string manner, generating a data set of the complex sentence-simple sentence pair with the monitoring signal, and dividing the data set of the complex sentence-simple sentence pair with the monitoring signal into a training set, a verification set and a test set according to a preset proportion;

the second generation module is used for carrying out model pruning on a preset multi-language pre-training model based on a coder-decoder so as to obtain a Chinese single-language pre-training model;

the training module is used for introducing contrast learning loss to finely tune the Chinese single-language pre-training model based on the training set and the verification set, and jointly training a Chinese sentence simplified model;

and the third generation module is used for inputting the complex sentences in the test set into the Chinese sentence simplification model, generating predicted simplified sentences through the Chinese sentence simplification model, evaluating the simplification effect of the Chinese sentence simplification model, and simplifying the Chinese sentences to be simplified through the Chinese sentence simplification model when the simplification effect is greater than a preset threshold value.

Optionally, in an embodiment of the present application, the mining module is specifically configured to: acquiring large-data-volume Chinese sentences from a preset resource library; obtaining a vector of each sentence through a language tool library, creating an index, and excavating a plurality of similar candidate sentences corresponding to each sentence; and performing condition filtering on the candidate sentences corresponding to each sentence, determining a target sentence corresponding to each sentence, and generating a plurality of complex sentence-simple sentence pairs with similar semantemes.

Optionally, in an embodiment of the present application, the supervision signal includes a sentence length ratio, an edit distance ratio, a vocabulary complexity ratio, and a syntax tree depth ratio, and the calculation module is specifically configured to: calculating the ratio of the length of the simple sentence to the length of the complex sentence in each sentence pair to obtain the sentence length ratio; calculating a levens distance between the complex sentence and the simple sentence, and calculating a levens distance ratio of each editing operation to obtain the editing distance ratio, wherein the editing operations comprise: delete, insert, and replace; calculating the ratio of the vocabulary complexity of the complex sentence to the vocabulary complexity of the simple sentence to obtain the vocabulary complexity ratio, wherein the vocabulary complexity is expressed by the word frequency of the vocabulary; and respectively acquiring the syntax tree depths of the complex sentence and the simple sentence through a natural language text processing library, and calculating the ratio of the syntax tree depth of the complex sentence to the syntax tree depth of the simple sentence to acquire the syntax tree depth ratio.

In order to implement the foregoing embodiments, the third aspect of the present application further provides a non-transitory computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method for reducing chinese sentences based on contrast learning in the foregoing embodiments is implemented.

The technical scheme provided by the embodiment of the application at least has the following beneficial effects: the method comprises the steps of firstly adopting an unsupervised method to mine complex sentence-simple sentence pairs with similar semantemes, then taking an editing signal and a vocabulary syntax complexity signal between the sentence pairs as supervised signals, and carrying out fine adjustment by utilizing a pre-training model of an encoder-decoder, so that a user can control the generated simplified sentences according to different requirements. In addition, the method introduces contrast learning loss on the basis of the cross entropy loss function so as to increase the distance between positive and negative samples, thereby improving the fidelity of the generated simplified sentence. The method and the device not only enable the generated result to be controllable according to different requirements, but also can improve the fidelity of the simplified sentences of the target.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which

Fig. 1 is a flowchart of a chinese sentence reduction method based on contrast learning according to an embodiment of the present application;

fig. 2 is a flowchart of a specific mining method for sentence pairs with similar semantics according to an embodiment of the present application;

fig. 3 is a flowchart of a method for calculating a supervisory signal according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a chinese sentence reduction system based on comparative learning according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The following describes a method and a system for Chinese sentence reduction based on comparative learning according to an embodiment of the present invention with reference to the drawings.

Fig. 1 is a flowchart of a chinese sentence reduction method based on contrast learning according to an embodiment of the present application, as shown in fig. 1, the method includes the following steps:

and step S101, mining a plurality of complex sentence-simple sentence pairs with similar semantics based on an unsupervised learning mode.

The unsupervised learning is a method for performing pattern recognition machine learning according to unmarked training samples, the training samples have no labels, and the unsupervised learning method can be applied to mining complex sentence-simple sentence pairs with similar semantics in the scene lacking prior knowledge.

The complex sentence-simple sentence pair is a sentence pair composed of a complex sentence with a longer length and a simple sentence with a similar semantic meaning but a shorter length, for example, "today, a line is built and completed" - "a line is completed today".

Specifically, a complex sentence-simple sentence pair with similar semantics is mined by an unsupervised method, and in order to more clearly explain a specific implementation process of mining the complex sentence-simple sentence pair in the present application, a mining method of a sentence pair with similar semantics, which is provided in an embodiment of the present application, is taken as an example to be described in detail below. As shown in fig. 2, the method comprises the steps of:

step S201: and acquiring large-data-volume Chinese sentences from a preset resource library.

Specifically, the resource library is a database in which a large number of chinese sentences are prestored, and may be various websites and cloud platform servers capable of reading chinese data. In this embodiment, a resource library is selected in advance according to actual needs, and a large number of processed chinese sentences are downloaded from the resource library. For example, large-scale Chinese sentences that have been crawled and preprocessed by the CC-100 website are downloaded from the website.

Step S202: and acquiring a vector of each sentence through a language tool library, creating an index, and excavating a plurality of similar candidate sentences corresponding to each sentence.

The Language tool library is an open source Language tool kit that includes multiple languages and supports Similarity vector retrieval, and for example, the Language tool library includes a LASER (Language-adaptive search retrieval) library and a faiss (facebook AI Similarity search) library. The LASER library is a tool kit capable of exploring multi-language sentence expression, can write multiple languages through different character strings, embeds more types of languages into an independent shared space, and realizes cross-language migration without modification. The Faiss library is a library aiming at clustering and similarity searching, provides efficient similarity searching and clustering for dense vectors, and realizes approximate searching.

In this embodiment, first, each language tool library is jointly utilized to execute a respective search mining function, a vector representation of each sentence mined in the above steps is obtained, and an index of each sentence is created. Then, a plurality of similar candidate sentences corresponding to each sentence are mined by utilizing the Euclidean distance. For example, 8 candidate sentences similar to each sentence in semanteme are mined by calculating Euclidean distances between sentences.

Step S203: and performing conditional filtering on the candidate sentences corresponding to each sentence, determining the target sentences corresponding to each sentence, and generating a plurality of complex sentence-simple sentence pairs with similar semantics.

In this embodiment, for each sentence, further conditional filtering is performed on a plurality of candidate sentences of the sentence, including operations of filtering out a sentence with wrong content and filtering out a sentence with higher semantic similarity, and a target sentence corresponding to the current sentence is determined from the plurality of candidate sentences. For example, if the current sentence is a complex sentence, the target sentence is a target simple sentence corresponding to the complex sentence.

Therefore, the corresponding target simple sentences or target complex sentences with similar semantics are matched for each sentence, each sentence and the corresponding target sentence are combined into sentence pairs, and a plurality of complex sentence-simple sentence pairs with similar semantics are generated.

And step S102, calculating a supervision signal of each complex sentence-simple sentence pair.

In one embodiment of the present application, the supervisory signals include information such as sentence length ratio, edit distance ratio, lexical complexity ratio, and syntax tree depth ratio.

In order to more clearly describe a specific implementation process of calculating each kind of supervisory signal, a method for calculating supervisory signals provided in an embodiment of the present application is taken as an example to be described in detail below. As shown in fig. 3, the method comprises the steps of:

step S301: and calculating the ratio of the simple sentence length to the complex sentence length in each sentence pair to obtain a sentence length ratio.

Specifically, the length of the simple sentence and the length of the complex sentence in each sentence pair are determined, and then the ratio of the length of the simple sentence to the length of the complex sentence is calculated, so that the sentence length ratio of each sentence pair is calculated.

Step S302: calculating the Levensian distance between the complex sentence and the simple sentence, and calculating the Levensian distance ratio of each editing operation to obtain the editing distance ratio, wherein the editing operations comprise: delete, insert, and replace.

The Levenshtein distance (Levenshtein) refers to the minimum number of editing operations required for converting one string into another string, and can be used for measuring the similarity between two strings.

In this embodiment, when calculating the edit distance ratio, the levenstein distance between the complex sentence and the simple sentence is calculated first, and then the levenstein ratios of various editing operations of deletion, insertion, and replacement are calculated respectively, that is, the present application calculates 4 edit distance ratios in total.

Step S303: and calculating the ratio of the vocabulary complexity of the complex sentence to the vocabulary complexity of the simple sentence to obtain a vocabulary complexity ratio, wherein the vocabulary complexity is expressed by the word frequency of the vocabulary.

In this embodiment, the word complexity is expressed by the word frequency of the words, the complexity of each word in the complex sentence and the simple sentence is calculated for each sentence pair in sequence, and the word complexity of the complex sentence and the word complexity of the simple sentence are determined by the accumulative calculation. And then calculating the ratio of the vocabulary complexity of the complex sentence to the vocabulary complexity of the simple sentence to obtain the vocabulary complexity ratio of each sentence pair.

Step S304: and respectively acquiring the syntax tree depths of the complex sentence and the simple sentence through a natural language text processing library, and calculating the ratio of the syntax tree depth of the complex sentence to the syntax tree depth of the simple sentence to obtain the syntax tree depth ratio.

In this embodiment, the natural language text processing library may be a spaCy library or other open source library for natural language processing, the spaCy library may be applied to natural language processing in Python, and in the present application, a space library is used to obtain the depths of the chinese dependent syntax trees of the complex sentence and the simple sentence, and then the ratio between the two is calculated to obtain the syntax tree depth ratio.

Step S103, each supervision signal is added to the initial position of the complex sentence in the corresponding sentence pair in the form of a character string, a data set of the complex sentence-simple sentence pair with the supervision signal is generated, and the data set of the complex sentence-simple sentence pair with the supervision signal is divided into a training set, a verification set and a test set according to a preset proportion.

In an embodiment of the present application, for the supervision signals calculated in the embodiment of step S102, adding each supervision signal to the start position of the initial complex sentence in the corresponding sentence pair in the form of a character string includes sequentially adding each supervision signal of each sentence pair to the front of the start position of the corresponding complex sentence in the form of setting a ratio of the supervision signals after the names of the supervision signals. The numerical value of the supervision signal after the name of the supervision signal can be added to the front of the initial position of the complex sentence in sequence according to the mode of calculating the sequence of the supervision signal or setting the weight for each supervision signal and sequencing the supervision signals from large to small.

For example, a supervisory signal of a complex sentence-simple sentence pair is added to the beginning position of the original complex sentence in the sentence pair by a ratio of < supervisory signal name _ supervisory signal >, and a complex sentence-simple sentence data set with the supervisory signal is formed on the basis of the original sentence pair, wherein each row of data is in the form of: < supervisory signal name 1_ supervisory signal ratio 1> … < supervisory signal name n _ supervisory signal ratio n > complex sentence, simple sentence. And then saved in a file format such as csv.

Furthermore, a data set of the complex sentence-simple sentence pair with the monitoring signal is divided into a training set, a verification set and a test set according to a preset proportion. For example, the data of the complex sentence-simple sentence data set with the supervision signals is divided into a training set, a verification set and a test set according to the ratio of 8:1:1, so that the model can be trained subsequently and the performance of the model can be verified.

And step S104, performing model pruning on a preset multi-language pre-training model based on the encoder-decoder to obtain a Chinese single-language pre-training model.

It should be noted that most of the pre-training models for sentence reduction in the related art are multi-language pre-training models, which are implemented based on the structure of an Encoder-Decoder (Encoder-Decoder), and therefore the pre-training models based on the Encoder-Decoder, which include chinese, usually include other languages as well. For the Chinese sentence reduction, redundant information is too much useless, and the consumed resources are too large, wherein vector representation parameters of other languages may account for more than half of model parameters, and the useless information wastes computing resources. Therefore, the method and the device can prune redundant vector representation and convert the multi-language pre-training model into a single-language pre-training model suitable for Chinese.

In specific implementation, a multi-language pre-training model in the related technology is selected in advance to serve as a Chinese simplified pre-training model, then model pruning is carried out on the preset multi-language pre-training model based on a coder-decoder, redundant vector representation is pruned, and the multi-language pre-training model is converted into a Chinese single-language pre-training model.

As one possible implementation mode, punctuation marks, numbers, English letters and high-frequency Chinese words commonly used in Chinese sentences are selected as a new vocabulary. The preset multilingual pre-training model is provided with an original vocabulary which may include vocabularies of other multiple languages, and punctuation marks, numbers, English letters and high-frequency Chinese words commonly used in Chinese sentences are selected as a new vocabulary for replacing the original vocabulary in the embodiment.

Then, the original vocabulary table of the multilingual pre-training model is replaced with a new vocabulary table, and the expression parameters of the input vector and the output vector of the multilingual pre-training model are updated to update the multilingual pre-training model. In specific implementation, the vocabulary can be replaced directly, or the vocabulary which is not in the new vocabulary and is contained in the original vocabulary can be deleted, and the new vocabulary which appears in the new vocabulary is introduced into the original vocabulary. The vector representations of the inputs and outputs of the model are then updated, replacing the parameters of the input and output vector representations in the multi-language pre-trained model to update the neural network.

And finally, storing the newly generated vocabulary list and the updated pre-training model, completing pruning of the multi-language pre-training model, and obtaining the Chinese single-language pre-training model.

And S105, introducing contrast learning loss to finely adjust the single Chinese language pre-training model based on the training set and the verification set, and jointly training a Chinese sentence simplified model.

The comparative learning is one type of the self-supervised learning, and the comparative learning learns the feature representation of the sample by comparing the data with the positive sample and the negative sample in the feature space respectively.

Specifically, on the basis of the original cross entropy loss of the pre-training model, comparison learning loss is added to contact with the training optimization model, the divided training set and the verification concentrated data are used as training data, the pre-training model after pruning in the step S104 is subjected to fine tuning, a Chinese simplified model is obtained through training, and the Chinese simplified model is convenient to simplify Chinese sentences through the Chinese simplified model.

It should be noted that, when performing contrast learning, the application trains the pre-training model after pruning by adding a contrast positive sample and a contrast negative sample, and represents the pre-training model by a vector of a simple sentence of a learning target, wherein the simple sentence in the original complex sentence-simple sentence pair mined in the above embodiment is used as the simple sentence of the learning target, and features output by the model are more similar to those of the positive sample and are more dissimilar to those of other remaining negative samples by learning an encoder. The specific implementation process of the comparative learning may refer to the manner in the related art, and is not described herein again.

And S106, inputting the complex sentences in the test set into the Chinese sentence simplification model, generating predicted simplified sentences through the Chinese sentence simplification model, evaluating the simplification effect of the Chinese sentence simplification model, and simplifying the Chinese sentences to be simplified through the Chinese sentence simplification model when the simplification effect is greater than a preset threshold value.

Specifically, the complex sentences in the test set are input into the trained Chinese sentence simplification model, the optimal simplified sentences are generated through prediction in the simplified model after fine adjustment, then the predicted simplified sentences output by the model are obtained, and the model evaluates the Chinese sentence simplification effect based on the prediction result.

In an embodiment of the present application, evaluating the simplified effect of the simplified model of the chinese sentence includes comparing the predicted simplified sentence output by the model with a standard reference simplified sentence, and evaluating the simplified effect of the simplified model of the chinese sentence according to a plurality of preset evaluation indexes, where the plurality of evaluation indexes include: BLEU-4 index, Rouge-L index, and SARI index.

In specific implementation, the standard reference simplified sentence may include an original simple sentence in a mined sentence pair, or a reference sentence obtained by manually labeling a complex sentence in the test set. Firstly, the predicted simplified sentences output by the Chinese sentence simplified model are compared with standard reference simplified sentences, and the model is sequentially evaluated by using three evaluation indexes of BLEU-4, Rouge-L and SARI. Wherein, the BLEU is mainly used to evaluate the quality of machine translation, and the translation quality is mainly measured according to Precision (Precision). Rouge-L is a commonly used machine translation and article abstract evaluation index, which measures the quality of translation according to the Recall (Recall) and mainly considers the longest common subsequence between a machine translation and a reference translation. The SARI metric is considered from the vocabulary reduction perspective and is used to measure how "good or bad" words are added, deleted, and retained by the simplified model. In the embodiment of the present application, the evaluation value of the chinese sentence simplification model corresponding to each evaluation index can be calculated in the above manner, so as to obtain the simplification effect of the chinese sentence simplification model.

Furthermore, after the simplifying effect of the Chinese sentence simplifying model is evaluated, when the simplifying effect is larger than a preset threshold value, the Chinese sentence to be simplified is simplified through the Chinese sentence simplifying model. The preset threshold is a threshold which indicates that the simplification effect of the Chinese sentence simplification model meets the minimum requirement of sentence simplification, and it can be understood that when the simplification effect is greater than the preset threshold, the Chinese sentence simplification model is applicable to the Chinese sentence simplification task, the Chinese sentence simplification task is actually carried out through the Chinese sentence simplification model, and the Chinese sentence to be simplified is input into the Chinese sentence simplification model for simplification.

As an example, according to the current simplified sentence application scenario, respective minimum score thresholds of the BLEU-4 index, the Rouge-L index, and the SARI index are preset, for example, when the method is applied to text summary generation, a numerical value of each preset threshold is set to be higher, when the method is applied to a human-computer interaction system, a numerical value of each preset threshold is set to be slightly lower, and the like. And then, when evaluating each index value of the current Chinese sentence simplification model, comparing each index value with a corresponding preset threshold value respectively, and if the index values are greater than the corresponding threshold values, executing the Chinese sentence simplification task under the corresponding scene through the current Chinese sentence simplification model.

Therefore, the method not only controls the simplification result from the aspects of length, lexical and syntactic complexity and the like, but also can improve the fidelity of the target simplified sentence.

In summary, in the method for simplifying chinese sentences based on contrast learning according to the embodiment of the present application, the belief-sentence simplification task is regarded as an end-to-end conditional generation task, complex-simple-sentence pairs with similar semantics are first mined by an unsupervised method, then an edit signal and a vocabulary syntax complexity signal between the sentence pairs are used as supervised signals, and fine tuning is performed by using a pre-training model of an encoder-decoder, so that a user can control the generated simplified sentences according to different requirements. In addition, the method introduces contrast learning loss on the basis of the cross entropy loss function so as to increase the distance from the positive sample to the negative sample, thereby improving the fidelity of the generated simplified sentence. The method not only enables the generated result to be controllable according to different requirements, but also can improve the fidelity of the simplified sentences of the target.

In order to implement the embodiment, the application further provides a Chinese sentence simplification system based on comparative learning.

As shown in fig. 4, the chinese sentence reduction system based on the comparative learning includes a mining module 100, a calculating module 200, a first generating module 300, a second generating module 400, a training module 500, and a third generating module 600.

The mining module 100 is configured to mine a plurality of complex sentence-simple sentence pairs with similar semantics based on an unsupervised learning manner.

And the calculating module 200 is used for calculating the supervision signal of each complex sentence-simple sentence pair.

The first generating module 300 is configured to add each monitoring signal to the starting position of a complex sentence in a corresponding sentence pair in the form of a character string, generate a data set of the complex sentence-simple sentence pair with the monitoring signal, and divide the data set of the complex sentence-simple sentence pair with the monitoring signal into a training set, a verification set, and a test set according to a preset ratio.

A second generating module 400, configured to perform model pruning on a preset multi-language pre-training model based on the encoder-decoder to obtain a single-language pre-training model in chinese.

And the training module 500 is used for introducing comparison learning loss to finely tune the single-language pre-training model of the Chinese based on the training set and the verification set, and jointly training a simplified model of the Chinese sentence.

A third generating module 600, configured to input the complex sentences in the test set into the chinese sentence simplification model, generate predicted simplified sentences through the chinese sentence simplification model, evaluate the simplification effect of the chinese sentence simplification model, and when the simplification effect is greater than a preset threshold, simplify the chinese sentences to be simplified through the chinese sentence simplification model.

Optionally, in an embodiment of the present application, the mining module 100 is specifically configured to: acquiring large-data-volume Chinese sentences from a preset resource library; obtaining a vector of each sentence through a language tool library, creating an index, and excavating a plurality of similar candidate sentences corresponding to each sentence; and performing conditional filtering on the candidate sentences corresponding to each sentence, determining the target sentences corresponding to each sentence, and generating a plurality of complex sentence-simple sentence pairs with similar semantics.

In an embodiment of the application, the supervision signals include a sentence length ratio, an edit distance ratio, a vocabulary complexity ratio, and a syntax tree depth ratio, and the calculation module 200 is specifically configured to: calculating the ratio of the length of the simple sentence to the length of the complex sentence in each sentence pair to obtain the sentence length ratio; calculating the Levensian distance between the complex sentence and the simple sentence, and calculating the Levensian distance ratio of each editing operation to obtain the editing distance ratio, wherein the editing operations comprise: delete, insert, and replace; calculating the ratio of the vocabulary complexity of the complex sentence to the vocabulary complexity of the simple sentence to obtain a vocabulary complexity ratio, wherein the vocabulary complexity is expressed by the word frequency of the vocabulary; and respectively acquiring the syntactic tree depths of the complex sentence and the simple sentence through a natural language text processing library, and calculating the ratio of the syntactic tree depth of the complex sentence to the syntactic tree depth of the simple sentence to obtain the syntactic tree depth ratio.

Optionally, in an embodiment of the present application, the first generating module 300 is specifically configured to: and sequentially adding the supervision signals of each sentence pair to the front of the initial position of the corresponding complex sentence in a mode of setting the ratio of the supervision signals after the names of the supervision signals.

Optionally, in an embodiment of the present application, the second generating module 400 is specifically configured to: selecting punctuation marks, numbers, English letters and high-frequency Chinese words commonly used in Chinese sentences as a new vocabulary; replacing the original vocabulary table of the multi-language pre-training model with a new vocabulary table, and updating the expression parameters of the input vector and the output vector of the multi-language pre-training model so as to update the multi-language pre-training model; saving the new vocabulary and the updated pre-training model.

Optionally, in an embodiment of the present application, the third generating module 600 is specifically configured to: comparing the predicted simplified sentence with a standard reference simplified sentence; evaluating the simplifying effect of the Chinese sentence simplifying model through a plurality of preset evaluating indexes, wherein the plurality of evaluating indexes comprise: BLEU-4 index, Rouge-L index, and SARI index.

In summary, in the chinese sentence simplification system based on contrast learning according to the embodiment of the present application, the belief sentence simplification task is regarded as an end-to-end conditional generation task, a complex sentence-simple sentence pair with similar semantics is first mined by an unsupervised method, then an edit signal and a vocabulary syntax complexity signal between the sentence pairs are used as supervised signals, and fine tuning is performed by using a pre-training model of an encoder-decoder, so that a user can control the generated simplified sentence according to different requirements. In addition, the system introduces contrast learning loss on the basis of the cross entropy loss function so as to increase the distance between positive and negative samples, thereby improving the fidelity of the generated simplified sentences. The system not only makes the generated result controllable according to different requirements, but also can improve the simplified sentence fidelity of the target.

In order to implement the foregoing embodiments, the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the chinese sentence reduction method based on contrast learning according to the embodiment of the first aspect of the present application.

In the description of the present specification, reference to the description of "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In the present specification, if a schematic expression of the above-described terms is employed in a plurality of embodiments or examples, it does not mean that the embodiments or examples are the same. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried out in the method of implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and the program, when executed, includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer-readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A Chinese sentence simplification method based on contrast learning is characterized by comprising the following steps:

calculating a supervision signal of each complex sentence-simple sentence pair;

2. The simplification method according to claim 1, wherein said mining a plurality of semantically similar complex-simple sentence pairs based on unsupervised learning comprises:

acquiring large-data-volume Chinese sentences from a preset resource library;

obtaining a vector of each sentence through a language tool library, creating an index, and excavating a plurality of similar candidate sentences corresponding to each sentence;

and performing condition filtering on the candidate sentences corresponding to each sentence, determining a target sentence corresponding to each sentence, and generating a plurality of complex sentence-simple sentence pairs with similar semantics.

3. The compaction method of claim 1 or 2, wherein the supervisory signals include a sentence length ratio, an edit distance ratio, a vocabulary complexity ratio and a syntax tree depth ratio, and wherein the calculating the supervisory signals for each of the complex-simple sentence pairs comprises:

calculating the ratio of the length of the simple sentence to the length of the complex sentence in each sentence pair to obtain the sentence length ratio;

calculating a levens distance between the complex sentence and the simple sentence, and calculating a levens distance ratio of each editing operation to obtain the editing distance ratio, wherein the editing operations comprise: delete, insert, and replace;

calculating the ratio of the vocabulary complexity of the complex sentence to the vocabulary complexity of the simple sentence to obtain the vocabulary complexity ratio, wherein the vocabulary complexity is expressed by the word frequency of the vocabulary;

and respectively obtaining the syntactic tree depths of the complex sentence and the simple sentence through a natural language text processing library, and calculating the ratio of the syntactic tree depth of the complex sentence to the syntactic tree depth of the simple sentence to obtain the syntactic tree depth ratio.

4. The compaction method of claim 1 wherein the adding each of the supervisory signals in the form of a string of characters to a starting position of an initial complex sentence in a corresponding sentence pair comprises:

and sequentially adding the supervision signals of each sentence pair to the front of the starting position of the corresponding complex sentence in a mode of setting the ratio of the supervision signals after the names of the supervision signals.

5. The compaction method of claim 1, wherein the model pruning of the pre-defined multi-lingual pre-training model based on a coder-decoder comprises:

selecting punctuation marks, numbers, English letters and high-frequency Chinese words commonly used in Chinese sentences as a new vocabulary;

replacing the original vocabulary table of the multilingual pre-training model with the new vocabulary table, and updating the expression parameters of the input vector and the output vector of the multilingual pre-training model so as to update the multilingual pre-training model;

and saving the new vocabulary and the updated pre-training model.

6. The reduction method according to claim 1, wherein said evaluating the reduction effect of the chinese sentence reduction model comprises:

comparing the predicted reduced sentence to a standard reference reduced sentence;

evaluating the simplifying effect of the Chinese sentence simplifying model through a plurality of preset evaluating indexes, wherein the plurality of evaluating indexes comprise: BLEU-4 index, Rouge-L index, and SARI index.

7. A Chinese sentence simplification system based on contrast learning is characterized by comprising the following components:

8. The system of claim 7, wherein the mining module is specifically configured to:

acquiring large-data-volume Chinese sentences from a preset resource library;

9. The system according to claim 7 or 8, wherein the supervisory signals comprise a sentence length ratio, an edit distance ratio, a vocabulary complexity ratio, and a syntax tree depth ratio, and wherein the calculation module is specifically configured to:

calculating a Levensan distance between the complex sentence and the simple sentence, and calculating a Levensan distance ratio of each editing operation to obtain the editing distance ratio, wherein the editing operations comprise: delete, insert, and replace;

10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the contrast learning based chinese sentence reduction method of any of claims 1-6.