CN115329784B

CN115329784B - Sentence repeat generating system based on pre-training model

Info

Publication number: CN115329784B
Application number: CN202211245822.7A
Authority: CN
Inventors: 谢冰; 尹越; 袭向明; 宋伟; 朱世强
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2022-10-12
Filing date: 2022-10-12
Publication date: 2023-04-07
Anticipated expiration: 2042-10-12
Also published as: CN115329784A

Abstract

The invention discloses a sentence repetition generating system based on a pre-training model, which comprises a repetition generating module, a fluency filtering module and a semantic filtering module, wherein the repetition generating module is used for generating a repetition, the repetition generating module comprises a translation generating module, a model generating module and a synonym replacing generating module, the translation generating module generates the repetition by two methods of transliteration and retranslation, the model generating module generates the repetition by directly training a Chinese repetition generating model and indirectly generating the Chinese repetition by using an English repetition generating model, and the synonym generating module generates the repetition by replacing a synonym in an original sentence.

Description

Sentence repeat generating system based on pre-training model

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a sentence repeat generating system based on a pre-training model.

Background

The repeated sentence generating task is to generate sentences which have the same semanteme as the original sentences but different expressions. The sentence generated is called the repeat of the original sentence. The generation of the repeat can play an important role in tasks such as question answering, translation, natural language generation and the like. In a question-and-answer system, questions input by a user can be expanded by repeated generation, so that a question-and-answer library is easier to match with similar questions. In training the translation model, the restatement generation may augment the training data and the tag data. In the natural language generation task, the generated sentences are generated repeatedly to generate abundant and diverse expressions.

The current generation methods for rephrasing are mainly rule-based, statistical-based and neural network-based methods.

The generation method based on the rules rewrites the original sentence according to the rules, and changes the expression and structure of the sentence to generate the repeat under the condition of keeping the same semantic meaning as the original sentence. For example, adding or inserting appropriate mood assist words in appropriate positions of the original sentence, changing the word order of the sentence, and replacing synonyms for words in the sentence. Synonym substitution for words in a sentence can produce rich paraphrases, but for ambiguous words, its semantics are often related to the context of the primitive, and its synonyms are somewhat inappropriate. If the synonym is blindly taken to replace the original word, the generated repeated sentence has semantic change and grammar is not available.

The statistical-based method is mainly a repeat generation method based on statistical machine translation. The statement generation method can be considered as a translation task in which the source language and the target language are the same. And translating the original sentence into sentences with the same semantics and different expressions by using a statistical machine translation model to generate a repeat. With the development of deep learning, the translation model based on neural network training is superior to the machine translation model based on statistics, and the generation of the manifold by using the neural network translation model is a better choice.

The development of deep learning also provides new ideas and methods for the generation of the retelling language, and the pre-trained language model can be finely adjusted to generate the retelling language. The language model pre-trained on the large-scale corpus has strong extraction capability on the general characteristics of the text. Fine-tuning on the replication of the generated data set based on such a model may be very effective. With the development of deep learning, more and more pre-training models are sourced. The model features have strong extraction capability, greatly help the promotion of downstream tasks, and are valuable resources worthy of utilization.

Natural language processing has received increasing attention as artificial intelligence and deep learning have evolved. The creation of a repeat is also being studied by more and more scholars as a research direction of natural language processing. Currently, the research on Chinese language repeat generation is less than that of English language repeat generation. English language repetition generated studies have a long technological accumulation and a broader research population. At present, open-source frames and models are available to realize English rephrase generation, and can generate high-quality English rephrase, which are also resources worthy of utilization.

At present, a better automatic evaluation method is lacked for the repeated generation task. The most convincing method is manual evaluation, but this method is time-consuming, labor-intensive and costly. How to automatically evaluate the generated repeat quality, especially whether the repeat is smooth or not, and whether the repeated semantics are the same as the original sentence is a problem worthy of further study.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a sentence repeat generation system based on a pre-training model.

In order to realize the purpose, the technical scheme of the invention is as follows:

a sentence repeat generating system based on a pre-training model comprises a repeat generating module, a fluency filtering module and a semantic filtering module which are connected in sequence; the rephrase generating module is used for generating the rephrase; the fluency filtering module is used for calculating the fluency of repeat, and filtering to obtain repeat with the fluency not lower than a threshold value; the semantic filtering module is used for calculating the semantic similarity between the repeat and the original sentence, and filtering to obtain the repeat with the semantic similarity not lower than a threshold value.

Further, the rephrase generation module comprises a translation generation module, a model generation module and a synonym replacement generation module; the translation generation module generates a repeat through two modes of transliteration generation and retranslation generation, the model generation module generates a repeat through a language model, and the synonym replacement generation module generates a repeat through replacing synonyms in the original sentence.

Furthermore, the transliteration generation mode generates a repeat by translating an input sentence into a Chinese language, a Guangdong language and a traditional Chinese language, and the retranslation generation mode generates a repeat by translating an input Chinese sentence into an English sentence and then translating the English sentence into a Chinese language; the retranslation generating method also utilizes a multi-language translation model to translate an input sentence into a German sentence, and then translates the German sentence into a Chinese to generate a repeat.

Furthermore, the model generation module comprises two modes of directly training the Chinese rephrase generation model to generate the rephrase and indirectly generating the Chinese rephrase by using the English rephrase generation model, wherein the mode of indirectly generating the Chinese rephrase by using the English rephrase generation model is to translate Chinese sentences into English, input the English rephrase generation model to generate English rephrase and translate the English rephrase into Chinese to generate the Chinese rephrase.

Further, the synonym replacement generating module comprises a participle module, a named entity recognition module, a replaceable word filtering module, a synonym searching module, a synonym filtering module and a synonym replacement module, wherein the synonym replacement generating module comprises a participle sub-module, a named entity recognition sub-module, a replaceable word filtering sub-module, a synonym searching sub-module, a synonym filtering sub-module and a synonym replacement sub-module.

Further, the replaceable word filtering module in the synonym replacement generating module takes a set consisting of entity types such as an article name (nw), other proper names (nz), punctuation marks (w), a person name (PER), a location name (LOC), an organization name (ORG) and TIME (TIME) as a non-replaceable entity type set, and entities with types belonging to the set do not carry out synonym replacement.

Further, a mask Language Model (mask Language Model) pre-trained on large-scale corpus is introduced into a synonym filtering module in the synonym replacement generation module to filter synonyms, and the method specifically includes the following steps:

s1: replacing original words in the sentences with masks with the same length as the synonyms to obtain mask sentences;

s2: inputting a mask sentence into a pre-training mask language model, calculating the probability of generating a word corresponding to a synonym by the output corresponding to the mask position, and using the geometric mean of the generation probability of each word of the synonym as a confidence coefficient;

s3: setting a confidence threshold;

s4: and filtering out synonyms with the confidence degrees smaller than the threshold value to obtain synonyms suitable for the current context.

Furthermore, the fluency filtering module calculates the fluency by using the pre-trained mask language model, sets a fluency threshold value, and filters to obtain the repeat of the fluency. The fluency calculation method is to cover each word in the sentence one by one, and calculate the probability of generating the original word at the mask position output by the model. The degree of confusion is calculated based on the resulting generation probability, and the value obtained by subtracting the degree of confusion from 1 is mapped to the interval [0,1] using an exponential function. The fluency calculation method is formulated as:

/>

wherein flu is fluency, N is sentence length, i is ith position, p is probability of generation, s is sentence,

is the ith word of the sentence.

Furthermore, the semantic filtering module encodes the Sentence by using the Sennce Bert model to obtain a Sentence vector, and judges whether the semantics are the same according to the cosine values between the Sentence vectors. And setting a cosine value threshold, and considering that the semantics are the same if and only if the cosine value between sentence vectors is greater than or equal to the threshold.

The semantic filtering module trains an independent, sentence generation independent, senntence Bert model to determine semantics. Constructing a data set training model containing a strong negative sample and a weak negative sample, and improving a ternary objective function in the training of the sequence Bert model into the following form:

wherein

For an improved loss value>

A sentence vector representing the sentence a @>

A sentence vector representing the sentence p @>

A sentence vector representing a sentence n; the sentence p is a positive example of a, i.e., p has the same semantic meaning as a. Sentence n is a negative example of a, i.e., n is different from a in semantic; />

Represents a distance measure, <' > is>

Indicating the set margin value.

The invention has the beneficial effects that:

(1) The method fully utilizes open-source pre-training model resources, utilizes the translation model to directly translate to generate the repeat and retranslate to generate the repeat, utilizes the multi-language translation model to improve the efficiency and diversity of generating the repeat, utilizes the pre-trained language model to finely tune and train the Chinese repeat generation model, and applies the English repeat generation model to the Chinese repeat generation.

(2) When synonym replacement is carried out, a set formed by entity types is used as an irreplaceable entity type set, and entities of which the types belong to the set are not subjected to synonym replacement, so that sentence semantic change caused by replacement of proper nouns and key information is avoided.

(3) A mask Language Model (mask Language Model) pre-trained on large-scale linguistic data is introduced to filter synonyms to obtain synonyms suitable for the current context, and the phenomenon that the synonyms which do not accord with the context cause sentences to be unsmooth or change sentence semantics is avoided.

(4) Calculating fluency by using a pre-trained mask language model, setting a fluency threshold, and filtering to obtain a repeat sentence with the fluency not lower than the threshold, thereby filtering out unsmooth sentences and obtaining the repeat sentences with fluency.

(5) Training an independent and Sentence generation-independent sequence Bert model to judge semantics so as to avoid influence on an evaluation result by a generation process, and ensuring generation of high-quality repetition by using multiple measures, wherein the steps comprise collecting an irreplaceable entity composition set, defining synonym confidence and a fluency calculation method, and improving a sequence Bert loss function.

Drawings

FIG. 1 is a block diagram of a system according to the present invention;

FIG. 2 is a diagram of a repeat generating module;

FIG. 3 is a diagram of a translation generation module architecture;

FIG. 4 is a block diagram of an transliteration generation module;

FIG. 5 is a block diagram of a translation generation module;

FIG. 6 is a block diagram of a model generation module;

FIG. 7 is a diagram of a synonym replacement generation module structure;

FIG. 8 is a diagram of a synonym filter module structure.

Detailed Description

The sentence repetition generating system based on the pre-training model provided by the invention is described in detail below with reference to the accompanying drawings.

The invention relates to a sentence repeat generating system based on a pre-training model, which comprises a repeat generating module, a fluency filtering module and a semantic filtering module which are connected in sequence; the rephrase generating module is used for generating the rephrase; the fluency filtering module is used for calculating the fluency of repeat, and filtering to obtain repeat with the fluency not lower than a threshold value; the semantic filtering module is used for calculating the semantic similarity between the repeat and the original sentence, and filtering to obtain the repeat with the semantic similarity not lower than a threshold value.

The duplicate generation module comprises a translation generation module, a model generation module and a synonym replacement generation module; the translation generation module generates a repeat through two modes of transliteration generation and retranslation generation, the model generation module generates a repeat through a language model, and the synonym replacement generation module generates a repeat through replacing synonyms in the original sentence.

The transliteration generation mode generates a repeat by translating an input sentence into a Chinese language, a Guangdong language and a traditional Chinese language, and the retranslation generation mode generates a repeat by translating an input Chinese sentence into an English sentence and then translating the English sentence into a Chinese language; the retracing generation method also utilizes a multi-language translation model to translate an input sentence into a German sentence, and then translates the German sentence into a Chinese to generate a rephrase.

The model generation module comprises two modes of directly training a Chinese compound generation model to generate compound and indirectly generating Chinese compound by using an English compound generation model, wherein the mode of indirectly generating Chinese compound by using the English compound generation model is to translate Chinese sentences into English, input the English compound generation model to generate English compound, and translate the English compound into Chinese to generate Chinese compound.

The synonym replacement generation module comprises a participle submodule, a named entity recognition submodule, a replaceable word filtering submodule, a synonym searching submodule, a synonym filtering submodule and a synonym replacing submodule.

The replaceable word filtering module in the synonym replacement generating module takes an entity type set consisting of an article name (nw), other proper names (nz), punctuation marks (w), a person name (PER), a place name (LOC), an organization name (ORG) and TIME (TIME) as a non-replaceable entity type set, and entities with types belonging to the set do not undergo synonym replacement.

The synonym filtering module in the synonym replacement generating module introduces a mask Language Model (Masked Language Model) pre-trained on large-scale corpus to filter synonyms, and the method specifically comprises the following steps:

s1: replacing original words in the sentence with masks with the same length as the synonyms to obtain mask sentences;

s3: setting a confidence threshold;

The fluency filtering module calculates the fluency by using the pre-trained mask language model, sets a fluency threshold value, and filters to obtain the repeat of the fluency. The fluency calculation method is to mask each word in a sentence one by one, and calculate the probability of generating an original word at the mask position output by the model. The degree of confusion is calculated based on the resulting generation probability, and the value obtained by subtracting the degree of confusion from 1 is mapped to the interval [0,1] using an exponential function. The fluency calculation method is formulated as:

/>

wherein flu is fluency, N is sentence length, i is ith position, p is generation probability, s is sentence,

is the ith word of the sentence.

The semantic filtering module encodes the sentences by using a Senntence Bert model to obtain Sentence vectors, and judges whether the semantics are the same or not according to cosine values among the Sentence vectors. And setting a cosine value threshold, and considering that the semantics are the same if and only if the cosine value between sentence vectors is greater than or equal to the threshold.

The semantic filtering module trains an independent, sentence generation independent, sennce Bert model to judge semantics. Constructing a data set training model containing a strong negative sample and a weak negative sample, and improving a ternary objective function in the training of the sequence Bert model into the following form:

wherein

For an improved loss value>

A sentence vector representing sentence a @>

A sentence vector representing the sentence p @>

Represents a distance metric, <' > based on>

Indicating the set margin value.

Example 1

The pre-training model name referred to in this embodiment 1 is the name of the model in the hugging face model library. The model may be loaded according to the model name through the transformations framework of huggingface. For example, the model Vamsi/T5_ parahrase _ rows may be loaded in the following manner:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Vamsi/T5_Paraphrase_Paws")

model = AutoModelForSeq2SeqLM.from_pretrained("Vamsi/T5_Paraphrase_Paws")

when a sentence is input into a model or a sentence is generated by the model, for example, "inputting a mask sentence into a pre-trained mask language model", "calculating the probability of generating an original word from the mask position output by a coding model", and "inputting an english sentence into a Vamsi/T5_ paramhrase _ Paws model to generate an english duplicate" are used for convenience of expression and avoid repeated descriptions. The steps of adding separators ([ CLS ], [ SEP ]) before and after, mapping characters to integer indexes, padding length (pad) for zero padding, generating a word type index list (token type id) and an attention mask list (attention mask) required for model input, or mapping integer indexes to characters are omitted. These omitted steps are not obvious to the relevant practitioner and need not be described in any further detail.

The sentence repeat generating system based on the pre-training model inputs simplified Chinese mandarin sentences and outputs sentences with the same semanteme and different expressions, and the implementation steps comprise:

1. a Baidu translation interface calling module is set up to realize the function of translating the Chinese mandarin into the languages, the Guangdong languages and the traditional Chinese;

2. the method comprises the following steps of realizing a function of translating Chinese into English based on a Helsinki-NLP/opus-mt-zh-en model, and realizing the function of translating English into Chinese based on the Helsinki-NLP/opus-mt-en-zh model;

3. the function of translating Chinese into German is realized based on a Helsinki-NLP/opus-mt-ZH-de model, and the functions of translating German into Mandarin Chinese, guangdong language and local dialect are realized based on the Helsinki-NLP/opus-mt-de-ZH model;

4. and (4) collecting sentence pairs with the same semantic meaning by using a crawler and a manual labeling method, and using the sentence pairs as a Chinese repeat generation data set. Training a Chinese repeat generating model based on a thu-coai/CDial-GPT _ LCCC-large model fine tuning;

5. realizing the function of generating English rephrase based on a Vamsi/T5_ Parahrase _ Paws model; combining the translation function realized in the step 2, realizing the function of translating the Chinese sentence into the English sentence, inputting the English sentence into the Vamsi/T5_ Parahrase _ Paws model to generate English retelling, and translating the English retelling into Chinese;

6. utilizing a Baidu LAC (Lexical Analysis of Chinese) tool to perform word segmentation and named entity identification on sentences, and marking entities with entity types of article names (nw), other proper names (nz), punctuations (w), person names (PER), location names (LOC), organization names (ORG) and TIME (TIME) as non-replaceable entities, otherwise, marking as replaceable entities;

7. realizing a synonym searching function based on a synonym forest expansion version of a Hayada information retrieval research room, and searching out synonyms of entity words marked as replaceable words;

8. the probability calculation function of generating words at the mask position is realized based on a chip-roberta-wwm-ext model, sentences with masks are input, the probability of generating words corresponding to synonyms at the mask position is output, and a geometric mean is calculated as a confidence coefficient. Setting a confidence threshold value to be 0.0015, and realizing a function of filtering out synonyms with the confidence values lower than the threshold value;

9. the function of replacing the original words in the sentences into synonyms is realized;

10. realizing the function of calculating the fluency based on a chip-roberta-wwm-ext model; setting the fluency threshold to

The function of filtering repeat with fluency lower than the threshold value is realized;

11. and (5) constructing a strong negative sample and a weak negative sample for each sentence by using the repeated generated data collected in the step (4) to generate a new data set. Training a sequence-BERT model using the improved ternary objective function based on the new dataset;

12. and (5) coding the sentences based on the model trained in the step 11, and calculating vector cosine values of the original sentences and the compound sentences. And setting the threshold value of the cosine value to be 0.85, and realizing the function of filtering the repeated statement of the original sentence with the cosine value lower than the threshold value.

The sentence repeat generating system based on the pre-training model, disclosed by the invention, is shown in fig. 1 and comprises a repeat generating module, a fluency filtering module and a semantic filtering module. Wherein: the repeat generating module is used for generating repeat; the fluency filtering module calculates the repeated fluency, and filters to obtain the repeated fluency not lower than a threshold value; and the semantic filtering module calculates the semantic similarity between the repeat and the original sentence, and filters the repeat to obtain the repeat with the semantic similarity not lower than a threshold value.

The structure of the repeat generation module is shown in fig. 2, and comprises a translation generation module, a model generation module and a synonym replacement generation module.

The translation generation module is shown in fig. 3 and includes a transliteration generation module and a retrace generation module. In this embodiment, the input is simplified Chinese Mandarin. The generation of transliteration means the generation of repeat through one-time translation, and the generation of retranslation means the generation of repeat by firstly translating Chinese into other languages and then translating the other languages back into Chinese.

The transliteration generation module is shown in fig. 4, and the target languages of translation include chinese, cantonese and traditional chinese. The input is translated into a literal using the open source model raynardj/wenyanwen-chicken-translate-to-answer. The input is translated into Guangdong and traditional Chinese using a Baidu Universal translation API.

As shown in FIG. 5, the translation generation module uses a Helsinki-NLP/opus-mt-zh-en model to translate Chinese into English, and then uses the Helsinki-NLP/opus-mt-en-zh model to translate English into Chinese to generate a reply. The multilingual translation model can translate an input language into a plurality of target languages, and the multilingual translation model is used for generating the repeat, so that abundant repeat can be generated and the generation efficiency is improved. In this example, helsinki-NLP/opus-mt-de-ZH model was used. It is a multilingual translation model that can translate german into mandarin, cantonese, and local dialects. In order to use the Helsinki-NLP/opus-mt-de-ZH model, the Helsinki-NLP/opus-mt-ZH-de model is used for translating Chinese into German, and then the Helsinki-NLP/opus-mt-de-ZH model is used for translating the German into Chinese such as Mandarin, guangdong and local dialects to generate the compound statement.

The model generation module includes two methods, as shown in fig. 6, that is, directly training the chinese repeat generating model to generate repeat and indirectly generating chinese repeat by using the english repeat generating model. The GPT module in FIG. 6 is a Chinese repeat generation model obtained based on the fine tuning training of the thu-coai/CDial-GPT _ LCCC-large model. the thu-coai/CDial-GPT _ LCCC-large is an open-source Chinese language model, and can generate high-quality dialogue response after being trained by an LCCC-large data set. And collecting sentence pairs with the same semantics and different expressions by a network crawling and manual labeling method to serve as a Chinese repeated statement data set. And (3) finely adjusting the thu-coai/CDial-GPT _ LCCC-large model by using the Chinese repeat data set to obtain a Chinese repeat generating model.

At present, english rephrase generation is studied more deeply than Chinese rephrase generation, and a plurality of open-source models and frameworks can generate high-quality rephrase. Although the open source models and frameworks are resources generated by English language reproduction, the open source models and frameworks are still valuable in Chinese language reproduction tasks. To utilize these resources in Chinese language paraphrase generation, the most natural idea is to translate Chinese sentences into English, input an English paraphrase generation model to generate English paraphrases, and translate English paraphrases into Chinese to generate Chinese paraphrases. In this embodiment, a Vamsi/T5_ parahrase _ pages model is selected, a translation method in a translation generation module is used to translate chinese into english, an english sentence is input into the Vamsi/T5_ parahrase _ pages model to generate an english rephrase, and then the english rephrase is translated into chinese by the translation method in the translation generation module to generate a chinese rephrase.

The synonym replacement module is shown in fig. 7 and comprises a segmentation module, a named entity recognition module, an alternative filtering module, a synonym searching module, a synonym filtering module and a synonym replacement module. The idea of generating the repeat by the agreement word replacement module is to replace words in the sentence with synonyms, so that the expression of the sentence can be changed to generate the repeat under the condition of keeping the semantic of the sentence unchanged. However, the meaning of a word is related to the context of the sentence in which the word is located, and in some cases, synonym replacement can change the sentence semantics. For example, the phrase "when a conjugant is listed" is synonymous with "feeling", "imagination", "conception", "reverie", etc. for the term "conjugant". The "association" in a sentence is the name of the organization, and replacing "association" with a synonym changes the sentence semantics. In order to avoid this situation, a set of non-replaceable entity types needs to be counted, and entity words belonging to the set of non-replaceable entity types will not be replaced by synonyms. The sentence is divided and named entity identification is carried out by using a Baidu LAC (Lexical Analysis of Chinese) tool. And taking a set consisting of entity types such as an article name (nw), other proper names (nz), punctuation marks (w), a person name (PER), a location name (LOC), an organization name (ORG) and TIME (TIME) as a set of non-replaceable entity types, wherein the entities with the types belonging to the set are not subjected to synonym replacement.

And for the alternative words, searching synonyms through the synonym forest expansion version of the Hadamard information retrieval research room. The synonym forest expansion version of the information retrieval research room in Hadamard is an expanded version of synonym forest in the information retrieval research room in Hadamard and comprises 77343 words. In the extended version of the word forest text, words in the same line have the same word senses or have strong correlation with the word senses. Synonyms can be looked up based on the expanded version of the forest. Given a word, other words in the row may be treated as synonyms for the given word by querying the row number of the word. For example, the term "soybean" can be used synonymously to refer to the terms "green soybean" and "soybean".

However, for an ambiguous word, its meaning is related to the context in which it is presented. Some synonyms queried from a forest of words may not be applicable to the current context. For example, "several" and "more or less" can both be synonyms for "how many", in the sentence "how many people are in the figure" more or less "cannot be synonyms for" how many "; in the sentence "how much he is angry" the "several" cannot be used as a synonym for the "how much". Therefore, the queried synonyms need to be filtered to obtain the synonyms applicable to the current context. Compared with the synonym which is not suitable for the current context, the sentence is still smooth and fluent after the synonym which is suitable for the current context is replaced by the original word. From a statistical point of view, synonyms that apply to the current context should have a higher probability of occurrence in the current context than synonyms that do not apply to the current context. Based on this idea, a mask Language Model (Masked Language Model) pre-trained on large-scale corpora is introduced to filter synonyms. Specifically, original words in the sentence are replaced by masks with the same length as the synonyms, and a mask sentence is obtained. Inputting a mask sentence into a pre-training mask language model, calculating the probability of generating a word corresponding to a synonym by the output corresponding to the mask position, taking the geometric mean of the generation probability of each word of the synonym as a confidence coefficient, and expressing the probability as follows by a formula:

where conf is the confidence, n is the mask length, p is the probability of generation, i is the ith mask position, w is a synonym,

is the ith word of the synonym.

After the synonym confidence is present, a confidence threshold is set. And filtering out synonyms with the confidence degrees smaller than the threshold value to obtain synonyms suitable for the current context. Taking "how many people are in the graph" as an example, replacing "how many" in the sentence with a mask having the same length as the synonym results in a mask sentence. For synonym "several", we get MASK sentence "with MASK]Human, for synonyms "more or less", we get the MASK sentence "there is [ MASK ] in the graph][MASK][MASK][MASK]Human ". The mask sentence is input into a pre-training mask language model, and a chinese-roberta-wwm-ext model is selected in the embodiment. The probability that the corresponding output at the mask location generates synonyms is calculated, as shown in FIG. 8. For MASK sentence "there is [ MASK in the figure]Person ", probability of mask position generating" a few

The confidence of "a/d" is>

. For MASK sentence "there is [ MASK in the figure][MASK][MASK][MASK]Human ", the mask positions respectively generate a" more or less "probability of being

. The "more or less" confidence level is

。

The threshold value config =0.0015 is set,

thus "several" is retained and "more or less" is filtered out.

After the synonyms suitable for the context are obtained, the synonym replacing module replaces the original words with the synonyms to generate the repeated statement. For example, for the sentence "how many people are in the figure", the synonyms "drawing" and "how many" of "figure" are obtained. The synonym replacement module will generate 6 repeats: how many people are in the drawing, and how many people are in the drawing.

The statement generating module generates a statement through translation generation, model generation and synonym replacement. And after the repetition is removed, inputting the repetition into a fluency filtering module, and filtering out fluency and smooth repetition. Similar to the synonym filtering module, the fluency is calculated by utilizing a pre-trained mask language model. Specifically, each word in the sentence is covered one by one, and the probability of generating the original word at the mask position output by the model is calculated. Calculating a degree of confusion based on the obtained generation probability, and mapping a value obtained by subtracting the degree of confusion from 1 to an interval [0,1] by using an exponential function]. The fluency calculation method is formulated as:

is the ith word of the sentence.

In the embodiment, a self-coding model, namely, a chip-roberta-wwm-ext is selected to build a fluency filtering module. Take "how many people are in the figure" as an example, will "[ MASK]How many people in the model are input to obtain the output of the model in the MASK]The probability of a location generating a "map" is

. Then "map [ MASK]Input model for how many people there areObtaining model output in [ MASK ]]The probability that a location generates "in" is ≦>

. By analogy, the probability is obtained

，/>

. Calculate the fluency->

. For "how many patterns there are in person", the probability is calculated, respectively>

，

，

And calculating to obtain the fluency->

. Setting the fluency threshold to

. The fluency of the example sentence has >>

"how many people in the figure" can be filtered by fluency, "how many figures in the figure" will be filtered out.

After filtering to obtain fluent retelling, the retelling should be ensured to be the same as the original sentence semantics. According to the method, a repetition is generated through a generation model, an original sentence and a repeated sentence vector are obtained through the generation model, and whether the repeated sentence is the same as the original sentence in semantics is judged through cosine values among the sentence vectors. This approach does not distinguish between the generation and evaluation tasks. The generation model generates the repeat based on the original sentence, which indicates that the generation model considers that the repeated semantics are the same as the original sentence, and then the sentence vector obtained by the generation model is used for judging the original sentence and the repeat semantics, so that the judgment semantics tend to be the same. In order to avoid the generation process affecting the evaluation result, the semantic filtering module should train an independent model unrelated to sentence generation to judge the semantics.

Using the Sennce Bert structure, a dataset is constructed to train the model. A data set constructed when training a Chinese restatement model is utilized. Each pair of sentences in the data set is a positive sample. For each sentence, one sentence is randomly sampled from the data set and paired with it as a negative sample. Most sentences of the negative sample formed by random sampling pairing have different semantics and words, and the sample is called as a strong negative sample. For better training of the model, for each sentence, a sentence pair similar to its wording but having a different semantic meaning is selected from the data set as a negative sample, called a weak negative sample. The selection method comprises the steps of firstly extracting tf-idf vectors of sentences, recalling the most similar 10 sentences according to cosine values of the tf-idf vectors among the sentences, and manually selecting one sentence with different semantics from the current sentence from the 10 sentences. And if no sentences with different semantics exist in the recalled 10 sentences, continuously recalling 10 sentences with later cosine values for screening until the sentences with different semantics are selected. After a new data set is constructed, the training of the model begins. The sequence Bert model is trained to include a Classification Objective Function (Classification Objective Function), a Regression Objective Function (Regression Objective Function), and a triple Objective Function (triple Objective Function). The conventional ternary objective function is to minimize the loss function:

. Wherein loss is a loss value>

A sentence vector representing the sentence a @>

A sentence vector, representing a sentence p>

A sentence vector representing sentence n. The sentence p is a positive example of a, i.e., p has the same semantic meaning as a. Sentence n is a negative example of a, i.e., n is different from a in semantic meaning. />

Represents a distance metric, <' > based on>

Indicating the set margin value.

In the expression of loss function

Make a setting of>

At least the ratio->

Leave/be>

More closely->

Of the distance of (c). The loss function can be opened and/or closed>

And &>

But not enough to open>

And &>

Is a distance of. For example when>

And &>

Is at a distance of->

And->

Located and->

At the middle of the middle, it is selected>

The model will not be optimized again during training. When a sentence p is input and a sentence with the same semantic meaning as p is desired to be searched, the sentences a and n are as close to p in the sentence vector space. For the sentence p, it cannot be judged which of a and n is the same as self-semantic according to the distance between the sentence vectors. As a positive example of sentence a, p has the same semantic meaning as a. It can be considered that the sentences p and a are equivalent, and interchanging p and a should not and cannot affect the calculation result of the loss function. To embody this idea, the ternary objective function is modified into the form: />

. Wherein +>

For improved loss values, the remaining characters have the same meaning as the original ternary objective function. When +>

And &>

In a distance of >>

And->

Is located at

And &>

At the middle of the middle, it is selected>

The model continues to be optimized during training.

The improved ternary objective function expression has the rotation symmetry of sentences a and p, and makes

And &>

In a distance ratio>

And &>

、/>

And &>

Is at least small>

. The improved ternary objective function enables sentences with the same semantics to be closer in a sentence vector space and sentences with different semantics to be farther away.

And training by utilizing the classification target function, the regression target function and the improved ternary target function to obtain a Senntence Bert model, setting a cosine value threshold, and considering that the semantics of two sentences are the same when the cosine value of a Sentence vector between the two sentences is greater than or equal to the threshold. The original sentence "how many people there are in the figure" and the repeated sentence "how many people there are in the figure" and how many trees there are in the figure "are taken as examples. The cosine values of the original sentence and the sentence "there are several persons in the figure" are 0.86368716, and the cosine values of the sentence "there are trees in the figure" are 0.14648251. Setting the threshold value of cosine value to be 0.85, filtering to obtain the repeat of the original sentence with 'several persons in the picture', and filtering out 'how many trees are in the picture'.

The above modules are assembled to build a duplicate generation system of the present embodiment, and duplicate representations generated by the system are shown in the following table.

/>

Table 1 example of generating the present embodiment

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A sentence rephrasing generation system based on a pre-training model is characterized in that: the system comprises a repeat generation module, a fluency filtering module and a semantic filtering module which are connected in sequence; the rephrase generating module is used for generating the rephrase; the fluency filtering module is used for calculating the fluency of repeat, and filtering to obtain repeat with the fluency not lower than a threshold value; the semantic filtering module is used for calculating the semantic similarity between the repeat and the original sentence, and filtering to obtain the repeat with the semantic similarity not lower than a threshold value;

the fluency filtering module calculates the fluency by utilizing a pre-trained mask language model, sets a fluency threshold value, and filters to obtain the repeat of the fluency; the fluency calculation method comprises the steps of covering each word in a sentence one by one, and calculating the probability of generating an original word at the mask position output by a model; based onCalculating the confusion degree of the generated probability, and mapping the value obtained by subtracting the confusion degree from 1 to the interval [0,1] by using an exponential function](ii) a The fluency calculation method is formulated as:

is the ith word of the sentence;

the semantic filtering module trains an independent Sennce Bert model irrelevant to Sentence generation to judge semantics, constructs a data set training model containing strong negative samples and weak negative samples, and improves a ternary objective function during training of the Sennce Bert model into the following form:

wherein

For an improved loss value>

A sentence vector representing the sentence a @>

A sentence vector representing the sentence p @>

A sentence vector representing a sentence n; sentence p is a positive example of a, i.e., p has the same semantic meaning as a; sentence n is a negative example of a, i.e., n is different from a in semantic;

represents a distance measure, <' > is>

Indicating the set margin value.

2. The system according to claim 1, wherein the system comprises: the rephrase generation module comprises a translation generation module, a model generation module and a synonym replacement generation module; the translation generation module generates a statement through an transliteration generation mode and a retrace generation mode, the model generation module generates a statement through a language model, and the synonym replacement generation module generates a statement through replacing synonyms in the original sentence.

3. The system for generating sentence rephrasing based on the pre-training model of claim 2, wherein: the transliteration generation mode generates a compound sentence by translating an input sentence into Chinese, and the retranslation generation mode translates the input Chinese sentence into a foreign language sentence by using a multilingual translation model and then translates the foreign language sentence into Chinese to generate the compound sentence.

4. The system for generating sentence rephrasing based on the pre-training model of claim 2, wherein: the model generation module comprises two modes of directly training a Chinese compound generation model to generate compound and indirectly generating Chinese compound by using an English compound generation model, wherein the mode of indirectly generating Chinese compound by using the English compound generation model is to translate Chinese sentences into English, input the English compound generation model to generate English compound, and translate the English compound into Chinese to generate Chinese compound.

5. The system according to claim 2, wherein the sentence repetition generation system based on the pre-training model comprises: the synonym replacement generation module comprises a participle sub-module, a named entity recognition sub-module, a replaceable word filtering sub-module, a synonym searching sub-module, a synonym filtering sub-module and a synonym replacement sub-module.

6. The system according to claim 5, wherein the sentence repetition generation system based on the pre-training model comprises: the replaceable word filtering submodule takes a set consisting of a work name, other proper names, punctuation marks, a person name, a place name, a mechanism name and time as a non-replaceable entity type set, and entities of which the types belong to the set are not replaced by synonyms.

7. The system according to claim 5, wherein the sentence repetition generation system based on the pre-training model comprises: the synonym filtering submodule introduces a mask language model pre-trained on a large-scale corpus to filter synonyms and specifically comprises the following steps:

s3: setting a confidence threshold;

8. The system according to claim 1, wherein the system comprises: the semantic filtering module encodes the Sentence by using a Sennce Bert model to obtain a Sentence vector, and judges whether the semantics are the same or not according to cosine values among the Sentence vectors; and setting a cosine value threshold, and considering that the semantics are the same if and only if the cosine value between the sentence vectors is greater than or equal to the threshold.