CN115809658A - Parallel corpus generation method and device and unsupervised synonymy transcription method and device - Google Patents

Parallel corpus generation method and device and unsupervised synonymy transcription method and device Download PDF

Info

Publication number
CN115809658A
CN115809658A CN202211497311.4A CN202211497311A CN115809658A CN 115809658 A CN115809658 A CN 115809658A CN 202211497311 A CN202211497311 A CN 202211497311A CN 115809658 A CN115809658 A CN 115809658A
Authority
CN
China
Prior art keywords
transcription
corpus
synonymous
transcribed
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211497311.4A
Other languages
Chinese (zh)
Inventor
李涓子
刘金鑫
齐济
曹书林
侯磊
张鹏
唐杰
许斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202211497311.4A priority Critical patent/CN115809658A/en
Publication of CN115809658A publication Critical patent/CN115809658A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a method and a device for generating parallel corpora and a method and a device for unsupervised synonymy transcription, wherein the method for generating the parallel corpora comprises the following steps: acquiring a linguistic data to be transcribed and the context of the linguistic data to be transcribed; acquiring a keyword set based on the corpus to be transcribed; inputting the keyword set and the context of the linguistic data to be transcribed into a pre-training language model to obtain at least one candidate synonymy transcription linguistic data output by the pre-training language model; and evaluating each candidate synonymy transcription corpus, and determining a target synonymy transcription corpus based on an evaluation result. The unsupervised synonymous transfer method comprises the following steps: acquiring a sentence to be transcribed; inputting the sentence to be transcribed into a synonymous transcription model to obtain a synonymous transcription sentence output by the synonymous transcription model; wherein, the synonymy transcription model is obtained based on parallel corpus pair training. The embodiment of the invention can obtain excellent synonymy transliteration sentences.

Description

Parallel corpus generation method and device and unsupervised synonymy transcription method and device
Technical Field
The invention relates to the technical field of computers, in particular to a parallel corpus generation method and device and an unsupervised synonymy transcription method and device.
Background
Synonym transcription refers to the expression of sentences of the same meaning in different expression forms. From the early days of computational linguistic research, automatic generation of commentary is a basic task of natural language processing and has wide application in downstream tasks, including question answering, semantic parsing, machine translation, and the like. In addition, synonym transcription generation is an important data enhancement method, which can benefit learning in low-resource environments.
The current synonymous transfer method comprises the following steps: and on the basis of the linguistic data to be transcribed, the method utilizes the pre-training language model to carry out augmentation, modification and optimization, and directly utilizes the pre-training language model to carry out generation. However, on the basis of the linguistic data to be transcribed, the pre-training language model is utilized to carry out the improvement and optimization, and usually, only the common words are locally changed, so that the diversity is hindered; methods that directly use pre-trained language models for generation typically suffer from some semantic bias.
Therefore, the current synonymy transfer method cannot take into account the semantic identity and the sentence diversity of the synonymy transfer.
Disclosure of Invention
The invention provides a parallel corpus generation method and device and an unsupervised synonymy transcription method and device, which are used for overcoming the defect that the semantic identity and the sentence diversity of the synonymy transcription cannot be considered in the prior art.
In a first aspect, the present invention provides a method for generating parallel corpora, including:
obtaining a linguistic data to be transcribed and a context of the linguistic data to be transcribed;
acquiring a keyword set based on the corpus to be transcribed;
inputting the keyword set and the context of the linguistic data to be transcribed into a pre-training language model to obtain at least one candidate synonymous transcription linguistic data output by the pre-training language model;
and evaluating each candidate synonymous transcription corpus, and determining a target synonymous transcription corpus based on an evaluation result, wherein the evaluation at least comprises diversity evaluation.
Optionally, the obtaining the keyword in the corpus to be transcribed includes:
extracting keywords from the corpus to be transcribed to obtain at least one initial keyword;
filtering the initial keywords to obtain filtered keywords;
obtaining synonymous keywords corresponding to the filtering keywords one by one based on the filtering keywords;
replacing part or all of the filtering keywords with synonymous keywords based on a preset replacement proportion to obtain an initial keyword set;
and rearranging the sequence of the keywords in the initial keyword set to obtain the keyword set.
Optionally, the obtaining, based on the filtering keyword, synonymous keywords corresponding to the filtering keyword one to one includes:
inputting the linguistic data to be transcribed into a pre-training language model, taking the position of a single filtering keyword as a prediction object, and performing mask prediction on the linguistic data to be transcribed through the pre-training language model to obtain candidate synonymous keywords output by the pre-training language model, wherein each filtering keyword corresponds to at least one candidate synonymous keyword;
and determining the synonymy keywords corresponding to each filtering keyword in the candidate synonymy keywords.
Optionally, the inputting the keyword set and the context of the corpus to be transcribed into a pre-training language model to obtain at least one candidate synonym transcription corpus output by the pre-training language model includes:
and inputting the keyword set and the context of the linguistic data to be transcribed into a pre-training language model, wherein the pre-training language model predicts words among the keywords under the constraint of the context of the linguistic data to be transcribed to obtain at least one candidate synonymous transcription linguistic data.
Optionally, the evaluating each candidate synonymous transcription corpus comprises:
respectively calculating the semantic similarity between each candidate synonymy transcription corpus and the corpus to be transcribed, and obtaining the semantic score corresponding to each candidate synonymy transcription corpus;
calculating the generation probability of each candidate synonymous transcription corpus to obtain the fluency score corresponding to each candidate synonymous transcription corpus;
calculating the Jacard Jaccard similarity of each candidate synonymous transcription corpus and the corpus to be transcribed respectively to obtain the diversity score corresponding to each candidate synonymous transcription corpus;
and weighting the semantic score, the fluency score and the diversity score to obtain an evaluation result.
In a second aspect, the present invention provides an unsupervised synonymous transfer method, including:
acquiring a statement to be transcribed;
inputting the sentence to be transcribed into a synonymous transcription model to obtain a synonymous transcription sentence output by the synonymous transcription model;
the synonymy transcription model is obtained by training based on a parallel corpus pair, and the parallel corpus pair is obtained based on the parallel corpus generation method of the first aspect.
In a third aspect, the present invention provides a device for generating parallel corpora, including:
the first obtaining unit is used for obtaining the linguistic data to be transcribed and the context of the linguistic data to be transcribed, and the first processing unit is used for obtaining a keyword set based on the linguistic data to be transcribed;
the first transcription unit is used for inputting the keyword set and the context of the linguistic data to be transcribed into a pre-training language model, and obtaining at least one candidate synonymous transcription linguistic data output by the pre-training language model;
and the first evaluation unit is used for evaluating each candidate synonymous transcription corpus and determining a target synonymous transcription corpus based on an evaluation result, wherein the evaluation at least comprises diversity evaluation.
In a fourth aspect, the present invention provides an unsupervised synonymous transfer device, comprising:
the second acquisition unit is used for acquiring the statement to be transcribed;
the second transcription unit is used for inputting the sentence to be transcribed to the synonymous transcription model and obtaining the synonymous transcription sentence output by the synonymous transcription model;
the synonymy transcription model is obtained by training based on a parallel corpus pair, and the parallel corpus pair is obtained based on the parallel corpus generation method of the first aspect.
In a fifth aspect, the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the parallel corpus generating method and the unsupervised synonym transcription method according to the first aspect.
In a sixth aspect, the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the parallel corpus generating method and the unsupervised synonymy transcription method according to the first aspect.
According to the parallel corpus generation method and device and the unsupervised synonymy transcription method and device, the keywords of the corpus to be transcribed are adopted in the parallel corpus generation, and the candidate synonymy transcription corpus generated according to the keywords of the corpus to be transcribed can keep semantic consistency with the corpus to be transcribed; the embodiment of the invention also restricts the generated synonymy transcription language data candidates through the context of the language data to be transcribed, and makes full use of the information contained in the context of the language data to be transcribed to make the generated synonymy transcription language data candidates conform to the language scene formed by the context of the language data to be transcribed, so that the synonymy transcription language data candidates and the language data to be transcribed further keep consistent semantics; in addition, the embodiment of the invention also carries out diversity evaluation on the candidate synonymous transcription corpuses, can select the candidate synonymous transcription corpuses with the optimal diversity from a plurality of candidate synonymous transcription corpuses as the target synonymous transcription corpuses, and obtains more excellent synonymous transcription sentences by considering both semantic consistency and expression diversity of the parallel corpuses generated by the parallel corpuses generating method and device provided by the invention. The unsupervised synonymous transcription method and the unsupervised synonymous transcription device provided by the invention can learn the semantic consistency and the expression diversity in the parallel corpus by training the parallel corpus generated by the parallel corpus generating method and the parallel corpus generating device provided by the invention, and realize unsupervised synonymous transcription without manually marking the corpus.
Drawings
In order to more clearly illustrate the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a schematic flow chart illustrating a method for generating parallel corpora according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of an unsupervised synonymous transfer method according to an embodiment of the present invention;
FIG. 3 is a second flowchart of an unsupervised synonymous transfer method according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of an apparatus for generating parallel corpora according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of an unsupervised synonymous transfer device according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.
The technical terms related to the invention are described as follows:
(1) Synonymy transcription generation: for any given natural language sentence text, deep semantic information in the sentence is understood and captured, the language expression mode is changed, and sentences which are consistent with the input text semantics but have various expressions are generated. For example, input "amusement park at a", the generation may be "amusement park at a".
(2) The "generate-train-apply" paradigm: firstly, generating a large number of pseudo parallel sentence pairs by using large-scale unmarked linguistic data and a pre-training language model such as T5; then, a sequence-to-sequence (Seq 2 Seq) language model is trained by using the pseudo parallel sentences, so that better initialization parameters on the synonymous transcription task are obtained. And finally, applying the Seq2Seq model to a plurality of task scenes, including unsupervised and supervised synonymous transcription, data generation and enhancement of downstream natural language tasks and the like.
(3) Keyword extraction and replacement: in order to ensure that the generated transcription sentence contains key information in the corpus to be transcribed and is used as a generation constraint, some keyword extraction technologies based on the co-occurrence of a text base and word frequency are required to be firstly applied to extract partial keywords (phrases) from the corpus to be transcribed. Meanwhile, in order to increase the diversity of the key information, the extracted keywords need to be appropriately replaced.
(4) Pre-training a language model: a language model with good initialization parameters, which is obtained by training a certain basic task on a large amount of linguistic data, can be applied to a wide range of downstream natural language understanding and generating tasks. Different pre-training language models adopt different basic training tasks in pre-training, and different strong downstream tasks, for example, GPT2 completes the pre-training task of autoregressive coding before being applied to the downstream tasks, which is better at the generation of weak constraints; BERT adopts the task of predictive masking in the pre-training stage, and is better at the language task of understanding class.
Synonym transcription refers to the expression of sentences of the same meaning in different expression forms. From the early days of computational linguistic research, automatic generation of commentary is a basic task of natural language processing and has wide application in downstream tasks, including question answering, semantic parsing, machine translation, and the like. In addition, synonym transcription generation is an important data enhancement method, which can benefit learning in low-resource environments.
The following methods are generally used in the related art for the synonymous transcription: such as rule-based and thesaurus-based methods, synonym transcription is generated primarily by direct explicit manipulation of words, phrases or sentences. These methods, however, often do not perform well and are limited by expensive manual labeling or the size of the linguistic resources. Later, a sequence-to-sequence (Seq 2 Seq) paradigm was introduced into parse word generation. The paradigm greatly improves the performance of the generation by training on a large number of parallel corpora and combining with new model structures such as GAN or VAE. However, these surveillance methods are highly dependent on a large amount of annotation data, which is expensive and difficult to obtain with high quality.
In the above description, a class of methods mainly uses a pre-training language model to perform augmentation, modification and optimization on the basis of the corpus to be transcribed, and uses some optimization methods (simulated annealing, reinforcement learning, etc.) to achieve the purpose of synonymous transcription. However, because the method mainly edits on the corpus to be transcribed, the common words are usually changed only locally, and other global expression factors (such as expression order) are ignored, so that the diversity is hindered; on the other hand, the method of directly generating by using the pre-training language model usually lacks strong semantic constraint conditions, and cannot control the generation of the pre-training language model well, so that some semantic deviation inevitably occurs.
The quality of the generated synonymous transcription can be evaluated mainly from the aspects of semantic consistency with the linguistic data to be transcribed and expression diversity: a good synonym transcription sentence should use as much richer expression as possible while remaining semantically consistent with the corpus to be transcribed. Although the existing technology and method using the pre-training language model have some good effects, the technology and the method are difficult to realize the balance between the two.
The following describes a method for generating parallel corpora according to an embodiment of the present invention with reference to fig. 1.
Fig. 1 is a schematic flow diagram of a method for generating parallel corpuses according to an embodiment of the present invention, and as shown in fig. 1, the embodiment of the present invention provides a method for generating parallel corpuses, including:
step 110, obtaining a corpus to be transcribed and a context of the corpus to be transcribed;
specifically, the linguistic data to be transcribed is a sentence needing to be transcribed; the context of the corpus to be transcribed refers to the preceding and following sentences of the corpus to be transcribed, and can be one or more preceding sentences and one or more following sentences.
Step 120, obtaining a keyword set based on the corpus to be transcribed;
specifically, a keyword of the corpus to be transcribed is obtained, and it should be understood that the keyword may be a word or a phrase composed of words. The keywords can be original words in the corpus to be transcribed or synonyms.
Step 130, inputting the keyword set and the context of the linguistic data to be transcribed into a pre-training language model, and obtaining at least one candidate synonymy transcription linguistic data output by the pre-training language model;
specifically, a Pre-trained Language Model (PLM) refers to a Language Model that can obtain a better Language representation by performing "Pre-training" and is applied to a specific natural Language processing downstream task, and the Pre-trained Language Model adopted in the embodiment of the present invention is a Pre-trained Language Model that can generate a sentence according to a keyword set and a context of a corpus to be transcribed.
Illustratively, the keyword set "meeting, a studio, 8 o' clock", one sentence above: "executive" and a sentence below: the pre-training language model can output a statement that the 8 th meeting is held in the working room a, and the statement that the 8 th meeting is held in the working room a is the candidate synonymous transcription corpus.
The pre-trained language model may output a plurality of sentences as candidate synonym transcription corpora.
And step 140, evaluating each candidate synonymous transcription corpus, and determining a target synonymous transcription corpus based on an evaluation result, wherein the evaluation at least comprises diversity evaluation.
In one possible implementation, the difference between the synonymy transcription corpus candidate and the synonymy transcription corpus to be transcribed can be evaluated by calculating the Euclidean distance between the synonymy transcription corpus candidate and the synonymy transcription corpus to be transcribed, so that the synonymy transcription corpus candidate and the synonymy transcription corpus to be transcribed are similar but different, and the semantic consistency and the sentence expression diversity are ensured.
It should be understood that the above is an example for facilitating understanding of the present invention, and the present invention does not limit the specific method for evaluation, and may include various evaluation criteria, such as semantic consistency evaluation, diversity evaluation, fluency evaluation, and the like, and the present invention also does not limit the calculation method corresponding to each evaluation criterion.
The method for generating the parallel corpus provided by the embodiment of the invention adopts the keywords of the corpus to be transcribed, and the candidate synonymous transcription corpus generated according to the keywords of the corpus to be transcribed can keep semantic consistency with the corpus to be transcribed; the embodiment of the invention also restrains the generated synonymy transcription corpora candidate through the context of the corpora to be transcribed, fully utilizes the information contained in the context of the corpora to be transcribed, and ensures that the generated synonymy transcription corpora candidate accords with the language scene formed by the context of the corpora to be transcribed, so that the synonymy transcription corpora candidate and the corpora to be transcribed further keep consistent semantics; in addition, the embodiment of the invention also carries out diversity evaluation on the candidate synonymous transcription corpuses, can select the candidate synonymous transcription corpuses with the optimal diversity from a plurality of candidate synonymous transcription corpuses as the target synonymous transcription corpuses, and obtains the excellent parallel corpuses by considering both semantic consistency and expression diversity.
In the following, a possible implementation of the above steps in a specific embodiment is further described.
Optionally, the obtaining the keyword in the corpus to be transcribed includes:
obtaining a linguistic data to be transcribed and a context of the linguistic data to be transcribed;
acquiring a keyword set based on the corpus to be transcribed;
inputting the keyword set and the context of the linguistic data to be transcribed into a pre-training language model to obtain at least one candidate synonymy transcription linguistic data output by the pre-training language model;
and evaluating each candidate synonymy transcription corpus, and determining a target synonymy transcription corpus based on an evaluation result, wherein the evaluation at least comprises diversity evaluation.
Extracting keywords from the corpus to be transcribed to obtain at least one initial keyword:
specifically, the keyword extraction may extract key information in the corpus to be transcribed. The keyword extraction may adopt a keyword extraction technique in the related art, such as a Rake algorithm.
Alternatively, the nouns and verbs in the corpus to be transcribed can be used as the initial keywords by Part-of-Speech (POS) to form a set { k } i }。
For filtering the initial keyword, obtaining a filtered keyword:
specifically, to balance the increase in computational complexity and the decrease in diversity due to too many keywords, { k } i The embodiment of the invention can adopt the technology of keyword filtering to lead the low information content, namely the less important relation, in the corpus to be transcribedAnd filtering out the key words to obtain filtered key words, namely filtering the key words.
Alternatively, p (S | k) can be calculated i ) The amount of information of the keyword is measured,
wherein p (· | ·) represents the conditional generation probability of the pre-trained language model, and it can be considered herein that if the initial keyword can be better restored to generate the corpus S to be transcribed, the initial keyword is a keyword having a representative and high information content in the corpus to be transcribed.
Alternatively, the filtering ratio may be 20% -30%, i.e., 20% -30% of the initial keywords are filtered out.
For the synonymy keywords which are obtained based on the filtering keywords and correspond to the filtering keywords one by one:
specifically, synonymy keywords corresponding to the filtering keywords one to one are obtained based on the filtering keywords, that is, the filtering keywords are subjected to keyword replacement to obtain the synonymy keywords.
In one possible implementation, the filtering keywords may be input to a pre-training language model for synonymy transformation, and the synonymy keywords output by the pre-training language model may be obtained.
Optionally, the obtaining, based on the filtering keyword, synonymous keywords corresponding to the filtering keyword one to one includes:
inputting the linguistic data to be transcribed into a pre-training language model, taking the position of a single filtering keyword as a prediction object, and performing mask prediction on the linguistic data to be transcribed through the pre-training language model to obtain candidate synonymous keywords output by the pre-training language model, wherein each filtering keyword corresponds to at least one candidate synonymous keyword;
and determining the synonymy keywords corresponding to each filtering keyword in the candidate synonymy keywords.
In one possible implementation, the keywords to be replaced (i.e., the filtering keywords) are covered by special [ MASK ] symbols in the corpus to be transcribed, and then the covered corpus to be transcribed is predicted by using a pre-trained language model (such as BERT or T5) based on MASK prediction, wherein the pre-trained language model completes the covered part and outputs the covered part, and the output covered part is the synonymous keywords.
In one possible implementation, a strategy of beam search (beam search) may be employed for predicting the predicted object. Optionally, the embodiment of the present invention regards the prediction result of the second beam as a possible replacement for the original word (i.e., the filtering keyword as the prediction object). It should be understood that the second beam refers to the beam with the second output order, and since the outputs in the beam search are sorted by probability, the second largest beam can also be understood.
Optionally, certainly, in order to prevent a situation that a word with an opposite meaning happens to occur in the replacement process (e.g., small → big), the embodiment of the present invention also screens the replaced keyword based on a synonym dictionary such as WordNet, so as to prevent some commonly occurring errors.
Specifically, it may be determined from the synonym dictionary in WordNet that the determined object includes an adjective, an adverb, and a verb included in the keyword, and specifically, if the word output from the pre-trained language model is not in the synonym set of the original word, the word is not used to replace the original word.
By way of example: if the extracted keyword is a phrase of small sofa and the predicted replacement keyword of the pre-trained language model is large sofa, according to the rule of the embodiment of the invention, the synonym set of the adjective small in the phrase has no large, so that the current replacement change is rejected and the original keyword is kept unchanged.
For the preset replacement proportion, replacing part or all of the filtering keywords with synonymous keywords to obtain an initial keyword set:
illustratively, the filter keywords are: { A1, B1, C1}, the synonymous keywords are: { A2, B2, C3}, it should be understood that A1 corresponds to A2, B1 corresponds to B2, C3 corresponds to C1, and a preset replacement ratio is 1/3, one filtering keyword in { A1, B1, C1} is replaced. The replacement can be performed based on a preset rule, such as replacing the ith, i +2, i +5, or can be performed randomly. The initial set of keywords after replacement may be { A1, B2, C1}.
It should be understood that the substitution scale may be adjusted.
Rearranging the sequence of the keywords in the initial keyword set to obtain the keyword set:
specifically, the replaced keywords may be rearranged randomly, so as to obtain an initial keyword set.
Illustratively, the initial keyword set may be { A1, B2, C1}, and the random arrangement results in { B2, C1, A1}.
According to the method for generating the parallel corpus provided by the embodiment of the invention, the key information in the corpus to be transcribed can be reserved through the extraction of the key words; through the keyword filtering, the rising of the computational complexity and the reduction of the diversity caused by too many keywords can be balanced; through keyword replacement and keyword rearrangement, the diversity of sentences is improved.
Optionally, the inputting the keyword set and the context of the corpus to be transcribed into a pre-training language model to obtain at least one candidate synonymous transcription corpus output by the pre-training language model includes:
and inputting the keyword set and the context of the linguistic data to be transcribed into a pre-training language model, wherein the pre-training language model predicts words among the keywords under the constraint of the context of the linguistic data to be transcribed to obtain at least one candidate synonymous transcription linguistic data. It should be understood that the keywords are keywords in the set of keywords.
Specifically, the pre-training language model may be a pre-training language model capable of implementing segment prediction (span prediction), such as a bidirectional pre-training language model of the T5 series. The segment prediction can be realized by covering a segment in the corpus to be transcribed, usually several words, and the pre-trained language model can predict the covered segment in an autoregressive mode.
In one possible implementation, the keywords k in the keyword set are combined i As anchor, at k i Is inserted inAnd measuring the covered special symbol (span-mask token), predicting the special symbol (span-mask token) by utilizing the pre-training task of T5 to obtain a prediction result, completing sentences through the prediction result, and connecting key words to generate candidate synonymy transliteration corpora.
The constraint of the context of the linguistic data to be transcribed means that the context of the linguistic data to be transcribed is inserted before and after the input keyword set, and the generation space of the output (namely the candidate synonymous transcription linguistic data) is further reduced.
The parallel corpus generation method provided by the embodiment of the invention utilizes the capability of a pre-training language model in the aspect of language modeling and the performance of being capable of generating fluent and meaningful sentences to generate fluent and complete candidate synonymous transcription corpuses; the keywords can be used as anchors, so that the extracted keywords can be completely generated in the output candidate synonymous transcription corpora, and the semantic consistency of the candidate synonymous transcription corpora and the corpora to be transcribed is improved; in addition, the semantic consistency of the candidate synonymy transcription corpus and the candidate synonymy transcription corpus is further ensured through the constraint of the context of the corpus to be transcribed.
Optionally, the evaluating each candidate synonymous transcription corpus comprises:
respectively calculating the semantic similarity between each candidate synonymy transcription corpus and the corpus to be transcribed, and obtaining the semantic score corresponding to each candidate synonymy transcription corpus;
calculating the generation probability of each candidate synonymous transcription corpus to obtain the fluency score corresponding to each candidate synonymous transcription corpus;
calculating the Jacard Jaccard similarity of each candidate synonymous transcription corpus and the corpus to be transcribed respectively to obtain the diversity score corresponding to each candidate synonymous transcription corpus;
and weighting the semantic score, the fluency score and the diversity score to obtain an evaluation result.
Specifically, for respectively calculating the semantic similarity between each candidate synonymous transcription corpus and the corpus to be transcribed, the semantic score corresponding to each candidate synonymous transcription corpus is obtained:
in one possible implementation, an automatic semantic similarity calculation method BERTScore may be used to calculate the semantic similarity between each candidate synonymous transcription corpus and the corpus to be transcribed, and a semantic score corresponding to each candidate synonymous transcription corpus.
It should be understood that the above is an example for facilitating understanding of the present invention, and the method for calculating semantic similarity is not limited in the present invention.
For calculating the generation probability of each candidate synonymous transcription corpus, obtaining the fluency score corresponding to each candidate synonymous transcription corpus:
in one possible implementation, the probability of generation of the entire sentence can be taken as the fluency score S flu
S flu =p(w 1 )p(w 2 |w 1 )…p(w n |w n-1 …w 1 )
Wherein S is flu Indicating fluency score, w n Denotes the nth word, p (w) n |w n-1 …w 1 ) Is shown at w n-1 …w 1 In the case of (3), w is generated n The probability of (c). Exemplarily, p (w) n |w n-1 …w 1 ) Representing conditional probabilities, where the probability of each word in a sentence is multiplied by, for example, p (w) in order to calculate the probability of a sentence occurrence 1 ) Denotes the first word as w 1 The probability of (d); p (w) 2 |w 1 ) Is indicated in the first word as w 1 On the condition that the second word is w 2 The probability of (d); by analogy, p (w) n |w n-1 …w 1 ) Representing the first n-1 words as eta n-1 …w 1 Under the condition that the nth word is w n Probability of (2)
For calculating the Jacard Jaccard similarity of each candidate synonymous transcription corpus and the corpus to be transcribed respectively, obtaining the diversity score corresponding to each candidate synonymous transcription corpus:
in one possible implementation, the score calculation can be considered from both the word and word order, using Jaccard distance calculation S div
Figure BDA0003964964380000141
Wherein S is div Represents the diversity score, S 1 Representing a set of words in the corpus to be transcribed, S 2 Representing a set of words, beta, in a candidate synonymy transcription corpus 1 Representing the coefficient between word and word order, beta 2 Representing coefficients between words and word orders, beta 1 And beta 2 For hyper-parameters, values may be set empirically, w represents a word, S represents the set of all words in the corpus to be transcribed and the candidate synonymous transcription corpus,
Figure BDA0003964964380000142
indicating the position of the word in the corpus to be transcribed,
Figure BDA0003964964380000143
representing the position of the word in the candidate synonymous transcription corpus.
Weighting the semantic score, the fluency score and the diversity score to obtain an evaluation result:
specifically, the final evaluation result is a weighted average of three components:
s final (S 1 ,S 2 )=λ 1 ·s sem (S 1 ,S 2 )+λ 2 ·s flu (S 1 ,S 2 )+λ 3 ·s div (S 1 ,S 2 )
wherein s is final Shows the evaluation results, S 1 Representing the corpus to be transcribed, S 2 Representing candidate synonyms, λ 1 Representing semantic score weight coefficient, s sem Representing a semantic score, λ 2 Representing fluency fractional weight coefficient, s flu Indicating fluency score, λ 3 Represents a diversity score weight coefficient, s div A diversity score is represented.
It should be understood that, in the above evaluation, each time, one candidate synonymous transcription corpus is evaluated.
Optionally, the candidate synonymous transcription corpus with the highest score in the evaluation result is selected as the target synonymous transcription corpus.
According to the parallel corpus generation method provided by the embodiment of the invention, the quality of the candidate synonymous transcription corpus is evaluated through three aspects of semantic consistency, fluency and diversity, the target synonymous transcription corpus is determined from the candidate synonymous transcription corpus, and the target synonymous transcription corpus is ensured to be better represented in the three aspects of semantic consistency, fluency and diversity.
Optionally, the parallel corpus generating method in each embodiment is an unsupervised synonymous transcription generating framework based on a pre-training language model, and emphasizes the generation quality of the transcription in two aspects of semantic consistency and expression diversity. Meanwhile, in order to enable the synonymous transcription to better serve a common natural language task, the parallel corpus generation method provided by the invention can be used as a universal data generation or data enhancement module on various downstream tasks, for example, a transcription model is trained by the corpus generated by the parallel corpus generation method, so that the performance of the downstream tasks is improved.
Fig. 2 is a schematic flow diagram of an unsupervised synonymous transfer method provided in an embodiment of the present invention, and as shown in fig. 2, the unsupervised synonymous transfer method provided in the embodiment of the present invention includes:
step 210, obtaining a statement to be transcribed;
step 220, inputting the sentence to be transcribed into a synonymous transcription model to obtain a synonymous transcription sentence output by the synonymous transcription model;
the synonymy transcription model is obtained by training based on a parallel corpus pair, and the parallel corpus pair is obtained by the parallel corpus generation method according to any one of the embodiments.
In particular, the synonymous transcription model may be a pre-trained language model, such as a Seq2Seq language model. Based on the parallel corpus generation method provided in the above embodiment, a target synonymous transcription corpus corresponding to the corpus to be transcribed is obtained, and the corpus to be transcribed and the target synonymous transcription corpus corresponding to the corpus to be transcribed are used as parallel corpus pairs.
The synonymy transcription model provided by the embodiment of the invention can be trained through a large number of parallel corpus pairs, so that the synonymy transcription model learns the semantic consistency and the expression diversity of the parallel corpus pairs generated by the parallel corpus-based generation method, and synonymy transcription sentences conforming to the semantic consistency and the expression diversity are generated; in addition, no monitoring signal is used in the parallel corpus pair generation process, so that the synonymous transcription model in the embodiment of the invention is an unsupervised transcription model, a large number of artificial corpus labels are not needed, and the synonymous transcription cost is reduced.
The invention will now be described with reference to one embodiment:
the purpose of the embodiment of the invention is mainly the following two aspects:
one of the two aspects is considered to be indispensable for qualified synonymous transcription by the embodiment of the invention, so that the embodiment of the invention aims to simultaneously strengthen the two aspects on the basis of previous work to obtain more excellent transcription performance.
Secondly, the method before the embodiment of the invention discovers that the synonymy transcription rarely and truly plays a role in the downstream task, so the embodiment of the invention can provide a more universal synonymy transcription interface and can be used as a data generator or a data enhancer to help the downstream task to obtain better performance in a general sense.
The method can be mainly divided into two parts, wherein the first part is to use a pre-training language model and a single language data set without labels to construct a data set of pseudo-parallel synonym sentence pairs with higher quality in an unsupervised mode, namely the data set comprises a plurality of parallel corpus pairs, and the second part is to perform corresponding training on the data set constructed by the embodiment of the invention, so that the method can be directly used for generating the synonym transcription or applying the synonym transcription to downstream tasks.
First, the first part is described below, i.e. the parallel corpus pair dataset is constructed in an unsupervised manner
The essence of the parallel corpus to data set is to "distill" related monitor signals of synonymous transcription from the language knowledge of the corpus and the pre-trained language model, i.e. the original sentence in the corpus is used as the corpus to be transcribed, and the target synonymous transcription corpus corresponding to the corpus to be transcribed is used as the corresponding monitor signal (also called label). The construction of the parallel corpus pair in the embodiment of the present invention is based on two basic assumptions: firstly, the target synonymy transcription corpus necessarily contains some key information of the corpus to be transcribed, for example, some key words in the corpus to be transcribed; secondly, the sequence of the key information can be exchanged, and then a sentence which is smooth and consistent in semantics can be formed again through proper rewriting and connection.
Based on the above two assumptions, the generation framework proposed by the embodiment of the present invention will proceed according to the following three steps:
1) Processing keywords: extracting keywords from an input corpus to be transcribed, and carrying out appropriate keyword screening, replacing and rearranging to obtain a rearranged and replaced keyword set;
2) Generating based on the pre-training language model: according to the keyword set in the previous step, combining the prompt and constraint of the text of the linguistic data to be transcribed, completing the semantic data to be transcribed into a complete and smooth sentence through a pre-training language model, and taking the generated sentence as the candidate synonymy transcription linguistic data of the linguistic data to be transcribed;
3) And (3) candidate evaluation ranking: and finally, selecting the most appropriate sentence from the generated candidate synonym transcription corpora as a final target synonym transcription corpus by using a proper automatic evaluation method, wherein the target synonym transcription corpus and the corpus to be transcribed form a parallel corpus in the data set.
In the generation process, in order to ensure semantic consistency and expression diversity of parallel corpus pairs, the embodiment of the invention designs a multi-angle semantic consistency constraint and a multi-granularity diversified expression mechanism.
For the consistency constraint:
E1. and (3) keyword constraint: the keywords in the linguistic data to be transcribed are mined out and used as anchors in output sentences;
E2. and (3) context constraint: adding the context of the linguistic data to be transcribed in the generation of the pre-training language model to reduce the generation space of output;
E3. semantic constraint: in the final evaluation stage, semantic consistency evaluation is taken as part of the final scoring.
The diversified expression mechanism is embodied in three layers:
D1. word level: extracting keywords to replace the pre-trained language model with corresponding synonymous keywords;
D2. phrase level: the order of the keywords will be rearranged when the keyword set is output, possibly breaking the original phrase collocation.
D3. Sentence level: in the final evaluation stage, the difference between the candidate synonymous transcription corpus and the corpus to be transcribed is used as a part of the final score.
Fig. 3 is a second schematic flow chart of the synonymous transcription method according to the embodiment of the present invention, and the following describes the three steps in detail with reference to fig. 3:
processing keywords: the term "keyword" in the embodiment of the present invention does not refer to words at word level in english, but also refers to phrases that may be extracted in an extraction algorithm.
The keyword extraction is the first step of the whole embodiment, and the purpose is to provide a coarse-grained selection of the key information in the corpus to be transcribed, which embodies E1. Alternatively, a Rake algorithm may be employed.
Inputting a corpus S (original sentence) to be transcribed: "the final stage is near completion, controller share reported critical".
Obtaining initial key words k through RAKE algorithm and POS i }。
Alternatively, in order to avoid loss of key information in the corpus to be transcribed, POS techniques (part-of-speech analysis) may be employed) All nouns and verbs in the corpus to be transcribed are used as initial keywords { k } i }。
Get the initial key word k i }:“final stage”、“completion”、“reported crisply”、“controller”。
Then, in order to balance the increase of computational complexity and the decrease of diversity caused by too many keywords, k is i In the embodiment of the invention, a keyword filtering technology is adopted to filter out keywords with low information content, namely, less important keywords in the corpus to be transcribed. To measure the amount of information of a keyword, p (S | k) is calculated in the embodiment of the present invention i ) Where p (· |) represents the conditional generation probability of the pre-trained language model, it can be considered here if the keyword k i Can better restore and generate the corpus S, k to be transcribed i Is a keyword with a representative and high information content in the corpus to be transcribed. In practice, filtering to 20-30% of the keywords is probably required at this step.
Filtering the initial keywords according to the conditional generation probability, removing the initial keywords of which the conditional generation probability does not meet the preset conditions, and obtaining filtering keywords: "final stage", "completion", "ported credit".
The preset condition may be that the conditional generation probability is lower than a preset conditional generation probability threshold.
Keyword replacement and arrangement are also an important part in the method, and the purpose is to increase the diversity of output and embody D1 and D2 in the design concept. The replacement is to replace the covering of the keyword to be replaced by a special [ MASK ] symbol in the corpus to be transcribed, then predict the covered sentence by utilizing a pre-training language model based on MASK prediction such as BERT or T5, and the model outputs the completion of the covering part, wherein the prediction part usually adopts a beam search strategy. Of course, in order to prevent the occurrence of the situation where the word with the opposite meaning just occurs in the replacement process (e.g., small → big), the embodiment of the present invention also uses a synonym dictionary such as WordNet to write some judgment principles to prevent some commonly occurring errors. In the keyword set, the ratio of the substitution is also an adjustable parameter.
Taking "final stage" as an example, the position of the "final stage" is covered, and the PLM is used to perform the following steps on the covered corpus S to be transcribed (original sentence): the () is adjacent completion, controller share ported credit ", predict content in (), can get three prediction beads, the first bead is" final stage ", the second bead is" final stage of the project ", the third bead is" last stage ", adopt the second bead.
And finally, the next step of generation can be carried out only by randomly rearranging the replaced keywords.
Replacing the filtering keywords with synonymous keywords (1/3 in this embodiment) based on a preset replacement ratio, and replacing only one keyword to obtain an initial keyword set: "final stage of the project", "completion", "ported credit"; rearranging the sequence of the keywords in the initial keyword set to obtain a keyword set: "ported credit", "completion", "final stage of the project".
Generating based on the pre-training language model: the pre-trained language model is very powerful in language modeling and is also good at generating smooth and meaningful sentences. In this step, the T5-series bi-directional pre-training language model is used in the embodiment of the present invention, and since the original pre-training task is segment prediction (span prediction), i.e. a small segment, usually several words, in the corpus to be transcribed is masked out, the model predicts the masked segment in an auto-regression manner.
In the synonymous transcription generation task in the embodiment of the present invention, in order to completely present the extracted keyword in the output, the embodiment of the present invention uses the keyword { k } in the embodiment of the present invention i As anchor, at k i Special symbols (span-mask tokens) which are predicted to cover are inserted between the two, the pre-training task of T5 is ingeniously utilized, the model can complete sentences, and keywords are linked to generate a smooth and complete sentence.
In this step, the original context of the corpus to be transcribed is inserted before and after the input, and the output generation space is further reduced. This step embodies E1 and E2 in the design concept as a whole.
Inputting the keyword set and the context of the corpus to be transcribed into a pre-training language model, wherein the context in the embodiment of the invention can be divided into an upper context and a lower context, which are respectively:
the method comprises the following steps: "\8230, he turned his attentions to the destination asams' I as she aproaced.";
the following: "for the briefest of masses additized her, \ 8230;".
The pre-training language model predicts according to the keyword set and the context of the linguistic data to be transcribed, and outputs at least one candidate synonymous transcription linguistic data, such as:
“She reported crisply that she was nearing completion of the final stage of the project.”
evaluation: in this step, the invention considers the quality of the candidate synonymy transcription corpus from three aspects; semantic consistency, fluency, diversity.
Semantic score: the semantic similarity between each candidate synonymy transcription corpus and the corpus to be transcribed can be calculated through an automatic semantic similarity calculation method BERTScore, the method is to obtain a word vector (embedding) of each word based on a pre-training language model BERT, then to obtain the semantic similarity of the whole sentence according to the cosine similarity of the word vector in the sentence in Euclidean space, and the semantic similarity can be used as a semantic score S sem
Fluency score: the generation probability of the entire sentence can be calculated as the fluency score S flu
S flu =p(w 1 )p(w 2 |w 1 )…p(w n |w n-1 …w 1 )
Diversity score: the diversity score of the present invention is intended to encourage the generation of more diverse sentences, so that the calculation of the score, both from the standpoint of word usage and word order, utilizes Jaccard distance to calculate S div
Figure BDA0003964964380000211
Wherein S represents the set of all words in the corpus to be transcribed and the candidate synonymous transcription corpus, p S (w) represents the position of the word in the sentence.
The final evaluation result is a weighted average of three components:
s final (S 1 ,S 2 )=λ 1 ·s Sem (S 1 ,S 2 )+λ 2 ·s flu (S 1 ,S 2 )+λ 3 ·s div (S 1 ,S 2 )
will s is final And taking the highest candidate synonymy transliteration corpus as the target synonymy transliteration corpus.
In this embodiment, a target synonymous transcription corpus S' is obtained: "She reported credit that She was left with following completion of the final stage of the project"
The corpus S to be transcribed and the target synonymous transcription corpus S' are a pair of parallel corpuses, which can be regarded as ParaNet. (II) the second part is described below, namely corresponding training is carried out on the constructed data set, and synonymous transcription generation or application is carried out on the downstream task.
By the method for generating parallel corpora (which may be referred to as ParaMac) provided in the above embodiment, the original sentence (corpus to be transcribed) may be synonymously transcribed into the corresponding synonymy transcription corpus.
Thus, an appropriate corpus can be selected to generate a parallel corpus pair dataset (referred to as ParaNet).
In an embodiment of the present invention, because the generation process needs the context of the input corpus to be transcribed, the embodiment of the present invention selects the corpus BookCorpus of a long text as the generation input corpus of the embodiment of the present invention. Since the T5 model uses BookCorpus as one of its pre-training corpuses, in order not to let the pre-training language model used by the synonymous transcription model see already the input sentences in the pre-training phase, the embodiment of the present invention uses a newly grabbed version (supplied by Shawn press in 9.2020). After comparison with the original BookCorpus and removal of the repeated parts, 3551 books remained. The genres of these books include novel, non-novel, prose, poem, drama and movie script, ranging up to 100 topics such as romance, science fiction, fantasy, thriller and apprehension.
From this subset of bookcopus, embodiments of the present invention randomly draw 10k instances. Examples of each input include: 1) A complete sentence S is used as the corpus to be transcribed, and the length of the sentence S is between 60 and 100 characters; 2) In the context before and after S, the average length of the two is 250 characters. According to the 10k examples, the parallel corpus generation method (ParaMac) provided by the embodiment of the present invention is used to generate the parallel corpus pair database ParaNet in an unsupervised manner. Then, on the basis of ParaNet, the embodiment of the invention can train a Seq2Seq language model (called a synonymous transcription model ParaMod), which can generate synonymous transcription sentences given any sentences.
When the ParaMod is directly used for the synonymous transcription task, the method is equivalent to an unsupervised transcription model because no supervision signal is used in the generation process of the ParaNet. But also can be used for the evaluation of supervised synonymous transcription after the learning of few samples on downstream data, or directly used for the data enhancement of downstream tasks, so that training samples are increased, and the generalization capability of the model is improved.
The synonymy transcription method provided by the embodiment of the invention provides an unsupervised synonymy transcription generation framework based on a pre-training language model, and emphasizes on the generation quality of the transcription in the aspects of semantic consistency and expression diversity. Meanwhile, in order to enable the synonymous transcription to better serve a general natural language task, the transcription model provided by the invention can be used as a universal data generation or data enhancement module on various downstream tasks, so that the performance of the downstream tasks is improved.
On the synonymous transfer task itself, the synonymous transfer method provided by the invention not only achieves better optimal effects than the previous methods on the commonly used Quora and MSCOCO data sets (higher than the previous optimal methods by 9.1% and 3.3% respectively on BLEU values) under the unsupervised setting, but also can trace the previous optimal supervised model by only using 500 training samples under the supervised setting, and the performance greatly exceeds the previous optimal performance under the condition of using 10k samples. Meanwhile, in the application of two downstream tasks, the model provided by the invention can be used as a data generator in a knowledge question-answering task, heavy manual labeling is replaced to a certain extent, and a more natural problem is generated and expressed from a template problem; in GLUE evaluation, the data enhancer (including emotion analysis, natural language inference, etc.) of various downstream tasks can be used to help achieve an average 2.0% performance gain.
The following describes the synonymous transfer device provided in the present invention, and the synonymous transfer device described below and the synonymous transfer method described above may be referred to in correspondence with each other.
Fig. 4 is a device for generating parallel corpora according to an embodiment of the present invention, and as shown in fig. 4, the device for synonymy transcription according to an embodiment of the present invention includes:
a first obtaining unit 410, configured to obtain a corpus to be transcribed and a context of the corpus to be transcribed
A first processing unit 420, configured to obtain a keyword set based on the corpus to be transcribed;
a first transcription unit 430, configured to input the keyword set and the context of the corpus to be transcribed into a pre-training language model, and obtain at least one candidate synonymy transcription corpus output by the pre-training language model;
the first evaluation unit 440 is configured to evaluate each candidate synonymous transcription corpus, and determine a target synonymous transcription corpus based on an evaluation result, where the evaluation at least includes diversity evaluation.
Optionally, the first processing unit 420 is configured to perform keyword extraction on the corpus to be transcribed, and obtain at least one initial keyword;
the first processing unit 420 is configured to filter the initial keyword to obtain a filtered keyword;
the first processing unit 420 is configured to obtain, based on the filtering keyword, a synonymous keyword corresponding to the filtering keyword one to one;
the first processing unit 420 is configured to replace part or all of the filtering keywords with synonymous keywords based on a preset replacement ratio, so as to obtain an initial keyword set;
the first processing unit 420 is configured to rearrange the order of the keywords in the initial keyword set to obtain the keyword set.
Optionally, the first processing unit 420 is configured to input the corpus to be transcribed into a pre-training language model, perform mask prediction on the corpus to be transcribed through the pre-training language model by using a position of a single filtering keyword as a prediction object, and obtain candidate synonymous keywords output by the pre-training language model, where each filtering keyword corresponds to at least one candidate synonymous keyword;
the first processing unit 420 is configured to determine, from the candidate synonymous keywords, a synonymous keyword corresponding to each filtering keyword.
Optionally, the first transcription unit 430 is configured to input the keyword set and the context of the corpus to be transcribed into a pre-training language model, where the pre-training language model predicts words and phrases among the keywords under the constraint of the context of the corpus to be transcribed, and obtains at least one candidate synonymous transcription corpus.
Optionally, the first evaluation unit 440 is configured to calculate semantic similarity between each candidate synonymous transcription corpus and the corpus to be transcribed, and obtain a semantic score corresponding to each candidate synonymous transcription corpus;
the first evaluation unit 440 is configured to calculate a generation probability of each candidate synonymous transcription corpus, and obtain a fluency score corresponding to each candidate synonymous transcription corpus;
the first evaluation unit 440 is configured to calculate the Jaccard similarity between each candidate synonymous transcription corpus and the corpus to be transcribed, and obtain a diversity score corresponding to each candidate synonymous transcription corpus;
and weighting the semantic score, the fluency score and the diversity score to obtain an evaluation result.
It should be noted that the apparatus provided in the embodiment of the present invention can implement all the method steps implemented by the method embodiment, and can achieve the same technical effects, and detailed descriptions of the same parts and beneficial effects as the method embodiment in this embodiment are not repeated herein.
Fig. 5 is a schematic structural diagram of an unsupervised synonymous transfer device provided in an embodiment of the present invention, and as shown in fig. 5, the unsupervised synonymous transfer device provided in the embodiment of the present invention includes:
a second obtaining unit 510, configured to obtain a statement to be transcribed;
a second transcription unit 520, configured to input the sentence to be transcribed to the synonymous transcription model, and obtain a synonymous transcription sentence output by the synonymous transcription model;
the synonymy transcription model is obtained by training based on a parallel corpus pair, and the parallel corpus pair is obtained by the parallel corpus generation method according to any one of the embodiments.
It should be noted that the apparatus provided in the embodiment of the present invention can implement all the method steps implemented by the method embodiment, and can achieve the same technical effects, and detailed descriptions of the same parts and beneficial effects as the method embodiment in this embodiment are not repeated herein.
Fig. 6 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 6: a processor (processor) 610, a communication Interface (Communications Interface) 620, a memory (memory) 630 and a communication bus 640, wherein the processor 610, the communication Interface 620 and the memory 630 communicate with each other via the communication bus 640. The processor 610 may call logic instructions in the memory 630 to perform a parallel corpus generation method or an unsupervised synonymy transcription method; the parallel corpus generating method comprises the following steps: acquiring a linguistic data to be transcribed and the context of the linguistic data to be transcribed; obtaining a keyword set based on the corpus to be transcribed; inputting the keyword set and the context of the linguistic data to be transcribed into a pre-training language model to obtain at least one candidate synonymous transcription linguistic data output by the pre-training language model; evaluating each candidate synonymy transcription corpus, and determining a target synonymy transcription corpus based on an evaluation result, wherein the evaluation at least comprises diversity evaluation; the unsupervised synonymous transfer method comprises the following steps: acquiring a sentence to be transcribed; inputting the sentence to be transcribed into a synonymous transcription model to obtain a synonymous transcription sentence output by the synonymous transcription model; the synonymy transcription model is obtained by training based on a parallel corpus pair, and the parallel corpus pair is obtained by the parallel corpus generation method according to any one of the embodiments.
In addition, the logic instructions in the memory 630 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention further provides a computer program product, where the computer program product includes a computer program, the computer program can be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, the computer can execute the parallel corpus generation method or the unsupervised synonymous transcription method provided by the above methods; the parallel corpus generating method comprises the following steps: obtaining a linguistic data to be transcribed and a context of the linguistic data to be transcribed; acquiring a keyword set based on the corpus to be transcribed; inputting the keyword set and the context of the linguistic data to be transcribed into a pre-training language model to obtain at least one candidate synonymy transcription linguistic data output by the pre-training language model; evaluating each candidate synonymy transcription corpus, and determining a target synonymy transcription corpus based on an evaluation result, wherein the evaluation at least comprises diversity evaluation; the unsupervised synonymy transcription method comprises the following steps: acquiring a sentence to be transcribed; inputting the sentence to be transcribed into a synonymous transcription model to obtain a synonymous transcription sentence output by the synonymous transcription model; the synonymy transcription model is obtained by training based on a parallel corpus pair, and the parallel corpus pair is obtained by the method for generating a parallel corpus according to any one of the embodiments.
In yet another aspect, the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to execute the method for generating parallel corpora or the unsupervised synonymy transcription method provided by the above methods; the method for generating the parallel corpora comprises the following steps: obtaining a linguistic data to be transcribed and a context of the linguistic data to be transcribed; obtaining a keyword set based on the corpus to be transcribed; inputting the keyword set and the context of the linguistic data to be transcribed into a pre-training language model to obtain at least one candidate synonymous transcription linguistic data output by the pre-training language model; evaluating each candidate synonymy transcription corpus, and determining a target synonymy transcription corpus based on an evaluation result, wherein the evaluation at least comprises diversity evaluation; the unsupervised synonymous transfer method comprises the following steps: acquiring a sentence to be transcribed; inputting the sentence to be transcribed into a synonymous transcription model to obtain a synonymous transcription sentence output by the synonymous transcription model; the synonymy transcription model is obtained by training based on a parallel corpus pair, and the parallel corpus pair is obtained by the method for generating a parallel corpus according to any one of the embodiments.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for generating parallel corpora is characterized by comprising the following steps:
obtaining a linguistic data to be transcribed and a context of the linguistic data to be transcribed;
acquiring a keyword set based on the corpus to be transcribed;
inputting the keyword set and the context of the linguistic data to be transcribed into a pre-training language model to obtain at least one candidate synonymy transcription linguistic data output by the pre-training language model;
and evaluating each candidate synonymous transcription corpus, and determining a target synonymous transcription corpus based on an evaluation result, wherein the evaluation at least comprises diversity evaluation.
2. The method for generating parallel corpus according to claim 1, wherein said obtaining keywords in the corpus to be transcribed comprises:
extracting keywords from the corpus to be transcribed to obtain at least one initial keyword;
filtering the initial keywords to obtain filtered keywords;
obtaining synonymous keywords corresponding to the filtering keywords one by one based on the filtering keywords;
replacing part or all of the filtering keywords with synonymous keywords based on a preset replacement proportion to obtain an initial keyword set;
and rearranging the sequence of the keywords in the initial keyword set to obtain the keyword set.
3. The method for generating parallel corpus according to claim 2, wherein said obtaining synonymous keywords corresponding to said filtering keywords one by one based on said filtering keywords comprises:
inputting the linguistic data to be transcribed into a pre-training language model, taking the position of a single filtering keyword as a prediction object, and performing mask prediction on the linguistic data to be transcribed through the pre-training language model to obtain candidate synonymous keywords output by the pre-training language model, wherein each filtering keyword corresponds to at least one candidate synonymous keyword;
and determining the synonymy keyword corresponding to each filtering keyword in the candidate synonymy keywords.
4. The method for generating parallel corpora according to claim 1, wherein the step of inputting the keyword set and the context of the corpus to be transcribed into a pre-trained language model to obtain at least one candidate synonymous transcription corpus output by the pre-trained language model includes:
and inputting the keyword set and the context of the linguistic data to be transcribed into a pre-training language model, wherein the pre-training language model predicts words among the keywords under the constraint of the context of the linguistic data to be transcribed to obtain at least one candidate synonymous transcription linguistic data.
5. The method for generating parallel corpora according to any one of claims 1 to 4, wherein the evaluating each of the candidate synonymous transcription corpora includes:
respectively calculating the semantic similarity of each candidate synonymous transcription corpus and the corpus to be transcribed, and obtaining the semantic score corresponding to each candidate synonymous transcription corpus;
calculating the generation probability of each candidate synonymy transcription corpus to obtain the fluency score corresponding to each candidate synonymy transcription corpus;
respectively calculating the Jacard Jaccard similarity of each candidate synonymous transcription corpus and the corpus to be transcribed, and obtaining a diversity score corresponding to each candidate synonymous transcription corpus;
and weighting the semantic score, the fluency score and the diversity score to obtain an evaluation result.
6. An unsupervised synonymous transcription method, comprising:
acquiring a sentence to be transcribed;
inputting the sentence to be transcribed into a synonymous transcription model to obtain a synonymous transcription sentence output by the synonymous transcription model;
the synonymous transcription model is obtained based on parallel corpus training, and the parallel corpus pair is obtained based on the parallel corpus generation method according to any one of claims 1 to 5.
7. A parallel corpus generating apparatus, comprising:
the first obtaining unit is used for obtaining the linguistic data to be transcribed and the context of the linguistic data to be transcribed, and the first processing unit is used for obtaining a keyword set based on the linguistic data to be transcribed;
the first transcription unit is used for inputting the keyword set and the context of the linguistic data to be transcribed into a pre-training language model, and obtaining at least one candidate synonymy transcription linguistic data output by the pre-training language model;
and the first evaluation unit is used for evaluating each candidate synonymous transcription corpus and determining a target synonymous transcription corpus based on an evaluation result, wherein the evaluation at least comprises diversity evaluation.
8. An unsupervised synonymous transfer device, comprising:
the second acquisition unit is used for acquiring the statement to be transcribed;
the second transcription unit is used for inputting the sentence to be transcribed to the synonymous transcription model and obtaining the synonymous transcription sentence output by the synonymous transcription model;
the synonymous transcription model is obtained based on parallel corpus training, and the parallel corpus pair is obtained based on the parallel corpus generation method according to any one of claims 1 to 5.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for generating parallel corpora according to any one of claims 1 to 5 or the unsupervised synonymous transfer method according to claim 6 when executing the program.
10. A non-transitory computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the method for generating parallel corpuses according to any one of claims 1 to 5 or the unsupervised synonymous transcription method according to claim 6.
CN202211497311.4A 2022-11-25 2022-11-25 Parallel corpus generation method and device and unsupervised synonymy transcription method and device Pending CN115809658A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211497311.4A CN115809658A (en) 2022-11-25 2022-11-25 Parallel corpus generation method and device and unsupervised synonymy transcription method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211497311.4A CN115809658A (en) 2022-11-25 2022-11-25 Parallel corpus generation method and device and unsupervised synonymy transcription method and device

Publications (1)

Publication Number Publication Date
CN115809658A true CN115809658A (en) 2023-03-17

Family

ID=85484308

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211497311.4A Pending CN115809658A (en) 2022-11-25 2022-11-25 Parallel corpus generation method and device and unsupervised synonymy transcription method and device

Country Status (1)

Country Link
CN (1) CN115809658A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117313747A (en) * 2023-09-19 2023-12-29 重庆邮电大学 Method for generating sports war report by sports event explanation

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117313747A (en) * 2023-09-19 2023-12-29 重庆邮电大学 Method for generating sports war report by sports event explanation

Similar Documents

Publication Publication Date Title
Alomari et al. Deep reinforcement and transfer learning for abstractive text summarization: A review
Peng et al. Phonetic-enriched text representation for Chinese sentiment analysis with reinforcement learning
JP2021190087A (en) Text recognition processing method, device, electronic apparatus, and storage medium
CN111611810B (en) Multi-tone word pronunciation disambiguation device and method
Davydov et al. Mathematical method of translation into Ukrainian sign language based on ontologies
RU2721190C1 (en) Training neural networks using loss functions reflecting relationships between neighbouring tokens
CN112699216A (en) End-to-end language model pre-training method, system, device and storage medium
Tripathy et al. Comprehensive analysis of embeddings and pre-training in NLP
CN112185361B (en) Voice recognition model training method and device, electronic equipment and storage medium
CN112597366B (en) Encoder-Decoder-based event extraction method
CN111339772B (en) Russian text emotion analysis method, electronic device and storage medium
Wang et al. Learning morpheme representation for mongolian named entity recognition
Alhojely et al. Recent progress on text summarization
Goyal et al. A Systematic survey on automated text generation tools and techniques: application, evaluation, and challenges
CN115809658A (en) Parallel corpus generation method and device and unsupervised synonymy transcription method and device
Shafiq et al. Abstractive text summarization of low-resourced languages using deep learning
Onan et al. Improving Turkish text sentiment classification through task-specific and universal transformations: an ensemble data augmentation approach
Hamarashid et al. A comprehensive review and evaluation on text predictive and entertainment systems
Zhu et al. Improving low-resource named entity recognition via label-aware data augmentation and curriculum denoising
Zhan et al. Stage-wise Stylistic Headline Generation: Style Generation and Summarized Content Insertion.
Passban Machine translation of morphologically rich languages using deep neural networks
CN116414988A (en) Graph convolution aspect emotion classification method and system based on dependency relation enhancement
Tu Named entity recognition and emotional viewpoint monitoring in online news using artificial intelligence
CN116108840A (en) Text fine granularity emotion analysis method, system, medium and computing device
CN110162615A (en) A kind of intelligent answer method, apparatus, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination