CN113673261A - Data generation method and device and readable storage medium - Google Patents

Data generation method and device and readable storage medium Download PDF

Info

Publication number
CN113673261A
CN113673261A CN202111045889.1A CN202111045889A CN113673261A CN 113673261 A CN113673261 A CN 113673261A CN 202111045889 A CN202111045889 A CN 202111045889A CN 113673261 A CN113673261 A CN 113673261A
Authority
CN
China
Prior art keywords
language
text
language text
training
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111045889.1A
Other languages
Chinese (zh)
Inventor
穆畅
李响
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiaomi Mobile Software Co Ltd
Beijing Xiaomi Pinecone Electronic Co Ltd
Original Assignee
Beijing Xiaomi Mobile Software Co Ltd
Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xiaomi Mobile Software Co Ltd, Beijing Xiaomi Pinecone Electronic Co Ltd filed Critical Beijing Xiaomi Mobile Software Co Ltd
Priority to CN202111045889.1A priority Critical patent/CN113673261A/en
Publication of CN113673261A publication Critical patent/CN113673261A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The present disclosure relates to a data generation method, an apparatus and a readable storage medium, the method comprising: carrying out noise adding processing on the initial first language text to obtain a noise-added first language text; processing the first language text after noise addition according to a pre-training language model to obtain a target first language text; performing reverse translation processing on the target first language text to obtain a second language text; and obtaining training data for training a translation model based on the target first language text and the second language text. The method disclosed by the invention can improve the diversity of the training data for training the translation model and solve the problem of training data shortage.

Description

Data generation method and device and readable storage medium
Technical Field
The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a data generation method and apparatus, and a readable storage medium.
Background
Machine translation refers to the process of converting a source language to a target language by a computer. With the development of machine learning and deep learning techniques, machine translation gradually moves from statistical machine translation to the neural machine translation era. For the neural machine translation model, the training data used for training the translation model has a very important influence on the prediction of the model, so how to better construct the training data of the translation model to improve the translation accuracy of the translation model is a problem to be solved urgently.
Disclosure of Invention
To overcome the problems in the related art, the present disclosure provides a data generation method, apparatus, and readable storage medium.
According to a first aspect of the embodiments of the present disclosure, there is provided a data generation method, including:
carrying out noise adding processing on the initial first language text to obtain a noise-added first language text;
processing the first language text after noise addition according to a pre-training language model to obtain a target first language text;
performing reverse translation processing on the target first language text to obtain a second language text;
and obtaining training data for training a translation model based on the target first language text and the second language text.
In some embodiments, the first language text is a chapter-level text, and the performing a reverse translation process on the target first language text to obtain a second language text includes:
splitting the target first language text to obtain a plurality of target first language sub-texts; wherein the first language sub-text is a sentence level text;
according to a reverse translation model, performing reverse translation processing on the plurality of target first language sub-texts to obtain a plurality of second language sub-texts;
and performing fusion processing on the plurality of second language sub texts to obtain the second language texts.
In some embodiments, the denoising the initial first language text to obtain a denoised first language text includes:
and carrying out deletion, replacement and noise addition on the initial first language text to obtain a noise-added first language text, wherein the noise-added first language text lacks words with preset quantity or preset positions relative to the initial first language text.
In some embodiments, the denoising the initial first language text to obtain a denoised first language text includes:
and performing disorder and noise addition processing on the initial first language text to obtain a noise-added first language text, wherein the noise-added first language text is different in sentence sequence or word sequence relative to the initial first language text.
In some embodiments, the pre-trained language model is a bi-directional autoregressive transformer model.
In some embodiments, the reverse translation model is trained by:
obtaining a plurality of training samples; wherein each of the training samples comprises a sample first language sub-text and a sample second language sub-text; wherein the sample first language sub-text and the sample second language sub-text are the sentence level text;
iteratively updating parameters of the initial reverse translation model based on a plurality of training samples to reduce loss function values corresponding to the training samples, and obtaining a trained reverse translation model;
wherein, the loss function value corresponding to each training sample is determined by the following process:
processing the sample first language sub-text through a reverse translation model to obtain a predicted second language sub-text;
determining a loss function value based at least on a difference of the predicted second language sub-text and the sample second language sub-text.
In some embodiments, the reverse translation model is a Transformer model.
According to a second aspect of the embodiments of the present disclosure, there is provided a data generation apparatus including:
the noise adding module is configured to add noise to the initial first language text to obtain a first language text after noise addition;
the processing module is configured to process the first language text subjected to noise addition according to a pre-training language model to obtain a target first language text;
the reverse translation module is configured to perform reverse translation processing on the target first language text to obtain a second language text;
a training data determination module configured to obtain training data for training a translation model based on the target first language text and the second language text.
In some embodiments, the first language text is chapter-level text, and the reverse translation module is further configured to:
splitting the target first language text to obtain a plurality of target first language sub-texts; wherein the first language sub-text is a sentence level text;
according to a reverse translation model, performing reverse translation processing on the plurality of target first language sub-texts to obtain a plurality of second language sub-texts;
and performing fusion processing on the plurality of second language sub texts to obtain the second language texts.
In some embodiments, the noise adding module is further configured to: and carrying out deletion, replacement and noise addition on the initial first language text to obtain a noise-added first language text, wherein the noise-added first language text lacks words with preset quantity or preset positions relative to the initial first language text.
In some embodiments, the noise adding module is further configured to: and performing disorder and noise addition processing on the initial first language text to obtain a noise-added first language text, wherein the noise-added first language text is different in sentence sequence or word sequence relative to the initial first language text.
In some embodiments, the pre-trained language model is a bi-directional autoregressive transformer model.
In some embodiments, the apparatus further comprises a training module configured to:
obtaining a plurality of training samples; wherein each of the training samples comprises a sample first language sub-text and a sample second language sub-text; wherein the sample first language sub-text and the sample second language sub-text are the sentence level text;
iteratively updating parameters of the initial reverse translation model based on a plurality of training samples to reduce loss function values corresponding to the training samples, and obtaining a trained reverse translation model;
wherein, the loss function value corresponding to each training sample is determined by the following process:
processing the sample first language text through a reverse translation model to obtain a predicted second language sub-text;
determining a loss function value based at least on a difference of the predicted second language sub-text and the sample second language sub-text.
In some embodiments, the reverse translation model is a Transformer model.
According to a third aspect of the embodiments of the present disclosure, there is provided a data generation apparatus including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to execute the computer program in the memory to implement the steps of the method of any one of the first aspects of the disclosure.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the method of any one of the first aspects of the present disclosure.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: the noisy first language text is processed through the pre-training language model, a target first language text different from the initial first language text can be generated, training data are further constructed based on a reverse translation result of the target first language text and the target first language text, training data reflecting different scores can be obtained, diversification of the training data is achieved, and a training effect of the translation model is improved; and the pre-training model does not need to be trained additionally, so that the efficiency of generating training data is greatly improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
FIG. 1 is a flow chart illustrating a method of data generation according to an exemplary embodiment of the present disclosure;
FIG. 2 is a flow diagram illustrating training a reverse translation model according to an exemplary embodiment of the present disclosure;
FIG. 3 is a block diagram illustrating a data generation apparatus according to an exemplary embodiment of the present disclosure;
fig. 4 is a block diagram illustrating a data generation apparatus according to an exemplary embodiment of the present disclosure.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
Before introducing a data generation method provided by the present disclosure, an application scenario related to various embodiments in the present disclosure is first described, and the present disclosure may be applied to a process of training a translation model, where the trained translation model may be used for translating a chapter-level text.
In some embodiments, in a scenario of chapter translation, the original chapter-level text may be split to obtain a plurality of sentence-level texts, and training data for training a chapter-level translation model is constructed according to translation results corresponding to the plurality of sentence-level texts and the original chapter-level text. However, this approach has the following features: complex dependency relationships exist among sentences in the chapters, and after the chapters-level texts are simply split, the obtained multiple sentence-level texts do not reflect the dependency relationships, so that the finally generated translation results are inconsistent in context and poor in consistency. Furthermore, the training data constructed in this way reduces the training effect of the model. In addition, in the context of chapter-based translation, the training data required for training the translation model is in short supply, which limits the wide application of chapter-based translation models.
Therefore, the data generation method, the data generation device and the readable storage medium are provided, the noisy first language text is processed through a pre-training language model, a target first language text different from the initial first language text can be generated, namely, the target first language text with translation errors is generated, then training data is constructed based on a reverse translation result of the target first language text and the target first language text, training data reflecting different scores can be obtained, diversification of the training data is realized, and the training effect of the translation model is improved; and the pre-training model does not need to be trained additionally, so that the efficiency of generating training data is greatly improved.
Fig. 1 is a flow chart illustrating a method of data generation according to an exemplary embodiment, which may include the following steps, as shown in fig. 1.
In step S11, noise processing is performed on the initial first language text to obtain a noise-added first language text.
In some embodiments, the initial first language text is first language text that has not been processed. The first language text may be a text obtained by translating a text to be translated, i.e., a source language text. The first language text may be text in any language. Such as english, chinese, german, etc. In some embodiments, the initial first language text may be chapter-level text. A chapter can be a language whole unit consisting of a series of consecutive words, phrases, clauses, sentences, or paragraphs, and chapter-level text can be text formed from the language whole unit. Such as articles, books, periodicals, etc. It will be appreciated that chapter-level text may include a large number (e.g., four thousand, ten thousand, etc.) of characters.
In some embodiments, the denoised first language text may be a text obtained by performing a denoising process on the initial first language text. In some embodiments, the noise processing may include at least one of: the method includes replacing at least one word in the initial first language text, deleting at least one word segment in the initial first language text, changing an order of at least two sentences in the initial first language text, and rotating the initial first language text.
In some embodiments, the denoising the initial first language text to obtain a denoised first language text may include: and deleting, replacing and denoising the initial first language text to obtain a denoised first language text, wherein the denoised first language text lacks words with preset quantity or preset positions relative to the initial first language text.
It will be appreciated that the pruning replacement noising process may be used to derive a noised first language text lacking a predetermined number or position of words relative to the initial first language text. The preset number and the preset position can be specifically set according to actual conditions. In some embodiments, pruning replacement noising may include at least one of: deleting at least one word in the initial first language text, deleting at least one word segment in the initial first language text.
In some embodiments, when the noising process is a replacement process, at least one word in the initial first language text may be replaced with a preset marker. For example, the preset marker is MASK. Wherein the replacement may be a random replacement. Illustratively, to initiateA language text includes sentence A, and the word sequence of sentence A is { a1,a2,a3,a4,a5For example, if the word a in sentence A is replaced3Then the word sequence of sentence A in the denoised first language text is changed to { a }1,a2,MASK,a4,a5}. The method for replacing other words is similar to the replacing process, and is not repeated here.
In some embodiments, replacing at least one word in the initial first language text may include: at least one word other than the connection word and the entity word in the initial first language text is replaced. A conjunction may refer to a word used to join sentences. E.g., words related to causality, control, summary, inference, etc. An entity may be any object that can be described, such as a service, a person name, a place name, and so on. An entity word may be a word corresponding to an entity. By replacing the connecting words and the words except the entity words, the integral structure of the initial first language text after noise addition can be ensured not to change as much as possible, so that the subsequently generated training data is more matched with the field of real text data, and the training effect of the translation model is better.
In some embodiments, at least one word in the initial first language text may be randomly deleted. For example, the text in the initial first language includes sentence A, and the word sequence of sentence A is { a }1,a2,a3,a4,a5Take the example, if the word a is deleted4And a5Then the word sequence of sentence A in the denoised first language text is changed to { a }1,a2,a3}。
In some embodiments, a term segment may be a segment consisting of a contiguous plurality of terms. In some embodiments, the word segment may be a contiguous plurality of words in a sentence that the initial first language text includes. It is worth mentioning that the plurality of word fragments may be a plurality of word fragments comprised by the same sentence or different sentences in the initial first language text. For example, the word sequence still above sentence A is { a1,a2,a3,a4,a5For example, the plurality of word segments may include: a is1-a2、a1-a2-a3,、a2-a3-a4-a5And the like.
In some embodiments, at least one word segment in the first language text may be randomly deleted. For example, still taking the foregoing example as an example, the word segment a may be deleted1-a2Or a1-a2-a3And then the word sequence of the sentence A in the first language text after noise addition is changed into { a3,a4,a5Either of (a)4,a5}. In some embodiments, the length of each of the at least one word segment follows a poisson distribution with λ ═ 3.
In some embodiments, the denoising the initial first language text to obtain a denoised first language text may include: and performing disorder and noise addition processing on the initial first language text to obtain a noise-added first language text, wherein the noise-added first language text is different in sentence sequence or word sequence relative to the initial first language text.
It will be appreciated that out-of-order denoising can be used to derive denoised first language text in a different order or word order relative to the original first language text. In some embodiments, the out-of-order noise-adding process may include at least one of: the method includes replacing at least one word in the initial first language text, changing an order of at least two sentences in the initial first language text, and rotating the initial first language text.
In some embodiments, at least one word in the initial first language text may be replaced with a preset marker. For example, the preset marker is MASK. Wherein the replacement may be a random replacement. Illustratively, the word sequence of sentence A, still comprised in the initial first language text, is { a }1,a2,a3,a4,a5For example, if the word a in sentence A is replaced3Then the word sequence of sentence A in the denoised first language text is changed to { a }1,a2,MASK,a4,a5}. The method for replacing other words is similar to the replacing process, and is not repeated here.
In some embodiments, replacing at least one word in the initial first language text may include: at least one word other than the connection word and the entity word in the initial first language text is replaced. A conjunction may refer to a word used to join sentences. E.g., words related to causality, control, summary, inference, etc. An entity may be any object that can be described, such as a service, a person name, a place name, and so on. An entity word may be a word corresponding to an entity. By replacing the connecting words and the words except the entity words, the integral structure of the initial first language text after noise addition can be ensured not to change as much as possible, so that the subsequently generated training data is more matched with the field of real text data, and the training effect of the translation model is better.
In some embodiments, the order of at least two sentences in the initial first language text may be randomly changed. In some embodiments, the initial first language text may be divided into a plurality of sentences with periods as separators, and the order of the plurality of sentences may be randomly changed. For example, taking the sentence sequence of the initial first language text as { a, B, C, D, E }, the order of sentences a to E can be changed arbitrarily to obtain the sentence sequence of the first language text after the noise processing { D, E, a, B, C }.
In some embodiments, rotating the initial first language text may refer to randomly selecting a sentence, centering on the sentence, and rotating the initial first language text with the selected sentence as a new beginning. For example, still taking the sentence sequence of the initial first language text as { a, B, C, D, E }, if the initial first language text is rotated around the sentence D, the sentence sequence of the first language text after the noise processing is { D, E, a, B, C }.
In some embodiments, multiple kinds of noise processing may be performed on the initial first language text to obtain a noise-added first language text. In some embodiments, the various noising processes may include pruned alternative noising processes and out-of-order noising processes. Multiple noise addingThe execution sequence of the processing can be specifically set according to the actual situation, and the present disclosure does not set any limit to this. For example, the sentence sequence of text still in the initial first language { A, B, C, D, E }, the word sequence of sentence A is { a1,a2,a3,a4,a5For example, if the denoising process includes: changing the order of sentences A and B → centering on sentence D, rotating the initial first language text → deleting the word segment a in sentence A1-a2→ delete word a in sentence A4→ substitute for word a in sentence a3The sentence sequence of the first language text after the noise processing is { D, E, B, A, C }, and the word sequence of the sentence A is { MASK, a }5}. It should be noted that, in order to simplify the examples of the present disclosure, the sentences included in the initial first language text and the words included in the sentences are not exhaustive, and the present disclosure does not set any limit to the number of sentences and the number of words.
In step S12, the noisy first language text is processed according to the pre-training language model to obtain a target first language text.
In some embodiments, the target first language text may be a text obtained by reconstructing the noisy first language text by a pre-trained language model. Since the pre-trained language model processes the noisy data, in some embodiments, the difference between the target first language text and the initial first language text, i.e., the reconstructed text is different from the actual text, has the content of a recovery error.
In some embodiments, the pre-trained language model may be a pre-trained neural network model for text generation. In some embodiments, the pre-trained language model may be a bidirectional Auto-Regressive Transformers (BART) model, and the BART model is a pre-trained language model using a transform model overall structure. When the BART model is pre-trained, firstly, the input text is damaged by using various noises, and then the input text is reconstructed through the sequence-to-sequence model to obtain an output sample.
In some embodiments, the BART model may include an Encoder and a decoder, where the Encoder uses the Encoder components of the BERT (Bidirectional Encoder from Transformers) model, which may encode the input text from two directions to obtain more context information. The decoder uses a decoder component from GPT (Generative Pre-Training) for reconstructing the input text.
In some embodiments, the BART model may process noisy first language text that includes a preset number of characters, and thus, the initial first language text may be language text that includes a preset number of characters. For example, two thousand characters. In some embodiments, the initial first language text may be a small discourse language text (e.g., a 2 kilo-character language text) resulting from splitting a large discourse language text (e.g., a ten thousand character language text). In some embodiments, a large amount of training data can be constructed by performing the data generation method of the present disclosure on a plurality of initial first language texts including a preset number of characters, thereby improving the accuracy of the translation model training.
Because the BART model uses the context information of the text, namely the bidirectional semantic information, in the pre-training encoding process, the target first language text reconstructed by the BART model is more consistent with the semantics of the real language text (namely the initial first language text), and the fluency and the continuity are stronger. Furthermore, the fluency and the consistency of the second language text which is translated reversely from the target first language text are stronger, and the subsequent generation of training data is facilitated.
In step S13, the target first language text is subjected to a reverse translation process to obtain a second language text.
In step S14, based on the target first language text and the second language text, training data for training a translation model is obtained.
In some embodiments, the reverse translation process may refer to a process of reverse translating the language text resulting from the translation into source language text. In some embodiments, the second language text may be source language text. For example, the target first language text is English text, then the second language text may be Chinese text. For another example, if the target first language text is german text, then the second language text may be english text. The first language text and the second language text correspond to different languages. It should be noted that the first language text and the second language text may be texts corresponding to any language translation, and the disclosure is not limited thereto.
In some embodiments, the reverse translation process may be performed according to a model. As mentioned above, the text in the first language is chapter-level text, and in some embodiments, the chapter-level text may be split to obtain sentence-level text, so as to perform the inverse translation process based on the sentence-level text.
In some embodiments, performing a reverse translation process on the target first language text to obtain a second language text may include: splitting the target first language text to obtain a plurality of target first language sub-texts; the first language sub-text is a sentence-level text; according to the reverse translation model, performing reverse translation processing on the plurality of target first language sub-texts to obtain a plurality of second language sub-texts; and carrying out fusion processing on the plurality of second language sub texts to obtain the second language texts.
In some embodiments, the plurality of target first language sub-texts may be a plurality of sentence-level texts obtained by sentence-splitting the target first language texts at chapter level. In some embodiments, the reverse translation model may be a pre-trained machine learning model, and the trained reverse translation model may output the second language sub-text based on the input first language sub-text. The second language sub-text may be the source language text of the first language sub-text, and as will be appreciated, the second language sub-text may be sentence level text. For the training process of the reverse translation model, reference may be made to fig. 2 and the related description thereof, which are not described herein again.
In some embodiments, a plurality of second language sub-texts may be subjected to fusion processing to obtain a second language text. In some embodiments, the fusion process may include stitching. In some embodiments, the plurality of second language sub-texts may be stitched according to a sentence structure of the target first language text. For example, still taking the sentence sequence of the target first language text as { a, B, C, D, E }, if the plurality of second language sub-texts are a ', B ', C ', D ', and E ', the concatenation order of the plurality of second language sub-texts a ', B ', C ', D ', and E ' is a ' → B ' → C ' → D ' → E '.
In some embodiments, training data may be derived based on the target first language text and the second language text. The target first language text is a language text obtained through translation, the second language text is a source language text, a parallel data set can be constructed through the target first language text and the second language text, and the parallel data set can be used for training a translation model. In some embodiments, the translation model may be a model for chapter-level text translation.
Splitting a target first language text at a chapter level into a plurality of target first language sub-texts at a sentence level; according to the reverse translation model, performing reverse translation processing on the plurality of target first language sub-texts to obtain a plurality of second language sub-texts; and performing fusion processing on the plurality of second language sub-texts to obtain a second language text. The anti-translation model is used for sentence-level text translation, so that the training mode is simple and easy to apply, and the efficiency of obtaining the training data of the translation model can be improved. Moreover, because the target first language text better conforms to the semantics of the real language text and has strong fluency and coherence, although the target first language text is a sentence-level text, the target first language text split into the multiple sentence-level target first language sub-texts still considers the semantic information of the context, so that the fluency and coherence of the discourse-level second language text obtained based on the multiple sentence-level target first language sub-texts are stronger.
In addition, the present disclosure employs a BART model pre-trained on a large corpus without retraining additional models. Meanwhile, the denoised first language text obtained through the BART model not only keeps the overall structural style of the initial first language text, but also increases the diversity of the generated language text through the denoising, further increases the diversity of the finally generated chapter-level second language text, effectively solves the problem of training data shortage of the translation model in a chapter translation scene, and is beneficial to the training of a subsequent translation model.
FIG. 2 is a flow diagram illustrating training a reverse translation model in accordance with an exemplary embodiment. As shown in fig. 2, the process includes:
in step S21, a plurality of training samples are acquired; wherein each of the training samples comprises a sample first language sub-text and a sample second language sub-text; wherein the sample first language sub-text and the sample second language sub-text are the sentence level text.
In some embodiments, the training samples may be data input into the initial reverse translation model for training the reverse translation model. In some embodiments, the sample first language sub-text may be a sentence-level translated target language text and the sample second language sub-text may be a sentence-level source language text. For more details about the sample first language sub-text and the sample second language sub-text, reference may be made to step S14 and its related description, which are not repeated herein.
In some embodiments, multiple training samples may be obtained through a database or by invoking an associated interface.
In step S22, parameters of the initial reverse translation model are iteratively updated based on a plurality of training samples to reduce the loss function values corresponding to the training samples, so as to obtain a trained reverse translation model.
In some embodiments, the reverse translation model may be a Transformer model. During the training of the reverse translation model, parameters of the initial reverse translation model may be iteratively updated based on a plurality of training samples. Specifically, the parameters of the initial reverse translation model can be continuously adjusted to reduce the loss function values corresponding to the training samples, so that the loss function values satisfy the preset conditions. For example, the loss function value converges, or the loss function value is less than a preset value. And when the loss function meets the preset condition, completing model training to obtain a trained reverse translation model. The trained reverse translation model can perform reverse translation processing on the target first language sub-text to obtain a second language sub-text.
In some embodiments, the loss function value for each training sample is determined by: processing the sample first language sub-text through a reverse translation model to obtain a predicted second language sub-text; determining a loss function value based at least on a difference between the predicted second language sub-text and the sample second language sub-text.
Fig. 3 is a block diagram illustrating a data generation apparatus 300 according to an example embodiment. Referring to fig. 3, the apparatus includes a noise adding module 310, a processing module 320, a reverse translation module 330, and a training data determining module 340.
The noise adding module 310 is configured to add noise to the initial first language text to obtain a noise-added first language text.
The processing module 320 is configured to process the noisy first language text according to a pre-training language model to obtain a target first language text.
The reverse translation module 330 is configured to perform a reverse translation process on the target first language text to obtain a second language text.
The training data determination module 340 is configured to derive training data for training a translation model based on the target first language text and the second language text.
In some embodiments, the first language text is chapter-level text, and the reverse translation module 330 is further configured to:
splitting the target first language text to obtain a plurality of target first language sub-texts; wherein the first language sub-text is a sentence level text;
according to a reverse translation model, performing reverse translation processing on the plurality of target first language sub-texts to obtain a plurality of second language sub-texts;
and performing fusion processing on the plurality of second language sub texts to obtain the second language texts.
In some embodiments, the noise adding module 310 is further configured to: and carrying out deletion, replacement and noise addition on the initial first language text to obtain a noise-added first language text, wherein the noise-added first language text lacks words with preset quantity or preset positions relative to the initial first language text.
In some embodiments, the noise adding module 310 is further configured to: and performing disorder and noise addition processing on the initial first language text to obtain a noise-added first language text, wherein the noise-added first language text is different in sentence sequence or word sequence relative to the initial first language text.
In some embodiments, the pre-trained language model is a bi-directional autoregressive transformer model.
In some embodiments, the apparatus further comprises a training module configured to:
obtaining a plurality of training samples; wherein each of the training samples comprises a sample first language sub-text and a sample second language sub-text; wherein the sample first language sub-text and the sample second language sub-text are the sentence level text;
iteratively updating parameters of the initial reverse translation model based on a plurality of training samples to reduce loss function values corresponding to the training samples, and obtaining a trained reverse translation model;
wherein, the loss function value corresponding to each training sample is determined by the following process:
processing the sample first language text through a reverse translation model to obtain a predicted second language sub-text;
determining a loss function value based at least on a difference of the predicted second language sub-text and the sample second language sub-text.
In some embodiments, the reverse translation model is a Transformer model.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Fig. 4 is a block diagram illustrating a data generation apparatus 400 according to an example embodiment. For example, the apparatus 400 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 4, the apparatus 400 may include one or more of the following components: a processing component 402, a memory 404, a power component 406, a multimedia component 408, an audio component 410, an interface for input/output (I/O) 412, a sensor component 414, and a communication component 416.
The processing component 402 generally controls overall operation of the apparatus 400, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 402 may include one or more processors 420 to execute instructions to perform all or a portion of the steps of the data generation method described above. Further, the processing component 402 can include one or more modules that facilitate interaction between the processing component 402 and other components. For example, the processing component 402 can include a multimedia module to facilitate interaction between the multimedia component 408 and the processing component 402.
The memory 404 is configured to store various types of data to support operations at the apparatus 400. Examples of such data include instructions for any application or method operating on the device 400, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 404 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
Power components 406 provide power to the various components of device 400. Power components 406 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for apparatus 400.
The multimedia component 408 includes a screen that provides an output interface between the device 400 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 408 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the apparatus 400 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 410 is configured to output and/or input audio signals. For example, audio component 410 includes a Microphone (MIC) configured to receive external audio signals when apparatus 400 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 404 or transmitted via the communication component 416. In some embodiments, audio component 410 also includes a speaker for outputting audio signals.
The I/O interface 412 provides an interface between the processing component 402 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor component 414 includes one or more sensors for providing various aspects of status assessment for the apparatus 400. For example, the sensor assembly 414 may detect an open/closed state of the apparatus 400, the relative positioning of the components, such as a display and keypad of the apparatus 400, the sensor assembly 414 may also detect a change in the position of the apparatus 400 or a component of the apparatus 400, the presence or absence of user contact with the apparatus 400, orientation or acceleration/deceleration of the apparatus 400, and a change in the temperature of the apparatus 400. The sensor assembly 414 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 414 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 414 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 416 is configured to facilitate wired or wireless communication between the apparatus 400 and other devices. The apparatus 400 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 416 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 416 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 400 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described data generation methods.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 404 comprising instructions, executable by the processor 420 of the apparatus 400 to perform the data generation method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-mentioned data generation method when executed by the programmable apparatus.
In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-mentioned data generation method when executed by the programmable apparatus.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A method of generating data, comprising:
carrying out noise adding processing on the initial first language text to obtain a noise-added first language text;
processing the first language text after noise addition according to a pre-training language model to obtain a target first language text;
performing reverse translation processing on the target first language text to obtain a second language text;
and obtaining training data for training a translation model based on the target first language text and the second language text.
2. The method of claim 1, wherein the first language text is chapter-level text;
the performing reverse translation processing on the target first language text to obtain a second language text comprises:
splitting the target first language text to obtain a plurality of target first language sub-texts; wherein the first language sub-text is a sentence level text;
according to a reverse translation model, performing reverse translation processing on the plurality of target first language sub-texts to obtain a plurality of second language sub-texts;
and performing fusion processing on the plurality of second language sub texts to obtain the second language texts.
3. The method according to claim 1, wherein the denoising the initial first language text to obtain a denoised first language text comprises:
and carrying out deletion, replacement and noise addition on the initial first language text to obtain a noise-added first language text, wherein the noise-added first language text lacks words with preset quantity or preset positions relative to the initial first language text.
4. The method according to claim 1, wherein the denoising the initial first language text to obtain a denoised first language text comprises:
and performing disorder and noise addition processing on the initial first language text to obtain a noise-added first language text, wherein the noise-added first language text is different in sentence sequence or word sequence relative to the initial first language text.
5. The method of claim 1, wherein the pre-trained language model is a bi-directional autoregressive transformer model.
6. The method of claim 2, wherein the reverse translation model is trained by:
obtaining a plurality of training samples; wherein each of the training samples comprises a sample first language sub-text and a sample second language sub-text; wherein the sample first language sub-text and the sample second language sub-text are the sentence level text;
iteratively updating parameters of the initial reverse translation model based on a plurality of training samples to reduce loss function values corresponding to the training samples, and obtaining a trained reverse translation model;
wherein, the loss function value corresponding to each training sample is determined by the following process:
processing the sample first language sub-text through a reverse translation model to obtain a predicted second language sub-text;
determining a loss function value based at least on a difference of the predicted second language sub-text and the sample second language sub-text.
7. The method of claim 6, wherein the reverse translation model is a Transformer model.
8. A data generation apparatus, comprising:
the noise adding module is configured to add noise to the initial first language text to obtain a first language text after noise addition;
the processing module is configured to process the first language text subjected to noise addition according to a pre-training language model to obtain a target first language text;
the reverse translation module is configured to perform reverse translation processing on the target first language text to obtain a second language text;
a training data determination module configured to obtain training data for training a translation model based on the target first language text and the second language text.
9. A data generation apparatus, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to execute the computer program in the memory to implement the steps of the method of any one of claims 1-7.
10. A computer-readable storage medium, on which computer program instructions are stored, which program instructions, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 7.
CN202111045889.1A 2021-09-07 2021-09-07 Data generation method and device and readable storage medium Pending CN113673261A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111045889.1A CN113673261A (en) 2021-09-07 2021-09-07 Data generation method and device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111045889.1A CN113673261A (en) 2021-09-07 2021-09-07 Data generation method and device and readable storage medium

Publications (1)

Publication Number Publication Date
CN113673261A true CN113673261A (en) 2021-11-19

Family

ID=78548699

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111045889.1A Pending CN113673261A (en) 2021-09-07 2021-09-07 Data generation method and device and readable storage medium

Country Status (1)

Country Link
CN (1) CN113673261A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114444488A (en) * 2022-01-26 2022-05-06 中国科学技术大学 Reading understanding method, system, device and storage medium for few-sample machine
CN116187282A (en) * 2022-12-30 2023-05-30 北京百度网讯科技有限公司 Training method of text review model, text review method and device
CN116579352A (en) * 2023-04-25 2023-08-11 无锡捷通数智科技有限公司 Translation model training method and device, mobile terminal and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114444488A (en) * 2022-01-26 2022-05-06 中国科学技术大学 Reading understanding method, system, device and storage medium for few-sample machine
CN116187282A (en) * 2022-12-30 2023-05-30 北京百度网讯科技有限公司 Training method of text review model, text review method and device
CN116187282B (en) * 2022-12-30 2024-03-08 北京百度网讯科技有限公司 Training method of text review model, text review method and device
CN116579352A (en) * 2023-04-25 2023-08-11 无锡捷通数智科技有限公司 Translation model training method and device, mobile terminal and storage medium

Similar Documents

Publication Publication Date Title
CN109801644B (en) Separation method, separation device, electronic equipment and readable medium for mixed sound signal
US20210280202A1 (en) Voice conversion method, electronic device, and storage medium
CN113673261A (en) Data generation method and device and readable storage medium
CN107564526B (en) Processing method, apparatus and machine-readable medium
CN110069624B (en) Text processing method and device
CN114154459A (en) Speech recognition text processing method and device, electronic equipment and storage medium
CN112735396A (en) Speech recognition error correction method, device and storage medium
CN109977390B (en) Method and device for generating text
CN111369978A (en) Data processing method and device and data processing device
CN112036195A (en) Machine translation method, device and storage medium
CN112948565A (en) Man-machine conversation method, device, electronic equipment and storage medium
CN112035651A (en) Sentence completion method and device and computer-readable storage medium
CN109979435B (en) Data processing method and device for data processing
CN116127062A (en) Training method of pre-training language model, text emotion classification method and device
CN113326706A (en) Cross-language retrieval method and device and electronic equipment
CN113115104B (en) Video processing method and device, electronic equipment and storage medium
CN113535969B (en) Corpus expansion method, corpus expansion device, computer equipment and storage medium
CN114462410A (en) Entity identification method, device, terminal and storage medium
CN115688685A (en) Text processing method and device, electronic equipment and storage medium
CN113971218A (en) Position coding method, position coding device and storage medium
CN112199963A (en) Text processing method and device and text processing device
CN113591495A (en) Speech translation method, device and storage medium
CN113239707A (en) Text translation method, text translation device and storage medium
CN111400443A (en) Information processing method, device and storage medium
CN117056540B (en) Method and device for generating multimedia object based on text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination