CN113408305A

CN113408305A - Model training method, device, equipment and storage medium

Info

Publication number: CN113408305A
Application number: CN202110737842.5A
Authority: CN
Inventors: 张睿卿; 何中军; 李芝; 吴华
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-09-17
Anticipated expiration: 2041-06-30
Also published as: CN113408305B

Abstract

The present disclosure provides a model training method, device, apparatus, and storage medium, which relate to the field of computer technology, and in particular to the fields of artificial intelligence such as natural language processing and deep learning. The training method of the speech translation model comprises the following steps: acquiring a raw data pair, wherein the raw data pair comprises first raw data and second raw data, and the language or modality of the first raw data is different from that of the second raw data; processing the original data pair by adopting a corpus generating model to obtain a generated data pair, wherein the generated data pair comprises first generated data and second generated data, and the languages and the modes of the first generated data and the second generated data are different; and training a translation model by adopting the generated data pairs. The present disclosure may extend the scale of training data for translation models.

Description

Model training method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to artificial intelligence fields such as natural language processing and deep learning, and more particularly, to a model training method, apparatus, device, and storage medium.

Background

Speech translation refers to the conversion of source language speech into target language text. The speech translation comprises a pipeline mode and an end-to-end mode, wherein the pipeline mode is that a speech recognition model is adopted to convert source language speech into source language text, and then a machine translation model is adopted to convert the source language text into target language text; the end-to-end mode is that a speech translation model is adopted to directly convert source language speech into target language text. Models related to both the pipeline mode and the end-to-end mode need larger-scale training data during training, and generally speaking, the speech translation effect of the end-to-end mode is better under the same training data scale.

In an end-to-end manner, a speech translation model needs to be trained, and training data of the speech translation model comprises source language speech and target language text corresponding to the source language speech. In the related art, the training data of the speech translation model can be obtained by adopting a manual labeling mode.

Disclosure of Invention

The disclosure provides a model training method, a model training device, a model training apparatus and a storage medium.

According to an aspect of the present disclosure, there is provided a method for training a speech translation model, including: acquiring a raw data pair, wherein the raw data pair comprises first raw data and second raw data, and the language or modality of the first raw data is different from that of the second raw data; processing the original data pair by adopting a corpus generating model to obtain a generated data pair, wherein the generated data pair comprises first generated data and second generated data, and the languages and the modes of the first generated data and the second generated data are different; and training a translation model by adopting the generated data pairs.

According to another aspect of the present disclosure, there is provided a training method of a corpus generating model, including: performing corpus generation processing on a first real sample pair by adopting a corpus generation model to obtain a pseudo sample pair corresponding to the first real sample pair, wherein the first real sample pair comprises a to-be-processed real sample and a non-processed real sample, the to-be-processed real sample and the non-processed real sample have different languages or modals, the pseudo sample pair comprises a first pseudo sample and a second pseudo sample, and the languages and the modals of the first pseudo sample and the second pseudo sample are different; adopting a discrimination model to discriminate the second real sample pair and the pseudo sample pair to obtain a discrimination result; constructing a loss function based on the discrimination result; and training the corpus generating model based on the loss function.

According to another aspect of the present disclosure, there is provided a training apparatus for a speech translation model, including: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a primary data pair, the primary data pair comprises first primary data and second primary data, and the languages or the modals of the first primary data and the second primary data are different; the generating module is used for processing the original data pair by adopting a corpus generating model to obtain a generated data pair, wherein the generated data pair comprises first generated data and second generated data, and the languages and the modes of the first generated data and the second generated data are different; and the training module is used for training a translation model by adopting the generated data pair.

According to another aspect of the present disclosure, there is provided a training apparatus for a corpus generating model, including: the generating module is used for performing corpus generation processing on a first real sample pair by adopting a corpus generation model so as to obtain a pseudo sample pair corresponding to the first real sample pair, wherein the first real sample pair comprises a to-be-processed real sample and a non-processed real sample, the to-be-processed real sample and the non-processed real sample have different languages or modals, the pseudo sample pair comprises a first pseudo sample and a second pseudo sample, and the languages and modals of the first pseudo sample and the second pseudo sample are different; the judging module is used for judging the second real sample pair and the pseudo sample pair by adopting a judging model so as to obtain a judging result; the construction module is used for constructing a loss function based on the discrimination result; and the training module is used for training the corpus generating model based on the loss function.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above aspects.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to any one of the above aspects.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of the above aspects.

According to the technical scheme of the disclosure, the scale of the training data of the translation model can be expanded.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 5 is a schematic diagram according to a fifth embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a sixth embodiment of the present disclosure;

FIG. 7 is a schematic diagram according to a seventh embodiment of the present disclosure;

FIG. 8 is a schematic diagram according to an eighth embodiment of the present disclosure;

FIG. 9 is a schematic diagram according to a ninth embodiment of the present disclosure;

fig. 10 is a schematic diagram of an electronic device for implementing any one of the training methods of the translation model and the corpus generation model according to the embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure. The embodiment provides a method for training a translation model, which comprises the following steps:

101. acquiring a raw data pair, wherein the raw data pair comprises first raw data and second raw data, and the language or modality of the first raw data is different from that of the second raw data.

102. And processing the original data pair by adopting a corpus generating model to obtain a generated data pair, wherein the generated data pair comprises first generated data and second generated data, and the languages and the modes of the first generated data and the second generated data are different.

103. And training a translation model by adopting the generated data pairs.

The translation model may refer to a cross-language and cross-modality translation model, where the cross-language means that languages input and output of the translation model are different, and the languages are, for example, chinese and english, the cross-modality means that input and output of the translation model are different, the modality means a data form, and the modality includes, for example: text, speech, etc.

Taking the Translation model as a Speech Translation model as an example, a Speech Translation (ST) model refers to converting a first language Speech into a second language text, where the first language may also be referred to as a source language and the second language may also be referred to as a target language. The first language is different from the second language.

As shown in fig. 2, the input of the speech translation model is speech in a first language, and the output is text in a second language, for example, the first language is chinese and the second language is english.

The training data of the speech translation model comprises speech in a first language, denoted as s, and text in a second language, denoted as x, requiring a large number of pairs of < s, y > data pairs as training data of the speech translation model.

In the related art, the < s, y > can be acquired by adopting a manual labeling mode, but the data scale is small, and the efficiency is poor.

The original data pair may be obtained from existing training data, and may be expressed as < first original data, second original data >, and the language or modality of the first original data and the second original data is different. When the languages are different, the first original data may be a text in a first language, and the second original data may be a text in a second language corresponding to the text in the first language, for example, the first original data is "me" and the second original data is "me". When the modalities are different, the first original data may be text in a first language, and the second original data may be speech in the first language corresponding to the text in the first language, for example, the first original data is "me", and the second original data is chinese speech corresponding to "me".

In this embodiment, the generated data pair is obtained by using the original data pair and the corpus generating model, and the generated data pair may be used as training data of the translation model, and the translation model may be trained by using the generated data pair, so that the scale of the training data of the translation model may be expanded.

In some embodiments, the processing the raw data pair by using the corpus generating model to obtain a generated data pair includes: processing the first original data by adopting a corpus generating model to obtain first generated data, wherein the language or modality of the first generated data is different from that of the first original data, and the language and modality of the first generated data are different from that of the second original data; and taking the second original data as the second generated data.

In this embodiment, a corpus generating model is used to process first original data to obtain first generated data, the language or modality of the first generated data is different from that of the first original data, second original data is used as second generated data, a generated data pair with both different languages and modalities can be obtained based on the first generated data and the second generated data, the generated data pair is used as training data of a translation model, and the scale of the training data of the translation model can be expanded.

The pipeline mode corresponding to the Speech Translation is that A Speech Recognition (ASR) model is adopted to convert a source language Speech into a source language text, and then a Machine Translation (MT) model is adopted to convert the source language text into a target language text. Because the technology of the pipeline mode is mature, a large amount of training data of corresponding models are accumulated, and therefore, original data pairs can be obtained from the training data corresponding to the ASR model and/or the training data corresponding to the MT model. And then generating data pairs based on the original data pairs.

In some embodiments, the processing the first raw data to obtain the first generated data includes: and performing speech synthesis processing on the text in the first language by adopting the TTS generation model to obtain generated speech in the first language corresponding to the text in the first language, and taking the generated speech in the first language as the first generated data.

That is, the training data of the speech translation model can be obtained using the existing training data corresponding to the MT model and the TTS generation model as the corpus generation model. The existing training data corresponding to the MT model includes a first language text and a second language text corresponding to each other.

For example, as shown in FIG. 3, the original data pair comprises<x₀，y₀>The original data pair<x₀，y₀>Can be obtained from the corresponding training data of the existing MT model. x is the number of₀Is text in a first language, y₀Is text in a second language corresponding to text in a first language, e.g. x₀To "do you guess what is something big today? ", y₀Is "visit what's going on today? ". x is the number of₀Inputting into TTS generation model, wherein the TTS generation model is x₀TTS processing is carried out to obtain the generated voice s of the first language corresponding to the first language text₀′，s₀' is "you guess what is something big today? "corresponding TTS processed Chinese speech, which can then be processed by<s0′，y0>And forming a generation data pair as training data of the voice translation model.

The generated speech of the first language can be obtained by performing speech synthesis processing on the text of the first language by using the text of the first language and the corresponding text of the second language and the TTS generation model, and then training data of a speech translation model can be constructed based on the generated speech of the first language and the text of the second language.

In some embodiments, the processing the first raw data to obtain the first generated data includes: and performing machine translation processing on the text in the first language by adopting the MT generation model to obtain a generated text in a second language corresponding to the text in the first language, and taking the generated text in the second language as the first generated data.

That is, the existing training data corresponding to the ASR model and the MT generation model as the corpus generation model can be used to obtain the training data of the speech translation model. The existing training data corresponding to the ASR model includes speech in a first language and text in the first language that correspond to each other.

For example, as shown in FIG. 4, the original data pair comprises<s₀，x₀>The original data pair<s₀，x₀>The method can be obtained from the training data corresponding to the existing ASR model. s₀Is speech of a first language, x₀Is text in a first language corresponding to speech in the first language, e.g. x₀To "do you guess what is something big today? ", s₀To "do you guess what is something big today? "corresponding Chinese speech. x is the number of₀Inputting into MT generative model, MT generative model pair x₀Performing MT processing to obtain generated text y in second language corresponding to text in first language₀′，y₀' is "you guess what is something big today? "corresponding MT processed English text, thereafter, can be processed by<s₀，y₀′>And forming a generation data pair as training data of the voice translation model.

The method comprises the steps of performing MT processing on a text in a first language through a voice in the first language and a text in the first language corresponding to the voice in the first language, and adopting an MT generation model to obtain a generated text in a second language, and further constructing training data of a voice translation model based on the voice in the first language and the generated text in the second language.

Fig. 5 is a schematic diagram according to a fifth embodiment of the present disclosure. The embodiment provides a training method of a corpus generating model, which comprises the following steps:

501. the method comprises the steps of performing corpus generation processing on a first real sample pair by adopting a corpus generation model to obtain a pseudo sample pair corresponding to the first real sample pair, wherein the first real sample pair comprises a to-be-processed real sample and a non-processed real sample, the to-be-processed real sample and the non-processed real sample are different in language or mode, the pseudo sample pair comprises a first pseudo sample and a second pseudo sample, and the language and mode of the first pseudo sample and the second pseudo sample are different.

502. And adopting a discrimination model to discriminate the second real sample pair and the pseudo sample pair to obtain a discrimination result.

503. And constructing a loss function based on the discrimination result.

504. And training the corpus generating model based on the loss function.

As shown in fig. 6, the material generation model may be trained by using a generated countermeasure Network (GAN) technique. That is, the corpus generating model (which may be represented as a Generator) and the discriminant model (which may be represented as a Discriminator) may constitute a generative confrontation network, and parameters of the corpus generating model and parameters of the discriminant model may be adjusted by using a loss function of the generative confrontation network until a final corpus generating model is obtained.

In some embodiments, the performing, by using a corpus generating model, corpus generating processing on a to-be-processed real sample pair to obtain a pseudo sample pair corresponding to the first real sample pair includes: processing the real sample to be processed by adopting a corpus generating model to obtain a generated sample, wherein the generated sample is different from the real sample to be processed in language or mode, and the generated sample is different from the non-processed real sample in language and mode; taking the non-processed real sample as the second pseudo sample.

Taking the speech translation model as an example, the real sample to be processed, such as a real text sample in a first language, the generated sample, such as a generated speech sample in the first language, and the non-processed sample, such as a real text sample in a second language, may construct a pseudo sample pair based on the generated speech sample in the first language and the real text sample in the second language. Alternatively, the real sample to be processed, such as a real text sample in a first language, the generated sample, such as a generated text sample in a second language, and the non-processed sample, such as a real speech sample in the first language, may be constructed as a pseudo sample pair based on the real speech sample in the first language and the generated text sample in the second language. The pair of dummy samples may be obtained by performing a generating process on the sample to be processed and based on the generated sample and the non-processed sample.

In this embodiment, the corpus generation model is used to process the real sample to be processed, so as to obtain a generated sample different from the non-processed sample in both language and modality, and a pseudo sample pair of the cross-language cross-modality translation model may be constructed based on the generated sample and the non-processed sample, so that the corpus generation model may be trained based on the real sample pair and the pseudo sample pair.

For example, the corpus generation model may include: a TTS generation model, and/or an MT generation model.

In some embodiments, the corpus generation model comprises: the method for processing the real sample to be processed by the TTS generation model comprises the following steps of: and performing voice synthesis processing on the real text sample of the first language by adopting the TTS generation model to obtain a generated voice sample of the first language corresponding to the real text sample of the first language, and taking the generated voice sample of the first language as the first pseudo sample.

A pair of pseudo samples, which may be referred to as a first pair of pseudo samples, may then be composed from the generated speech samples in the first language and the real text samples in the second language.

And obtaining a first pseudo sample pair through the real text sample of the first language, the real text sample of the second language corresponding to the real text sample of the first language and the TTS generation model.

In some embodiments, the corpus generation model comprises: the method for generating the model by the MT includes the steps of: and performing machine translation processing on the real text sample of the first language by adopting the MT generation model to obtain a generated text sample of a second language corresponding to the real text sample of the first language, and taking the generated text sample of the second language as the first pseudo sample.

A pair of pseudo samples, which may be referred to as a second pair of pseudo samples, may then be composed from the real speech samples in the first language and the generated text samples in the second language.

The second pair of pseudo samples is obtained from the real speech sample in the first language and the real text sample in the first language corresponding to the real speech sample in the first language, and the MT generated model.

Taking the example that the corpus generation model includes a TTS generation model and an MT generation model, in some embodiments, the pair of pseudo samples includes a first pair of pseudo samples and a second pair of pseudo samples, and the first pair of pseudo samples includes: a generated speech sample in a first language and a real text sample in a second language, the second pseudo sample pair comprising: the method for generating the text sample of the real voice sample of the first voice and the second voice comprises the following steps that a judgment result comprises a first judgment result, a second judgment result and a third judgment result, the second real sample pair comprises a voice translation real sample pair, and a judgment model is adopted to judge the second real sample pair and the pseudo sample pair so as to obtain a judgment result, and the method comprises the following steps: judging the voice translation real sample pair by adopting a judging model to obtain a first judging result; judging the first pseudo sample pair by adopting the judging model to obtain a second judging result; and performing discrimination processing on the second pseudo sample pair by adopting the discrimination model to obtain the third discrimination result.

The speech translation real sample pair, the first pseudo sample pair and the second pseudo sample pair are respectively processed through the discrimination model, three discrimination results can be obtained, and then a loss function of the GAN can be calculated based on the three discrimination results to generate a corpus generation model.

As shown in fig. 7, in fig. 7:

D_x,yrepresenting reality in a second language corresponding to a sample of real text in the first languageA sample pair consisting of real text samples y;

D_s,yrepresenting a sample pair consisting of a real voice sample s of a first language and a real text sample y of a second language corresponding to the real voice sample s of the first language;

D_s,xrepresenting a sample pair consisting of a real voice sample s of a first language and a real text sample x of the first language corresponding to the real voice sample s of the first language;

D_s′,ya sample pair consisting of a generated speech sample s 'of the first language and a real text sample y of the second language, wherein the generated speech sample s' of the first language is obtained after speech synthesis processing of a TTS generation model;

D_s,y′and the sample pair is composed of a real voice sample s representing a first language and a generated text sample y' of a second language obtained after the real text sample x of the first language is subjected to machine translation processing of the MT generation model.

As shown in fig. 7, real sample pairs may be collected in advance, and include: d_x,y、D_s,y、D_s,x。

Corresponds to D_x,yAfter the real text sample x of the first language passes through the TTS generation model, the generated voice sample s' of the first language can be obtained,<s′，y>a pair of dummy samples may be formed, which may be referred to as a first dummy sample pair D_s′,y。

Corresponds to D_s,xAfter the real text sample x of the first language passes through the MT generation model, the generated text sample y' of the second language can be obtained,<s，y′>may also form a pair of dummy samples, which may be referred to as a second dummy sample pair D_s,y′。

Corresponds to D_s,y，D_s,yA real sample pair, i.e. a sample pair consisting of a real speech sample s in a first language and a real text sample y in a second language, is translated for speech.

The discrimination model can respectively translate the real samples of the voice to D_s,yFirst pair of dummy samples D_s′,yAnd a second pair of dummy samples D_s,y′Discrimination processing is performed to obtain a first discrimination result D (s, y), a second discrimination result D (s ', y), and a third discrimination result D (s, y'), respectively, and then, a loss function may be constructed based on these three discrimination results.

The formula for the calculation of the loss function may be as follows:

L＝E[log(D(s,y))]+E[log(1-D(s',y))]+E[log(1-D(s,y'))]

wherein E [ ] represents a mean square error operation.

In this embodiment, a TTS generation model and an MT generation model may be obtained through GAN training, and then training data of a speech translation model may be generated based on the TTS generation model and/or the MT generation model to expand the scale of the training data of the speech translation model.

Fig. 8 is a schematic diagram according to an eighth embodiment of the present disclosure, which provides a training apparatus for a translation model. As shown in fig. 8, the training apparatus 800 for translation model includes: an acquisition module 801, a generation module 802 and a training module 803.

The obtaining module 801 is configured to obtain a pair of original data, where the pair of original data includes first original data and second original data, and languages or modalities of the first original data and the second original data are different; the generating module 802 is configured to process the original data pair by using a corpus generating model to obtain a generated data pair, where the generated data pair includes first generated data and second generated data, and languages and modalities of the first generated data and the second generated data are different; the training module 803 is configured to train the translation model using the generated data pairs.

In some embodiments, the generating module 802 is specifically configured to: processing the first original data by adopting a corpus generating model to obtain first generated data, wherein the language or modality of the first generated data is different from that of the first original data, and the language and modality of the first generated data are different from that of the second original data; and taking the second original data as the second generated data.

In some embodiments, the first original data is a text in a first language, the second original data is a text in a second language corresponding to the text in the first language, the corpus generating model includes a TTS generating model, and the generating module 802 is further specifically configured to: and performing speech synthesis processing on the text in the first language by adopting the TTS generation model to obtain generated speech in the first language corresponding to the text in the first language, and taking the generated speech in the first language as the first generated data.

In some embodiments, the first raw data is a text in a first language, the second raw data is a speech in the first language corresponding to the text in the first language, the corpus generating model includes an MT generating model, and the generating module 802 is further specifically configured to: and performing machine translation processing on the text in the first language by adopting the MT generation model to obtain a generated text in a second language corresponding to the text in the first language, and taking the generated text in the second language as the first generated data.

Fig. 9 is a schematic diagram illustrating a ninth embodiment of the present disclosure, which provides a training apparatus for corpus generating models. As shown in fig. 9, the training apparatus 900 for corpus generating model includes: a generating module 901, a judging module 902, a constructing module 903 and a training module 904.

The generating module 901 is configured to perform corpus generation processing on a first real sample pair by using a corpus generation model to obtain a pseudo sample pair corresponding to the first real sample pair, where the first real sample pair includes a to-be-processed real sample and a non-processed real sample, the to-be-processed real sample and the non-processed real sample have different languages or modalities, the pseudo sample pair includes a first pseudo sample and a second pseudo sample, and the languages and modalities of the first pseudo sample and the second pseudo sample are different; the discrimination module 902 is configured to perform discrimination processing on the second real sample pair and the dummy sample pair by using a discrimination model to obtain a discrimination result; the construction module 903 is used for constructing a loss function based on the judgment result; the training module 904 is configured to train the corpus generation model based on the loss function.

In some embodiments, the generating module 901 is specifically configured to: processing the real sample to be processed by adopting a corpus generating model to obtain a generated sample, wherein the generated sample is different from the real sample to be processed in language or mode, and the generated sample is different from the non-processed real sample in language and mode; taking the non-processed real sample as the second pseudo sample.

In some embodiments, the corpus generation model comprises: a TTS generation model, where the to-be-processed real sample is a real text sample in a first language, and the non-processed real sample is a real text sample in a second language corresponding to the real text sample in the first language, and the generation module 901 is further specifically configured to: and performing voice synthesis processing on the real text sample of the first language by adopting the TTS generation model to obtain a generated voice sample of the first language corresponding to the real text sample of the first language, and taking the generated voice sample of the first language as the first pseudo sample.

In some embodiments, the corpus generation model comprises: the MT generating model, where the to-be-processed real sample is a real text sample in a first language, and the non-processed real sample is a real voice sample in the first language corresponding to the real text sample in the first language, and the generating module 901 is further specifically configured to: and performing machine translation processing on the real text sample of the first language by adopting the MT generation model to obtain a generated text sample of a second language corresponding to the real text sample of the first language, and taking the generated text sample of the second language as the first pseudo sample.

In some embodiments, the pair of dummy samples comprises a first pair of dummy samples and a second pair of dummy samples, the first pair of dummy samples comprising: a generated speech sample in a first language and a real text sample in a second language, the second pseudo sample pair comprising: the real voice sample of the first voice and the generated text sample of the second language, where the determination result includes a first determination result, a second determination result, and a third determination result, the second real sample pair includes a voice translation real sample pair, and the determining module 902 is specifically configured to: judging the voice translation real sample pair by adopting a judging model to obtain a first judging result; judging the first pseudo sample pair by adopting the judging model to obtain a second judging result; and performing discrimination processing on the second pseudo sample pair by adopting the discrimination model to obtain the third discrimination result.

It is to be understood that in the disclosed embodiments, the same or similar elements in different embodiments may be referenced.

It is to be understood that "first", "second", and the like in the embodiments of the present disclosure are used for distinction only, and do not indicate the degree of importance, the order of timing, and the like.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 10 illustrates a schematic block diagram of an example electronic device 1000 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the electronic device 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)1002 or a computer program loaded from a storage unit 10010 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the electronic apparatus 1000 can also be stored. The calculation unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

A number of components in the electronic device 1000 are connected to the I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and a communication unit 1009 such as a network card, a modem, a wireless communication transceiver, or the like. The communication unit 1009 allows the electronic device 1000 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Computing unit 1001 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1001 executes the respective methods and processes described above, such as a training method of a speech translation model or a training method of a corpus generation model. For example, in some embodiments, the method of training the speech translation model or the method of training the corpus generation model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto electronic device 1000 via ROM 1002 and/or communications unit 1009. When the computer program is loaded into the RAM 1003 and executed by the computing unit 1001, one or more steps of the above-described training method of the speech translation model or the training method of the corpus generation model may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured in any other suitable way (e.g., by means of firmware) to perform a dialog understanding method or a training method of a dialog understanding model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method for training a translation model, comprising:

acquiring a raw data pair, wherein the raw data pair comprises first raw data and second raw data, and the language or modality of the first raw data is different from that of the second raw data;

processing the original data pair by adopting a corpus generating model to obtain a generated data pair, wherein the generated data pair comprises first generated data and second generated data, and the languages and the modes of the first generated data and the second generated data are different;

and training a translation model by adopting the generated data pairs.

2. The method of claim 1, wherein said processing said raw data pairs using a corpus generating model to obtain generated data pairs comprises:

processing the first original data by adopting a corpus generating model to obtain first generated data, wherein the language or modality of the first generated data is different from that of the first original data, and the language and modality of the first generated data are different from that of the second original data;

and taking the second original data as the second generated data.

3. The method of claim 2, wherein the first raw data is text in a first language, the second raw data is text in a second language corresponding to the text in the first language, the corpus generation model comprises a speech synthesis (TTS) generation model, and the processing the first raw data to obtain the first generated data using the corpus generation model comprises:

and performing speech synthesis processing on the text in the first language by adopting the TTS generation model to obtain generated speech in the first language corresponding to the text in the first language, and taking the generated speech in the first language as the first generated data.

4. The method according to claim 2, wherein the first raw data is a text in a first language, the second raw data is a speech in the first language corresponding to the text in the first language, the corpus generating model includes an MT generating model, and the processing the first raw data to obtain the first generated data using the corpus generating model includes:

and performing machine translation processing on the text in the first language by adopting the MT generation model to obtain a generated text in a second language corresponding to the text in the first language, and taking the generated text in the second language as the first generated data.

5. A training method of a corpus generating model comprises the following steps:

performing corpus generation processing on a first real sample pair by adopting a corpus generation model to obtain a pseudo sample pair corresponding to the first real sample pair, wherein the first real sample pair comprises a to-be-processed real sample and a non-processed real sample, the to-be-processed real sample and the non-processed real sample have different languages or modals, the pseudo sample pair comprises a first pseudo sample and a second pseudo sample, and the languages and the modals of the first pseudo sample and the second pseudo sample are different;

adopting a discrimination model to discriminate the second real sample pair and the pseudo sample pair to obtain a discrimination result;

constructing a loss function based on the discrimination result;

and training the corpus generating model based on the loss function.

6. The method according to claim 5, wherein the performing, by using the corpus generation model, corpus generation processing on the to-be-processed real sample pair to obtain a pseudo sample pair corresponding to the first real sample pair includes:

processing the real sample to be processed by adopting a corpus generating model to obtain a generated sample, wherein the generated sample is different from the real sample to be processed in language or mode, and the generated sample is different from the non-processed real sample in language and mode;

taking the non-processed real sample as the second pseudo sample.

7. The method of claim 6, wherein the corpus generation model comprises: the method for processing the real sample to be processed by the TTS generation model comprises the following steps of:

and performing voice synthesis processing on the real text sample of the first language by adopting the TTS generation model to obtain a generated voice sample of the first language corresponding to the real text sample of the first language, and taking the generated voice sample of the first language as the first pseudo sample.

8. The method of claim 6, wherein the corpus generation model comprises: the method for generating the model by the MT includes the steps of:

and performing machine translation processing on the real text sample of the first language by adopting the MT generation model to obtain a generated text sample of a second language corresponding to the real text sample of the first language, and taking the generated text sample of the second language as the first pseudo sample.

9. The method of any of claims 5-8, wherein the pair of dummy samples comprises a first pair of dummy samples and a second pair of dummy samples, the first pair of dummy samples comprising: a generated speech sample in a first language and a real text sample in a second language, the second pseudo sample pair comprising: the method for generating the text sample of the real voice sample of the first voice and the second voice comprises the following steps that a judgment result comprises a first judgment result, a second judgment result and a third judgment result, the second real sample pair comprises a voice translation real sample pair, and a judgment model is adopted to judge the second real sample pair and the pseudo sample pair so as to obtain a judgment result, and the method comprises the following steps:

judging the voice translation real sample pair by adopting a judging model to obtain a first judging result;

judging the first pseudo sample pair by adopting the judging model to obtain a second judging result;

and performing discrimination processing on the second pseudo sample pair by adopting the discrimination model to obtain the third discrimination result.

10. A training apparatus for translation models, comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a primary data pair, the primary data pair comprises first primary data and second primary data, and the languages or the modals of the first primary data and the second primary data are different;

the generating module is used for processing the original data pair by adopting a corpus generating model to obtain a generated data pair, wherein the generated data pair comprises first generated data and second generated data, and the languages and the modes of the first generated data and the second generated data are different;

and the training module is used for training a translation model by adopting the generated data pair.

11. The apparatus of claim 10, wherein the generation module is specifically configured to:

and taking the second original data as the second generated data.

12. The apparatus of claim 11, wherein the first raw data is a text in a first language, the second raw data is a text in a second language corresponding to the text in the first language, the corpus generation model comprises a TTS generation model, and the generation module is further specifically configured to:

13. The apparatus according to claim 11, wherein the first raw data is text in a first language, the second raw data is speech in the first language corresponding to the text in the first language, the corpus generating model includes an MT generating model, and the generating module is further specifically configured to:

14. A training apparatus for corpus generating models, comprising:

the generating module is used for performing corpus generation processing on a first real sample pair by adopting a corpus generation model so as to obtain a pseudo sample pair corresponding to the first real sample pair, wherein the first real sample pair comprises a to-be-processed real sample and a non-processed real sample, the to-be-processed real sample and the non-processed real sample have different languages or modals, the pseudo sample pair comprises a first pseudo sample and a second pseudo sample, and the languages and modals of the first pseudo sample and the second pseudo sample are different;

the judging module is used for judging the second real sample pair and the pseudo sample pair by adopting a judging model so as to obtain a judging result;

the construction module is used for constructing a loss function based on the discrimination result;

and the training module is used for training the corpus generating model based on the loss function.

15. The apparatus of claim 14, wherein the generation module is specifically configured to:

taking the non-processed real sample as the second pseudo sample.

16. The apparatus of claim 15, wherein the corpus generation model comprises: a TTS generation model, where the to-be-processed real sample is a real text sample in a first language, and the non-processed real sample is a real text sample in a second language corresponding to the real text sample in the first language, and the generation module is further specifically configured to:

17. The apparatus of claim 15, wherein the corpus generation model comprises: the MT generating model, where the to-be-processed real sample is a real text sample in a first language, and the non-processed real sample is a real voice sample in the first language corresponding to the real text sample in the first language, and the generating module is further specifically configured to:

18. The apparatus of any of claims 14-17, wherein the pair of dummy samples comprises a first pair of dummy samples and a second pair of dummy samples, the first pair of dummy samples comprising: a generated speech sample in a first language and a real text sample in a second language, the second pseudo sample pair comprising: the method includes the steps of generating a text sample of a real voice sample of a first voice and a generated text sample of a second language, wherein the judgment result includes a first judgment result, a second judgment result and a third judgment result, the second real sample pair includes a voice translation real sample pair, and the judgment module is specifically configured to:

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

20. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-9.

21. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-9.