CN117113091B

CN117113091B - Speech translation model training method and device, electronic equipment and storage medium

Info

Publication number: CN117113091B
Application number: CN202311380008.0A
Authority: CN
Inventors: 刘宇宸; 向露; 张亚萍; 周玉; 宗成庆
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2023-10-24
Filing date: 2023-10-24
Publication date: 2024-02-13
Anticipated expiration: 2043-10-24
Also published as: CN117113091A

Abstract

The invention provides a method and a device for training a speech translation model, electronic equipment and a storage medium, which are applied to the technical field of natural language processing. The method comprises the following steps: acquiring first voice data, first text data, first voice recognition data, first text translation data and first voice translation data; masking operation is carried out on the first voice data, the first text data and the first voice recognition data respectively, so that a plurality of masking sequences are generated; an encoder for training a speech translation model based on the plurality of mask sequences; freezing parameters of the encoder and training a decoder of the speech translation model based on the first text translation data, with a first loss function of the encoder in a converged state; training the speech translation model based on the first speech translation data.

Description

Speech translation model training method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a method and apparatus for training a speech translation model, an electronic device, and a storage medium.

Background

Speech translation technology is a technology that converts spoken content in one language into text content in another language using computer technology. With the increasing popularity of international communications, the need to implement automatic conversion between different languages through speech translation technology has become increasingly vigorous.

In the prior art, a mapping relation between source language voice and target language text can be established by adopting an end-to-end voice translation method, so that an end-to-end translation process is realized.

However, the end-to-end speech translation method generally adopts a neural network architecture, and because the architecture needs to rely on a large amount of labeling data for training, and the end-to-end labeling data is relatively deficient, the existing speech translation method cannot solve the problem of modal gap between speech and text.

Disclosure of Invention

The invention provides a voice translation model training method, a device, electronic equipment and a storage medium, which are used for solving the problem of a modal gap between voice and text in the existing voice translation method.

The invention provides a speech translation model training method, which comprises the following steps: acquiring first voice data, first text data, first voice recognition data, first text translation data and first voice translation data; masking operation is carried out on the first voice data, the first text data and the first voice recognition data respectively, so that a plurality of masking sequences are generated; an encoder for training a speech translation model based on the plurality of mask sequences; freezing parameters of the encoder and training a decoder of the speech translation model based on the first text translation data, with a first loss function of the encoder in a converged state; training the speech translation model based on the first speech translation data.

According to the invention, the plurality of mask sequences comprise a first mask sequence and a second mask sequence; the masking operation is performed on the first voice data, the first text data and the first voice recognition data respectively, and a plurality of masking sequences are generated, including: generating a discretized voice characterization unit according to the first voice data to obtain a first unit sequence; masking the first unit sequence to obtain a first masking sequence; and carrying out masking operation on the first text data to obtain the second masking sequence.

According to the invention, a voice translation model training method is provided, wherein the plurality of mask sequences comprise a third mask sequence; the first voice recognition data comprises second voice data and corresponding second text data; the masking operation is performed on the first voice data, the first text data and the first voice recognition data respectively, and a plurality of masking sequences are generated, including: generating a discretized voice characterization unit according to the second voice data to obtain a second unit sequence; under the condition that a cross-modal dictionary comprises a first voice characterization unit, masking a first word corresponding to the first voice characterization unit in the second text data to obtain a text masking sequence, wherein the first voice characterization unit is a part of voice characterization units in the second unit sequence; performing masking operation on a second voice characterization unit corresponding to a second word in the second unit sequence under the condition that the cross-modal dictionary comprises the second word, so as to obtain a voice characterization masking sequence, wherein the second word is part of text in the second text data; splicing the text mask sequence and the second unit sequence, or splicing the voice characterization mask sequence and the second text data to obtain the third mask sequence; the cross-modal dictionary comprises a plurality of phrases, wherein the phrases are combinations of a voice characterization unit and word texts.

According to the present invention, there is provided a speech translation model training method, before performing the masking operation, the method further comprising: acquiring third voice data and corresponding third text data; generating a discretized voice characterization unit according to the third voice data to obtain a third unit sequence; performing word alignment operation on the third unit sequence and the third text data to obtain a first phrase; the first phrase is a phrase in the cross-modal dictionary.

According to the invention, the first text translation data comprises a first source language text and a first target language text; the decoder for training the speech translation model based on the first text translation data comprises: carrying out random initialization processing on parameters of the decoder; inputting the first source language text into the encoder and the decoder to obtain a first predicted text, and determining a first loss based on the first target language text and the first predicted text; determining a third voice characterization unit corresponding to a third word in a cross-modal dictionary, and replacing the third word in the first source language text with the third voice characterization unit to obtain a first source language text sequence; inputting the first source language text sequence into the encoder and the decoder to obtain a second predicted text, and determining a second loss based on the first target language text and the second predicted text; and determining a second loss function based on the first loss and the second loss, and updating model parameters of the encoder and the decoder according to the second loss function until the second loss function is in a convergence state.

According to the voice translation model training method provided by the invention, the first voice translation data comprises first source language voice and corresponding second target language text; the training the speech translation model based on the first speech translation data includes: generating a discretized voice characterization unit according to the first source language voice to obtain a fourth unit sequence; and inputting the fourth unit sequence and the second target language text into the speech translation model to obtain a third predicted text, determining a third loss function based on the second target language text and the third predicted text, and updating model parameters of the encoder and the decoder according to the third loss function until the third loss function is in a convergence state.

The invention also provides a device for training the speech translation model, which comprises: the device comprises an acquisition module and a processing module; the acquisition module is used for acquiring first voice data, first text data, first voice recognition data, first text translation data and first voice translation data; the processing module may be configured to perform masking operations on the first voice data, the first text data, and the first voice recognition data, respectively, to generate a plurality of masking sequences; an encoder for training a speech translation model based on the plurality of mask sequences; freezing parameters of the encoder and training a decoder of the speech translation model based on the first text translation data, with a first loss function of the encoder in a converged state; training the speech translation model based on the first speech translation data.

According to the invention, a speech translation model training device is provided, wherein the mask sequences comprise a first mask sequence and a second mask sequence; the processing module may be specifically configured to generate a discretized speech characterization unit according to the first speech data, to obtain a first unit sequence; masking the first unit sequence to obtain a first masking sequence; and carrying out masking operation on the first text data to obtain the second masking sequence.

According to the invention, a speech translation model training device is provided, wherein the plurality of mask sequences comprise a third mask sequence; the first voice recognition data comprises second voice data and corresponding second text data; the processing module may be specifically configured to generate a discretized speech characterization unit according to the second speech data, to obtain a second unit sequence; under the condition that a cross-modal dictionary comprises a first voice characterization unit, masking a first word corresponding to the first voice characterization unit in the second text data to obtain a text masking sequence, wherein the first voice characterization unit is a part of voice characterization units in the second unit sequence; performing masking operation on a second voice characterization unit corresponding to a second word in the second unit sequence under the condition that the cross-modal dictionary comprises the second word, so as to obtain a voice characterization masking sequence, wherein the second word is part of text in the second text data; splicing the text mask sequence and the second unit sequence, or splicing the voice characterization mask sequence and the second text data to obtain the third mask sequence; the cross-modal dictionary comprises a plurality of phrases, wherein the phrases are combinations of a voice characterization unit and word texts.

According to the invention, the acquisition module can be used for acquiring third voice data and corresponding third text data; the processing module may be further configured to generate a discretized speech characterization unit according to the third speech data, to obtain a third unit sequence; performing word alignment operation on the third unit sequence and the third text data to obtain a first phrase; the first phrase is a phrase in the cross-modal dictionary.

According to the invention, the first text translation data comprises a first source language text and a first target language text; the processing module can be specifically used for carrying out random initialization processing on the parameters of the decoder; inputting the first source language text into the encoder and the decoder to obtain a first predicted text, and determining a first loss based on the first target language text and the first predicted text; determining a third voice characterization unit corresponding to a third word in a cross-modal dictionary, and replacing the third word in the first source language text with the third voice characterization unit to obtain a first source language text sequence; inputting the first source language text sequence into the encoder and the decoder to obtain a second predicted text, and determining a second loss based on the first target language text and the second predicted text; and determining a second loss function based on the first loss and the second loss, and updating model parameters of the encoder and the decoder according to the second loss function until the second loss function is in a convergence state.

According to the invention, the first voice translation data comprises first source language voice and corresponding second target language text; the processing module is specifically configured to generate a discretized speech characterization unit according to the first source language speech, so as to obtain a fourth unit sequence; and inputting the fourth unit sequence and the second target language text into the speech translation model to obtain a third predicted text, determining a third loss function based on the second target language text and the third predicted text, and updating model parameters of the encoder and the decoder according to the third loss function until the third loss function is in a convergence state.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the speech translation model training method as described in any one of the above when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the speech translation model training method as described in any of the above.

The voice translation model training method, the device, the electronic equipment and the storage medium can acquire the first voice data, the first text data, the first voice recognition data, the first text translation data and the first voice translation data; masking operation is carried out on the first voice data, the first text data and the first voice recognition data respectively, so that a plurality of masking sequences are generated; an encoder for training a speech translation model based on the plurality of mask sequences; freezing parameters of the encoder and training a decoder of the speech translation model based on the first text translation data, with a first loss function of the encoder in a converged state; training the speech translation model based on the first speech translation data. According to the scheme, the encoder of the speech translation model can be pre-trained based on a plurality of mask sequences, the encoder and the decoder are pre-trained based on first text translation data, and finally the speech translation model is trained based on the first speech translation data. Because the mask sequences are respectively generated according to the first voice data, the first text data and the first voice recognition data, the multi-source data such as large-scale voice, text, voice recognition, machine translation and the like can be utilized, the model cross-mode and cross-language characterization capability is gradually enhanced by adopting a gradual migration training mode, and the translation effect of the voice translation model is obviously improved under a low-resource voice translation scene.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a speech translation model training method provided by the invention;

FIG. 2 is a schematic diagram of a training process of an encoder provided by the present invention;

FIG. 3 is a schematic diagram of a training process for an encoder and decoder provided by the present invention;

FIG. 4 is a schematic diagram of a speech translation model training device according to the present invention;

fig. 5 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that, in the embodiments of the present invention, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present invention is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

In order to clearly describe the technical solution of the embodiment of the present invention, in the embodiment of the present invention, the words "first", "second", etc. are used to distinguish identical items or similar items having substantially the same function and effect, and those skilled in the art will understand that the words "first", "second", etc. are not limited in number and execution order.

Embodiments of the invention some exemplary embodiments have been described for illustrative purposes, it being understood that the invention may be practiced otherwise than as specifically shown in the accompanying drawings.

The foregoing implementations are described in detail below with reference to specific embodiments and accompanying drawings.

As shown in fig. 1, an embodiment of the present invention provides a speech translation model training method, which may be applied to a speech translation model training device. The speech translation model training method may include S101-S105:

s101, a voice translation model training device acquires first voice data, first text data, first voice recognition data, first text translation data and first voice translation data.

Optionally, the first voice data is source language voice data, and the first text data is source language text data.

It can be understood that the language corresponding to the first voice data is the same as the language corresponding to the first text data, and the source language is the language before translation. For example, if the speech translation model is used to implement a translation process of english to chinese, the first speech data may be english speech, and the first text data may be english text; if the speech translation model is used for implementing a translation process of a Chinese translation, the first speech data may be Chinese speech, and the first text data may be Chinese text.

Optionally, the first voice recognition data includes source language voice data and source language text data subjected to alignment processing. I.e. the source language text data may be text data generated after text transcription based on the source language speech data. For example, if the speech translation model is used to implement a translation process of english-to-chinese, the source language speech data may be english-language speech, and the source language text data may be english-transcribed text corresponding to the english-language speech.

Optionally, the first text translation data includes aligned source language text data and aligned target language text data. The target language text data is data obtained by translating the source language text data. For example, if the speech translation model is used to implement a translation process of english-to-chinese, the source language text data is english text, and the target language text data is chinese text corresponding to the english text.

Optionally, the first speech translation data includes aligned source language speech data and target language text data. The target language text data is data obtained by translating the source language voice data. For example, if the speech translation model is used to implement a translation process of english to chinese, the source language speech data is english speech and the target language text data is chinese text corresponding to the english speech.

S102, the voice translation model training device performs masking operation on the first voice data, the first text data and the first voice recognition data respectively to generate a plurality of masking sequences.

Alternatively, the aboveThe plurality of mask sequences includes a first mask sequence generated from the first speech data S, a second mask sequence generated from the first text data X, and a third mask sequence generated from the first speech recognition dataAnd (3) generating.

Optionally, the speech translation model training device may generate a discretized speech characterization unit according to the first speech data to obtain a first unit sequence; masking the first unit sequence to obtain a first masking sequence; and carrying out masking operation on the first text data to obtain a second masking sequence.

Specifically, the speech translation model training device may encode the first speech data S into continuous intermediate representations through a one-dimensional convolution layer and a 12-layer transform encoder based on the Hubert model, then convert the intermediate representations into discretized labels by using k-means clustering, and delete adjacent repeated discrete labels to finally obtain discretized speech characterization units corresponding to the first speech data S, i.e., the first unit sequence U. Then, the speech translation model training device can randomly train one unit sequence in the first unit sequence UPerforming masking operation to obtain a first masking sequence +.>. The speech translation model training means may also randomly apply +_ to one of the first text data X>Performing a masking operation to obtain a second masking sequence. It will be appreciated that->Representing the bitThe set tag performs masking processing.

Illustratively, the speech translation model training device may encode the first speech data S into successive intermediate representations, and then convert these intermediate representations into discretized labels using k-means clustering, where the discretized labels are 181, 232, 232, 897, 23, 45, 45, 45, for example, the speech translation model training device may delete adjacent repeated discrete labels, and the resulting discretized speech characterization units are 181, 232, 897, 23, 45.

Optionally, the first voice recognition dataMay include second voice data +.>And corresponding second text data +.>The second voice data->For source language speech data, the second text data +.>Is the source language text data.

Optionally, the speech translation model training device may generate a discretized speech characterization unit according to the second speech data to obtain a second unit sequence; under the condition that a cross-modal dictionary comprises a first voice characterization unit, masking a first word corresponding to the first voice characterization unit in the second text data to obtain a text masking sequence, wherein the first voice characterization unit is a part of voice characterization units in the second unit sequence; performing masking operation on a second voice characterization unit corresponding to a second word in the second unit sequence under the condition that the cross-modal dictionary comprises the second word, so as to obtain a voice characterization masking sequence, wherein the second word is part of text in the second text data; splicing the text mask sequence and the second unit sequence, or splicing the voice characterization mask sequence and the second text data to obtain the third mask sequence; the cross-modal dictionary comprises a plurality of phrases, wherein the phrases are combinations of a voice characterization unit and word texts.

Specifically, the speech translation model training device can be used for training the second speech data through a one-dimensional convolution layer and a 12-layer transducer coder based on the Hubert modelCoding into continuous intermediate representations, converting the intermediate representations into discretized labels by using k-means clustering, deleting adjacent repeated discrete labels, and finally obtaining second voice dataCorresponding discretized speech characterization unit, i.e. second unit sequence +.>. After that, the speech translation model training means can be trained by +.>And performing masking operation to obtain a third masking sequence, or performing masking operation on the second text data to obtain the third masking sequence.

Alternatively, the speech translation model training means may be arranged in a second sequence of unitsIs present in the cross-modal dictionary +.>The first word corresponding to the first voice characterization unit is determined in the second text data, and then the voice translation model training device can mask the first word to obtain a text mask sequence->The method comprises the steps of carrying out a first treatment on the surface of the Finally, the text mask sequence is added +.>And the second unit sequence->Performing splicing processing to obtain a third mask sequence (I) >，/>）。

Alternatively, the speech translation model training means may be configured to generate the second text dataIs present in the cross-modal dictionary +.>The second word of (a) and in the second unit sequence +.>After determining a second speech characterization unit corresponding to the second word, the speech translation model training device may perform a masking operation on the second speech characterization unit to obtain a speech characterization masking sequence +.>Finally, the speech characterization mask sequence is again +.>And second text data->Performing splicing processing to obtain a third mask sequence (I)>，/>）。

Illustratively, as shown in FIG. 2, the speech translation model training device may derive a first masking sequence "123 258 [ m ] 198 258 1369 [ m ]13 1" based on the sequence "123 258 567 198 258 1369 236 13 1" generated by the first speech data, "a second masking sequence" A [ m ] sat on the mat "based on the first text data" A cat sat on the mat, "a third masking sequence 1"236 278 385 92 37 219 [ sep ] Today a [ m ] day "based on the first speech recognition data 1"236 278 385 92 37 219 [ sep ] Today is a nice day, "a third masking sequence 2"265 349 [ m ]36 65 91 [ sep ] It is a good idea "based on the first speech recognition data 2"265 349 782 36 65 91 [ sep ] It is a good idea ".

Optionally, before performing the masking operation, the speech translation model training device may acquire third speech data and corresponding third text data; generating a discretized voice characterization unit according to the third voice data to obtain a third unit sequence; performing word alignment operation on the third unit sequence and the third text data to obtain a first phrase; the first phrase is a phrase in the cross-modal dictionary.

Specifically, before generating the third mask sequence, the speech translation model training device may acquire third speech data and corresponding third text data from the first speech recognition data; and then, generating a discretized voice characterization unit according to the third voice data to obtain a third unit sequence, and performing word alignment operation on the third unit sequence and the third text data based on a word alignment tool to obtain a first phrase. Repeating the above operation to obtain the cross-modal dictionary corresponding to the source language text words and the discretized voice characterization unit。

It will be appreciated that in performing the word alignment operation, if an adjacent continuous plurality of discrete voice tags are aligned with the same text word, the sequence of the adjacent continuous plurality of discrete voice tags is considered a phrase.

S103, the voice translation model training device trains encoders of the voice translation model based on the mask sequences.

Alternatively, the speech translation model may include an encoder. After the first mask sequence, the second mask sequence and the third mask sequence are obtained, the speech translation model training device may input the first mask sequence, the second mask sequence and the third mask sequence into the encoder respectively, predict an original tag corresponding to the mask position, calculate a loss function of the prediction result on the original tag of the mask position, and determine parameters of the encoder after minimizing the loss function and back propagation.

Illustratively, with continued reference to fig. 2, the first mask sequence may obtain, after passing through the encoder, original labels "567" and "236" corresponding to the mask positions, the second mask sequence may obtain, after passing through the encoder, an original label "cat" corresponding to the mask positions, the third mask sequence 1 may obtain, after passing through the encoder, an original label "good" corresponding to the mask positions, and the third mask sequence 2 may obtain, after passing through the encoder, an original label "782" corresponding to the mask positions.

Alternatively, the speech translation model training means may be based on a formula

A loss function of the encoder is determined.

In the embodiment of the invention, the original labels corresponding to the mask positions of each mask sequence can be predicted by the encoder, so that the problem of modal gap between voice and text can be solved, and the full utilization of large-scale monolingual data and voice data is realized.

S104, the voice translation model training device freezes parameters of the encoder under the condition that the first loss function of the encoder is in a convergence state, and trains a decoder of the voice translation model based on the first text translation data.

Optionally, the first text translation dataComprising text in the first source language->And first target language text->. The first source language text->For the text data in the source language, the first target language text +.>Is target language text data.

As shown in fig. 3, the speech translation model further includes a decoder; the speech translation model training device can perform random initialization processing on parameters of the decoder; inputting the first source language text into the encoder and the decoder to obtain a first predicted text, and determining a first loss based on the first target language text and the first predicted text; determining a third voice characterization unit corresponding to a third word in a cross-modal dictionary, and replacing the third word in the first source language text with the third voice characterization unit to obtain a first source language text sequence; inputting the first source language text sequence into the encoder and the decoder to obtain a second predicted text, and determining a second loss based on the first target language text and the second predicted text; and determining a second loss function based on the first loss and the second loss, and updating model parameters of the encoder and the decoder according to the second loss function until the second loss function is in a convergence state.

Specifically, as shown in fig. 3, the speech translation model training device may perform random initialization processing on parameters of the decoder; and then text in the first source language"Today is a nice day" is input to the encoder and decoder to obtain a first predicted text, which is then based on the first target language text +.>And the first predicted text determines a first penalty L1. Thereafter, text from the first source language +.>A third word "nice" is randomly determined in the cross-modal dictionary +.>A third speech characterization unit corresponding to a third word is determined->: "9237" and text in the first Source language +.>The third word of (a) is replaced by a third speech characterization unit +.>Obtaining a first source language text sequence +.>"Today is a 9237 day"; text sequence in the first Source language +.>Input encoder and decoder to obtain a second predicted text based on the first target language text +.>And determining a second penalty L2 from the second predicted text; determining a second loss function based on the first loss L1 and the second loss L2>And according to a second loss function->Updating model parameters of the speech translation model until a second loss function +.>In a converging state.

It should be noted that, a third word is randomly determined from the first source language text and replaced by the voice characterization unit in the cross-modal dictionary, so that the replaced first source language text and the first target language text form mixed parallel data, thereby realizing model training based on the cross-modal dictionary and further improving the model training effect.

Optionally, the second loss function。

S105, the voice translation model training device trains the voice translation model based on the first voice translation data.

Optionally, the first speech translation dataMay include a first source language speech +.>And second target language text +.>The first source language voice +.>For source language speech data, the second target language text +.>Is target language text data.

Optionally, the speech translation model training device may generate a discretized speech characterization unit according to the first source language speech to obtain a fourth unit sequence; and inputting the fourth unit sequence and the second target language text into the speech translation model to obtain a third predicted text, determining a third loss function based on the second target language text and the third predicted text, and updating model parameters of the encoder and the decoder according to the third loss function until the third loss function is in a convergence state.

Specifically, the speech translation model training device may be configured to perform speech on a first source language in the first speech translation data based on an open-source Hubert modelCoding to extract discretized speech characterization units, performing de-duplication processing on the continuously repeated speech characterization units to form a fourth unit sequence ∈ - >The fourth unit sequence is then added +.>And second target language text +.>Inputting the speech translation model to obtain a third predicted text, and calculating the third predicted text and the second target language textAnd obtaining a speech translation model after minimizing the loss function and back propagation.

Optionally, the third loss function。

Optionally, after the speech translation model is obtained, the speech translation model training device may obtain new source language speech data; and generating a discretized voice characterization unit according to the new source language voice data, and inputting the voice characterization unit into the voice translation model to obtain a corresponding target language translation text. That is, the speech translation model training means may translate the source language speech data into the target language translation text based on the speech translation model.

In the embodiment of the invention, the encoder of the speech translation model can be pre-trained based on a plurality of mask sequences, then the encoder and the decoder are pre-trained based on the first text translation data, and finally the speech translation model is trained based on the first speech translation data. Because the mask sequences are respectively generated according to the first voice data, the first text data and the first voice recognition data, the multi-source data such as large-scale voice, text, voice recognition, machine translation and the like can be utilized, the model cross-mode and cross-language characterization capability is gradually enhanced by adopting a gradual migration training mode, and the translation effect of the voice translation model is obviously improved under a low-resource voice translation scene.

The foregoing description of the solution provided by the embodiments of the present invention has been mainly presented in terms of a method. To achieve the above functions, it includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

According to the voice translation model training method provided by the embodiment of the invention, the execution main body can be a voice translation model training device or a control module for voice translation model training in the voice translation model training device. In the embodiment of the invention, a method for executing a speech translation model training by using a speech translation model training device is taken as an example, and the speech translation model training device provided by the embodiment of the invention is described.

It should be noted that, in the embodiment of the present invention, the function modules of the speech translation model training device may be divided according to the above method example, for example, each function module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated modules may be implemented in hardware or in software functional modules. Optionally, the division of the modules in the embodiment of the present invention is schematic, which is merely a logic function division, and other division manners may be implemented in practice.

As shown in fig. 4, an embodiment of the present invention provides a speech translation model training apparatus 400. The speech translation model training device 400 includes: an acquisition module 401 and a processing module 402. The acquiring module 401 may be configured to acquire first voice data, first text data, first voice recognition data, first text translation data, and first voice translation data; the processing module 402 may be configured to perform masking operations on the first voice data, the first text data, and the first voice recognition data, respectively, to generate a plurality of masking sequences; an encoder for training a speech translation model based on the plurality of mask sequences; freezing parameters of the encoder and training a decoder of the speech translation model based on the first text translation data, with a first loss function of the encoder in a converged state; training the speech translation model based on the first speech translation data.

Optionally, the plurality of mask sequences includes a first mask sequence and a second mask sequence; the processing module 402 may be specifically configured to generate a discretized speech characterization unit according to the first speech data, to obtain a first unit sequence; masking the first unit sequence to obtain a first masking sequence; and carrying out masking operation on the first text data to obtain the second masking sequence.

Optionally, the plurality of mask sequences includes a third mask sequence; the first voice recognition data comprises second voice data and corresponding second text data; the processing module 402 may be specifically configured to generate a discretized speech characterization unit according to the second speech data, to obtain a second unit sequence; under the condition that a cross-modal dictionary comprises a first voice characterization unit, masking a first word corresponding to the first voice characterization unit in the second text data to obtain a text masking sequence, wherein the first voice characterization unit is a part of voice characterization units in the second unit sequence; performing masking operation on a second voice characterization unit corresponding to a second word in the second unit sequence under the condition that the cross-modal dictionary comprises the second word, so as to obtain a voice characterization masking sequence, wherein the second word is part of text in the second text data; splicing the text mask sequence and the second unit sequence, or splicing the voice characterization mask sequence and the second text data to obtain the third mask sequence; the cross-modal dictionary comprises a plurality of phrases, wherein the phrases are combinations of a voice characterization unit and word texts.

According to the present invention, the above-mentioned obtaining module 401 may be further configured to obtain third voice data and corresponding third text data; the processing module 402 may be further configured to generate a discretized speech characterization unit according to the third speech data, to obtain a third unit sequence; performing word alignment operation on the third unit sequence and the third text data to obtain a first phrase; the first phrase is a phrase in the cross-modal dictionary.

According to the invention, the first text translation data comprises a first source language text and a first target language text; the processing module 402 may be specifically configured to perform random initialization processing on parameters of the decoder; inputting the first source language text into the encoder and the decoder to obtain a first predicted text, and determining a first loss based on the first target language text and the first predicted text; determining a third voice characterization unit corresponding to a third word in a cross-modal dictionary, and replacing the third word in the first source language text with the third voice characterization unit to obtain a first source language text sequence; inputting the first source language text sequence into the encoder and the decoder to obtain a second predicted text, and determining a second loss based on the first target language text and the second predicted text; and determining a second loss function based on the first loss and the second loss, and updating model parameters of the encoder and the decoder according to the second loss function until the second loss function is in a convergence state.

Optionally, the first speech translation data includes a first source language speech and a corresponding second target language text; the processing module 402 may be specifically configured to generate a discretized speech characterization unit according to the first source language speech, so as to obtain a fourth unit sequence; and inputting the fourth unit sequence and the second target language text into the speech translation model to obtain a third predicted text, determining a third loss function based on the second target language text and the third predicted text, and updating model parameters of the encoder and the decoder according to the third loss function until the third loss function is in a convergence state.

Fig. 5 illustrates a physical schematic diagram of an electronic device, as shown in fig. 5, which may include: processor 510, communication interface (Communications Interface) 520, memory 530, and communication bus 540, wherein processor 510, communication interface 520, memory 530 complete communication with each other through communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a speech translation model training method comprising: acquiring first voice data, first text data, first voice recognition data, first text translation data and first voice translation data; masking operation is carried out on the first voice data, the first text data and the first voice recognition data respectively, so that a plurality of masking sequences are generated; an encoder for training a speech translation model based on the plurality of mask sequences; freezing parameters of the encoder and training a decoder of the speech translation model based on the first text translation data, with a first loss function of the encoder in a converged state; training the speech translation model based on the first speech translation data.

Further, the logic instructions in the memory 530 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the speech translation model training method provided by the above methods, the method comprising: acquiring first voice data, first text data, first voice recognition data, first text translation data and first voice translation data; masking operation is carried out on the first voice data, the first text data and the first voice recognition data respectively, so that a plurality of masking sequences are generated; an encoder for training a speech translation model based on the plurality of mask sequences; freezing parameters of the encoder and training a decoder of the speech translation model based on the first text translation data, with a first loss function of the encoder in a converged state; training the speech translation model based on the first speech translation data.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the above provided speech translation model training method, the method comprising: acquiring first voice data, first text data, first voice recognition data, first text translation data and first voice translation data; masking operation is carried out on the first voice data, the first text data and the first voice recognition data respectively, so that a plurality of masking sequences are generated; an encoder for training a speech translation model based on the plurality of mask sequences; freezing parameters of the encoder and training a decoder of the speech translation model based on the first text translation data, with a first loss function of the encoder in a converged state; training the speech translation model based on the first speech translation data.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for training a speech translation model, comprising:

acquiring first voice data, first text data, first voice recognition data, first text translation data and first voice translation data;

masking operation is carried out on the first voice data, the first text data and the first voice recognition data respectively, so that a plurality of masking sequences are generated;

an encoder for training a speech translation model based on the plurality of mask sequences;

freezing parameters of the encoder and training a decoder of the speech translation model based on the first text translation data, with a first loss function of the encoder in a converged state;

training the speech translation model based on the first speech translation data;

wherein the first voice data is source language voice data; the first text data is source language text data; the first voice recognition data comprises source language voice data and source language text data which are subjected to alignment processing; the first text translation data comprises source language text data and target language text data which are subjected to alignment processing; the first voice translation data comprises source language voice data and target language text data which are subjected to alignment processing;

The plurality of mask sequences includes a third mask sequence; the first voice recognition data comprises second voice data and corresponding second text data;

the masking operation is performed on the first voice data, the first text data and the first voice recognition data respectively, and a plurality of masking sequences are generated, including:

generating a discretized voice characterization unit according to the second voice data to obtain a second unit sequence; under the condition that a cross-modal dictionary comprises a first voice characterization unit, masking a first word corresponding to the first voice characterization unit in the second text data to obtain a text masking sequence, wherein the first voice characterization unit is a part of voice characterization units in the second unit sequence; performing masking operation on a second voice characterization unit corresponding to a second word in the second unit sequence under the condition that the cross-modal dictionary comprises the second word, so as to obtain a voice characterization masking sequence, wherein the second word is part of text in the second text data; splicing the text mask sequence and the second unit sequence, or splicing the voice characterization mask sequence and the second text data to obtain the third mask sequence; the cross-modal dictionary comprises a plurality of phrases, wherein the phrases are combinations of a voice characterization unit and word texts.

2. The speech translation model training method according to claim 1, wherein the plurality of mask sequences comprises a first mask sequence and a second mask sequence;

generating a discretized voice characterization unit according to the first voice data to obtain a first unit sequence;

masking the first unit sequence to obtain a first masking sequence;

and carrying out masking operation on the first text data to obtain the second masking sequence.

3. The speech translation model training method according to claim 1, wherein prior to performing the masking operation, the method further comprises:

acquiring third voice data and corresponding third text data;

generating a discretized voice characterization unit according to the third voice data to obtain a third unit sequence;

performing word alignment operation on the third unit sequence and the third text data to obtain a first phrase;

the first phrase is a phrase in the cross-modal dictionary.

4. The speech translation model training method according to claim 1, wherein the first text translation data comprises a first source language text and a first target language text;

the decoder for training the speech translation model based on the first text translation data comprises:

carrying out random initialization processing on parameters of the decoder;

inputting the first source language text into the encoder and the decoder to obtain a first predicted text, and determining a first loss based on the first target language text and the first predicted text;

determining a third voice characterization unit corresponding to a third word in a cross-modal dictionary, and replacing the third word in the first source language text with the third voice characterization unit to obtain a first source language text sequence;

inputting the first source language text sequence into the encoder and the decoder to obtain a second predicted text, and determining a second loss based on the first target language text and the second predicted text;

and determining a second loss function based on the first loss and the second loss, and updating model parameters of the encoder and the decoder according to the second loss function until the second loss function is in a convergence state.

5. The speech translation model training method according to any one of claims 1 to 4, wherein the first speech translation data comprises a first source language speech and a corresponding second target language text; the training the speech translation model based on the first speech translation data includes:

generating a discretized voice characterization unit according to the first source language voice to obtain a fourth unit sequence;

and inputting the fourth unit sequence and the second target language text into the speech translation model to obtain a third predicted text, determining a third loss function based on the second target language text and the third predicted text, and updating model parameters of the encoder and the decoder according to the third loss function until the third loss function is in a convergence state.

6. A speech translation model training device, comprising: the device comprises an acquisition module and a processing module;

the acquisition module is used for acquiring first voice data, first text data, first voice recognition data, first text translation data and first voice translation data;

the processing module is used for performing masking operation on the first voice data, the first text data and the first voice recognition data respectively to generate a plurality of masking sequences; an encoder for training a speech translation model based on the plurality of mask sequences; freezing parameters of the encoder and training a decoder of the speech translation model based on the first text translation data, with a first loss function of the encoder in a converged state; training the speech translation model based on the first speech translation data;

the plurality of mask sequences includes a third mask sequence; the first voice recognition data comprises second voice data and corresponding second text data; the processing module is specifically configured to generate a discretized speech characterization unit according to the second speech data, so as to obtain a second unit sequence; under the condition that a cross-modal dictionary comprises a first voice characterization unit, masking a first word corresponding to the first voice characterization unit in the second text data to obtain a text masking sequence, wherein the first voice characterization unit is a part of voice characterization units in the second unit sequence; performing masking operation on a second voice characterization unit corresponding to a second word in the second unit sequence under the condition that the cross-modal dictionary comprises the second word, so as to obtain a voice characterization masking sequence, wherein the second word is part of text in the second text data; splicing the text mask sequence and the second unit sequence, or splicing the voice characterization mask sequence and the second text data to obtain the third mask sequence; the cross-modal dictionary comprises a plurality of phrases, wherein the phrases are combinations of a voice characterization unit and word texts.

7. The speech translation model training device according to claim 6, wherein the first text translation data comprises a first source language text and a first target language text; the processing module is specifically used for carrying out random initialization processing on the parameters of the decoder; inputting the first source language text into the encoder and the decoder to obtain a first predicted text, and determining a first loss based on the first target language text and the first predicted text; determining a third voice characterization unit corresponding to a third word in a cross-modal dictionary, and replacing the third word in the first source language text with the third voice characterization unit to obtain a first source language text sequence; inputting the first source language text sequence into the encoder and the decoder to obtain a second predicted text, and determining a second loss based on the first target language text and the second predicted text; and determining a second loss function based on the first loss and the second loss, and updating model parameters of the encoder and the decoder according to the second loss function until the second loss function is in a convergence state.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the speech translation model training method of any of claims 1 to 5 when the program is executed.

9. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps in the speech translation model training method according to any of claims 1 to 5.