CN115983203A - Voice translation method, device, equipment and readable storage medium - Google Patents

Voice translation method, device, equipment and readable storage medium Download PDF

Info

Publication number
CN115983203A
CN115983203A CN202211682639.3A CN202211682639A CN115983203A CN 115983203 A CN115983203 A CN 115983203A CN 202211682639 A CN202211682639 A CN 202211682639A CN 115983203 A CN115983203 A CN 115983203A
Authority
CN
China
Prior art keywords
source language
text
speech
training
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211682639.3A
Other languages
Chinese (zh)
Inventor
周心远
邓攀
张为泰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Iflytek Shanghai Technology Co ltd
Original Assignee
Iflytek Shanghai Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Iflytek Shanghai Technology Co ltd filed Critical Iflytek Shanghai Technology Co ltd
Priority to CN202211682639.3A priority Critical patent/CN115983203A/en
Publication of CN115983203A publication Critical patent/CN115983203A/en
Pending legal-status Critical Current

Links

Images

Abstract

The application discloses a voice translation method, a device, equipment and a readable storage medium. According to the scheme, after source language voice to be translated is obtained, the source language voice is processed to obtain source language type text representation, and the source language type text representation is decoded to obtain target language text.

Description

Voice translation method, device, equipment and readable storage medium
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for speech translation.
Background
The purpose of Speech Translation (ST) is to translate Speech in a source language into text in a target language. Conventional speech translation systems generally follow a cascade model, comprising two subtasks: automatic Speech Recognition (ASR) and Machine Translation (MT). This cascade model mainly suffers from error propagation and high delay problems. In recent years, in order to solve the problem of the cascade model, an end-to-end speech translation model has been proposed. The end-to-end speech translation model mainly uses an encoder (encoder) of an automatic speech recognition model to encode a source end input (i.e. source language speech) and then uses a decoder (decoder) of a machine translation model to decode the source end input (i.e. target language speech) into a target end output (i.e. target language text), however, the speech translation model of such an encoder-decoder structure has poor semantic modeling capability, resulting in poor speech translation effect.
In order to improve the semantic modeling capability of a speech translation model, some speech translation models insert an encoder of a machine translation model between an encoder and a decoder of the speech translation model, and input an acoustic feature sequence obtained by the encoder of the speech translation model into the encoder of the machine translation model, but because the input received by the encoder is a text sequence when the machine translation model is trained, the expected input of the encoder of the machine translation model is a text sequence, and the length of the acoustic feature sequence is inconsistent with the length of the text sequence input into the encoder when the machine translation model is trained, which affects the learning of the speech translation model, and results in poor speech translation effect of the speech translation model.
Therefore, how to provide a speech translation method to improve the speech translation effect becomes a technical problem to be solved urgently by those skilled in the art.
Disclosure of Invention
In view of the foregoing problems, the present application provides a speech translation method, apparatus, device and readable storage medium. The specific scheme is as follows:
a method of speech translation, the method comprising:
acquiring source language voice to be translated;
and processing the source language speech to obtain a source language type text representation, and decoding the source language type text representation to obtain a target language text, wherein the sequence length of the source language type text representation is consistent with that of a source language text corresponding to the source language speech.
Optionally, the processing the source language speech to obtain a source language type text representation, and decoding the source language type text representation to obtain a target language text includes:
inputting the source language speech into a speech translation model, wherein the speech translation model comprises an acoustic encoder module, a text encoder module, a semantic decoder module and a speech translation decoder module;
the acoustic encoder module performs acoustic representation extraction on the source language voice to obtain acoustic representation of the source language voice, and obtains a source language prediction text based on the acoustic representation of the source language voice;
the text encoder module obtains a source language text representation corresponding to the source language voice based on the acoustic representation of the source language voice and the source language predicted text;
the semantic decoder module performs semantic decoding processing on the source language text representation to obtain a source language type text representation, wherein the source language type text representation is consistent with the sequence length of a source language text corresponding to the source language voice;
and the voice translation decoder module decodes the source language type text representation to obtain a target language text.
Optionally, the acoustic encoder module comprises an acoustic representation extraction unit and a source language text prediction unit;
the acoustic encoder module performs acoustic representation extraction on the source language speech to obtain acoustic representation of the source language speech, and obtains a source language prediction text based on the acoustic representation of the source language speech, including:
the acoustic characterization extraction unit is used for performing acoustic characterization extraction on the source language speech to obtain an acoustic characterization of the source language speech;
the source language text prediction unit obtains a source language prediction text based on the acoustic characterization of the source language speech.
Optionally, the text encoder module includes a mapping processing unit, an embedding processing unit, a feature fusion unit, and an encoding unit;
the text encoder module obtains a source language text representation corresponding to the source language speech based on the acoustic representation of the source language speech and the source language predicted text, and comprises:
the mapping processing unit is used for mapping the acoustic representation of the source language voice to obtain mapped features;
the embedding processing unit is used for embedding the source language predicted text to obtain embedded characteristics;
the feature fusion unit fuses the mapped features and the embedded features to obtain fused features;
and the coding unit codes the fused features to obtain a source language text representation corresponding to the source language speech.
Optionally, the speech translation model is trained in the following manner:
obtaining a source language voice for training, a source language text corresponding to the source language voice for training and a target language text corresponding to the source language voice for training, wherein the source language text corresponding to the source language voice for training is marked with a sentence end label;
obtaining a machine translation model trained in advance;
inputting the source language text corresponding to the training source language speech into the pre-trained machine translation model to obtain source language text characteristics output by the pre-trained machine translation model;
and training by taking the source language voice for training as a training sample, and taking a source language text corresponding to the source language voice for training, a target language text corresponding to the source language voice for training, source language text characteristics output by the pre-trained machine translation model and sentence end labels labeled by the source language text corresponding to the source language voice for training as sample labels to obtain the voice translation model.
Optionally, in the speech translation model training process:
the acoustic encoder module performs acoustic representation extraction on the source language voice for training to obtain acoustic representation of the source language voice for training, and obtains a source language predicted text corresponding to the source language voice for training based on the acoustic representation of the source language voice for training;
the text encoder module obtains a source language text representation corresponding to the source language speech for training based on the acoustic representation of the source language speech for training and a source language predicted text corresponding to the source language speech for training;
the semantic decoder module performs semantic decoding on the source language text representation corresponding to the training source language voice to obtain a source language type text representation corresponding to the training source language voice, and performs sentence tail prediction on the basis of the source language type text representation corresponding to the training source language voice to obtain a source language predicted text corresponding to the training source language voice, wherein the source language predicted text corresponding to the training source language voice comprises a predicted sentence tail position;
and the speech translation decoder module decodes the source language type text representation corresponding to the source language speech for training to obtain a target language predicted text corresponding to the source language speech for training.
Optionally, the training of the speech translation model is performed by taking a first loss, a second loss, a third loss and a fourth loss as joint losses in a training process;
wherein the first loss characterizes a difference between a source language predicted text corresponding to the source language speech for training and a source language text corresponding to the source language speech for training;
the second loss represents the difference between the source language type text representation corresponding to the source language speech for training and the source language text feature output by the machine translation model trained in advance;
the third loss represents a difference between the predicted text corresponding to the source language speech for training and the source language text corresponding to the source language speech for training;
the fourth loss characterizes a difference between the target language predicted text corresponding to the training source language speech and the target language text corresponding to the training source language speech.
A speech translation apparatus, the apparatus comprising:
the obtaining unit is used for obtaining source language voice to be translated;
and the voice translation unit is used for processing the source language voice to obtain a source language type text representation and decoding the source language type text representation to obtain a target language text, wherein the source language type text representation is consistent with the sequence length of the source language text corresponding to the source language voice.
Optionally, the speech translation unit is specifically configured to:
inputting the source language speech into a speech translation model, wherein the speech translation model comprises an acoustic encoder module, a text encoder module, a semantic decoder module and a speech translation decoder module;
the acoustic encoder module is used for performing acoustic representation extraction on the source language voice to obtain acoustic representation of the source language voice and obtaining a source language prediction text based on the acoustic representation of the source language voice;
the text encoder module is used for obtaining a source language text representation corresponding to the source language voice based on the acoustic representation of the source language voice and the source language predicted text;
the semantic decoder module is used for performing semantic decoding processing on the source language text representation to obtain a source language type text representation, and the sequence length of the source language type text representation is consistent with that of a source language text corresponding to the source language voice;
and the voice translation decoder module is used for decoding the source language type text representation to obtain a target language text.
Optionally, the acoustic encoder module comprises an acoustic representation extraction unit and a source language text prediction unit;
the acoustic characterization extraction unit is used for performing acoustic characterization extraction on the source language speech to obtain an acoustic characterization of the source language speech;
the source language text prediction unit is used for obtaining a source language prediction text based on the acoustic characterization of the source language voice.
Optionally, the text encoder module includes a mapping processing unit, an embedding processing unit, a feature fusion unit, and an encoding unit;
the mapping processing unit is used for mapping the acoustic representation of the source language speech to obtain mapped features;
the embedding processing unit is used for embedding the source language predicted text to obtain embedded characteristics;
the feature fusion unit is used for fusing the mapped features and the embedded features to obtain fused features;
and the coding unit is used for coding the fused features to obtain a source language text representation corresponding to the source language speech.
Optionally, the speech translation model is trained in the following manner:
obtaining a source language voice for training, a source language text corresponding to the source language voice for training and a target language text corresponding to the source language voice for training, wherein the source language text corresponding to the source language voice for training is marked with a sentence end label;
obtaining a machine translation model trained in advance;
inputting the source language text corresponding to the training source language speech into the pre-trained machine translation model to obtain source language text characteristics output by the pre-trained machine translation model;
and training by taking the source language voice for training as a training sample, and taking a source language text corresponding to the source language voice for training, a target language text corresponding to the source language voice for training, source language text characteristics output by the pre-trained machine translation model and sentence end labels labeled by the source language text corresponding to the source language voice for training as sample labels to obtain the voice translation model.
Optionally, in the speech translation model training process:
the acoustic encoder module performs acoustic representation extraction on the source language voice for training to obtain acoustic representation of the source language voice for training, and obtains a source language predicted text corresponding to the source language voice for training based on the acoustic representation of the source language voice for training;
the text encoder module obtains a source language text representation corresponding to the source language speech for training based on the acoustic representation of the source language speech for training and a source language predicted text corresponding to the source language speech for training;
the semantic decoder module performs semantic decoding on the source language text representation corresponding to the source language voice for training to obtain a source language type text representation corresponding to the source language voice for training, and performs sentence tail prediction based on the source language type text representation corresponding to the source language voice for training to obtain a source language predicted text corresponding to the source language voice for training, wherein the source language predicted text corresponding to the source language voice for training comprises a predicted sentence tail position;
and the speech translation decoder module decodes the source language type text representation corresponding to the source language speech for training to obtain a target language predicted text corresponding to the source language speech for training.
Optionally, the training of the speech translation model is performed with a first loss, a second loss, a third loss and a fourth loss as joint losses in the training process;
wherein the first loss characterizes a difference between a source language predicted text corresponding to the source language speech for training and a source language text corresponding to the source language speech for training;
the second loss represents the difference between the source language type text representation corresponding to the source language speech for training and the source language text feature output by the machine translation model trained in advance;
the third loss represents a difference between the predicted text corresponding to the source language speech for training and the source language text corresponding to the source language speech for training;
the fourth loss characterizes a difference between the target language predicted text corresponding to the training source language speech and the target language text corresponding to the training source language speech.
A speech translation device comprising a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the steps of the speech translation method.
A readable storage medium, having stored thereon a computer program which, when executed by a processor, carries out the steps of the speech translation method as described above.
By means of the technical scheme, the application discloses a voice translation method, a device, equipment and a readable storage medium. According to the scheme, after source language voice to be translated is obtained, the source language voice is processed to obtain source language type text representation, and the source language type text representation is decoded to obtain target language text.
Drawings
Various additional advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a schematic flowchart of a speech translation method disclosed in an embodiment of the present application;
FIG. 2 is a schematic structural diagram of a speech translation model disclosed in an embodiment of the present application;
FIG. 3 is a schematic process diagram illustrating a process of processing a source language speech based on a speech translation model to obtain a source language-class text representation and decoding the source language-class text representation to obtain a target language text, disclosed in an embodiment of the present application;
FIG. 4 is a schematic diagram illustrating training of a speech translation model according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a speech translation apparatus disclosed in the embodiment of the present application;
fig. 6 is a block diagram of a hardware structure of a speech translation apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
For better understanding of the scheme of the present application, the end-to-end speech translation model is first explained in detail.
Current end-to-end speech translation models are based primarily on sequence-to-sequence modeling, similar to sequence-to-sequence tasks such as machine translation and automatic speech recognition, and encode a source input (i.e., source speech) using a transform model encoder (encoder) and decode it with a decoder (decoder) into a target output (i.e., target language text). However, sequence-to-sequence modeling requires a large amount of training data, and speech translation data is scarce and labeling cost is too high. Therefore, the conventional method is to translate the text into the target text by using a machine translation model based on the training data of the automatic speech recognition model, and construct speech translation data according to the idea of knowledge distillation. And respectively initializing encoders and decoders of the speech translation models by using the trained automatic speech recognition models and machine translation models so as to utilize the data of a large number of automatic speech recognition models and machine translation models, and training and fine-tuning by successively and respectively using forged speech translation data and a small amount of real speech translation data after initialization is completed. That is, the end-to-end speech translation model mainstream encodes source end input (i.e., source language speech) using an encoder (encoder) of the automatic speech recognition model and decodes it into target end output (i.e., target language text) using a decoder (decoder) of the machine translation model.
However, the speech translation model of such an encoder-decoder structure does not fully utilize the automatic speech recognition model and the machine translation model, and only the encoder using the automatic speech recognition model and the decoder using the machine translation model have a poor semantic modeling capability, resulting in a poor speech translation effect.
In order to improve the semantic modeling capability of the speech translation model, the inventor of the present application researches and discovers that: at present, some speech translation models are formed by inserting an encoder of a machine translation model between an encoder and a decoder of the speech translation model, and inputting an acoustic feature sequence obtained by the encoder of the speech translation model into the encoder of the machine translation model, however, because the input received by the encoder is a text sequence when the machine translation model is trained, the expected input of the encoder of the machine translation model is a text sequence, and the acoustic feature sequence is significantly different from the text sequence, which will affect the learning of the model, and thus the speech translation effect of the speech translation model is poor.
In view of the problems of the above solutions, the present inventors have conducted intensive studies and finally proposed a speech translation method. Next, a speech translation method provided by the present application will be described by the following embodiments.
Referring to fig. 1, fig. 1 is a schematic flowchart of a speech translation method disclosed in an embodiment of the present application, where the method may include:
step S101: and acquiring the source language voice to be translated.
In this application, the source language speech to be translated may be speech of any kind of language, and this application is not limited in any way.
Step S102: and processing the source language speech to obtain a source language type text representation, and decoding the source language type text representation to obtain a target language text, wherein the sequence length of the source language type text representation is consistent with that of a source language text corresponding to the source language speech.
In this application, the processing of the source language speech to obtain a source language-class text representation, and the decoding of the source language-class text representation to obtain a target language text may include: performing acoustic representation extraction on the source language voice to obtain acoustic representation of the source language voice, and obtaining a source language prediction text based on the acoustic representation of the source language voice; obtaining a source language text representation corresponding to the source language voice based on the acoustic representation of the source language voice and the source language prediction text; semantic decoding is carried out on the source language text representation to obtain a source language type text representation, and the sequence length of the source language type text representation is consistent with that of a source language text corresponding to the source language voice; and decoding the source language type text representation to obtain a target language text.
It should be noted that the process of processing the source language speech to obtain the source language type text representation and decoding the source language type text representation to obtain the target language text may be implemented based on a neural network, which will be specifically described in detail through the following embodiments.
The embodiment discloses a voice translation method. According to the scheme, after source language voice to be translated is obtained, the source language voice is processed to obtain source language type text representation, and the source language type text representation is decoded to obtain target language text.
In the above embodiment, it is pointed out that the process of processing the source language speech to obtain a source language-class text representation and decoding the source language-class text representation to obtain a target language text can be implemented based on a neural network. The details will be explained by the following examples.
In one embodiment of the present application, the speech translation model disclosed in the present application is described in detail.
Referring to fig. 2, fig. 2 is a schematic structural diagram of a speech translation model disclosed in an embodiment of the present application, where the speech translation model may include: the device comprises an acoustic encoder module, a text encoder module, a semantic decoder module and a voice translation decoder module.
Wherein the acoustic encoder module comprises an acoustic representation extraction unit and a source language text prediction unit; as an implementation, the acoustic Encoder module may be initialized by using an Encoder of a pre-trained automatic speech recognition model, the acoustic representation extraction unit may be implemented by using a VGG Block layer and a transformer Encoder layer, and the source language text prediction unit may be implemented by using a CTC (continuous time Classification) project layer.
The text encoder module comprises a mapping processing unit, an embedding processing unit, a feature fusion unit and an encoding unit; as an implementable manner, the text Encoder module may be initialized by using an Encoder of a pre-trained machine translation model, the mapping processing unit may be implemented by using a mapping layer, the embedding processing unit may be implemented by using an embedding layer, and the encoding unit may be implemented by using a transform Encoder.
The semantic Decoder module comprises a semantic decoding unit, and the semantic decoding unit can be realized by adopting a transform Decoder.
The speech translation Decoder module comprises a speech translation decoding unit, and the speech translation decoding unit can be initialized by adopting a pre-trained Decoder (namely, a transducer Decoder) of a machine translation model.
Based on the above speech translation model, in another embodiment of the present application, a specific implementation manner of processing the source language speech in step S102 to obtain a source language type text representation, and decoding the source language type text representation to obtain a target language text is described, where the implementation manner specifically includes:
step S201: inputting the source language speech into a speech translation model.
The speech translation model is the speech translation model disclosed in the previous embodiment.
Step S202: and the acoustic encoder module performs acoustic representation extraction on the source language voice to obtain acoustic representation of the source language voice, and obtains a source language prediction text based on the acoustic representation of the source language voice.
In the above embodiment, it is explained that the acoustic encoder module includes an acoustic representation extracting unit and a source language text predicting unit, and then the acoustic encoder module performs acoustic representation extraction on the source language speech to obtain an acoustic representation of the source language speech, and obtains a source language predicted text based on the acoustic representation of the source language speech, including:
the acoustic representation extraction unit is used for carrying out acoustic representation extraction on the source language voice to obtain an acoustic representation of the source language voice;
the source language text prediction unit obtains a source language prediction text based on the acoustic characterization of the source language speech.
It should be noted that, before performing acoustic representation extraction on the source language speech, the acoustic representation extraction unit needs to acquire an audio feature sequence of the source language speech, and then extract an acoustic representation of the source language speech from the audio feature sequence of the source language speech.
For ease of understanding, assume that the audio feature sequence of the source language speech is X = { X = 1 ,x 2 ,...,x U Can be based on formula H sph =Enc sph (X) obtaining an acoustic representation H of the source language speech sph ={h 1 ,h 2 ,...,h N U represents the length of the audio feature sequence of the source language speech, and N represents the length of the acoustic representation of the source language speech.
Step S203: and the text encoder module obtains a source language text representation corresponding to the source language voice based on the acoustic representation of the source language voice and the source language predicted text.
The text encoder module includes a mapping processing unit, an embedding processing unit, a feature fusion unit and an encoding unit; the text encoder module obtains a source language text representation corresponding to the source language speech based on the acoustic representation of the source language speech and the source language predicted text, and includes:
the mapping processing unit carries out mapping processing on the acoustic representation of the source language voice to obtain the mapped features;
the embedding processing unit is used for embedding the source language predicted text to obtain embedded characteristics;
the feature fusion unit fuses the mapped features and the embedded features to obtain fused features;
and the coding unit codes the fused features to obtain a source language text representation corresponding to the source language speech.
The feature fusion unit may add the mapped feature and the embedded feature to obtain a fused feature.
For ease of understanding, assume that the acoustic characterization of the source language speech is H sph Acoustics of the source language speechThe source language predicted text obtained by characterization is P CTC The characteristic after fusion is H adaptor The source language text corresponding to the source language speech is characterized as H text Then, the source language text representation corresponding to the source language speech can be obtained by calculation based on the following formula:
H text =Enc text (H adaptor );
wherein:
H adaptor =H map +H embed
H map =RELU(W map ·H sph +b map )
H embed =W embed ·P CTC
P ctc =Softmax(W ctc ·H sph +b ctc )
wherein, W map 、W ctc 、W embed 、b map 、b ctc Which are parameters of the mapping processing unit, the source language text prediction unit, and the embedding processing unit, respectively.
Step S204: and the semantic decoder module performs semantic decoding processing on the source language text representation to obtain a source language text representation, wherein the source language text representation is consistent with the sequence length of the source language text corresponding to the source language voice.
It should be noted that the semantic decoder module may extract semantic information from the source language text representation, and obtain the source language type text representation by down-sampling to remove redundant noise information.
Step S205: and the voice translation decoder module decodes the source language type text representation to obtain a target language text.
For convenience of understanding, referring to fig. 3, fig. 3 is a schematic process diagram of processing a source language speech based on a speech translation model to obtain a source language type text representation, and decoding the source language type text representation to obtain a target language text, which is disclosed in an embodiment of the present application.
In another embodiment of the present application, a detailed description is given of a training mode of the speech translation model.
The training mode of the speech translation model specifically comprises the following steps:
step S301: the method comprises the steps of obtaining a source language voice for training, a source language text corresponding to the source language voice for training and a target language text corresponding to the source language voice for training, wherein sentence end labels are marked on the source language text corresponding to the source language voice for training.
Step S302: and acquiring a pre-trained machine translation model.
Step S303: and inputting the source language text corresponding to the training source language voice into the pre-trained machine translation model to obtain the source language text characteristics output by the pre-trained machine translation model.
Step S304: and training by taking the source language voice for training as a training sample, and taking a source language text corresponding to the source language voice for training, a target language text corresponding to the source language voice for training, source language text characteristics output by the pre-trained machine translation model and sentence end labels labeled by the source language text corresponding to the source language voice for training as sample labels to obtain the voice translation model.
Specifically, in the speech translation model training process:
the acoustic encoder module performs acoustic representation extraction on the source language voice for training to obtain acoustic representation of the source language voice for training, and obtains a source language predicted text corresponding to the source language voice for training based on the acoustic representation of the source language voice for training;
the text encoder module obtains a source language text representation corresponding to the source language speech for training based on the acoustic representation of the source language speech for training and a source language predicted text corresponding to the source language speech for training;
the semantic decoder module performs semantic decoding on the source language text representation corresponding to the training source language voice to obtain a source language type text representation corresponding to the training source language voice, and performs sentence tail prediction on the basis of the source language type text representation corresponding to the training source language voice to obtain a source language predicted text corresponding to the training source language voice, wherein the source language predicted text corresponding to the training source language voice comprises a predicted sentence tail position;
and the speech translation decoder module decodes the source language type text representation corresponding to the source language speech for training to obtain a target language predicted text corresponding to the source language speech for training.
It should be noted that, the semantic decoder module generates features in an auto-regressive manner during semantic decoding, and therefore needs to predict the exact position of an end-of-presence (EOS). An inaccurate EOS position can result in loss of useful features or loss of redundant information contained in the generated features. Therefore, in the present application, when training the speech translation model, an end-of-sentence prediction unit may be added in the semantic decoder module, and after the end-of-sentence prediction unit performs end-of-sentence prediction, the source language prediction corresponding to the source language speech for training is obtained
As an implementable way, the predicted sentence end position can be obtained by directly pairing
Figure BDA0004019627490000141
Normalizing, and determining a time index corresponding to the maximum value, wherein>
Figure BDA0004019627490000142
Representing the predicted characteristics when the semantic decoder module decodes at the t step.
As another example, a single linear layer may be used to project features into the dimensions of the source language end dictionary. When the output of the projection network is classified and recognized as the sentence end position label, the semantic decoder module stops semantic decoding to obtain a source language predicted text corresponding to the source language speech for training. The mode is more accurate in prediction.
In the application, the sentence end prediction unit can realize the down-sampling of the source language type text representation corresponding to the source language voice for training, the length consistency of the source language type text representation corresponding to the source language voice for training and a real text sequence is maintained, and meanwhile, the alignment difficulty of the learning target language text sequence of the voice translation model and the source language type text representation is reduced, so that the trained semantic decoder module can remove redundant information in the source language voice, the density of semantic information in the source language type text representation corresponding to the source language voice is improved, and the quality of voice translation is effectively improved.
In addition, in the present application, the speech translation model may be trained by a multitask learning method, that is, in the speech translation model training process, a plurality of losses may be trained as a joint loss, and a sum of weights of the plurality of losses is 1.
As an implementation manner, in the present application, the speech translation model may be trained with a first loss, a second loss, a third loss, and a fourth loss as a joint loss, and a sum of weights of the first loss, the second loss, the third loss, and the fourth loss is 1;
wherein the first loss characterizes a difference between a source language predicted text corresponding to the source language speech for training and a source language text corresponding to the source language speech for training;
the second loss represents the difference between the source language type text representation corresponding to the source language voice for training and the source language text feature output by the machine translation model trained in advance;
the third loss represents a difference between the predicted text corresponding to the source language speech for training and the source language text corresponding to the source language speech for training;
the fourth loss characterizes a difference between the target language predicted text corresponding to the training source language speech and the target language text corresponding to the training source language speech.
The first Loss may be CTC Loss, the second Loss may be L1 Loss, the third Loss may be CE Loss, and the fourth Loss may be CE Loss.
For understanding, referring to fig. 4, fig. 4 is a schematic diagram illustrating training of a speech translation model disclosed in an embodiment of the present application.
In summary, the most advanced current end-to-end speech translation technology is improved, deep learning, multi-task learning and other technologies are adopted, the semantic decoder is embedded into the traditional end-to-end speech translation model, similar text features rich in semantic information are generated, the downsampling of the similar text features is realized through EOS prediction, the length consistency of a similar text feature sequence and a real text sequence is maintained, and the speech translation quality is effectively improved. Compared with the prior art, the scheme has the following advantages:
the text features of the machine translation model with excellent translation performance are used as external constraint targets, the capability of the text-like features containing semantic information coded by the voice translation model is improved, and therefore the quality of voice translation is effectively improved. Redundant information in the input voice signal of the voice translation model is removed through the EOS prediction network, the density of semantic information in the coded text-like features is improved, the introduction of a pre-trained machine translation model is facilitated, and the difficulty of a voice translation decoder in learning an alignment task is reduced.
The following describes a speech translation apparatus disclosed in an embodiment of the present application, and the speech translation apparatus described below and the speech translation method described above may be referred to correspondingly.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a speech translation apparatus disclosed in the embodiment of the present application. As shown in fig. 5, the speech translation apparatus may include:
an obtaining unit 11, configured to obtain source language speech to be translated;
and the voice translation unit 12 is configured to process the source language voice to obtain a source language type text representation, and decode the source language type text representation to obtain a target language text, where the source language type text representation is consistent with a sequence length of a source language text corresponding to the source language voice.
As an implementable embodiment, the speech translation unit is specifically configured to:
inputting the source language speech into a speech translation model, wherein the speech translation model comprises an acoustic encoder module, a text encoder module, a semantic decoder module and a speech translation decoder module;
the acoustic encoder module is used for performing acoustic representation extraction on the source language voice to obtain acoustic representation of the source language voice and obtaining a source language prediction text based on the acoustic representation of the source language voice;
the text encoder module is used for obtaining a source language text representation corresponding to the source language voice based on the acoustic representation of the source language voice and the source language predicted text;
the semantic decoder module is used for performing semantic decoding processing on the source language text representation to obtain a source language text representation, and the sequence length of the source language text representation corresponding to the source language voice is consistent;
and the voice translation decoder module is used for decoding the source language type text representation to obtain a target language text.
As an implementable embodiment, the acoustic encoder module comprises an acoustic representation extraction unit and a source language text prediction unit;
the acoustic characterization extraction unit is used for performing acoustic characterization extraction on the source language speech to obtain an acoustic characterization of the source language speech;
the source language text prediction unit is used for obtaining a source language prediction text based on the acoustic characterization of the source language voice.
As an implementable manner, the text encoder module includes a mapping processing unit, an embedding processing unit, a feature fusion unit and an encoding unit;
the mapping processing unit is used for mapping the acoustic representation of the source language speech to obtain mapped features;
the embedding processing unit is used for embedding the source language predicted text to obtain embedded characteristics;
the feature fusion unit is used for fusing the mapped features and the embedded features to obtain fused features;
and the coding unit is used for coding the fused features to obtain a source language text representation corresponding to the source language voice.
As an implementation, the speech translation model is trained as follows:
obtaining a source language voice for training, a source language text corresponding to the source language voice for training and a target language text corresponding to the source language voice for training, wherein the source language text corresponding to the source language voice for training is marked with a sentence end label;
acquiring a pre-trained machine translation model;
inputting the source language text corresponding to the training source language speech into the pre-trained machine translation model to obtain source language text characteristics output by the pre-trained machine translation model;
and training by taking the source language voice for training as a training sample, and taking a source language text corresponding to the source language voice for training, a target language text corresponding to the source language voice for training, source language text characteristics output by the pre-trained machine translation model and sentence end labels labeled by the source language text corresponding to the source language voice for training as sample labels to obtain the voice translation model.
As an implementation, in the speech translation model training process:
the acoustic encoder module performs acoustic representation extraction on the source language voice for training to obtain acoustic representation of the source language voice for training, and obtains a source language predicted text corresponding to the source language voice for training based on the acoustic representation of the source language voice for training;
the text encoder module obtains a source language text representation corresponding to the source language speech for training based on the acoustic representation of the source language speech for training and a source language predicted text corresponding to the source language speech for training;
the semantic decoder module performs semantic decoding on the source language text representation corresponding to the source language voice for training to obtain a source language type text representation corresponding to the source language voice for training, and performs sentence tail prediction based on the source language type text representation corresponding to the source language voice for training to obtain a source language predicted text corresponding to the source language voice for training, wherein the source language predicted text corresponding to the source language voice for training comprises a predicted sentence tail position;
and the speech translation decoder module decodes the source language type text representation corresponding to the source language speech for training to obtain a target language predicted text corresponding to the source language speech for training.
As an implementation manner, the training process of the speech translation model is trained by taking a first loss, a second loss, a third loss and a fourth loss as a joint loss;
wherein the first loss characterizes a difference between a source language predicted text corresponding to the source language speech for training and a source language text corresponding to the source language speech for training;
the second loss represents the difference between the source language type text representation corresponding to the source language speech for training and the source language text feature output by the machine translation model trained in advance;
the third loss represents a difference between the predicted text corresponding to the source language speech for training and the source language text corresponding to the source language speech for training;
the fourth loss characterizes a difference between the target language predicted text corresponding to the training source language speech and the target language text corresponding to the training source language speech.
Referring to fig. 6, fig. 6 is a block diagram of a hardware structure of a speech translation apparatus according to an embodiment of the present application, and referring to fig. 6, the hardware structure of the speech translation apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;
in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;
the processor 1 may be a central processing unit CPU, or an Application Specific Integrated Circuit ASIC (Application Specific Integrated Circuit), or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;
the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;
wherein the memory stores a program and the processor can call the program stored in the memory, the program for:
acquiring source language voice to be translated;
and processing the source language voice to obtain a source language type text representation, and decoding the source language type text representation to obtain a target language text, wherein the sequence length of the source language type text representation is consistent with that of a source language text corresponding to the source language voice.
Alternatively, the detailed function and the extended function of the program may refer to the above description.
Embodiments of the present application further provide a readable storage medium, which may store a program adapted to be executed by a processor, where the program is configured to:
acquiring source language voice to be translated;
and processing the source language speech to obtain a source language type text representation, and decoding the source language type text representation to obtain a target language text, wherein the sequence length of the source language type text representation is consistent with that of a source language text corresponding to the source language speech.
Alternatively, the detailed function and the extended function of the program may refer to the above description.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising one of ...does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method of speech translation, the method comprising:
acquiring source language voice to be translated;
and processing the source language speech to obtain a source language type text representation, and decoding the source language type text representation to obtain a target language text, wherein the sequence length of the source language type text representation is consistent with that of a source language text corresponding to the source language speech.
2. The method of claim 1, wherein processing the source language speech to obtain a source language-class text representation and decoding the source language-class text representation to obtain a target language text comprises:
inputting the source language speech into a speech translation model, wherein the speech translation model comprises an acoustic encoder module, a text encoder module, a semantic decoder module and a speech translation decoder module;
the acoustic encoder module performs acoustic representation extraction on the source language voice to obtain acoustic representation of the source language voice, and obtains a source language prediction text based on the acoustic representation of the source language voice;
the text encoder module obtains a source language text representation corresponding to the source language voice based on the acoustic representation of the source language voice and the source language predicted text;
the semantic decoder module performs semantic decoding processing on the source language text representation to obtain a source language type text representation, wherein the source language type text representation is consistent with the sequence length of a source language text corresponding to the source language voice;
and the voice translation decoder module decodes the source language type text representation to obtain a target language text.
3. The method of claim 2, wherein the acoustic encoder module comprises an acoustic representation extraction unit and a source language text prediction unit;
the acoustic encoder module performs acoustic representation extraction on the source language speech to obtain acoustic representation of the source language speech, and obtains a source language prediction text based on the acoustic representation of the source language speech, including:
the acoustic representation extraction unit is used for carrying out acoustic representation extraction on the source language voice to obtain an acoustic representation of the source language voice;
the source language text prediction unit obtains a source language prediction text based on the acoustic characterization of the source language speech.
4. The method of claim 2, wherein the text encoder module comprises a mapping processing unit, an embedding processing unit, a feature fusion unit, and an encoding unit;
the text encoder module obtains a source language text representation corresponding to the source language speech based on the acoustic representation of the source language speech and the source language predicted text, and comprises:
the mapping processing unit carries out mapping processing on the acoustic representation of the source language voice to obtain the mapped features;
the embedding processing unit is used for embedding the source language prediction text to obtain embedded characteristics;
the feature fusion unit fuses the mapped features and the embedded features to obtain fused features;
and the coding unit codes the fused features to obtain a source language text representation corresponding to the source language speech.
5. The method of claim 2, wherein the speech translation model is trained as follows:
obtaining a source language voice for training, a source language text corresponding to the source language voice for training and a target language text corresponding to the source language voice for training, wherein the source language text corresponding to the source language voice for training is marked with a sentence end label;
obtaining a machine translation model trained in advance;
inputting the source language text corresponding to the training source language speech into the pre-trained machine translation model to obtain source language text characteristics output by the pre-trained machine translation model;
and training by taking the source language voice for training as a training sample, and taking a source language text corresponding to the source language voice for training, a target language text corresponding to the source language voice for training, source language text characteristics output by the pre-trained machine translation model and sentence end labels labeled by the source language text corresponding to the source language voice for training as sample labels to obtain the voice translation model.
6. The method of claim 5, wherein during the speech translation model training process:
the acoustic encoder module performs acoustic representation extraction on the source language voice for training to obtain acoustic representation of the source language voice for training, and obtains a source language predicted text corresponding to the source language voice for training based on the acoustic representation of the source language voice for training;
the text encoder module obtains a source language text representation corresponding to the source language speech for training based on the acoustic representation of the source language speech for training and a source language predicted text corresponding to the source language speech for training;
the semantic decoder module performs semantic decoding on the source language text representation corresponding to the training source language voice to obtain a source language type text representation corresponding to the training source language voice, and performs sentence tail prediction on the basis of the source language type text representation corresponding to the training source language voice to obtain a source language predicted text corresponding to the training source language voice, wherein the source language predicted text corresponding to the training source language voice comprises a predicted sentence tail position;
and the speech translation decoder module decodes the source language type text representation corresponding to the source language speech for training to obtain a target language predicted text corresponding to the source language speech for training.
7. The method of claim 6, wherein the training of the speech translation model is performed with a first loss, a second loss, a third loss, and a fourth loss as joint losses;
wherein the first loss characterizes a difference between a source language predicted text corresponding to the source language speech for training and a source language text corresponding to the source language speech for training;
the second loss represents the difference between the source language type text representation corresponding to the source language speech for training and the source language text feature output by the machine translation model trained in advance;
the third loss represents a difference between the predicted text corresponding to the source language speech for training and the source language text corresponding to the source language speech for training;
the fourth loss characterizes a difference between the target language predicted text corresponding to the training source language speech and the target language text corresponding to the training source language speech.
8. An apparatus for speech translation, the apparatus comprising:
an acquisition unit configured to acquire source language speech to be translated;
and the voice translation unit is used for processing the source language voice to obtain a source language type text representation and decoding the source language type text representation to obtain a target language text, wherein the source language type text representation is consistent with the sequence length of the source language text corresponding to the source language voice.
9. A speech translation device comprising a memory and a processor;
the memory is used for storing programs;
the processor, configured to execute the program, implementing the steps of the speech translation method according to any one of claims 1 to 7.
10. A readable storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of a speech translation method according to any one of claims 1 to 7.
CN202211682639.3A 2022-12-27 2022-12-27 Voice translation method, device, equipment and readable storage medium Pending CN115983203A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211682639.3A CN115983203A (en) 2022-12-27 2022-12-27 Voice translation method, device, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211682639.3A CN115983203A (en) 2022-12-27 2022-12-27 Voice translation method, device, equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN115983203A true CN115983203A (en) 2023-04-18

Family

ID=85969549

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211682639.3A Pending CN115983203A (en) 2022-12-27 2022-12-27 Voice translation method, device, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN115983203A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116416969A (en) * 2023-06-09 2023-07-11 深圳市江元科技(集团)有限公司 Multi-language real-time translation method, system and medium based on big data

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116416969A (en) * 2023-06-09 2023-07-11 深圳市江元科技(集团)有限公司 Multi-language real-time translation method, system and medium based on big data

Similar Documents

Publication Publication Date Title
CN108899013B (en) Voice search method and device and voice recognition system
CN113283244B (en) Pre-training model-based bidding data named entity identification method
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN110472255B (en) Neural network machine translation method, model, electronic terminal, and storage medium
CN116884391B (en) Multimode fusion audio generation method and device based on diffusion model
CN113793591A (en) Speech synthesis method and related device, electronic equipment and storage medium
CN112446221B (en) Translation evaluation method, device, system and computer storage medium
CN115983203A (en) Voice translation method, device, equipment and readable storage medium
CN116341651A (en) Entity recognition model training method and device, electronic equipment and storage medium
CN115881104A (en) Speech recognition method, device and storage medium based on hot word coding
CN115273830A (en) Method, device and equipment for stream type speech recognition and model training
CN112668346B (en) Translation method, device, equipment and storage medium
CN112036122B (en) Text recognition method, electronic device and computer readable medium
JP2021524095A (en) Text-level text translation methods and equipment
CN114694637A (en) Hybrid speech recognition method, device, electronic equipment and storage medium
CN113643694A (en) Voice recognition method and device, electronic equipment and storage medium
CN113889087B (en) Speech recognition and model establishment method, device, equipment and storage medium
CN112735392B (en) Voice processing method, device, equipment and storage medium
CN114283786A (en) Speech recognition method, device and computer readable storage medium
CN111048065A (en) Text error correction data generation method and related device
CN113392645B (en) Prosodic phrase boundary prediction method and device, electronic equipment and storage medium
CN115081459B (en) Spoken language text generation method, device, equipment and storage medium
CN113688309B (en) Training method for generating model and generation method and device for recommendation reason
CN117877460A (en) Speech synthesis method, device, speech synthesis model training method and device
CN113011176A (en) Language model training and language reasoning method, device and computer storage medium thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination