CN115983203A

CN115983203A - Voice translation method, device, equipment and readable storage medium

Info

Publication number: CN115983203A
Application number: CN202211682639.3A
Authority: CN
Inventors: 周心远; 邓攀; 张为泰
Original assignee: Iflytek Shanghai Technology Co ltd
Current assignee: Iflytek Shanghai Technology Co ltd
Priority date: 2022-12-27
Filing date: 2022-12-27
Publication date: 2023-04-18

Abstract

The application discloses a voice translation method, a device, equipment and a readable storage medium. According to the scheme, after source language voice to be translated is obtained, the source language voice is processed to obtain source language type text representation, and the source language type text representation is decoded to obtain target language text.

Description

Voice translation method, device, equipment and readable storage medium

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for speech translation.

Background

The purpose of Speech Translation (ST) is to translate Speech in a source language into text in a target language. Conventional speech translation systems generally follow a cascade model, comprising two subtasks: automatic Speech Recognition (ASR) and Machine Translation (MT). This cascade model mainly suffers from error propagation and high delay problems. In recent years, in order to solve the problem of the cascade model, an end-to-end speech translation model has been proposed. The end-to-end speech translation model mainly uses an encoder (encoder) of an automatic speech recognition model to encode a source end input (i.e. source language speech) and then uses a decoder (decoder) of a machine translation model to decode the source end input (i.e. target language speech) into a target end output (i.e. target language text), however, the speech translation model of such an encoder-decoder structure has poor semantic modeling capability, resulting in poor speech translation effect.

In order to improve the semantic modeling capability of a speech translation model, some speech translation models insert an encoder of a machine translation model between an encoder and a decoder of the speech translation model, and input an acoustic feature sequence obtained by the encoder of the speech translation model into the encoder of the machine translation model, but because the input received by the encoder is a text sequence when the machine translation model is trained, the expected input of the encoder of the machine translation model is a text sequence, and the length of the acoustic feature sequence is inconsistent with the length of the text sequence input into the encoder when the machine translation model is trained, which affects the learning of the speech translation model, and results in poor speech translation effect of the speech translation model.

Therefore, how to provide a speech translation method to improve the speech translation effect becomes a technical problem to be solved urgently by those skilled in the art.

Disclosure of Invention

In view of the foregoing problems, the present application provides a speech translation method, apparatus, device and readable storage medium. The specific scheme is as follows:

a method of speech translation, the method comprising:

acquiring source language voice to be translated;

and processing the source language speech to obtain a source language type text representation, and decoding the source language type text representation to obtain a target language text, wherein the sequence length of the source language type text representation is consistent with that of a source language text corresponding to the source language speech.

Optionally, the processing the source language speech to obtain a source language type text representation, and decoding the source language type text representation to obtain a target language text includes:

inputting the source language speech into a speech translation model, wherein the speech translation model comprises an acoustic encoder module, a text encoder module, a semantic decoder module and a speech translation decoder module;

the acoustic encoder module performs acoustic representation extraction on the source language voice to obtain acoustic representation of the source language voice, and obtains a source language prediction text based on the acoustic representation of the source language voice;

the text encoder module obtains a source language text representation corresponding to the source language voice based on the acoustic representation of the source language voice and the source language predicted text;

the semantic decoder module performs semantic decoding processing on the source language text representation to obtain a source language type text representation, wherein the source language type text representation is consistent with the sequence length of a source language text corresponding to the source language voice;

and the voice translation decoder module decodes the source language type text representation to obtain a target language text.

Optionally, the acoustic encoder module comprises an acoustic representation extraction unit and a source language text prediction unit;

the acoustic encoder module performs acoustic representation extraction on the source language speech to obtain acoustic representation of the source language speech, and obtains a source language prediction text based on the acoustic representation of the source language speech, including:

the acoustic characterization extraction unit is used for performing acoustic characterization extraction on the source language speech to obtain an acoustic characterization of the source language speech;

the source language text prediction unit obtains a source language prediction text based on the acoustic characterization of the source language speech.

Optionally, the text encoder module includes a mapping processing unit, an embedding processing unit, a feature fusion unit, and an encoding unit;

the text encoder module obtains a source language text representation corresponding to the source language speech based on the acoustic representation of the source language speech and the source language predicted text, and comprises:

the mapping processing unit is used for mapping the acoustic representation of the source language voice to obtain mapped features;

the embedding processing unit is used for embedding the source language predicted text to obtain embedded characteristics;

the feature fusion unit fuses the mapped features and the embedded features to obtain fused features;

and the coding unit codes the fused features to obtain a source language text representation corresponding to the source language speech.

Optionally, the speech translation model is trained in the following manner:

obtaining a source language voice for training, a source language text corresponding to the source language voice for training and a target language text corresponding to the source language voice for training, wherein the source language text corresponding to the source language voice for training is marked with a sentence end label;

obtaining a machine translation model trained in advance;

inputting the source language text corresponding to the training source language speech into the pre-trained machine translation model to obtain source language text characteristics output by the pre-trained machine translation model;

and training by taking the source language voice for training as a training sample, and taking a source language text corresponding to the source language voice for training, a target language text corresponding to the source language voice for training, source language text characteristics output by the pre-trained machine translation model and sentence end labels labeled by the source language text corresponding to the source language voice for training as sample labels to obtain the voice translation model.

Optionally, in the speech translation model training process:

the acoustic encoder module performs acoustic representation extraction on the source language voice for training to obtain acoustic representation of the source language voice for training, and obtains a source language predicted text corresponding to the source language voice for training based on the acoustic representation of the source language voice for training;

the text encoder module obtains a source language text representation corresponding to the source language speech for training based on the acoustic representation of the source language speech for training and a source language predicted text corresponding to the source language speech for training;

the semantic decoder module performs semantic decoding on the source language text representation corresponding to the training source language voice to obtain a source language type text representation corresponding to the training source language voice, and performs sentence tail prediction on the basis of the source language type text representation corresponding to the training source language voice to obtain a source language predicted text corresponding to the training source language voice, wherein the source language predicted text corresponding to the training source language voice comprises a predicted sentence tail position;

and the speech translation decoder module decodes the source language type text representation corresponding to the source language speech for training to obtain a target language predicted text corresponding to the source language speech for training.

Optionally, the training of the speech translation model is performed by taking a first loss, a second loss, a third loss and a fourth loss as joint losses in a training process;

wherein the first loss characterizes a difference between a source language predicted text corresponding to the source language speech for training and a source language text corresponding to the source language speech for training;

the second loss represents the difference between the source language type text representation corresponding to the source language speech for training and the source language text feature output by the machine translation model trained in advance;

the third loss represents a difference between the predicted text corresponding to the source language speech for training and the source language text corresponding to the source language speech for training;

the fourth loss characterizes a difference between the target language predicted text corresponding to the training source language speech and the target language text corresponding to the training source language speech.

A speech translation apparatus, the apparatus comprising:

the obtaining unit is used for obtaining source language voice to be translated;

and the voice translation unit is used for processing the source language voice to obtain a source language type text representation and decoding the source language type text representation to obtain a target language text, wherein the source language type text representation is consistent with the sequence length of the source language text corresponding to the source language voice.

Optionally, the speech translation unit is specifically configured to:

the acoustic encoder module is used for performing acoustic representation extraction on the source language voice to obtain acoustic representation of the source language voice and obtaining a source language prediction text based on the acoustic representation of the source language voice;

the text encoder module is used for obtaining a source language text representation corresponding to the source language voice based on the acoustic representation of the source language voice and the source language predicted text;

the semantic decoder module is used for performing semantic decoding processing on the source language text representation to obtain a source language type text representation, and the sequence length of the source language type text representation is consistent with that of a source language text corresponding to the source language voice;

and the voice translation decoder module is used for decoding the source language type text representation to obtain a target language text.

the source language text prediction unit is used for obtaining a source language prediction text based on the acoustic characterization of the source language voice.

the mapping processing unit is used for mapping the acoustic representation of the source language speech to obtain mapped features;

the feature fusion unit is used for fusing the mapped features and the embedded features to obtain fused features;

and the coding unit is used for coding the fused features to obtain a source language text representation corresponding to the source language speech.

Optionally, the speech translation model is trained in the following manner:

obtaining a machine translation model trained in advance;

Optionally, in the speech translation model training process:

the semantic decoder module performs semantic decoding on the source language text representation corresponding to the source language voice for training to obtain a source language type text representation corresponding to the source language voice for training, and performs sentence tail prediction based on the source language type text representation corresponding to the source language voice for training to obtain a source language predicted text corresponding to the source language voice for training, wherein the source language predicted text corresponding to the source language voice for training comprises a predicted sentence tail position;

Optionally, the training of the speech translation model is performed with a first loss, a second loss, a third loss and a fourth loss as joint losses in the training process;

A speech translation device comprising a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the speech translation method.

A readable storage medium, having stored thereon a computer program which, when executed by a processor, carries out the steps of the speech translation method as described above.

By means of the technical scheme, the application discloses a voice translation method, a device, equipment and a readable storage medium. According to the scheme, after source language voice to be translated is obtained, the source language voice is processed to obtain source language type text representation, and the source language type text representation is decoded to obtain target language text.

Drawings

Various additional advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a schematic flowchart of a speech translation method disclosed in an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a speech translation model disclosed in an embodiment of the present application;

FIG. 3 is a schematic process diagram illustrating a process of processing a source language speech based on a speech translation model to obtain a source language-class text representation and decoding the source language-class text representation to obtain a target language text, disclosed in an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating training of a speech translation model according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a speech translation apparatus disclosed in the embodiment of the present application;

fig. 6 is a block diagram of a hardware structure of a speech translation apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

For better understanding of the scheme of the present application, the end-to-end speech translation model is first explained in detail.

Current end-to-end speech translation models are based primarily on sequence-to-sequence modeling, similar to sequence-to-sequence tasks such as machine translation and automatic speech recognition, and encode a source input (i.e., source speech) using a transform model encoder (encoder) and decode it with a decoder (decoder) into a target output (i.e., target language text). However, sequence-to-sequence modeling requires a large amount of training data, and speech translation data is scarce and labeling cost is too high. Therefore, the conventional method is to translate the text into the target text by using a machine translation model based on the training data of the automatic speech recognition model, and construct speech translation data according to the idea of knowledge distillation. And respectively initializing encoders and decoders of the speech translation models by using the trained automatic speech recognition models and machine translation models so as to utilize the data of a large number of automatic speech recognition models and machine translation models, and training and fine-tuning by successively and respectively using forged speech translation data and a small amount of real speech translation data after initialization is completed. That is, the end-to-end speech translation model mainstream encodes source end input (i.e., source language speech) using an encoder (encoder) of the automatic speech recognition model and decodes it into target end output (i.e., target language text) using a decoder (decoder) of the machine translation model.

However, the speech translation model of such an encoder-decoder structure does not fully utilize the automatic speech recognition model and the machine translation model, and only the encoder using the automatic speech recognition model and the decoder using the machine translation model have a poor semantic modeling capability, resulting in a poor speech translation effect.

In order to improve the semantic modeling capability of the speech translation model, the inventor of the present application researches and discovers that: at present, some speech translation models are formed by inserting an encoder of a machine translation model between an encoder and a decoder of the speech translation model, and inputting an acoustic feature sequence obtained by the encoder of the speech translation model into the encoder of the machine translation model, however, because the input received by the encoder is a text sequence when the machine translation model is trained, the expected input of the encoder of the machine translation model is a text sequence, and the acoustic feature sequence is significantly different from the text sequence, which will affect the learning of the model, and thus the speech translation effect of the speech translation model is poor.

In view of the problems of the above solutions, the present inventors have conducted intensive studies and finally proposed a speech translation method. Next, a speech translation method provided by the present application will be described by the following embodiments.

Referring to fig. 1, fig. 1 is a schematic flowchart of a speech translation method disclosed in an embodiment of the present application, where the method may include:

step S101: and acquiring the source language voice to be translated.

In this application, the source language speech to be translated may be speech of any kind of language, and this application is not limited in any way.

Step S102: and processing the source language speech to obtain a source language type text representation, and decoding the source language type text representation to obtain a target language text, wherein the sequence length of the source language type text representation is consistent with that of a source language text corresponding to the source language speech.

In this application, the processing of the source language speech to obtain a source language-class text representation, and the decoding of the source language-class text representation to obtain a target language text may include: performing acoustic representation extraction on the source language voice to obtain acoustic representation of the source language voice, and obtaining a source language prediction text based on the acoustic representation of the source language voice; obtaining a source language text representation corresponding to the source language voice based on the acoustic representation of the source language voice and the source language prediction text; semantic decoding is carried out on the source language text representation to obtain a source language type text representation, and the sequence length of the source language type text representation is consistent with that of a source language text corresponding to the source language voice; and decoding the source language type text representation to obtain a target language text.

It should be noted that the process of processing the source language speech to obtain the source language type text representation and decoding the source language type text representation to obtain the target language text may be implemented based on a neural network, which will be specifically described in detail through the following embodiments.

The embodiment discloses a voice translation method. According to the scheme, after source language voice to be translated is obtained, the source language voice is processed to obtain source language type text representation, and the source language type text representation is decoded to obtain target language text.

In the above embodiment, it is pointed out that the process of processing the source language speech to obtain a source language-class text representation and decoding the source language-class text representation to obtain a target language text can be implemented based on a neural network. The details will be explained by the following examples.

In one embodiment of the present application, the speech translation model disclosed in the present application is described in detail.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a speech translation model disclosed in an embodiment of the present application, where the speech translation model may include: the device comprises an acoustic encoder module, a text encoder module, a semantic decoder module and a voice translation decoder module.

Wherein the acoustic encoder module comprises an acoustic representation extraction unit and a source language text prediction unit; as an implementation, the acoustic Encoder module may be initialized by using an Encoder of a pre-trained automatic speech recognition model, the acoustic representation extraction unit may be implemented by using a VGG Block layer and a transformer Encoder layer, and the source language text prediction unit may be implemented by using a CTC (continuous time Classification) project layer.

The text encoder module comprises a mapping processing unit, an embedding processing unit, a feature fusion unit and an encoding unit; as an implementable manner, the text Encoder module may be initialized by using an Encoder of a pre-trained machine translation model, the mapping processing unit may be implemented by using a mapping layer, the embedding processing unit may be implemented by using an embedding layer, and the encoding unit may be implemented by using a transform Encoder.

The semantic Decoder module comprises a semantic decoding unit, and the semantic decoding unit can be realized by adopting a transform Decoder.

The speech translation Decoder module comprises a speech translation decoding unit, and the speech translation decoding unit can be initialized by adopting a pre-trained Decoder (namely, a transducer Decoder) of a machine translation model.

Based on the above speech translation model, in another embodiment of the present application, a specific implementation manner of processing the source language speech in step S102 to obtain a source language type text representation, and decoding the source language type text representation to obtain a target language text is described, where the implementation manner specifically includes:

step S201: inputting the source language speech into a speech translation model.

The speech translation model is the speech translation model disclosed in the previous embodiment.

Step S202: and the acoustic encoder module performs acoustic representation extraction on the source language voice to obtain acoustic representation of the source language voice, and obtains a source language prediction text based on the acoustic representation of the source language voice.

In the above embodiment, it is explained that the acoustic encoder module includes an acoustic representation extracting unit and a source language text predicting unit, and then the acoustic encoder module performs acoustic representation extraction on the source language speech to obtain an acoustic representation of the source language speech, and obtains a source language predicted text based on the acoustic representation of the source language speech, including:

the acoustic representation extraction unit is used for carrying out acoustic representation extraction on the source language voice to obtain an acoustic representation of the source language voice;

It should be noted that, before performing acoustic representation extraction on the source language speech, the acoustic representation extraction unit needs to acquire an audio feature sequence of the source language speech, and then extract an acoustic representation of the source language speech from the audio feature sequence of the source language speech.

For ease of understanding, assume that the audio feature sequence of the source language speech is X = { X = ₁ ，x ₂ ，...，x _U Can be based on formula H _sph ＝Enc _sph (X) obtaining an acoustic representation H of the source language speech _sph ＝{h ₁ ，h ₂ ，...，h _N U represents the length of the audio feature sequence of the source language speech, and N represents the length of the acoustic representation of the source language speech.

Step S203: and the text encoder module obtains a source language text representation corresponding to the source language voice based on the acoustic representation of the source language voice and the source language predicted text.

The text encoder module includes a mapping processing unit, an embedding processing unit, a feature fusion unit and an encoding unit; the text encoder module obtains a source language text representation corresponding to the source language speech based on the acoustic representation of the source language speech and the source language predicted text, and includes:

the mapping processing unit carries out mapping processing on the acoustic representation of the source language voice to obtain the mapped features;

The feature fusion unit may add the mapped feature and the embedded feature to obtain a fused feature.

For ease of understanding, assume that the acoustic characterization of the source language speech is H _sph Acoustics of the source language speechThe source language predicted text obtained by characterization is P _CTC The characteristic after fusion is H _adaptor The source language text corresponding to the source language speech is characterized as H _text Then, the source language text representation corresponding to the source language speech can be obtained by calculation based on the following formula:

H _text ＝Enc _text (H _adaptor )；

wherein:

H _adaptor ＝H _map +H _embed

H _map ＝RELU(W ^map ·H _sph +b ^map )

H _embed ＝W ^embed ·P _CTC

P _ctc ＝Softmax(W ^ctc ·H _sph +b ^ctc )

wherein, W ^map 、W ^ctc 、W ^embed 、b ^map 、b ^ctc Which are parameters of the mapping processing unit, the source language text prediction unit, and the embedding processing unit, respectively.

Step S204: and the semantic decoder module performs semantic decoding processing on the source language text representation to obtain a source language text representation, wherein the source language text representation is consistent with the sequence length of the source language text corresponding to the source language voice.

It should be noted that the semantic decoder module may extract semantic information from the source language text representation, and obtain the source language type text representation by down-sampling to remove redundant noise information.

Step S205: and the voice translation decoder module decodes the source language type text representation to obtain a target language text.

For convenience of understanding, referring to fig. 3, fig. 3 is a schematic process diagram of processing a source language speech based on a speech translation model to obtain a source language type text representation, and decoding the source language type text representation to obtain a target language text, which is disclosed in an embodiment of the present application.

In another embodiment of the present application, a detailed description is given of a training mode of the speech translation model.

The training mode of the speech translation model specifically comprises the following steps:

step S301: the method comprises the steps of obtaining a source language voice for training, a source language text corresponding to the source language voice for training and a target language text corresponding to the source language voice for training, wherein sentence end labels are marked on the source language text corresponding to the source language voice for training.

Step S302: and acquiring a pre-trained machine translation model.

Step S303: and inputting the source language text corresponding to the training source language voice into the pre-trained machine translation model to obtain the source language text characteristics output by the pre-trained machine translation model.

Step S304: and training by taking the source language voice for training as a training sample, and taking a source language text corresponding to the source language voice for training, a target language text corresponding to the source language voice for training, source language text characteristics output by the pre-trained machine translation model and sentence end labels labeled by the source language text corresponding to the source language voice for training as sample labels to obtain the voice translation model.

Specifically, in the speech translation model training process:

It should be noted that, the semantic decoder module generates features in an auto-regressive manner during semantic decoding, and therefore needs to predict the exact position of an end-of-presence (EOS). An inaccurate EOS position can result in loss of useful features or loss of redundant information contained in the generated features. Therefore, in the present application, when training the speech translation model, an end-of-sentence prediction unit may be added in the semantic decoder module, and after the end-of-sentence prediction unit performs end-of-sentence prediction, the source language prediction corresponding to the source language speech for training is obtained

As an implementable way, the predicted sentence end position can be obtained by directly pairing

Normalizing, and determining a time index corresponding to the maximum value, wherein>

Representing the predicted characteristics when the semantic decoder module decodes at the t step.

As another example, a single linear layer may be used to project features into the dimensions of the source language end dictionary. When the output of the projection network is classified and recognized as the sentence end position label, the semantic decoder module stops semantic decoding to obtain a source language predicted text corresponding to the source language speech for training. The mode is more accurate in prediction.

In the application, the sentence end prediction unit can realize the down-sampling of the source language type text representation corresponding to the source language voice for training, the length consistency of the source language type text representation corresponding to the source language voice for training and a real text sequence is maintained, and meanwhile, the alignment difficulty of the learning target language text sequence of the voice translation model and the source language type text representation is reduced, so that the trained semantic decoder module can remove redundant information in the source language voice, the density of semantic information in the source language type text representation corresponding to the source language voice is improved, and the quality of voice translation is effectively improved.

In addition, in the present application, the speech translation model may be trained by a multitask learning method, that is, in the speech translation model training process, a plurality of losses may be trained as a joint loss, and a sum of weights of the plurality of losses is 1.

As an implementation manner, in the present application, the speech translation model may be trained with a first loss, a second loss, a third loss, and a fourth loss as a joint loss, and a sum of weights of the first loss, the second loss, the third loss, and the fourth loss is 1;

the second loss represents the difference between the source language type text representation corresponding to the source language voice for training and the source language text feature output by the machine translation model trained in advance;

The first Loss may be CTC Loss, the second Loss may be L1 Loss, the third Loss may be CE Loss, and the fourth Loss may be CE Loss.

For understanding, referring to fig. 4, fig. 4 is a schematic diagram illustrating training of a speech translation model disclosed in an embodiment of the present application.

In summary, the most advanced current end-to-end speech translation technology is improved, deep learning, multi-task learning and other technologies are adopted, the semantic decoder is embedded into the traditional end-to-end speech translation model, similar text features rich in semantic information are generated, the downsampling of the similar text features is realized through EOS prediction, the length consistency of a similar text feature sequence and a real text sequence is maintained, and the speech translation quality is effectively improved. Compared with the prior art, the scheme has the following advantages:

the text features of the machine translation model with excellent translation performance are used as external constraint targets, the capability of the text-like features containing semantic information coded by the voice translation model is improved, and therefore the quality of voice translation is effectively improved. Redundant information in the input voice signal of the voice translation model is removed through the EOS prediction network, the density of semantic information in the coded text-like features is improved, the introduction of a pre-trained machine translation model is facilitated, and the difficulty of a voice translation decoder in learning an alignment task is reduced.

The following describes a speech translation apparatus disclosed in an embodiment of the present application, and the speech translation apparatus described below and the speech translation method described above may be referred to correspondingly.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a speech translation apparatus disclosed in the embodiment of the present application. As shown in fig. 5, the speech translation apparatus may include:

an obtaining unit 11, configured to obtain source language speech to be translated;

and the voice translation unit 12 is configured to process the source language voice to obtain a source language type text representation, and decode the source language type text representation to obtain a target language text, where the source language type text representation is consistent with a sequence length of a source language text corresponding to the source language voice.

As an implementable embodiment, the speech translation unit is specifically configured to:

the semantic decoder module is used for performing semantic decoding processing on the source language text representation to obtain a source language text representation, and the sequence length of the source language text representation corresponding to the source language voice is consistent;

As an implementable embodiment, the acoustic encoder module comprises an acoustic representation extraction unit and a source language text prediction unit;

As an implementable manner, the text encoder module includes a mapping processing unit, an embedding processing unit, a feature fusion unit and an encoding unit;

and the coding unit is used for coding the fused features to obtain a source language text representation corresponding to the source language voice.

As an implementation, the speech translation model is trained as follows:

acquiring a pre-trained machine translation model;

As an implementation, in the speech translation model training process:

As an implementation manner, the training process of the speech translation model is trained by taking a first loss, a second loss, a third loss and a fourth loss as a joint loss;

Referring to fig. 6, fig. 6 is a block diagram of a hardware structure of a speech translation apparatus according to an embodiment of the present application, and referring to fig. 6, the hardware structure of the speech translation apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an Application Specific Integrated Circuit ASIC (Application Specific Integrated Circuit), or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

acquiring source language voice to be translated;

and processing the source language voice to obtain a source language type text representation, and decoding the source language type text representation to obtain a target language text, wherein the sequence length of the source language type text representation is consistent with that of a source language text corresponding to the source language voice.

Alternatively, the detailed function and the extended function of the program may refer to the above description.

Embodiments of the present application further provide a readable storage medium, which may store a program adapted to be executed by a processor, where the program is configured to:

acquiring source language voice to be translated;

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising one of ...does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of speech translation, the method comprising:

acquiring source language voice to be translated;

2. The method of claim 1, wherein processing the source language speech to obtain a source language-class text representation and decoding the source language-class text representation to obtain a target language text comprises:

3. The method of claim 2, wherein the acoustic encoder module comprises an acoustic representation extraction unit and a source language text prediction unit;

4. The method of claim 2, wherein the text encoder module comprises a mapping processing unit, an embedding processing unit, a feature fusion unit, and an encoding unit;

the embedding processing unit is used for embedding the source language prediction text to obtain embedded characteristics;

5. The method of claim 2, wherein the speech translation model is trained as follows:

obtaining a machine translation model trained in advance;

6. The method of claim 5, wherein during the speech translation model training process:

7. The method of claim 6, wherein the training of the speech translation model is performed with a first loss, a second loss, a third loss, and a fourth loss as joint losses;

8. An apparatus for speech translation, the apparatus comprising:

an acquisition unit configured to acquire source language speech to be translated;

9. A speech translation device comprising a memory and a processor;

the memory is used for storing programs;

the processor, configured to execute the program, implementing the steps of the speech translation method according to any one of claims 1 to 7.

10. A readable storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of a speech translation method according to any one of claims 1 to 7.