WO2022057637A1 - 语音翻译方法、装置、设备和存储介质 - Google Patents

语音翻译方法、装置、设备和存储介质 Download PDF

Info

Publication number
WO2022057637A1
WO2022057637A1 PCT/CN2021/116232 CN2021116232W WO2022057637A1 WO 2022057637 A1 WO2022057637 A1 WO 2022057637A1 CN 2021116232 W CN2021116232 W CN 2021116232W WO 2022057637 A1 WO2022057637 A1 WO 2022057637A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
speech
decoder
encoder
translation model
Prior art date
Application number
PCT/CN2021/116232
Other languages
English (en)
French (fr)
Inventor
李磊
王明轩
董倩倩
赵程绮
Original Assignee
北京字节跳动网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字节跳动网络技术有限公司 filed Critical 北京字节跳动网络技术有限公司
Priority to US18/245,802 priority Critical patent/US20240028841A1/en
Publication of WO2022057637A1 publication Critical patent/WO2022057637A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/47Machine-assisted translation, e.g. using translation memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • the present application relates to the field of computer technology, for example, to a speech translation method, apparatus, device and storage medium.
  • End-to-end speech translation With the continuous development of neural networks and the explosive growth of data, end-to-end speech translation technology came into being. End-to-end speech translation establishes the mapping relationship between the source language speech signal and the target language text, and then realizes the translation from the original speech to the target translation. However, the predictive performance of end-to-end speech translation models still falls short of expectations.
  • the present application provides a speech translation method, apparatus, device and storage medium to improve the prediction performance of speech translation.
  • a speech translation method including:
  • the source language text corresponding to the semantic feature is decoded from the semantic feature
  • the semantic feature is decoded according to the source language text to obtain a text sequence corresponding to the semantic feature, wherein the text sequence includes the source language text and the The target language text corresponding to the source language text;
  • a voice translation device comprising:
  • an encoding module configured to extract the semantic features of the speech to be processed through the encoder of the end-to-end speech translation model
  • a first decoding module configured to decode the source language text corresponding to the semantic feature from the semantic feature through the decoder of the end-to-end speech translation model
  • the second decoding module is configured to decode the semantic feature according to the source language text through the decoder of the end-to-end speech translation model to obtain a text sequence corresponding to the semantic feature, wherein the text sequence includes the Describe the source language text and the target language text corresponding to the source language text;
  • the splitting module is configured to split the text sequence to obtain the target language text corresponding to the speech to be processed.
  • An electronic device including a memory and a processor, wherein the memory stores a computer program, and the processor implements the above-mentioned speech translation method provided by the present application when the computer program is executed.
  • a computer-readable storage medium is also provided, storing a computer program, and when the computer program is executed by a processor, the above-mentioned speech translation method provided by the present application is implemented.
  • FIG. 1 is a schematic flowchart of a speech translation method provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of the principle of a speech translation process provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of another speech translation method provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a training process of an end-to-end speech translation model provided by an embodiment of the present application
  • FIG. 5 is a schematic structural diagram of a speech translation apparatus provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • method embodiments of the present disclosure may be performed in different orders and/or in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this regard.
  • the term “including” and variations thereof are open-ended inclusions, ie, "including but not limited to”.
  • the term “based on” is “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • Speech-to-text translation typically uses a pipelined system of automatic speech recognition and machine translation.
  • pipelined systems suffer from long delays, parameter redundancy, error accumulation, and loss of speech features.
  • end-to-end speech translation technology has received extensive attention.
  • This end-to-end speech translation can directly translate speech in the form of the source language into text in the form of the target language, effectively avoiding the technical problems existing in the pipeline system.
  • end-to-end speech translation still faces many problems, and the prediction performance of end-to-end speech translation still falls short of expectations. Therefore, the technical solutions provided by the embodiments of the present application can improve the prediction performance of the end-to-end speech translation model.
  • the execution subject of the following method embodiments may be a speech translation apparatus, and the apparatus may be implemented as part or all of an electronic device through software, hardware, or a combination of software and hardware.
  • the electronic device may be a client, including but not limited to a smart phone, a tablet computer, an e-book reader, a vehicle terminal, and the like.
  • the electronic device may also be an independent server or a server cluster, and the embodiment of the present application does not limit the form of the electronic device.
  • the following method embodiments are described by taking the execution subject being an electronic device as an example.
  • FIG. 1 is a schematic flowchart of a speech translation method provided by an embodiment of the present application. This embodiment relates to the process of how the electronic device translates the speech in the form of the source language into the text in the form of the target language based on the end-to-end speech translation model. As shown in Figure 1, the method may include the following steps.
  • the speech to be processed is the speech that needs to be translated into text.
  • the speech to be processed can be any source language, and the translated text is another target language corresponding to the source language. If the source language is English, the target language corresponding to the source language can be French.
  • the electronic device also needs to acquire the to-be-processed voice in the source language.
  • the electronic device can select from the database the to-be-processed speech that needs to be translated into text, or can obtain the to-be-processed speech input by the user through translation software installed on the electronic device. In this embodiment, the acquisition method of the to-be-processed speech Not limited.
  • the above-mentioned end-to-end speech translation model may be a pre-trained multi-layer neural network, and the end-to-end speech translation model may include an encoder and a decoder.
  • the encoder and decoder can be of various network structures.
  • a Recurrent Neural Network RNN
  • RNN Recurrent Neural Network
  • the encoder and decoder can also be other forms of network structures, such as Convolutional Neural Networks (CNN).
  • CNN Convolutional Neural Networks
  • the encoder can perform feature extraction on the input content to obtain feature vectors.
  • the electronic device inputs the speech to be processed into the end-to-end speech translation model, and extracts the semantic features of the speech to be processed through the encoder of the end-to-end speech translation model.
  • the semantic feature contains all the information of the to-be-processed speech, as a high-dimensional intermediate representation of the to-be-processed speech.
  • the electronic device may also perform spectrogram, log-Mel filter bank, discrete cosine transform and other processing on the to-be-processed speech, thereby extracting the Mel cepstrum.
  • the low-dimensional audio features obtained are used as the input of the end-to-end speech translation model.
  • the low-dimensional audio features are encoded by the encoder of the end-to-end speech translation model to obtain the semantic features of the speech to be processed.
  • the encoder of the end-to-end speech translation model performs feature extraction on the input content to obtain a feature vector, and the encoder decodes the feature vector according to the context information to obtain the corresponding output text.
  • the embodiment of the present application introduces a continuous decoding mechanism, that is, in the decoding process, the decoder first predicts the relatively simple source language text, and then predicts the relatively difficult target. language text.
  • the electronic device inputs the semantic features into the decoder of the end-to-end speech translation model, and the decoder decodes the semantic features and decodes the source language text corresponding to the semantic features.
  • the electronic device inputs the speech to be processed into the end-to-end speech translation model 201 , extract the semantic features of the speech to be processed by the encoder 202 of the end-to-end speech translation model 201, and input the obtained semantic features into the decoder 203 of the end-to-end speech translation model 201.
  • the source language text "see you” corresponding to the semantic features is decoded from the semantic features.
  • the decoder of the end-to-end speech translation model continues to perform secondary decoding on the semantic features.
  • the electronic device can then use the source language text as a reference for subsequent decoding, and continue to decode the semantic features of the to-be-processed speech according to the source-language text, thereby obtaining a text sequence corresponding to the to-be-processed speech.
  • the text sequence includes the source language text and the target language text corresponding to the source language text, and the source language text and the target language text are connected by a task identifier.
  • the continuous decoding mechanism introduced in the embodiment of the present application relieves the decoding pressure of the decoder.
  • the continuous decoding mechanism introduced in the embodiment of the present application relieves the decoding pressure of the decoder.
  • the source language text corresponding to the to-be-processed speech is known, continuing to decode the semantic features of the to-be-processed speech in combination with the known source language text can improve the accuracy of decoding sex.
  • the electronic device uses the source language text "see you” decoded by the decoder 203 as a reference for decoding the semantic features, continues to pass through the decoder 203, and performs secondary decoding on the semantic features in combination with the source language text "see you” , so as to decode the text sequence "see you ⁇ st>Au revoir” corresponding to the speech to be processed.
  • the electronic device can split the text sequence output by the decoder according to the task identifier, so as to obtain the target corresponding to the speech to be processed. language text.
  • connection identifier between the source language text and the target language text is “ ⁇ st>”
  • the electronic device can, based on the connection identifier, perform the text sequence “see you ⁇ st>Au revoir” to split, so that the to-be-processed speech "see you” is translated into the target language text "Au revoir”.
  • the electronic device extracts the semantic features of the speech to be processed through the encoder of the end-to-end speech translation model, and decodes the semantic features corresponding to the semantic features through the decoder of the end-to-end speech translation model.
  • the source language text, and the decoder of the end-to-end speech translation model decodes the above semantic features according to the source language text, obtains a text sequence corresponding to the semantic features, and splits the text sequence to obtain the target language text corresponding to the speech to be processed.
  • the decoder of the end-to-end speech translation model can first decode the source language text from the semantic features, and then continue to decode the semantic features based on the known source language text.
  • the relatively simple source language text is predicted first, and then the relatively difficult target language text is predicted, which relieves the decoding pressure of the decoder; and when predicting the target language text, the source language text corresponding to the target language text is known, which improves the The decoding performance of the decoder, thereby improving the prediction performance of the end-to-end speech translation model.
  • the encoder includes a first encoder and a second encoder. As shown in FIG. 3 , the foregoing S101 may include the following steps.
  • the encoder of the end-to-end speech translation model may include a first encoder and a second encoder to extract different features of the input speech to be processed, respectively.
  • the first encoder is used to extract acoustic features of the speech to be processed
  • the second encoder is used to extract semantic features of the speech to be processed.
  • the network structures of the first encoder and the second encoder can be constructed according to actual needs.
  • both the first encoder and the second encoder may be an RNN network.
  • both the first encoder and the second encoder may be implemented by a transform network (Transformer).
  • Transformer Transformer
  • the conversion network can include a multi-layer multi-head self-attention module, and the conversion network can also include a linear layer and a softmax layer.
  • the electronic device inputs the to-be-processed speech to the first encoder, and performs acoustic encoding of the to-be-processed speech through the multi-head self-attention module of the first encoder, thereby extracting a high-dimensional acoustic feature representation of the to-be-processed speech.
  • the electronic device may further perform down-sampling and linear layer processing on the audio features of the to-be-processed speech.
  • downsampling refers to reducing the dimension of the input audio features in the time domain.
  • manual downsampling can be used for downsampling, that is, one frame is taken for every three frames of audio.
  • Other downsampling manners may also be used, which are not limited in this embodiment.
  • the linear layer can map the frequency domain feature dimension of the downsampled audio to the hidden layer dimension of the network.
  • Blank frames and repeated frames may exist in the acoustic features of the speech to be processed extracted by the first encoder.
  • a blank frame can be understood as an audio frame without content information
  • a repeated frame can be understood as an audio frame with repeated content information.
  • an acoustic unit shrinkage layer can be added between the first encoder and the second encoder. The blank frames and repeated frames in the acoustic features are detected by the acoustic unit shrinkage layer, the detected blank frames are removed from the acoustic features, and the repeated frames are merged to obtain the shrinked acoustic features.
  • the length of the obtained shrunk acoustic features will be greatly shortened compared with the acoustic features before processing, which is beneficial for the subsequent second encoder to perform high-level semantic features. extraction.
  • the electronic device may detect blank frames and repeated frames in the acoustic feature based on the peak characteristic of the probability distribution of the time-series classification loss function.
  • the temporal classification loss function (Connectionist Temporal Classification, CTC) introduces blank (there is no predicted value in this frame), each predicted classification corresponds to a spike (spike) in a whole speech, and other positions that are not spikes are considered blank.
  • CTC Connectionist Temporal Classification
  • the final output of CTC is a sequence of spikes, and it does not care how long each phoneme lasts. Therefore, for the speech to be processed, the electronic device can detect blank frames and repeated frames in the acoustic features of the speech to be processed based on the peak feature of the probability distribution of the CTC.
  • the electronic device After the acoustic features of the speech to be processed are contracted, the electronic device inputs the contracted acoustic features into the second encoder, so as to extract high-level semantic features in the contracted acoustic features through the second encoder.
  • the second encoder may include multi-layer self-attention modules, and high-level semantic features in the shrunk acoustic features are extracted through the stacked multi-layer self-attention modules.
  • the electronic device may perform shrink processing on the acoustic features of the speech to be processed extracted by the first encoder, that is, Eliminate the blank frames in the acoustic features and combine the repeated frames in the acoustic features, reduce the interference of blank frames and repeated frames, and facilitate the extraction of high-level semantic features by the second encoder, thereby improving the encoding performance of the encoder, thereby improving the overall performance. Predictive performance of end-to-end speech translation models.
  • a training process for an end-to-end speech translation model is also provided, which can make full use of machine translation data with rich data sources, thereby improving the decoding performance of the decoder.
  • the training process of the end-to-end speech translation model may include the following steps.
  • the speech recognition parallel data can improve the prediction performance of the end-to-end speech translation model, due to the lack of speech recognition parallel data, the training of the end-to-end speech translation model is very time-consuming and labor-intensive, and the prediction performance still falls short of expectations. Require. However, the number of samples of machine translation parallel data is large, and how to use the machine translation parallel data with a large number of samples to train an end-to-end speech translation model is a technical issue worth considering.
  • the decoder of the end-to-end speech translation model introduces a continuous decoding mechanism, that is, the decoder first decodes the source language samples in the semantic features, and then decodes the text sequences corresponding to the semantic features based on the source language samples. Therefore, under the continuous decoding structure of the decoder, the machine translation data can be fully utilized, so that the electronic device can pre-train the decoder of the end-to-end speech translation model based on the machine translation data (ie, text translation samples).
  • the above-mentioned text translation samples may include source language sample text and target language sample text corresponding to the source language sample text.
  • the decoder of the end-to-end speech translation model is pre-trained through a large number of source language sample texts and target language sample texts corresponding to the source language sample texts, so that the decoder converges better.
  • the electronic device pre-trains the decoder of the end-to-end speech translation model according to the text translation samples, and the process of obtaining the initial decoder may be the following steps.
  • S4011 splicing the source language sample text and the target language sample text to obtain a spliced sample sequence.
  • the electronic device may splicing the source language sample text and the target language sample text through a preset task identifier to obtain a spliced sample sequence.
  • the masked cross-entropy loss function is used to mask the prediction loss of the source language prediction text corresponding to the all-zero vector.
  • the pre-training stage has no corresponding audio features as input.
  • an all-zero vector can be used as the output of the encoder of the end-to-end speech translation model, that is, the all-zero vector is used as the semantic feature of the speech to be processed. into the decoder.
  • the masked cross-entropy loss function can be used as the decoder's objective function to mask out the source corresponding to the all-zero vector. Prediction loss for language prediction text.
  • the electronic device can input the source language sample text and the all-zero vector into the decoder of the end-to-end speech translation model, predict the predicted sample sequence corresponding to the all-zero vector according to the source language sample text, and use the masked cross entropy loss
  • the function calculates the loss value between the predicted sample sequence and the spliced sample sequence, and adjusts the parameters of the decoder based on the loss value until the convergence condition of the masked cross-entropy loss function is reached, thereby obtaining the initial decoder.
  • the electronic device adopts a joint optimization method to train the initialized end-to-end speech translation model as a whole. Since the end-to-end speech translation model includes an encoder and a decoder, and the encoder includes a first encoder and a second encoder, the loss value of the loss function of the end-to-end speech translation model is the value corresponding to the first encoder. The weighted sum of the first loss value, the second loss value corresponding to the second encoder, and the third loss value corresponding to the decoder, based on the weighted sum of the first loss value, the second loss value, and the third loss value. The parameters of the speech translation model are adjusted until the convergence condition of the loss function is reached, thereby obtaining an end-to-end speech translation model.
  • the training process of the first encoder may be: using the sample speech in the training sample set of the end-to-end speech translation model as the first encoder.
  • the input of the encoder takes the sample phoneme sequence corresponding to the sample speech as the expected output, and trains the first encoder based on the time-series classification loss function.
  • the training sample set of the end-to-end speech translation model includes multiple training samples, and each training sample includes sample speech, a sample phoneme sequence corresponding to the sample speech, and a sample text sequence corresponding to the sample speech.
  • the electronic device can input the sample speech in the training sample into the first encoder, and obtain the actual output of the first encoder, that is, obtain the actual phoneme sequence.
  • the electronic device can use the sample phoneme sequence corresponding to the sample speech in the training sample as the expected output of the first encoder, and calculate the difference between the actual output and the expected output of the first encoder based on the CTC loss function, as the end-to-end speech translation.
  • the first loss value of the model is combined with the weighted sum of the first loss value, the second loss value and the third loss value to adjust the parameters of the first encoder, so as to realize the training of the first encoder.
  • the CTC loss function is introduced as an auxiliary supervision signal for training, and the output of the softmax layer of the first decoder is supervised by the CTC loss function; at the same time, the phoneme sequence is used as the optimization target of the CTC loss function.
  • the reason why the phoneme sequence is used is that the phoneme has fewer modeling units, and the phoneme is selected according to the pronunciation dictionary, which is closer to the pronunciation information of the speech, which makes it easier for the model to learn the mapping relationship from speech to phoneme. More acoustic information in the sample speech improves the coding performance of the encoder.
  • the electronic device may pre-train the decoder of the end-to-end speech translation model based on the continuous decoding mechanism of the decoder, and initialize the decoder of the end-to-end speech translation model based on the initial decoder obtained after pre-training , and train the initialized end-to-end speech translation model.
  • the parallel data of machine translation with a large number of samples can be fully utilized, thereby improving the decoding performance of the decoder and the prediction performance of the end-to-end speech translation model.
  • it also avoids the long and slow learning phase in the early stage of model training, which greatly reduces the model training time and improves the training efficiency of the end-to-end speech translation model.
  • FIG. 5 is a schematic structural diagram of a speech translation apparatus provided by an embodiment of the present application.
  • the apparatus may include: an encoding module 501 , a first decoding module 502 , a second decoding module 503 and a splitting module 504 .
  • the encoding module 501 is configured to extract the semantic features of the speech to be processed through the encoder of the end-to-end speech translation model;
  • the first decoding module 502 is configured to decode the source language text corresponding to the semantic feature from the semantic feature through the decoder of the end-to-end speech translation model;
  • the second decoding module 503 is configured to decode the semantic feature according to the source language text through the decoder of the end-to-end speech translation model to obtain a text sequence corresponding to the semantic feature, where the text sequence includes the source language.
  • the splitting module 504 is configured to split the text sequence to obtain the target language text corresponding to the speech to be processed.
  • the electronic device extracts the semantic features of the speech to be processed through the encoder of the end-to-end speech translation model, and decodes the semantic features corresponding to the semantic features through the decoder of the end-to-end speech translation model.
  • the source language text, and the decoder of the end-to-end speech translation model decodes the above semantic features according to the source language text, obtains a text sequence corresponding to the semantic features, and splits the text sequence to obtain the target language text corresponding to the speech to be processed.
  • the decoder of the end-to-end speech translation model can first decode the source language text from the semantic features, and then continue to decode the semantic features based on the known source language text.
  • the relatively simple source language text is predicted first, and then the relatively difficult target language text is predicted, which relieves the decoding pressure of the decoder; and when predicting the target language text, the source language text corresponding to the target language text is known, which improves the The decoding performance of the decoder, thereby improving the prediction performance of the end-to-end speech translation model.
  • the encoder includes a first encoder and a second encoder
  • the encoding module 501 may include: a first encoding unit, an acoustic puncturing unit, and a second encoding unit;
  • the first encoding unit is configured to extract acoustic features of the speech to be processed by the first encoder
  • the second encoding unit is configured to extract semantic features in the punctured acoustic features through the second encoder.
  • the pre-training module is set to pre-train the decoder of the end-to-end speech translation model according to the text translation samples to obtain an initial decoder;
  • the initialization module is set to initialize the decoder of the end-to-end speech translation model based on the initial decoder;
  • the text translation samples include source language sample text and target language sample text corresponding to the source language sample text.
  • the pre-training module is configured to splicing the source language sample text and the target language sample text to obtain a spliced sample sequence; the source language sample text and the all-zero vector are used as The input of the decoder of the end-to-end speech translation model takes the spliced sample sequence as the expected output, and pre-trains the decoder based on the masked cross-entropy loss function to obtain an initial decoder, wherein the masked The cross-entropy loss function is used to mask the prediction loss of the source language prediction text corresponding to the all-zero vector.
  • the training process of the first encoder includes: using the sample speech in the training sample set of the end-to-end speech translation model as the input of the first encoder, Taking the sample phoneme sequence corresponding to the sample speech as the expected output, the first encoder is trained based on the time series classification loss function.
  • the acoustic shrinking unit is configured to detect blank frames and repeated frames in the acoustic feature based on the peak characteristic of the probability distribution of the time-series classification loss function.
  • FIG. 6 shows a schematic structural diagram of an electronic device 600 suitable for implementing an embodiment of the present disclosure.
  • the electronic devices in the embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, personal digital assistants (Personal Digital Assistant, PDA), tablet computers (Portable Android Device, PAD), portable multimedia players (Personal Multimedia Player, PMP), mobile terminals such as in-vehicle terminals (for example, in-vehicle navigation terminals), etc., and fixed terminals such as digital televisions (television, TV), desktop computers, and the like.
  • PDA Personal Digital Assistant
  • PAD Portable Android Device
  • PMP portable multimedia players
  • mobile terminals such as in-vehicle terminals (for example, in-vehicle navigation terminals), etc.
  • fixed terminals such as digital televisions (television, TV), desktop computers, and the like.
  • the electronic device shown in FIG. 6 is only an example, and should not impose any limitation on the function and scope of
  • the electronic device 600 may include a processing device (such as a central processing unit, a graphics processor, etc.) 601, and the processing device 601 may be based on a program stored in a read-only memory (Read-only Memory, ROM) 602 or from a
  • the storage device 606 loads a program into a random access memory (RAM) 603 to perform various appropriate actions and processes.
  • RAM random access memory
  • various programs and data required for the operation of the electronic device 600 are also stored.
  • the processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604.
  • An Input/Output (I/O) interface 605 is also connected to the bus 604 .
  • I/O interface 605 The following devices may be connected to the I/O interface 605: Input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD), speakers Output device 607 , vibrator, etc.; storage device 606 including, for example, magnetic tape, hard disk, etc.; and communication device 609 . Communication means 609 may allow electronic device 600 to communicate wirelessly or by wire with other devices to exchange data.
  • FIG. 6 shows electronic device 600 having various means, it is not required to implement or have all of the illustrated means, and more or fewer means may be implemented or provided instead.
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication device 609, or from the storage device 606, or from the ROM 602.
  • the processing apparatus 601 the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • the computer-readable medium described above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above.
  • Computer readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), Flash memory, optical fiber, portable Compact Disc Read-Only Memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • the program code embodied on the computer-readable medium can be transmitted by any suitable medium, including but not limited to: electric wire, optical fiber cable, radio frequency (RF), etc., or any suitable combination of the above.
  • clients and servers can communicate using any currently known or future developed network protocols, such as HyperText Transfer Protocol (HTTP), and can communicate with digital data in any form or medium.
  • Communication eg, a communication network
  • Examples of communication networks include Local Area Networks (LANs), Wide Area Networks (WANs), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently Known or future developed networks.
  • LANs Local Area Networks
  • WANs Wide Area Networks
  • the Internet eg, the Internet
  • peer-to-peer networks eg, ad hoc peer-to-peer networks
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: extracts the semantic features of the speech to be processed through the encoder of the end-to-end speech translation model; Through the decoder of the end-to-end speech translation model, the source language text corresponding to the semantic feature is decoded from the semantic feature; through the decoder of the end-to-end speech translation model, according to the source language text Decoding the semantic feature to obtain a text sequence corresponding to the semantic feature, wherein the text sequence includes the source language text and the target language text corresponding to the source language text; splitting the text sequence to obtain the The target language text corresponding to the speech to be processed.
  • Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user computer through any kind of network, including a LAN or WAN, or may be connected to an external computer (eg, using an Internet service provider to connect through the Internet).
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures.
  • two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • Each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in special purpose hardware-based systems that perform the specified functions or operations, or special purpose hardware implemented in combination with computer instructions.
  • the units involved in the embodiments of the present disclosure may be implemented in a software manner, and may also be implemented in a hardware manner.
  • the name of the unit does not constitute a limitation of the unit itself in one case, for example, the first obtaining unit may also be described as "a unit that obtains at least two Internet Protocol addresses".
  • exemplary types of hardware logic components include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (Application Specific Standard Products) Standard Parts, ASSP), system on chip (System on Chip, SOC), complex programmable logic device (Complex Programmable Logic Device, CPLD) and so on.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSP Application Specific Standard Products
  • SOC System on Chip
  • complex programmable logic device Complex Programmable Logic Device, CPLD
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing.
  • Machine-readable storage media include one or more wire-based electrical connections, portable computer disks, hard disks, RAM, ROM, EPROM, flash memory, optical fibers, portable CD-ROMs, optical storage devices, magnetic storage devices, or the above any suitable combination of content.
  • the speech translation apparatus, device, and storage medium provided in the above embodiments can execute the speech translation method provided by any embodiment of the present application, and have corresponding functional modules and effects for executing the method.
  • speech translation method provided by any embodiment of the present application.
  • a speech translation method comprising:
  • the source language text corresponding to the semantic feature is decoded from the semantic feature
  • the semantic feature is decoded according to the source language text, and a text sequence corresponding to the semantic feature is obtained, and the text sequence includes the source language text and the source language text corresponding to the source language text.
  • the target language text is the target language text;
  • the encoder includes a first encoder and a second encoder.
  • the above speech translation method is provided, further comprising: extracting the to-be-to-be-used code by the first encoder. processing the acoustic features of the speech; detecting blank frames and repeated frames in the acoustic features, removing the blank frames and merging the repeated frames to obtain a shrunk acoustic feature; extracting the shrinkage through the second encoder Semantic features in the latter acoustic features.
  • the above speech translation method further comprising: pre-training a decoder of the end-to-end speech translation model according to text translation samples to obtain an initial decoder;
  • the decoder initializes the decoder of the end-to-end speech translation model; and trains the initialized end-to-end speech translation model.
  • the text translation sample includes source language sample text and target language sample text corresponding to the source language sample text.
  • the above speech translation method further comprising: splicing the source language sample text and the target language sample text to obtain a spliced sample sequence; combining the source language sample text and the target language sample text
  • the all-zero vector is used as the input of the decoder of the end-to-end speech translation model, the spliced sample sequence is used as the expected output, and the decoder is pre-trained based on the masked cross-entropy loss function to obtain the initial decoder, where , the masked cross-entropy loss function is used to mask the prediction loss of the source language prediction text corresponding to the all-zero vector.
  • the above speech translation method further comprising: using the sample speech in the training sample set of the end-to-end speech translation model as the input of the first encoder,
  • the sample phoneme sequence corresponding to the sample speech is used as the expected output, and the first encoder is trained based on the time series classification loss function.
  • the above speech translation method further comprising: detecting blank frames and repeated frames in the acoustic feature based on the peak characteristic of the probability distribution of the time-series classification loss function.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

一种语音翻译方法、装置、设备和存储介质。该方法包括:通过端到端语音翻译模型的编码器提取待处理语音的语义特征(S101);通过所述端到端语音翻译模型的解码器,从所述语义特征中解码出所述语义特征对应的源语言文本(S102);通过所述端到端语音翻译模型的解码器,根据所述源语言文本解码所述语义特征,得到所述语义特征对应的文本序列(S103);拆分所述文本序列,得到所述待处理语音对应的目标语言文本(S104)。

Description

语音翻译方法、装置、设备和存储介质
本申请要求在2020年09月18日提交中国专利局、申请号为202010987456.7的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,例如涉及一种语音翻译方法、装置、设备和存储介质。
背景技术
随着神经网络的不断发展以及数据的爆炸式增长,端到端语音翻译技术应运而生。端到端语音翻译建立了源语言语音信号到目标语言文本的映射关系,进而实现了从原始语音到目标译文的翻译。但是,端到端语音翻译模型的预测性能仍达不到期望要求。
发明内容
本申请提供一种语音翻译方法、装置、设备和存储介质,以提高语音翻译的预测性能。
提供了一种语音翻译方法,包括:
通过端到端语音翻译模型的编码器提取待处理语音的语义特征;
通过所述端到端语音翻译模型的解码器,从所述语义特征中解码出所述语义特征对应的源语言文本;
通过所述端到端语音翻译模型的解码器,根据所述源语言文本解码所述语义特征,得到所述语义特征对应的文本序列,其中,所述文本序列包括所述源语言文本和所述源语言文本对应的目标语言文本;
拆分所述文本序列,得到所述待处理语音对应的目标语言文本。
还提供了一种语音翻译装置,包括:
编码模块,设置为通过端到端语音翻译模型的编码器提取待处理语音的语义特征;
第一解码模块,设置为通过所述端到端语音翻译模型的解码器,从所述语义特征中解码出所述语义特征对应的源语言文本;
第二解码模块,设置为通过所述端到端语音翻译模型的解码器,根据所述 源语言文本解码所述语义特征,得到所述语义特征对应的文本序列,其中,所述文本序列包括所述源语言文本和所述源语言文本对应的目标语言文本;
拆分模块,设置为拆分所述文本序列,得到所述待处理语音对应的目标语言文本。
还提供了一种电子设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现本申请提供的上述语音翻译方法。
还提供了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时实现本申请提供的上述语音翻译方法。
附图说明
图1为本申请实施例提供的一种语音翻译方法的流程示意图;
图2为本申请实施例提供的一种语音翻译过程的原理示意图;
图3为本申请实施例提供的另一种语音翻译方法的流程示意图;
图4为本申请实施例提供的一种端到端语音翻译模型的训练过程的流程示意图;
图5为本申请实施例提供的一种语音翻译装置的结构示意图;
图6为本申请实施例提供的一种电子设备的结构示意图。
具体实施方式
下面将参照附图描述本公开的实施例。虽然附图中显示了本公开的一些实施例,然而本公开可以通过多种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了理解本公开。本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
本公开的方法实施方式中记载的多个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施 例”。其他术语的相关定义将在下文描述中给出。
本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,除非在上下文另有指出,否则应该理解为“一个或多个”。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
语音到文本翻译通常使用自动语音识别和机器翻译的管道式系统。然而,管道式系统具有长延迟、参数冗余、误差积累以及语音特征损失等缺点。
近年来,端到端语音翻译技术受到了广泛的关注,该端到端语音翻译能够直接将源语言形式的语音翻译成目标语言形式的文本,有效地避免了管道式系统所存在的技术问题。但是,端到端语音翻译仍然面临着诸多问题,端到端语音翻译的预测性能仍达不到期望要求。为此,本申请实施例提供的技术方案,能够改善端到端语音翻译模型的预测性能。
下文中将结合附图对本申请的实施例进行说明。在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互任意组合。
下述方法实施例的执行主体可以是语音翻译装置,该装置可以通过软件、硬件或者软硬件结合的方式实现成为电子设备的部分或者全部。可选的,该电子设备可以为客户端,包括但不限于智能手机、平板电脑、电子书阅读器以及车载终端等。该电子设备也可以为独立的服务器或者服务器集群,本申请实施例对电子设备的形式不做限定。下述方法实施例以执行主体是电子设备为例进行说明。
图1为本申请实施例提供的一种语音翻译方法的流程示意图。本实施例涉及的是电子设备如何基于端到端语音翻译模型将源语言形式的语音翻译成目标语言形式的文本的过程。如图1所示,该方法可以包括以下步骤。
S101、通过端到端语音翻译模型的编码器提取待处理语音的语义特征。
待处理语音为需要进行语音到文本翻译的语音。待处理语音可以为任意一种源语言,翻译后的文本为源语言对应的另外一种目标语言。如源语言为英语,源语言对应的目标语言可以为法语。可选的,在S101之前,电子设备还需要获取源语言的待处理语音。作为一种示例,电子设备可以从数据库中选取需要进行语音到文本翻译的待处理语音,也可以通过电子设备上安装的翻译软件获取用户输入的待处理语音,本实施例对待处理语音的获取方式不做限定。
上述端到端语音翻译模型可以是预先训练好的多层神经网络,端到端语音翻译模型可以包括编码器和解码器。根据实际需要,编码器和解码器可以是多种不同的网络结构,作为一种示例,可以采用一个循环神经网络(Recurrent Neural Network,RNN)作为编码器,另外一个RNN作为解码器。编码器和解码器也可以为其它形式的网络结构,如卷积神经网络(Convolutional Neural Networks,CNN)。编码器可以对输入的内容进行特征提取,得到特征向量。即,电子设备将待处理语音输入至端到端语音翻译模型中,通过端到端语音翻译模型的编码器提取该待处理语音的语义特征。其中,该语义特征包含了待处理语音的所有信息,作为待处理语音的高维中间表示。
可选的,在将待处理语音输入至端到端语音翻译模型之前,电子设备还可以对待处理语音进行语谱图、log-Mel滤波器组、离散余弦变换等处理,从而提取梅尔倒谱系数得到音频特征,将得到的低维音频特征作为端到端语音翻译模型的输入。通过端到端语音翻译模型的编码器对该低维的音频特征进行编码处理,得到待处理语音的语义特征。
S102、通过所述端到端语音翻译模型的解码器,从所述语义特征中解码出所述语义特征对应的源语言文本。
如上述所述的端到端语音翻译模型的编码器对输入的内容进行特征提取,得到特征向量,编码器根据上下文信息对特征向量进行解码,得到对应的输出文本。为了提高端到端语音翻译模型的解码器的解码性能,本申请实施例引入了连续解码机制,即在解码的过程中,解码器先预测较为简单的源语言文本,再预测相对较难的目标语言文本。这样,在得到待处理语音的语义特征之后,电子设备将语义特征输入至端到端语音翻译模型的解码器中,解码器对语义特征进行解码,解码出语义特征对应的源语言文本。
参见图2,假设待处理语音的源语言为英语,翻译后的文本的目标语言为法语,若待处理语音为“see you”,电子设备将该待处理语音输入至端到端语音翻译模型201中,通过端到端语音翻译模型201的编码器202提取该待处理语音的语义特征,并将得到的语义特征输入至端到端语音翻译模型201的解码器203中,通过解码器203从该语义特征中解码出语义特征对应的源语言文本“see you”。
S103、通过所述端到端语音翻译模型的解码器,根据所述源语言文本解码所述语义特征,得到所述语义特征对应的文本序列。
在解码出语义特征对应的源语言文本之后,端到端语音翻译模型的解码器继续对语义特征进行二次解码,在二次解码的过程中,由于源语言文本是已知的,此时,电子设备便可以将源语言文本作为后续解码的参考,继续根据源语 言文本解码待处理语音的语义特征,从而得到待处理语音对应的文本序列。其中,该文本序列包括源语言文本和源语言文本对应的目标语言文本,源语言文本和目标语言文本之间通过任务标识符连接。相比解码器同时解码多个任务,本申请实施例引入的连续解码机制,缓解了解码器的解码压力。同时,在预测待处理语音对应的文本序列时,由于待处理语音对应的源语言文本是已知的,因此,结合已知的源语言文本继续解码待处理语音的语义特征,能够提高解码的准确性。
继续参见图2,电子设备将解码器203解码出的源语言文本“see you”作为解码语义特征的参考,继续通过解码器203,并结合源语言文本“see you”对语义特征进行二次解码,从而解码出待处理语音对应的文本序列“see you<st>Au revoir”。
S104、拆分所述文本序列,得到所述待处理语音对应的目标语言文本。
由于源语言文本和源语言文本对应的目标语言文本之间通过任务标识符连接,因此,电子设备便可以根据任务标识符对解码器输出的文本序列进行拆分,从而得到待处理语音对应的目标语言文本。
继续参见图2,假设源语言文本和目标语言文本之间的连接标识符为“<st>”,此时,电子设备便可以基于该连接标识符对解码器203输出的文本序列“see you<st>Au revoir”进行拆分,从而将待处理语音“see you”翻译成目标语言文本“Au revoir”。
本申请实施例提供的语音翻译方法,电子设备通过端到端语音翻译模型的编码器提取待处理语音的语义特征,通过端到端语音翻译模型的解码器,从语义特征中解码出语义特征对应的源语言文本,以及通过端到端语音翻译模型的解码器,根据源语言文本解码上述语义特征,得到语义特征对应的文本序列,拆分该文本序列,得到待处理语音对应的目标语言文本。在对待处理语音的语义特征解码时,端到端语音翻译模型的解码器能够先从语义特征中解码出源语言文本,基于已知的源语言文本继续解码语义特征,即通过连续解码机制实现了先预测较为简单的源语言文本,再预测相对较难的目标语言文本,缓解了解码器的解码压力;并且在预测目标语言文本时,目标语言文本对应的源语言文本是已知的,改善了解码器的解码性能,从而提高了端到端语音翻译模型的预测性能。
在实际应用中,由于待处理语音的音频特征的长度远远大于待处理语音对应的源语言文本的长度,此情况不利于端到端语音翻译模型的编码器对待处理语音的语义特征的提取。为此,可以参照下述实施例所述的过程进行待处理语音的语义特征的提取。在上述实施例的基础上,可选的,所述编码器包括第一 编码器和第二编码器,如图3所示,上述S101可以包括以下步骤。
S301、通过所述第一编码器提取待处理语音的声学特征。
端到端语音翻译模型的编码器可以包括第一编码器和第二编码器,以分别提取所输入的待处理语音的不同特征。例如,第一编码器用于提取待处理语音的声学特征,第二编码器用于提取待处理语音的语义特征。第一编码器和第二编码器的网络结构可以根据实际需要进行构建。作为一种可选的实施方式,第一编码器和第二编码器都可以是一个RNN网络。作为另一种可选的实施方式,第一编码器和第二编码器均可以由转换网络(Transformer)实现。其中,转换网络可以包括多层多头自注意力模块,转换网络还可以包括线性层以及softmax层等。
电子设备将待处理语音输入至第一编码器,通过第一编码器的多头自注意力模块对待处理语音进行声学编码,从而提取到待处理语音的高维声学特征表示。可选的,在将待处理语音输入至第一编码器之前,电子设备还可以对待处理语音的音频特征进行下采样以及线性层等处理。其中,下采样是指对输入的音频特征进行时域的降维。为了简化端到端语音翻译模型,可以采用手动降采样的方式进行下采样,即采用每三帧音频采一帧的方式。也可以采用其它下采样方式,本实施例对此不做限定。线性层可以将下采样后的音频的频域特征维度映射到网络的隐层维度。
S302、检测所述声学特征中的空白帧和重复帧,并剔除所述空白帧以及合并所述重复帧,得到收缩后的声学特征。
由于第一编码器提取出的待处理语音的声学特征中可能存在空白帧和重复帧。其中,空白帧可以理解是没有内容信息的音频帧,重复帧可以理解是内容信息重复的音频帧。为了避免空白帧和重复帧带来的语义干扰,可以在第一编码器和第二编码器之间增加声学单元收缩层。通过声学单元收缩层检测声学特征中的空白帧和重复帧,并从声学特征中剔除检测到的空白帧,以及将重复帧进行合并,从而得到收缩后的声学特征。对待处理语音的声学特征经过空白帧的剔除以及重复帧的合并操作之后,得到的收缩后的声学特征相比处理前的声学特征的长度会大大缩短,从而利于后续第二编码器进行高层语义特征的提取。
作为一种可选的实施方式,电子设备可以基于时序分类损失函数的概率分布的尖峰特性,检测所述声学特征中的空白帧和重复帧。
时序分类损失函数(Connectionist Temporal Classification,CTC)引入了blank(该帧没有预测值),每个预测的分类对应一整段语音中的一个spike(尖峰),其它不是尖峰的位置认为是blank。对于一段语音,CTC最后输出的是spike的 序列,并不关心每个音素持续了多长时间。因此,对于待处理语音,电子设备可以基于CTC的概率分布的尖峰特征,检测待处理语音的声学特征中的空白帧和重复帧。
S303、通过所述第二编码器提取所述收缩后的声学特征中的语义特征。
在对待处理语音的声学特征进行收缩处理后,电子设备将收缩后的声学特征输入至第二编码器中,以通过第二编码器提取收缩后的声学特征中的高层语义特征。其中,该第二编码器可以包括多层自注意力模块,通过堆叠的多层自注意力模块提取收缩后的声学特征中的高层语义特征。
在本实施例中,在通过端到端语音翻译模型的编码器提取待处理语音的语义特征的过程中,电子设备可以对第一编码器提取出的待处理语音的声学特征进行收缩处理,即剔除声学特征中的空白帧以及合并声学特征中的重复帧,减少空白帧和重复帧的干扰,便于第二编码器进行高层语义特征的提取,从而提高了编码器的编码性能,进而提高了整个端到端语音翻译模型的预测性能。
在实际应用中,由于端到端语音翻译模型的训练数据较为匮乏,导致端到端语音翻译模型的训练非常耗时耗力。为此,在一个实施例中,还提供了一种端到端语音翻译模型的训练过程,能够充分利用数据源较为丰富的机器翻译数据,从而提高解码器的解码性能。在上述实施例的基础上,可选的,如图4所示,该端到端语音翻译模型的训练过程可以包括以下步骤。
S401、根据文本翻译样本,预先训练所述端到端语音翻译模型的解码器,得到初始解码器。
虽然语音识别平行数据能够提高端到端语音翻译模型的预测性能,但是,由于语音识别平行数据较为匮乏,导致端到端语音翻译模型的训练非常耗时耗力,且预测性能仍达不到期望要求。然而,机器翻译平行数据的样本数量较多,如何利用样本数量较多的机器翻译平行数据来训练端到端语音翻译模型,是值得思考的技术问题。
在本申请实施例中,由于端到端语音翻译模型的解码器引入了连续解码机制,即解码器先解码语义特征中的源语言样本,基于源语言样本解码语义特征对应的文本序列。因此,在解码器的连续解码结构下,便可以充分利用机器翻译数据,这样,电子设备便可以基于机器翻译数据(即文本翻译样本)对端到端语音翻译模型的解码器进行预训练。
可选的,上述文本翻译样本可以包括源语言样本文本和源语言样本文本对应的目标语言样本文本。通过大量的源语言样本文本和源语言样本文本对应的目标语言样本文本对端到端语音翻译模型的解码器进行预训练,使得解码器收 敛得更好。为此,可选的,电子设备根据文本翻译样本,预先训练端到端语音翻译模型的解码器,得到初始解码器的过程可以为以下步骤。
S4011、拼接所述源语言样本文本和所述目标语言样本文本,得到拼接样本序列。
为了便于区分源语言样本文本和源语言样本文本对应的目标语言样本文本,电子设备可以通过预设的任务标识符拼接源语言样本文本和目标语言样本文本,从而得到拼接样本序列。
S4012、将所述源语言样本文本以及全零向量作为所述端到端语音翻译模型的解码器的输入,将所述拼接样本序列作为期望输出,基于掩蔽的交叉熵损失函数对所述解码器进行预训练,得到初始解码器。
所述掩蔽的交叉熵损失函数用于掩蔽所述全零向量对应的源语言预测文本的预测损失。不同于端到端语音翻译模型的训练,预训练阶段没有相应的音频特征作为输入。为了和后续对端到端语音翻译模型的解码器的参数微调保持一致,可以使用一个全零向量作为端到端语音翻译模型的编码器的输出,即将该全零向量作为待处理语音的语义特征输入至解码器中。同时,为了使解码器在已知源语言样本文本的前提下,只关注目标语言预测样本的预测,可以使用掩蔽的交叉熵损失函数作为解码器的目标函数,以掩蔽掉全零向量对应的源语言预测文本的预测损失。
这样,电子设备便可以将源语言样本文本以及全零向量输入至端到端语音翻译模型的解码器中,根据源语言样本文本预测全零向量对应的预测样本序列,并基于掩蔽的交叉熵损失函数计算预测样本序列和拼接样本序列之间的损失值,基于该损失值对解码器的参数进行调整,直至达到掩蔽的交叉熵损失函数的收敛条件,从而得到初始解码器。
S402、基于所述初始解码器初始化所述端到端语音翻译模型的解码器。
在得到初始解码器之后,电子设备可以基于初始解码器的参数的参数值初始化端到端语音翻译模型的解码器的参数,即将端到端语音翻译模型的解码器的参数的初始值设置为初始解码器对应参数的参数值。通过这种预训练方式,可以避免模型训练早期的长时间缓慢学习阶段,从而极大地削减了模型训练时间。同时,也可以避免大量繁琐的超参数调优。
S403、对初始化后的所述端到端语音翻译模型进行训练。
电子设备采用联合优化的方法对初始化后的端到端语音翻译模型进行整体训练。由于端到端语音翻译模型包括编码器和解码器,编码器又包括第一编码器和第二编码器,因此,该端到端语音翻译模型的损失函数的损失值为第一编 码器对应的第一损失值、第二编码器对应的第二损失值以及解码器对应的第三损失值的加权和,基于第一损失值、第二损失值以及第三损失值的加权和对端到端语音翻译模型的参数进行调整,直至达到损失函数的收敛条件,从而得到端到端语音翻译模型。
可选的,在对上述端到端语音翻译模型进行训练的过程中,第一编码器的训练过程可以为:将所述端到端语音翻译模型的训练样本集中的样本语音作为所述第一编码器的输入,将所述样本语音对应的样本音素序列作为期望输出,基于时序分类损失函数对所述第一编码器进行训练。
端到端语音翻译模型的训练样本集包括多个训练样本,每个训练样本包括样本语音、样本语音对应的样本音素序列以及样本语音对应的样本文本序列。电子设备可以将训练样本中的样本语音输入至第一编码器中,得到第一编码器的实际输出,即得到实际音素序列。
电子设备可以将该训练样本中的样本语音对应的样本音素序列作为第一编码器的期望输出,基于CTC损失函数计算第一编码器的实际输出与期望输出的差值,作为端到端语音翻译模型的第一损失值,并结合第一损失值、上述第二损失值以及第三损失值的加权和对第一编码器的参数进行调整,从而实现对第一编码器的训练。在第一编码器的训练过程中,引入CTC损失函数作为训练的辅助监督信号,通过CTC损失函数对第一解码器的softmax层的输出进行监督;同时,以音素序列作为CTC损失函数的优化目标,之所以使用音素序列,是因为音素的建模单元少,而且音素是根据发音词典选出来的,本身更接近语音的发音信息,从而使得模型更容易学习从语音到音素的映射关系,保留了样本语音中更多的声学信息,提高了编码器的编码性能。
在本实施例中,电子设备可以基于解码器的连续解码机制,对端到端语音翻译模型的解码器进行预训练,基于预训练后得到的初始解码器初始化端到端语音翻译模型的解码器,以及对初始化后的端到端语音翻译模型进行训练。采用该训练方式,能够充分利用样本数量较多的机器翻译平行数据,从而提高解码器的解码性能,提高了端到端语音翻译模型的预测性能。同时,也避免了模型训练早期的长时间缓慢学习阶段,从而极大地削减了模型训练时间,提高了端到端语音翻译模型的训练效率。
图5为本申请实施例提供的一种语音翻译装置的结构示意图。如图5所示,该装置可以包括:编码模块501、第一解码模块502、第二解码模块503和拆分模块504。
编码模块501设置为通过端到端语音翻译模型的编码器提取待处理语音的语义特征;
第一解码模块502设置为通过所述端到端语音翻译模型的解码器,从所述语义特征中解码出所述语义特征对应的源语言文本;
第二解码模块503设置为通过所述端到端语音翻译模型的解码器,根据所述源语言文本解码所述语义特征,得到所述语义特征对应的文本序列,所述文本序列包括所述源语言文本和所述源语言文本对应的目标语言文本;
拆分模块504设置为拆分所述文本序列,得到所述待处理语音对应的目标语言文本。
本申请实施例提供的语音翻译装置,电子设备通过端到端语音翻译模型的编码器提取待处理语音的语义特征,通过端到端语音翻译模型的解码器,从语义特征中解码出语义特征对应的源语言文本,以及通过端到端语音翻译模型的解码器,根据源语言文本解码上述语义特征,得到语义特征对应的文本序列,拆分该文本序列,得到待处理语音对应的目标语言文本。在对待处理语音的语义特征解码时,端到端语音翻译模型的解码器能够先从语义特征中解码出源语言文本,基于已知的源语言文本继续解码语义特征,即通过连续解码机制实现了先预测较为简单的源语言文本,再预测相对较难的目标语言文本,缓解了解码器的解码压力;并且在预测目标语言文本时,目标语言文本对应的源语言文本是已知的,改善了解码器的解码性能,从而提高了端到端语音翻译模型的预测性能。
在上述实施例的基础上,可选的,所述编码器包括第一编码器和第二编码器,编码模块501可以包括:第一编码单元、声学收缩单元和第二编码单元;
第一编码单元设置为通过所述第一编码器提取待处理语音的声学特征;
声学收缩单元设置为检测所述声学特征中的空白帧和重复帧,并剔除所述空白帧以及合并所述重复帧,得到收缩后的声学特征;
第二编码单元设置为通过所述第二编码器提取所述收缩后的声学特征中的语义特征。
在上述实施例的基础上,可选的,该装置还可以包括:预训练模块、初始化模块和端到端训练模块;
预训练模块设置为根据文本翻译样本,预先训练所述端到端语音翻译模型的解码器,得到初始解码器;
初始化模块设置为基于所述初始解码器初始化所述端到端语音翻译模型的解码器;
端到端训练模块设置为对初始化后的所述端到端语音翻译模型进行训练。
可选的,所述文本翻译样本包括源语言样本文本和所述源语言样本文本对应的目标语言样本文本。
在上述实施例的基础上,可选的,预训练模块是设置为拼接所述源语言样本文本和所述目标语言样本文本,得到拼接样本序列;将所述源语言样本文本以及全零向量作为所述端到端语音翻译模型的解码器的输入,将所述拼接样本序列作为期望输出,基于掩蔽的交叉熵损失函数对所述解码器进行预训练,得到初始解码器,其中,所述掩蔽的交叉熵损失函数用于掩蔽所述全零向量对应的源语言预测文本的预测损失。
在上述实施例的基础上,可选的,所述第一编码器的训练过程,包括:将所述端到端语音翻译模型的训练样本集中的样本语音作为所述第一编码器的输入,将所述样本语音对应的样本音素序列作为期望输出,基于时序分类损失函数对所述第一编码器进行训练。
在上述实施例的基础上,可选的,声学收缩单元是设置为基于时序分类损失函数的概率分布的尖峰特性,检测所述声学特征中的空白帧和重复帧。
下面参考图6,图6示出了适于用来实现本公开实施例的电子设备600的结构示意图。本公开实施例中的电子设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、个人数字助理(Personal Digital Assistant,PDA)、平板电脑(Portable Android Device,PAD)、便携式多媒体播放器(Personal Multimedia Player,PMP)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字电视(television,TV)、台式计算机等等的固定终端。图6示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图6所示,电子设备600可以包括处理装置(例如中央处理器、图形处理器等)601,处理装置601可以根据存储在只读存储器(Read-only Memory,ROM)602中的程序或者从存储装置606加载到随机访问存储器(Random Access Memory,RAM)603中的程序而执行多种适当的动作和处理。在RAM 603中,还存储有电子设备600操作所需的多种程序和数据。处理装置601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(Input/Output,I/O)接口605也连接至总线604。
以下装置可以连接至I/O接口605:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置606;包括例如液晶显示器(Liquid Crystal Display,LCD)、扬声器、振动器等的输出装置607;包括例如磁带、硬盘等的存储装置606;以及通信装置609。通信装置609可以允许电子设备600与其他设备进行无线或有线通信以交换数据。虽然图6示出了具有多 种装置的电子设备600,但是并不要求实施或具备所有示出的装置,可以替代地实施或具备更多或更少的装置。
根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置609从网络上被下载和安装,或者从存储装置606被安装,或者从ROM 602被安装。在该计算机程序被处理装置601执行时,执行本公开实施例的方法中限定的上述功能。
本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、RAM、ROM、可擦式可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM)、闪存、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、射频(Radio Frequency,RF)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如超文本传输协议(HyperText Transfer Protocol,HTTP)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(Local Area Network,LAN),广域网(Wide Area Network,WAN),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:通过端到端语音翻译模型的编码器提取待处理语音的语义特征;通过所述端到端语音翻译模型的解码器,从所述语义特征中解码出所述语义特征对应的源语言文本;通过所述端到端语音翻译模型的解码器,根据所述源语言文本解码所述语义特征,得到所述语义特征对应的文本序列,其中,所述文本序列包括所述源语言文本和所述源语言文本对应的目标语言文本;拆分所述文本序列,得到所述待处理语音对应的目标语言文本。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括LAN或WAN—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开多种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在一种情况下并不构成对该单元本身的限定,例如,第一获取单元还可以被描述为“获取至少两个网际协议地址的单元”。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(Field Programmable Gate Array,FPGA)、专用集成电路(Application  Specific Integrated Circuit,ASIC)、专用标准产品(Application Specific Standard Parts,ASSP)、片上系统(System on Chip,SOC)、复杂可编程逻辑设备(Complex Programmable Logic Device,CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、RAM、ROM、EPROM、快闪存储器、光纤、便捷式CD-ROM、光学储存设备、磁储存设备、或上述内容的任何合适组合。
上述实施例中提供的语音翻译装置、设备以及存储介质可执行本申请任意实施例所提供的语音翻译方法,具备执行该方法相应的功能模块和效果。未在上述实施例中详尽描述的技术细节,可参见本申请任意实施例所提供的语音翻译方法。
根据本公开的一个或多个实施例,提供一种语音翻译方法,包括:
通过端到端语音翻译模型的编码器提取待处理语音的语义特征;
通过所述端到端语音翻译模型的解码器,从所述语义特征中解码出语义特征对应的源语言文本;
通过所述端到端语音翻译模型的解码器,根据所述源语言文本解码所述语义特征,得到语义特征对应的文本序列,所述文本序列包括所述源语言文本和所述源语言文本对应的目标语言文本;
拆分所述文本序列,得到所述待处理语音对应的目标语言文本。
可选的,所述编码器包括第一编码器和第二编码器,根据本公开的一个或多个实施例,提供了如上的语音翻译方法,还包括:通过所述第一编码器提取待处理语音的声学特征;检测所述声学特征中的空白帧和重复帧,并剔除所述空白帧以及合并所述重复帧,得到收缩后的声学特征;通过所述第二编码器提取所述收缩后的声学特征中的语义特征。
根据本公开的一个或多个实施例,提供了如上的语音翻译方法,还包括:根据文本翻译样本,预先训练所述端到端语音翻译模型的解码器,得到初始解码器;基于所述初始解码器初始化所述端到端语音翻译模型的解码器;对初始化后的所述端到端语音翻译模型进行训练。
可选的,所述文本翻译样本包括源语言样本文本和所述源语言样本文本对 应的目标语言样本文本。
根据本公开的一个或多个实施例,提供了如上的语音翻译方法,还包括:拼接所述源语言样本文本和所述目标语言样本文本,得到拼接样本序列;将所述源语言样本文本以及全零向量作为所述端到端语音翻译模型的解码器的输入,将所述拼接样本序列作为期望输出,基于掩蔽的交叉熵损失函数对所述解码器进行预训练,得到初始解码器,其中,所述掩蔽的交叉熵损失函数用于掩蔽所述全零向量对应的源语言预测文本的预测损失。
根据本公开的一个或多个实施例,提供了如上的语音翻译方法,还包括:将所述端到端语音翻译模型的训练样本集中的样本语音作为所述第一编码器的输入,将所述样本语音对应的样本音素序列作为期望输出,基于时序分类损失函数对所述第一编码器进行训练。
根据本公开的一个或多个实施例,提供了如上的语音翻译方法,还包括:基于时序分类损失函数的概率分布的尖峰特性,检测所述声学特征中的空白帧和重复帧。
此外,虽然采用特定次序描绘了多个操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了多个实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的一些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的多种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。

Claims (10)

  1. 一种语音翻译方法,包括:
    通过端到端语音翻译模型的编码器提取待处理语音的语义特征;
    通过所述端到端语音翻译模型的解码器,从所述语义特征中解码出所述语义特征对应的源语言文本;
    通过所述端到端语音翻译模型的解码器,根据所述源语言文本解码所述语义特征,得到所述语义特征对应的文本序列,其中,所述文本序列包括所述源语言文本和所述源语言文本对应的目标语言文本;
    拆分所述文本序列,得到所述待处理语音对应的目标语言文本。
  2. 根据权利要求1所述的方法,其中,所述编码器包括第一编码器和第二编码器,所述通过端到端语音翻译模型的编码器提取待处理语音的语义特征,包括:
    通过所述第一编码器提取所述待处理语音的声学特征;
    检测所述声学特征中的空白帧和重复帧,并剔除所述空白帧以及合并所述重复帧,得到收缩后的声学特征;
    通过所述第二编码器提取所述收缩后的声学特征中的语义特征。
  3. 根据权利要求1所述的方法,其中,所述端到端语音翻译模型的训练过程,包括:
    根据文本翻译样本,预先训练所述端到端语音翻译模型的解码器,得到初始解码器;
    基于所述初始解码器初始化所述端到端语音翻译模型的解码器;
    对初始化后的端到端语音翻译模型进行训练。
  4. 根据权利要求3所述的方法,其中,所述文本翻译样本包括源语言样本文本和所述源语言样本文本对应的目标语言样本文本。
  5. 根据权利要求4所述的方法,其中,所述根据文本翻译样本,预先训练所述端到端语音翻译模型的解码器,得到初始解码器,包括:
    拼接所述源语言样本文本和所述目标语言样本文本,得到拼接样本序列;
    将所述源语言样本文本以及全零向量作为所述端到端语音翻译模型的解码器的输入,将所述拼接样本序列作为期望输出,基于掩蔽的交叉熵损失函数对所述解码器进行预训练,得到所述初始解码器,其中,所述掩蔽的交叉熵损失函数用于掩蔽所述全零向量对应的源语言预测文本的预测损失。
  6. 根据权利要求2至5中任一项所述的方法,其中,所述第一编码器的训练过程,包括:
    将所述端到端语音翻译模型的训练样本集中的样本语音作为所述第一编码器的输入,将所述样本语音对应的样本音素序列作为期望输出,基于时序分类损失函数对所述第一编码器进行训练。
  7. 根据权利要求2所述的方法,其中,所述检测所述声学特征中的空白帧和重复帧,包括:
    基于时序分类损失函数的概率分布的尖峰特性,检测所述声学特征中的空白帧和重复帧。
  8. 一种语音翻译装置,包括:
    编码模块,设置为通过端到端语音翻译模型的编码器提取待处理语音的语义特征;
    第一解码模块,设置为通过所述端到端语音翻译模型的解码器,从所述语义特征中解码出所述语义特征对应的源语言文本;
    第二解码模块,设置为通过所述端到端语音翻译模型的解码器,根据所述源语言文本解码所述语义特征,得到所述语义特征对应的文本序列,其中,所述文本序列包括所述源语言文本和所述源语言文本对应的目标语言文本;
    拆分模块,设置为拆分所述文本序列,得到所述待处理语音对应的目标语言文本。
  9. 一种电子设备,包括存储器和处理器,所述存储器存储有计算机程序,其中,所述处理器执行所述计算机程序时实现权利要求1至7中任一项所述的方法。
  10. 一种计算机可读存储介质,存储有计算机程序,其中,所述计算机程序被处理器执行时实现权利要求1至7中任一项所述的方法。
PCT/CN2021/116232 2020-09-18 2021-09-02 语音翻译方法、装置、设备和存储介质 WO2022057637A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/245,802 US20240028841A1 (en) 2020-09-18 2021-09-02 Speech translation method, device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010987456.7 2020-09-18
CN202010987456.7A CN112183120B (zh) 2020-09-18 2020-09-18 语音翻译方法、装置、设备和存储介质

Publications (1)

Publication Number Publication Date
WO2022057637A1 true WO2022057637A1 (zh) 2022-03-24

Family

ID=73955233

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/116232 WO2022057637A1 (zh) 2020-09-18 2021-09-02 语音翻译方法、装置、设备和存储介质

Country Status (3)

Country Link
US (1) US20240028841A1 (zh)
CN (1) CN112183120B (zh)
WO (1) WO2022057637A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115312029A (zh) * 2022-10-12 2022-11-08 之江实验室 一种基于语音深度表征映射的语音翻译方法及系统
CN117056709A (zh) * 2023-10-11 2023-11-14 腾讯科技(深圳)有限公司 时序预测模型的训练方法、装置和存储介质及电子设备
CN117094329A (zh) * 2023-10-13 2023-11-21 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) 一种用于解决语音歧义的语音翻译方法及装置

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183120B (zh) * 2020-09-18 2023-10-20 北京字节跳动网络技术有限公司 语音翻译方法、装置、设备和存储介质
CN112800782B (zh) * 2021-01-29 2023-10-03 中国科学院自动化研究所 融合文本语义特征的语音翻译方法、系统、设备
CN112908341B (zh) * 2021-02-22 2023-01-03 哈尔滨工程大学 基于多任务自注意力机制的语言学习者声纹识别方法
CN113129868B (zh) * 2021-03-12 2022-02-25 北京百度网讯科技有限公司 获取语音识别模型的方法、语音识别的方法及对应装置
CN113299274B (zh) * 2021-05-18 2024-03-01 平安科技(深圳)有限公司 白话文与文言文互译及语音合成方法、装置、设备及介质
CN113408305B (zh) * 2021-06-30 2023-03-24 北京百度网讯科技有限公司 模型的训练方法、装置、设备和存储介质
CN113505611B (zh) * 2021-07-09 2022-04-15 中国人民解放军战略支援部队信息工程大学 在生成对抗中获得更好的语音翻译模型的训练方法和系统
CN113571044A (zh) * 2021-07-28 2021-10-29 北京有竹居网络技术有限公司 语音信息处理方法、装置和电子设备
CN114048758A (zh) * 2021-11-10 2022-02-15 北京有竹居网络技术有限公司 训练方法、语音翻译方法、设备和计算机可读介质
CN115831089B (zh) * 2021-12-27 2023-12-01 北京百度网讯科技有限公司 声学特征的确定方法、装置、设备、介质及产品
CN114822498B (zh) * 2022-03-29 2024-06-07 北京有竹居网络技术有限公司 语音翻译模型的训练方法、语音翻译方法、装置及设备
CN117113091B (zh) * 2023-10-24 2024-02-13 中国科学院自动化研究所 语音翻译模型训练方法、装置、电子设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108231062A (zh) * 2018-01-12 2018-06-29 科大讯飞股份有限公司 一种语音翻译方法及装置
US10249294B2 (en) * 2016-09-09 2019-04-02 Electronics And Telecommunications Research Institute Speech recognition system and method
CN111326157A (zh) * 2020-01-20 2020-06-23 北京字节跳动网络技术有限公司 文本生成方法、装置、电子设备和计算机可读介质
CN111368559A (zh) * 2020-02-28 2020-07-03 北京字节跳动网络技术有限公司 语音翻译方法、装置、电子设备及存储介质
CN112183120A (zh) * 2020-09-18 2021-01-05 北京字节跳动网络技术有限公司 语音翻译方法、装置、设备和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10249294B2 (en) * 2016-09-09 2019-04-02 Electronics And Telecommunications Research Institute Speech recognition system and method
CN108231062A (zh) * 2018-01-12 2018-06-29 科大讯飞股份有限公司 一种语音翻译方法及装置
CN111326157A (zh) * 2020-01-20 2020-06-23 北京字节跳动网络技术有限公司 文本生成方法、装置、电子设备和计算机可读介质
CN111368559A (zh) * 2020-02-28 2020-07-03 北京字节跳动网络技术有限公司 语音翻译方法、装置、电子设备及存储介质
CN112183120A (zh) * 2020-09-18 2021-01-05 北京字节跳动网络技术有限公司 语音翻译方法、装置、设备和存储介质

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115312029A (zh) * 2022-10-12 2022-11-08 之江实验室 一种基于语音深度表征映射的语音翻译方法及系统
CN117056709A (zh) * 2023-10-11 2023-11-14 腾讯科技(深圳)有限公司 时序预测模型的训练方法、装置和存储介质及电子设备
CN117094329A (zh) * 2023-10-13 2023-11-21 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) 一种用于解决语音歧义的语音翻译方法及装置
CN117094329B (zh) * 2023-10-13 2024-02-02 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) 一种用于解决语音歧义的语音翻译方法及装置

Also Published As

Publication number Publication date
CN112183120B (zh) 2023-10-20
CN112183120A (zh) 2021-01-05
US20240028841A1 (en) 2024-01-25

Similar Documents

Publication Publication Date Title
WO2022057637A1 (zh) 语音翻译方法、装置、设备和存储介质
CN111326157B (zh) 文本生成方法、装置、电子设备和计算机可读介质
WO2022033327A1 (zh) 视频生成方法、生成模型训练方法、装置、介质及设备
CN112786006B (zh) 语音合成方法、合成模型训练方法、装置、介质及设备
CN113327609B (zh) 用于语音识别的方法和装置
CN111402861B (zh) 一种语音识别方法、装置、设备及存储介质
CN110097870B (zh) 语音处理方法、装置、设备和存储介质
CN111368559A (zh) 语音翻译方法、装置、电子设备及存储介质
WO2022037526A1 (zh) 一种语音识别方法、装置、电子设备及存储介质
US20240029709A1 (en) Voice generation method and apparatus, device, and computer readable medium
CN111489735B (zh) 语音识别模型训练方法及装置
WO2022228041A1 (zh) 翻译模型的训练方法、装置、设备和存储介质
WO2022127620A1 (zh) 语音唤醒方法、装置、电子设备及存储介质
US20230127787A1 (en) Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium
WO2023005729A1 (zh) 语音信息处理方法、装置和电子设备
US20230306979A1 (en) Voice processing method and apparatus, electronic device, and computer readable medium
CN114550702A (zh) 一种语音识别方法和装置
CN115967833A (zh) 视频生成方法、装置、设备计存储介质
CN111128131B (zh) 语音识别方法、装置、电子设备及计算机可读存储介质
CN111933119A (zh) 用于生成语音识别网络的方法、装置、电子设备和介质
CN113986958B (zh) 文本信息的转换方法、装置、可读介质和电子设备
CN112836476B (zh) 一种纪要生成方法、装置、设备及介质
CN117316160B (zh) 无声语音识别方法、装置、电子设备和计算机可读介质
CN110634475B (zh) 语音识别方法、装置、电子设备和计算机可读存储介质
CN117376634B (zh) 一种短视频配乐方法、装置、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21868464

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18245802

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 030723)

122 Ep: pct application non-entry in european phase

Ref document number: 21868464

Country of ref document: EP

Kind code of ref document: A1