WO2022057637A1 - 语音翻译方法、装置、设备和存储介质 - Google Patents
语音翻译方法、装置、设备和存储介质 Download PDFInfo
- Publication number
- WO2022057637A1 WO2022057637A1 PCT/CN2021/116232 CN2021116232W WO2022057637A1 WO 2022057637 A1 WO2022057637 A1 WO 2022057637A1 CN 2021116232 W CN2021116232 W CN 2021116232W WO 2022057637 A1 WO2022057637 A1 WO 2022057637A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- text
- speech
- decoder
- encoder
- translation model
- Prior art date
Links
- 238000013519 translation Methods 0.000 title claims abstract description 169
- 238000000034 method Methods 0.000 title claims abstract description 61
- 238000012549 training Methods 0.000 claims description 41
- 230000006870 function Effects 0.000 claims description 34
- 230000008569 process Effects 0.000 claims description 17
- 239000013598 vector Substances 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 15
- 239000000284 extract Substances 0.000 claims description 12
- 238000000605 extraction Methods 0.000 claims description 7
- 230000002123 temporal effect Effects 0.000 claims description 3
- 238000012545 processing Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 9
- 238000004891 communication Methods 0.000 description 6
- 230000003287 optical effect Effects 0.000 description 5
- 230000007246 mechanism Effects 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 239000013307 optical fiber Substances 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 101000822695 Clostridium perfringens (strain 13 / Type A) Small, acid-soluble spore protein C1 Proteins 0.000 description 1
- 101000655262 Clostridium perfringens (strain 13 / Type A) Small, acid-soluble spore protein C2 Proteins 0.000 description 1
- 101000655256 Paraclostridium bifermentans Small, acid-soluble spore protein alpha Proteins 0.000 description 1
- 101000655264 Paraclostridium bifermentans Small, acid-soluble spore protein beta Proteins 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/47—Machine-assisted translation, e.g. using translation memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Definitions
- the present application relates to the field of computer technology, for example, to a speech translation method, apparatus, device and storage medium.
- End-to-end speech translation With the continuous development of neural networks and the explosive growth of data, end-to-end speech translation technology came into being. End-to-end speech translation establishes the mapping relationship between the source language speech signal and the target language text, and then realizes the translation from the original speech to the target translation. However, the predictive performance of end-to-end speech translation models still falls short of expectations.
- the present application provides a speech translation method, apparatus, device and storage medium to improve the prediction performance of speech translation.
- a speech translation method including:
- the source language text corresponding to the semantic feature is decoded from the semantic feature
- the semantic feature is decoded according to the source language text to obtain a text sequence corresponding to the semantic feature, wherein the text sequence includes the source language text and the The target language text corresponding to the source language text;
- a voice translation device comprising:
- an encoding module configured to extract the semantic features of the speech to be processed through the encoder of the end-to-end speech translation model
- a first decoding module configured to decode the source language text corresponding to the semantic feature from the semantic feature through the decoder of the end-to-end speech translation model
- the second decoding module is configured to decode the semantic feature according to the source language text through the decoder of the end-to-end speech translation model to obtain a text sequence corresponding to the semantic feature, wherein the text sequence includes the Describe the source language text and the target language text corresponding to the source language text;
- the splitting module is configured to split the text sequence to obtain the target language text corresponding to the speech to be processed.
- An electronic device including a memory and a processor, wherein the memory stores a computer program, and the processor implements the above-mentioned speech translation method provided by the present application when the computer program is executed.
- a computer-readable storage medium is also provided, storing a computer program, and when the computer program is executed by a processor, the above-mentioned speech translation method provided by the present application is implemented.
- FIG. 1 is a schematic flowchart of a speech translation method provided by an embodiment of the present application.
- FIG. 2 is a schematic diagram of the principle of a speech translation process provided by an embodiment of the present application.
- FIG. 3 is a schematic flowchart of another speech translation method provided by an embodiment of the present application.
- FIG. 4 is a schematic flowchart of a training process of an end-to-end speech translation model provided by an embodiment of the present application
- FIG. 5 is a schematic structural diagram of a speech translation apparatus provided by an embodiment of the present application.
- FIG. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
- method embodiments of the present disclosure may be performed in different orders and/or in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this regard.
- the term “including” and variations thereof are open-ended inclusions, ie, "including but not limited to”.
- the term “based on” is “based at least in part on.”
- the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
- Speech-to-text translation typically uses a pipelined system of automatic speech recognition and machine translation.
- pipelined systems suffer from long delays, parameter redundancy, error accumulation, and loss of speech features.
- end-to-end speech translation technology has received extensive attention.
- This end-to-end speech translation can directly translate speech in the form of the source language into text in the form of the target language, effectively avoiding the technical problems existing in the pipeline system.
- end-to-end speech translation still faces many problems, and the prediction performance of end-to-end speech translation still falls short of expectations. Therefore, the technical solutions provided by the embodiments of the present application can improve the prediction performance of the end-to-end speech translation model.
- the execution subject of the following method embodiments may be a speech translation apparatus, and the apparatus may be implemented as part or all of an electronic device through software, hardware, or a combination of software and hardware.
- the electronic device may be a client, including but not limited to a smart phone, a tablet computer, an e-book reader, a vehicle terminal, and the like.
- the electronic device may also be an independent server or a server cluster, and the embodiment of the present application does not limit the form of the electronic device.
- the following method embodiments are described by taking the execution subject being an electronic device as an example.
- FIG. 1 is a schematic flowchart of a speech translation method provided by an embodiment of the present application. This embodiment relates to the process of how the electronic device translates the speech in the form of the source language into the text in the form of the target language based on the end-to-end speech translation model. As shown in Figure 1, the method may include the following steps.
- the speech to be processed is the speech that needs to be translated into text.
- the speech to be processed can be any source language, and the translated text is another target language corresponding to the source language. If the source language is English, the target language corresponding to the source language can be French.
- the electronic device also needs to acquire the to-be-processed voice in the source language.
- the electronic device can select from the database the to-be-processed speech that needs to be translated into text, or can obtain the to-be-processed speech input by the user through translation software installed on the electronic device. In this embodiment, the acquisition method of the to-be-processed speech Not limited.
- the above-mentioned end-to-end speech translation model may be a pre-trained multi-layer neural network, and the end-to-end speech translation model may include an encoder and a decoder.
- the encoder and decoder can be of various network structures.
- a Recurrent Neural Network RNN
- RNN Recurrent Neural Network
- the encoder and decoder can also be other forms of network structures, such as Convolutional Neural Networks (CNN).
- CNN Convolutional Neural Networks
- the encoder can perform feature extraction on the input content to obtain feature vectors.
- the electronic device inputs the speech to be processed into the end-to-end speech translation model, and extracts the semantic features of the speech to be processed through the encoder of the end-to-end speech translation model.
- the semantic feature contains all the information of the to-be-processed speech, as a high-dimensional intermediate representation of the to-be-processed speech.
- the electronic device may also perform spectrogram, log-Mel filter bank, discrete cosine transform and other processing on the to-be-processed speech, thereby extracting the Mel cepstrum.
- the low-dimensional audio features obtained are used as the input of the end-to-end speech translation model.
- the low-dimensional audio features are encoded by the encoder of the end-to-end speech translation model to obtain the semantic features of the speech to be processed.
- the encoder of the end-to-end speech translation model performs feature extraction on the input content to obtain a feature vector, and the encoder decodes the feature vector according to the context information to obtain the corresponding output text.
- the embodiment of the present application introduces a continuous decoding mechanism, that is, in the decoding process, the decoder first predicts the relatively simple source language text, and then predicts the relatively difficult target. language text.
- the electronic device inputs the semantic features into the decoder of the end-to-end speech translation model, and the decoder decodes the semantic features and decodes the source language text corresponding to the semantic features.
- the electronic device inputs the speech to be processed into the end-to-end speech translation model 201 , extract the semantic features of the speech to be processed by the encoder 202 of the end-to-end speech translation model 201, and input the obtained semantic features into the decoder 203 of the end-to-end speech translation model 201.
- the source language text "see you” corresponding to the semantic features is decoded from the semantic features.
- the decoder of the end-to-end speech translation model continues to perform secondary decoding on the semantic features.
- the electronic device can then use the source language text as a reference for subsequent decoding, and continue to decode the semantic features of the to-be-processed speech according to the source-language text, thereby obtaining a text sequence corresponding to the to-be-processed speech.
- the text sequence includes the source language text and the target language text corresponding to the source language text, and the source language text and the target language text are connected by a task identifier.
- the continuous decoding mechanism introduced in the embodiment of the present application relieves the decoding pressure of the decoder.
- the continuous decoding mechanism introduced in the embodiment of the present application relieves the decoding pressure of the decoder.
- the source language text corresponding to the to-be-processed speech is known, continuing to decode the semantic features of the to-be-processed speech in combination with the known source language text can improve the accuracy of decoding sex.
- the electronic device uses the source language text "see you” decoded by the decoder 203 as a reference for decoding the semantic features, continues to pass through the decoder 203, and performs secondary decoding on the semantic features in combination with the source language text "see you” , so as to decode the text sequence "see you ⁇ st>Au revoir” corresponding to the speech to be processed.
- the electronic device can split the text sequence output by the decoder according to the task identifier, so as to obtain the target corresponding to the speech to be processed. language text.
- connection identifier between the source language text and the target language text is “ ⁇ st>”
- the electronic device can, based on the connection identifier, perform the text sequence “see you ⁇ st>Au revoir” to split, so that the to-be-processed speech "see you” is translated into the target language text "Au revoir”.
- the electronic device extracts the semantic features of the speech to be processed through the encoder of the end-to-end speech translation model, and decodes the semantic features corresponding to the semantic features through the decoder of the end-to-end speech translation model.
- the source language text, and the decoder of the end-to-end speech translation model decodes the above semantic features according to the source language text, obtains a text sequence corresponding to the semantic features, and splits the text sequence to obtain the target language text corresponding to the speech to be processed.
- the decoder of the end-to-end speech translation model can first decode the source language text from the semantic features, and then continue to decode the semantic features based on the known source language text.
- the relatively simple source language text is predicted first, and then the relatively difficult target language text is predicted, which relieves the decoding pressure of the decoder; and when predicting the target language text, the source language text corresponding to the target language text is known, which improves the The decoding performance of the decoder, thereby improving the prediction performance of the end-to-end speech translation model.
- the encoder includes a first encoder and a second encoder. As shown in FIG. 3 , the foregoing S101 may include the following steps.
- the encoder of the end-to-end speech translation model may include a first encoder and a second encoder to extract different features of the input speech to be processed, respectively.
- the first encoder is used to extract acoustic features of the speech to be processed
- the second encoder is used to extract semantic features of the speech to be processed.
- the network structures of the first encoder and the second encoder can be constructed according to actual needs.
- both the first encoder and the second encoder may be an RNN network.
- both the first encoder and the second encoder may be implemented by a transform network (Transformer).
- Transformer Transformer
- the conversion network can include a multi-layer multi-head self-attention module, and the conversion network can also include a linear layer and a softmax layer.
- the electronic device inputs the to-be-processed speech to the first encoder, and performs acoustic encoding of the to-be-processed speech through the multi-head self-attention module of the first encoder, thereby extracting a high-dimensional acoustic feature representation of the to-be-processed speech.
- the electronic device may further perform down-sampling and linear layer processing on the audio features of the to-be-processed speech.
- downsampling refers to reducing the dimension of the input audio features in the time domain.
- manual downsampling can be used for downsampling, that is, one frame is taken for every three frames of audio.
- Other downsampling manners may also be used, which are not limited in this embodiment.
- the linear layer can map the frequency domain feature dimension of the downsampled audio to the hidden layer dimension of the network.
- Blank frames and repeated frames may exist in the acoustic features of the speech to be processed extracted by the first encoder.
- a blank frame can be understood as an audio frame without content information
- a repeated frame can be understood as an audio frame with repeated content information.
- an acoustic unit shrinkage layer can be added between the first encoder and the second encoder. The blank frames and repeated frames in the acoustic features are detected by the acoustic unit shrinkage layer, the detected blank frames are removed from the acoustic features, and the repeated frames are merged to obtain the shrinked acoustic features.
- the length of the obtained shrunk acoustic features will be greatly shortened compared with the acoustic features before processing, which is beneficial for the subsequent second encoder to perform high-level semantic features. extraction.
- the electronic device may detect blank frames and repeated frames in the acoustic feature based on the peak characteristic of the probability distribution of the time-series classification loss function.
- the temporal classification loss function (Connectionist Temporal Classification, CTC) introduces blank (there is no predicted value in this frame), each predicted classification corresponds to a spike (spike) in a whole speech, and other positions that are not spikes are considered blank.
- CTC Connectionist Temporal Classification
- the final output of CTC is a sequence of spikes, and it does not care how long each phoneme lasts. Therefore, for the speech to be processed, the electronic device can detect blank frames and repeated frames in the acoustic features of the speech to be processed based on the peak feature of the probability distribution of the CTC.
- the electronic device After the acoustic features of the speech to be processed are contracted, the electronic device inputs the contracted acoustic features into the second encoder, so as to extract high-level semantic features in the contracted acoustic features through the second encoder.
- the second encoder may include multi-layer self-attention modules, and high-level semantic features in the shrunk acoustic features are extracted through the stacked multi-layer self-attention modules.
- the electronic device may perform shrink processing on the acoustic features of the speech to be processed extracted by the first encoder, that is, Eliminate the blank frames in the acoustic features and combine the repeated frames in the acoustic features, reduce the interference of blank frames and repeated frames, and facilitate the extraction of high-level semantic features by the second encoder, thereby improving the encoding performance of the encoder, thereby improving the overall performance. Predictive performance of end-to-end speech translation models.
- a training process for an end-to-end speech translation model is also provided, which can make full use of machine translation data with rich data sources, thereby improving the decoding performance of the decoder.
- the training process of the end-to-end speech translation model may include the following steps.
- the speech recognition parallel data can improve the prediction performance of the end-to-end speech translation model, due to the lack of speech recognition parallel data, the training of the end-to-end speech translation model is very time-consuming and labor-intensive, and the prediction performance still falls short of expectations. Require. However, the number of samples of machine translation parallel data is large, and how to use the machine translation parallel data with a large number of samples to train an end-to-end speech translation model is a technical issue worth considering.
- the decoder of the end-to-end speech translation model introduces a continuous decoding mechanism, that is, the decoder first decodes the source language samples in the semantic features, and then decodes the text sequences corresponding to the semantic features based on the source language samples. Therefore, under the continuous decoding structure of the decoder, the machine translation data can be fully utilized, so that the electronic device can pre-train the decoder of the end-to-end speech translation model based on the machine translation data (ie, text translation samples).
- the above-mentioned text translation samples may include source language sample text and target language sample text corresponding to the source language sample text.
- the decoder of the end-to-end speech translation model is pre-trained through a large number of source language sample texts and target language sample texts corresponding to the source language sample texts, so that the decoder converges better.
- the electronic device pre-trains the decoder of the end-to-end speech translation model according to the text translation samples, and the process of obtaining the initial decoder may be the following steps.
- S4011 splicing the source language sample text and the target language sample text to obtain a spliced sample sequence.
- the electronic device may splicing the source language sample text and the target language sample text through a preset task identifier to obtain a spliced sample sequence.
- the masked cross-entropy loss function is used to mask the prediction loss of the source language prediction text corresponding to the all-zero vector.
- the pre-training stage has no corresponding audio features as input.
- an all-zero vector can be used as the output of the encoder of the end-to-end speech translation model, that is, the all-zero vector is used as the semantic feature of the speech to be processed. into the decoder.
- the masked cross-entropy loss function can be used as the decoder's objective function to mask out the source corresponding to the all-zero vector. Prediction loss for language prediction text.
- the electronic device can input the source language sample text and the all-zero vector into the decoder of the end-to-end speech translation model, predict the predicted sample sequence corresponding to the all-zero vector according to the source language sample text, and use the masked cross entropy loss
- the function calculates the loss value between the predicted sample sequence and the spliced sample sequence, and adjusts the parameters of the decoder based on the loss value until the convergence condition of the masked cross-entropy loss function is reached, thereby obtaining the initial decoder.
- the electronic device adopts a joint optimization method to train the initialized end-to-end speech translation model as a whole. Since the end-to-end speech translation model includes an encoder and a decoder, and the encoder includes a first encoder and a second encoder, the loss value of the loss function of the end-to-end speech translation model is the value corresponding to the first encoder. The weighted sum of the first loss value, the second loss value corresponding to the second encoder, and the third loss value corresponding to the decoder, based on the weighted sum of the first loss value, the second loss value, and the third loss value. The parameters of the speech translation model are adjusted until the convergence condition of the loss function is reached, thereby obtaining an end-to-end speech translation model.
- the training process of the first encoder may be: using the sample speech in the training sample set of the end-to-end speech translation model as the first encoder.
- the input of the encoder takes the sample phoneme sequence corresponding to the sample speech as the expected output, and trains the first encoder based on the time-series classification loss function.
- the training sample set of the end-to-end speech translation model includes multiple training samples, and each training sample includes sample speech, a sample phoneme sequence corresponding to the sample speech, and a sample text sequence corresponding to the sample speech.
- the electronic device can input the sample speech in the training sample into the first encoder, and obtain the actual output of the first encoder, that is, obtain the actual phoneme sequence.
- the electronic device can use the sample phoneme sequence corresponding to the sample speech in the training sample as the expected output of the first encoder, and calculate the difference between the actual output and the expected output of the first encoder based on the CTC loss function, as the end-to-end speech translation.
- the first loss value of the model is combined with the weighted sum of the first loss value, the second loss value and the third loss value to adjust the parameters of the first encoder, so as to realize the training of the first encoder.
- the CTC loss function is introduced as an auxiliary supervision signal for training, and the output of the softmax layer of the first decoder is supervised by the CTC loss function; at the same time, the phoneme sequence is used as the optimization target of the CTC loss function.
- the reason why the phoneme sequence is used is that the phoneme has fewer modeling units, and the phoneme is selected according to the pronunciation dictionary, which is closer to the pronunciation information of the speech, which makes it easier for the model to learn the mapping relationship from speech to phoneme. More acoustic information in the sample speech improves the coding performance of the encoder.
- the electronic device may pre-train the decoder of the end-to-end speech translation model based on the continuous decoding mechanism of the decoder, and initialize the decoder of the end-to-end speech translation model based on the initial decoder obtained after pre-training , and train the initialized end-to-end speech translation model.
- the parallel data of machine translation with a large number of samples can be fully utilized, thereby improving the decoding performance of the decoder and the prediction performance of the end-to-end speech translation model.
- it also avoids the long and slow learning phase in the early stage of model training, which greatly reduces the model training time and improves the training efficiency of the end-to-end speech translation model.
- FIG. 5 is a schematic structural diagram of a speech translation apparatus provided by an embodiment of the present application.
- the apparatus may include: an encoding module 501 , a first decoding module 502 , a second decoding module 503 and a splitting module 504 .
- the encoding module 501 is configured to extract the semantic features of the speech to be processed through the encoder of the end-to-end speech translation model;
- the first decoding module 502 is configured to decode the source language text corresponding to the semantic feature from the semantic feature through the decoder of the end-to-end speech translation model;
- the second decoding module 503 is configured to decode the semantic feature according to the source language text through the decoder of the end-to-end speech translation model to obtain a text sequence corresponding to the semantic feature, where the text sequence includes the source language.
- the splitting module 504 is configured to split the text sequence to obtain the target language text corresponding to the speech to be processed.
- the electronic device extracts the semantic features of the speech to be processed through the encoder of the end-to-end speech translation model, and decodes the semantic features corresponding to the semantic features through the decoder of the end-to-end speech translation model.
- the source language text, and the decoder of the end-to-end speech translation model decodes the above semantic features according to the source language text, obtains a text sequence corresponding to the semantic features, and splits the text sequence to obtain the target language text corresponding to the speech to be processed.
- the decoder of the end-to-end speech translation model can first decode the source language text from the semantic features, and then continue to decode the semantic features based on the known source language text.
- the relatively simple source language text is predicted first, and then the relatively difficult target language text is predicted, which relieves the decoding pressure of the decoder; and when predicting the target language text, the source language text corresponding to the target language text is known, which improves the The decoding performance of the decoder, thereby improving the prediction performance of the end-to-end speech translation model.
- the encoder includes a first encoder and a second encoder
- the encoding module 501 may include: a first encoding unit, an acoustic puncturing unit, and a second encoding unit;
- the first encoding unit is configured to extract acoustic features of the speech to be processed by the first encoder
- the second encoding unit is configured to extract semantic features in the punctured acoustic features through the second encoder.
- the pre-training module is set to pre-train the decoder of the end-to-end speech translation model according to the text translation samples to obtain an initial decoder;
- the initialization module is set to initialize the decoder of the end-to-end speech translation model based on the initial decoder;
- the text translation samples include source language sample text and target language sample text corresponding to the source language sample text.
- the pre-training module is configured to splicing the source language sample text and the target language sample text to obtain a spliced sample sequence; the source language sample text and the all-zero vector are used as The input of the decoder of the end-to-end speech translation model takes the spliced sample sequence as the expected output, and pre-trains the decoder based on the masked cross-entropy loss function to obtain an initial decoder, wherein the masked The cross-entropy loss function is used to mask the prediction loss of the source language prediction text corresponding to the all-zero vector.
- the training process of the first encoder includes: using the sample speech in the training sample set of the end-to-end speech translation model as the input of the first encoder, Taking the sample phoneme sequence corresponding to the sample speech as the expected output, the first encoder is trained based on the time series classification loss function.
- the acoustic shrinking unit is configured to detect blank frames and repeated frames in the acoustic feature based on the peak characteristic of the probability distribution of the time-series classification loss function.
- FIG. 6 shows a schematic structural diagram of an electronic device 600 suitable for implementing an embodiment of the present disclosure.
- the electronic devices in the embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, personal digital assistants (Personal Digital Assistant, PDA), tablet computers (Portable Android Device, PAD), portable multimedia players (Personal Multimedia Player, PMP), mobile terminals such as in-vehicle terminals (for example, in-vehicle navigation terminals), etc., and fixed terminals such as digital televisions (television, TV), desktop computers, and the like.
- PDA Personal Digital Assistant
- PAD Portable Android Device
- PMP portable multimedia players
- mobile terminals such as in-vehicle terminals (for example, in-vehicle navigation terminals), etc.
- fixed terminals such as digital televisions (television, TV), desktop computers, and the like.
- the electronic device shown in FIG. 6 is only an example, and should not impose any limitation on the function and scope of
- the electronic device 600 may include a processing device (such as a central processing unit, a graphics processor, etc.) 601, and the processing device 601 may be based on a program stored in a read-only memory (Read-only Memory, ROM) 602 or from a
- the storage device 606 loads a program into a random access memory (RAM) 603 to perform various appropriate actions and processes.
- RAM random access memory
- various programs and data required for the operation of the electronic device 600 are also stored.
- the processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604.
- An Input/Output (I/O) interface 605 is also connected to the bus 604 .
- I/O interface 605 The following devices may be connected to the I/O interface 605: Input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD), speakers Output device 607 , vibrator, etc.; storage device 606 including, for example, magnetic tape, hard disk, etc.; and communication device 609 . Communication means 609 may allow electronic device 600 to communicate wirelessly or by wire with other devices to exchange data.
- FIG. 6 shows electronic device 600 having various means, it is not required to implement or have all of the illustrated means, and more or fewer means may be implemented or provided instead.
- embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
- the computer program may be downloaded and installed from the network via the communication device 609, or from the storage device 606, or from the ROM 602.
- the processing apparatus 601 the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
- the computer-readable medium described above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two.
- the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above.
- Computer readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), Flash memory, optical fiber, portable Compact Disc Read-Only Memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
- a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
- a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
- a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
- the program code embodied on the computer-readable medium can be transmitted by any suitable medium, including but not limited to: electric wire, optical fiber cable, radio frequency (RF), etc., or any suitable combination of the above.
- clients and servers can communicate using any currently known or future developed network protocols, such as HyperText Transfer Protocol (HTTP), and can communicate with digital data in any form or medium.
- Communication eg, a communication network
- Examples of communication networks include Local Area Networks (LANs), Wide Area Networks (WANs), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently Known or future developed networks.
- LANs Local Area Networks
- WANs Wide Area Networks
- the Internet eg, the Internet
- peer-to-peer networks eg, ad hoc peer-to-peer networks
- the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
- the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: extracts the semantic features of the speech to be processed through the encoder of the end-to-end speech translation model; Through the decoder of the end-to-end speech translation model, the source language text corresponding to the semantic feature is decoded from the semantic feature; through the decoder of the end-to-end speech translation model, according to the source language text Decoding the semantic feature to obtain a text sequence corresponding to the semantic feature, wherein the text sequence includes the source language text and the target language text corresponding to the source language text; splitting the text sequence to obtain the The target language text corresponding to the speech to be processed.
- Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages.
- the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
- the remote computer may be connected to the user computer through any kind of network, including a LAN or WAN, or may be connected to an external computer (eg, using an Internet service provider to connect through the Internet).
- each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
- the functions noted in the blocks may occur out of the order noted in the figures.
- two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
- Each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in special purpose hardware-based systems that perform the specified functions or operations, or special purpose hardware implemented in combination with computer instructions.
- the units involved in the embodiments of the present disclosure may be implemented in a software manner, and may also be implemented in a hardware manner.
- the name of the unit does not constitute a limitation of the unit itself in one case, for example, the first obtaining unit may also be described as "a unit that obtains at least two Internet Protocol addresses".
- exemplary types of hardware logic components include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (Application Specific Standard Products) Standard Parts, ASSP), system on chip (System on Chip, SOC), complex programmable logic device (Complex Programmable Logic Device, CPLD) and so on.
- FPGAs Field Programmable Gate Arrays
- ASICs Application Specific Integrated Circuits
- ASSP Application Specific Standard Products
- SOC System on Chip
- complex programmable logic device Complex Programmable Logic Device, CPLD
- a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
- the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
- Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing.
- Machine-readable storage media include one or more wire-based electrical connections, portable computer disks, hard disks, RAM, ROM, EPROM, flash memory, optical fibers, portable CD-ROMs, optical storage devices, magnetic storage devices, or the above any suitable combination of content.
- the speech translation apparatus, device, and storage medium provided in the above embodiments can execute the speech translation method provided by any embodiment of the present application, and have corresponding functional modules and effects for executing the method.
- speech translation method provided by any embodiment of the present application.
- a speech translation method comprising:
- the source language text corresponding to the semantic feature is decoded from the semantic feature
- the semantic feature is decoded according to the source language text, and a text sequence corresponding to the semantic feature is obtained, and the text sequence includes the source language text and the source language text corresponding to the source language text.
- the target language text is the target language text;
- the encoder includes a first encoder and a second encoder.
- the above speech translation method is provided, further comprising: extracting the to-be-to-be-used code by the first encoder. processing the acoustic features of the speech; detecting blank frames and repeated frames in the acoustic features, removing the blank frames and merging the repeated frames to obtain a shrunk acoustic feature; extracting the shrinkage through the second encoder Semantic features in the latter acoustic features.
- the above speech translation method further comprising: pre-training a decoder of the end-to-end speech translation model according to text translation samples to obtain an initial decoder;
- the decoder initializes the decoder of the end-to-end speech translation model; and trains the initialized end-to-end speech translation model.
- the text translation sample includes source language sample text and target language sample text corresponding to the source language sample text.
- the above speech translation method further comprising: splicing the source language sample text and the target language sample text to obtain a spliced sample sequence; combining the source language sample text and the target language sample text
- the all-zero vector is used as the input of the decoder of the end-to-end speech translation model, the spliced sample sequence is used as the expected output, and the decoder is pre-trained based on the masked cross-entropy loss function to obtain the initial decoder, where , the masked cross-entropy loss function is used to mask the prediction loss of the source language prediction text corresponding to the all-zero vector.
- the above speech translation method further comprising: using the sample speech in the training sample set of the end-to-end speech translation model as the input of the first encoder,
- the sample phoneme sequence corresponding to the sample speech is used as the expected output, and the first encoder is trained based on the time series classification loss function.
- the above speech translation method further comprising: detecting blank frames and repeated frames in the acoustic feature based on the peak characteristic of the probability distribution of the time-series classification loss function.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Artificial Intelligence (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Signal Processing (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims (10)
- 一种语音翻译方法,包括:通过端到端语音翻译模型的编码器提取待处理语音的语义特征;通过所述端到端语音翻译模型的解码器,从所述语义特征中解码出所述语义特征对应的源语言文本;通过所述端到端语音翻译模型的解码器,根据所述源语言文本解码所述语义特征,得到所述语义特征对应的文本序列,其中,所述文本序列包括所述源语言文本和所述源语言文本对应的目标语言文本;拆分所述文本序列,得到所述待处理语音对应的目标语言文本。
- 根据权利要求1所述的方法,其中,所述编码器包括第一编码器和第二编码器,所述通过端到端语音翻译模型的编码器提取待处理语音的语义特征,包括:通过所述第一编码器提取所述待处理语音的声学特征;检测所述声学特征中的空白帧和重复帧,并剔除所述空白帧以及合并所述重复帧,得到收缩后的声学特征;通过所述第二编码器提取所述收缩后的声学特征中的语义特征。
- 根据权利要求1所述的方法,其中,所述端到端语音翻译模型的训练过程,包括:根据文本翻译样本,预先训练所述端到端语音翻译模型的解码器,得到初始解码器;基于所述初始解码器初始化所述端到端语音翻译模型的解码器;对初始化后的端到端语音翻译模型进行训练。
- 根据权利要求3所述的方法,其中,所述文本翻译样本包括源语言样本文本和所述源语言样本文本对应的目标语言样本文本。
- 根据权利要求4所述的方法,其中,所述根据文本翻译样本,预先训练所述端到端语音翻译模型的解码器,得到初始解码器,包括:拼接所述源语言样本文本和所述目标语言样本文本,得到拼接样本序列;将所述源语言样本文本以及全零向量作为所述端到端语音翻译模型的解码器的输入,将所述拼接样本序列作为期望输出,基于掩蔽的交叉熵损失函数对所述解码器进行预训练,得到所述初始解码器,其中,所述掩蔽的交叉熵损失函数用于掩蔽所述全零向量对应的源语言预测文本的预测损失。
- 根据权利要求2至5中任一项所述的方法,其中,所述第一编码器的训练过程,包括:将所述端到端语音翻译模型的训练样本集中的样本语音作为所述第一编码器的输入,将所述样本语音对应的样本音素序列作为期望输出,基于时序分类损失函数对所述第一编码器进行训练。
- 根据权利要求2所述的方法,其中,所述检测所述声学特征中的空白帧和重复帧,包括:基于时序分类损失函数的概率分布的尖峰特性,检测所述声学特征中的空白帧和重复帧。
- 一种语音翻译装置,包括:编码模块,设置为通过端到端语音翻译模型的编码器提取待处理语音的语义特征;第一解码模块,设置为通过所述端到端语音翻译模型的解码器,从所述语义特征中解码出所述语义特征对应的源语言文本;第二解码模块,设置为通过所述端到端语音翻译模型的解码器,根据所述源语言文本解码所述语义特征,得到所述语义特征对应的文本序列,其中,所述文本序列包括所述源语言文本和所述源语言文本对应的目标语言文本;拆分模块,设置为拆分所述文本序列,得到所述待处理语音对应的目标语言文本。
- 一种电子设备,包括存储器和处理器,所述存储器存储有计算机程序,其中,所述处理器执行所述计算机程序时实现权利要求1至7中任一项所述的方法。
- 一种计算机可读存储介质,存储有计算机程序,其中,所述计算机程序被处理器执行时实现权利要求1至7中任一项所述的方法。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/245,802 US20240028841A1 (en) | 2020-09-18 | 2021-09-02 | Speech translation method, device, and storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010987456.7 | 2020-09-18 | ||
CN202010987456.7A CN112183120B (zh) | 2020-09-18 | 2020-09-18 | 语音翻译方法、装置、设备和存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022057637A1 true WO2022057637A1 (zh) | 2022-03-24 |
Family
ID=73955233
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/116232 WO2022057637A1 (zh) | 2020-09-18 | 2021-09-02 | 语音翻译方法、装置、设备和存储介质 |
Country Status (3)
Country | Link |
---|---|
US (1) | US20240028841A1 (zh) |
CN (1) | CN112183120B (zh) |
WO (1) | WO2022057637A1 (zh) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115312029A (zh) * | 2022-10-12 | 2022-11-08 | 之江实验室 | 一种基于语音深度表征映射的语音翻译方法及系统 |
CN117056709A (zh) * | 2023-10-11 | 2023-11-14 | 腾讯科技(深圳)有限公司 | 时序预测模型的训练方法、装置和存储介质及电子设备 |
CN117094329A (zh) * | 2023-10-13 | 2023-11-21 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | 一种用于解决语音歧义的语音翻译方法及装置 |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112183120B (zh) * | 2020-09-18 | 2023-10-20 | 北京字节跳动网络技术有限公司 | 语音翻译方法、装置、设备和存储介质 |
CN112800782B (zh) * | 2021-01-29 | 2023-10-03 | 中国科学院自动化研究所 | 融合文本语义特征的语音翻译方法、系统、设备 |
CN112908341B (zh) * | 2021-02-22 | 2023-01-03 | 哈尔滨工程大学 | 基于多任务自注意力机制的语言学习者声纹识别方法 |
CN113129868B (zh) * | 2021-03-12 | 2022-02-25 | 北京百度网讯科技有限公司 | 获取语音识别模型的方法、语音识别的方法及对应装置 |
CN113299274B (zh) * | 2021-05-18 | 2024-03-01 | 平安科技(深圳)有限公司 | 白话文与文言文互译及语音合成方法、装置、设备及介质 |
CN113408305B (zh) * | 2021-06-30 | 2023-03-24 | 北京百度网讯科技有限公司 | 模型的训练方法、装置、设备和存储介质 |
CN113505611B (zh) * | 2021-07-09 | 2022-04-15 | 中国人民解放军战略支援部队信息工程大学 | 在生成对抗中获得更好的语音翻译模型的训练方法和系统 |
CN113571044A (zh) * | 2021-07-28 | 2021-10-29 | 北京有竹居网络技术有限公司 | 语音信息处理方法、装置和电子设备 |
CN114048758A (zh) * | 2021-11-10 | 2022-02-15 | 北京有竹居网络技术有限公司 | 训练方法、语音翻译方法、设备和计算机可读介质 |
CN115831089B (zh) * | 2021-12-27 | 2023-12-01 | 北京百度网讯科技有限公司 | 声学特征的确定方法、装置、设备、介质及产品 |
CN114822498B (zh) * | 2022-03-29 | 2024-06-07 | 北京有竹居网络技术有限公司 | 语音翻译模型的训练方法、语音翻译方法、装置及设备 |
CN117113091B (zh) * | 2023-10-24 | 2024-02-13 | 中国科学院自动化研究所 | 语音翻译模型训练方法、装置、电子设备及存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108231062A (zh) * | 2018-01-12 | 2018-06-29 | 科大讯飞股份有限公司 | 一种语音翻译方法及装置 |
US10249294B2 (en) * | 2016-09-09 | 2019-04-02 | Electronics And Telecommunications Research Institute | Speech recognition system and method |
CN111326157A (zh) * | 2020-01-20 | 2020-06-23 | 北京字节跳动网络技术有限公司 | 文本生成方法、装置、电子设备和计算机可读介质 |
CN111368559A (zh) * | 2020-02-28 | 2020-07-03 | 北京字节跳动网络技术有限公司 | 语音翻译方法、装置、电子设备及存储介质 |
CN112183120A (zh) * | 2020-09-18 | 2021-01-05 | 北京字节跳动网络技术有限公司 | 语音翻译方法、装置、设备和存储介质 |
-
2020
- 2020-09-18 CN CN202010987456.7A patent/CN112183120B/zh active Active
-
2021
- 2021-09-02 US US18/245,802 patent/US20240028841A1/en active Pending
- 2021-09-02 WO PCT/CN2021/116232 patent/WO2022057637A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10249294B2 (en) * | 2016-09-09 | 2019-04-02 | Electronics And Telecommunications Research Institute | Speech recognition system and method |
CN108231062A (zh) * | 2018-01-12 | 2018-06-29 | 科大讯飞股份有限公司 | 一种语音翻译方法及装置 |
CN111326157A (zh) * | 2020-01-20 | 2020-06-23 | 北京字节跳动网络技术有限公司 | 文本生成方法、装置、电子设备和计算机可读介质 |
CN111368559A (zh) * | 2020-02-28 | 2020-07-03 | 北京字节跳动网络技术有限公司 | 语音翻译方法、装置、电子设备及存储介质 |
CN112183120A (zh) * | 2020-09-18 | 2021-01-05 | 北京字节跳动网络技术有限公司 | 语音翻译方法、装置、设备和存储介质 |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115312029A (zh) * | 2022-10-12 | 2022-11-08 | 之江实验室 | 一种基于语音深度表征映射的语音翻译方法及系统 |
CN117056709A (zh) * | 2023-10-11 | 2023-11-14 | 腾讯科技(深圳)有限公司 | 时序预测模型的训练方法、装置和存储介质及电子设备 |
CN117094329A (zh) * | 2023-10-13 | 2023-11-21 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | 一种用于解决语音歧义的语音翻译方法及装置 |
CN117094329B (zh) * | 2023-10-13 | 2024-02-02 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | 一种用于解决语音歧义的语音翻译方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
CN112183120B (zh) | 2023-10-20 |
CN112183120A (zh) | 2021-01-05 |
US20240028841A1 (en) | 2024-01-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022057637A1 (zh) | 语音翻译方法、装置、设备和存储介质 | |
CN111326157B (zh) | 文本生成方法、装置、电子设备和计算机可读介质 | |
WO2022033327A1 (zh) | 视频生成方法、生成模型训练方法、装置、介质及设备 | |
CN112786006B (zh) | 语音合成方法、合成模型训练方法、装置、介质及设备 | |
CN113327609B (zh) | 用于语音识别的方法和装置 | |
CN111402861B (zh) | 一种语音识别方法、装置、设备及存储介质 | |
CN110097870B (zh) | 语音处理方法、装置、设备和存储介质 | |
CN111368559A (zh) | 语音翻译方法、装置、电子设备及存储介质 | |
WO2022037526A1 (zh) | 一种语音识别方法、装置、电子设备及存储介质 | |
US20240029709A1 (en) | Voice generation method and apparatus, device, and computer readable medium | |
CN111489735B (zh) | 语音识别模型训练方法及装置 | |
WO2022228041A1 (zh) | 翻译模型的训练方法、装置、设备和存储介质 | |
WO2022127620A1 (zh) | 语音唤醒方法、装置、电子设备及存储介质 | |
US20230127787A1 (en) | Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium | |
WO2023005729A1 (zh) | 语音信息处理方法、装置和电子设备 | |
US20230306979A1 (en) | Voice processing method and apparatus, electronic device, and computer readable medium | |
CN114550702A (zh) | 一种语音识别方法和装置 | |
CN115967833A (zh) | 视频生成方法、装置、设备计存储介质 | |
CN111128131B (zh) | 语音识别方法、装置、电子设备及计算机可读存储介质 | |
CN111933119A (zh) | 用于生成语音识别网络的方法、装置、电子设备和介质 | |
CN113986958B (zh) | 文本信息的转换方法、装置、可读介质和电子设备 | |
CN112836476B (zh) | 一种纪要生成方法、装置、设备及介质 | |
CN117316160B (zh) | 无声语音识别方法、装置、电子设备和计算机可读介质 | |
CN110634475B (zh) | 语音识别方法、装置、电子设备和计算机可读存储介质 | |
CN117376634B (zh) | 一种短视频配乐方法、装置、电子设备和存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21868464 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 18245802 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 030723) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21868464 Country of ref document: EP Kind code of ref document: A1 |