CN117094329B

CN117094329B - Voice translation method and device for solving voice ambiguity

Info

Publication number: CN117094329B
Application number: CN202311326597.4A
Authority: CN
Inventors: 刘学博; 于腾斐; 李辰; 陈科海; 张民
Original assignee: Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Current assignee: Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Priority date: 2023-10-13
Filing date: 2023-10-13
Publication date: 2024-02-02
Anticipated expiration: 2043-10-13
Also published as: CN117094329A

Abstract

The invention discloses a voice translation method and device for solving voice ambiguity, and relates to the technical field of voice translation. Comprising the following steps: acquiring voice data to be translated; constructing a homonym dictionary; inputting voice data into the constructed voice translation model; and obtaining a translation text of the voice data according to the voice data, the homonym dictionary and the voice translation model. The invention constructs a high-efficiency voice disambiguation method, which can effectively relieve ambiguity in a voice translation model and improve the accuracy of voice translation.

Description

Voice translation method and device for solving voice ambiguity

Technical Field

The invention relates to the technical field of voice translation, in particular to a voice translation method and device for solving voice ambiguity.

Background

With the development of globalization and the increase of cross-cultural communication, the automatic speech translation technology is widely applied in various application scenes. End-to-End ST (End-to-End Speech Translation ) is an important research direction in this field recently. It aims to directly convert acoustic speech signals from one language into textual descriptions in another language. Compared with the traditional cascade staged translation method, the method can reduce error accumulation in the information transmission process and realize lower time delay, and therefore, the method has been widely paid attention in recent years.

Recent research progress has shown that problems caused by data limitation in ST model development can be effectively handled by combining joint pre-training of speech and text, but this increases the complexity of the model to be processed due to the cross-modal (acoustic to text) and cross-language transformations involved. Specifically, ST models face the problem of double ambiguity of acoustics and semantics. One corresponding area of this problem in text machine translation is word sense disambiguation. In the context of ST, the dual ambiguous representation of acoustics and semantics deals with homonyms, i.e., vocabularies that are the same in pronunciation but differ semantically. Accurate translation of these words is critical to ensure accuracy and reliability of translation.

In the related art, other inventions have attempted to solve the ambiguity problem by enhancing the understanding of the context by a speech translation model. However, the enhancement achieved still does not meet the desired requirements. Thus, ambiguity remains one of the significant sources of error in the speech translation model.

Disclosure of Invention

The invention aims at the problem of voice ambiguity in the prior art and provides the voice ambiguity detection method.

In order to solve the technical problems, the invention provides the following technical scheme:

in one aspect, the present invention provides a speech translation method for resolving speech ambiguity, the method implemented by an electronic device, the method comprising:

s1, acquiring voice data to be translated.

S2, constructing a homonym dictionary.

S3, inputting the voice data into the constructed voice translation model.

S4, obtaining a translation text of the voice data according to the voice data, the homonym dictionary and the voice translation model.

Optionally, constructing a homonym dictionary in S2 includes:

and constructing a homonym dictionary by using the source voice data and the Montreal forced aligner.

Optionally, the speech translation model includes a speech encoder, a translation encoder, and a translation decoder.

The construction process of the voice translation model in the S3 comprises the following steps:

s31, acquiring a source voice sequence sample, and marking an ambiguous voice fragment in the source voice sequence sample by utilizing a homonym dictionary to obtain a marked source voice sequence sample.

S32, compressing the marked source voice sequence sample by utilizing a voice encoder to obtain the source voice sequence sample in a hidden state.

S33, processing the source voice sequence sample in the hidden state by using a translation encoder to obtain a first voice coding characteristic.

S34, masking the source voice sequence sample in the hidden state to obtain a masked voice sequence sample, and processing the masked voice sequence sample by using a translation encoder to obtain a second voice coding feature.

And S35, calculating to obtain a character level comparison learning loss function and a sentence level comparison learning loss function according to the first voice coding characteristic and the second voice coding characteristic.

S36, processing the source voice sequence sample in the hidden state by using a translation decoder to obtain a first probability distribution of the target text.

S37, processing the masked speech sequence samples by using a translation decoder to obtain a second probability distribution of the target text.

S38, according to the first probability distribution and the second probability distribution, a model level comparison learning loss function and a voice translation loss function are obtained through calculation, a total loss function is further obtained, a voice translation model is trained according to the total loss function, and a built voice translation model is obtained.

Optionally, the obtaining a source voice sequence sample in S31, labeling the ambiguous voice segment in the source voice sequence sample by using the homonym dictionary, to obtain a labeled source voice sequence sample, including:

and obtaining a source voice sequence sample of the triplet, and marking the ambiguous voice fragments in the source voice sequence sample by utilizing a homonym dictionary to obtain the marked source voice sequence sample of the triplet.

Wherein the quintuples include speech, transcribed text, translated text, ambiguous words, and the position of the ambiguous words in the sentence.

Optionally, masking the source speech sequence samples in the hidden state in S34 to obtain masked speech sequence samples, including:

s341, generating a transcribed text of the source voice sequence sample in the hidden state according to the voice encoder and the source voice sequence sample in the hidden state.

S342, aligning the source voice sequence sample in the hidden state and the transcribed text.

S343, generating homonym perception shielding matrix according to the aligned hidden state source voice sequence sample, the transcribed text and the homonym dictionary.

S344, masking the source voice sequence sample in the hidden state according to the homonym perception masking matrix to obtain a masked voice sequence sample.

Optionally, the character level contrast learning loss function in S35The following formula (1) shows:

（1）

wherein,for indicating function +.>For the number of source speech sequence samples, < > for>Masking representation of the context of the source speech sequence sample for the hidden state, +.>Context representation of source speech sequence sample for hidden state, for example>Is used for the temperature super-parameter,to calculate cosine similarity.

Optionally, sentence level contrast learning loss function in S35The following formula (2) shows:

（2）

wherein,is->Sentence-level representation obtained by averaging the context representation of the source speech sequence samples of the hidden states in the time dimension +.>Is->Sentence-level representation, obtained by averaging, in the time dimension, the masked representation of the context of the source speech sequence sample of the hidden state,/v>Is temperature super parameter, < >>To be of the size of a small batch example,to calculate cosine similarity.

Optionally, model level contrast learning loss function in S38The following formula (3) shows:

（3）

wherein,represents Kullback-Leibler divergence, < >>For the total label of the translated text, +.>Contextual representation of source speech sequence samples for a given hidden state +.>In the case of input as translation encoder +.>Predictive probability distribution of individual target markers, +.>Masking table for context of source speech sequence samples for a given hidden stateShow->In the case of input as translation encoder +.>Predictive probability distribution for each target marker.

Optionally, the total loss function in S38The following formula (4) shows:

（4）

wherein,、is the coefficient weight, ++>For speech translation loss function, < >>Learning loss function for model level contrast +.>Learning a loss function for character level contrast, +.>The loss function is learned for sentence level comparison.

In another aspect, the present invention provides a speech translation apparatus for resolving speech ambiguity, the apparatus being applied to implement a speech translation method for resolving speech ambiguity, the apparatus comprising:

and the acquisition module is used for acquiring the voice data to be translated.

And the construction module is used for constructing homonym dictionary.

And the input module is used for inputting the voice data into the constructed voice translation model.

And the output module is used for obtaining the translation text of the voice data according to the voice data, the homonym dictionary and the voice translation model.

Optionally, the building module is further configured to:

An input module, further configured to:

Optionally, the input module is further configured to:

Optionally, character level contrast learning loss functionThe following formula (1) shows:

（1）

Alternatively, sentence level contrast learning loss functionThe following formula (2) shows:

（2）

Optionally, model level contrast learning loss functionThe following formula (3) shows:

（3）

wherein,represents Kullback-Leibler divergence, < >>For translating the total label of the text, +.>Contextual representation of source speech sequence samples for a given hidden state +.>In the case of input as translation encoder +.>Predictive probability distribution of individual target markers, +.>Masking representation of the context of a source speech sequence sample for a given hidden state +.>In the case of input as translation encoder +.>Predictive probability distribution for each target marker.

Optionally, the total loss functionThe following formula (4) shows:

（4）

In one aspect, an electronic device is provided, the electronic device including a processor and a memory, the memory storing at least one instruction loaded and executed by the processor to implement the above-described speech translation method for resolving speech ambiguities.

In one aspect, a computer readable storage medium having stored therein at least one instruction loaded and executed by a processor to implement the above-described speech translation method for resolving speech ambiguities is provided.

Compared with the prior art, the technical scheme has at least the following beneficial effects:

according to the scheme, an efficient voice disambiguation method is constructed, and the current best performance (BLEU score) is obtained on the voice translation tasks of MUST-C data sets from English to German, from English to French and from English to Spanish.

The method has the advantages that no extra parameter is added to the model, only the data are preprocessed, a contrast learning strategy is adopted in the training stage, and the method is visual and easy to understand, simple and efficient.

The invention can solve the problem of voice ambiguity existing in the prior art, and especially aims at challenges of homonyms.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow diagram of a speech translation method for resolving speech ambiguity according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a homonym dictionary construction method according to an embodiment of the present invention;

FIG. 3 is a diagram of an overall architecture of a model provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of a voice masking step provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of character level contrast learning provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of sentence level contrast learning provided by an embodiment of the present invention;

FIG. 7 is a block diagram of a speech translation apparatus for resolving speech ambiguities provided by an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without creative efforts, based on the described embodiments of the present invention fall within the protection scope of the present invention.

As shown in fig. 1, an embodiment of the present invention provides a speech translation method for resolving speech ambiguity, which may be implemented by an electronic device. A speech translation method flowchart for resolving speech ambiguities as shown in fig. 1, the process flow of which may include the steps of:

s1, acquiring voice data to be translated.

S2, constructing a homonym dictionary by using the source voice data and the Montreal forced aligner.

In a possible embodiment, the speech translation dataset comprises (speech-transcribed text-translated text) triples. Thus, the present invention creates a dictionary of homonyms and labels a data set containing homonym information based thereon.

Specifically, FIG. 2 illustrates a process for constructing a homonym dictionary that includes obtaining an original dataset, inputting the original dataset into an acoustic model, and then obtaining a pronunciation-labeled word set to construct the homonym dictionary. The invention utilizes a Montreal forced alignment tool to share transcription of phonemes to construct a homonym dictionary. Homonym dictionary consists of a set of words that share the same pronunciation. Phonemes such as "HH uhd" will include, for example, { hood }. Also the phonemes of "hhae D" will include { had, head }.

S3, inputting the voice data into the constructed voice translation model.

The speech translation model comprises a speech coder, a translation coder and a translation decoder.

The construction process of the speech translation model in S3 may include the following steps S31-S38:

s31, acquiring a source voice sequence sample of the triplet, and marking the ambiguous voice fragments in the source voice sequence sample by utilizing a homonym dictionary to obtain the marked source voice sequence sample of the triplet.

In one possible embodiment, the annotated data set consists of a five-tuple containing the ambiguous words and their position in the sentence. For a five-tuple data, the invention locates homonyms in the data by using transcribed text and a homonym dictionary and saves their positions in the sentence. For one labeling example, as shown in Table 1 below:

TABLE 1

After the data are marked, the invention performs model training based on the data set.

In one possible embodiment, FIG. 3 illustrates the overall model architecture of the present invention. The invention reconstructs the base model into three unique components: speech encoder, translation encoder and translation decoder. The speech coder first compresses the speech representation into a hidden state. These hidden states then serve as inputs to the translation encoder, producing rich semantic information extracted from the reduced speech data. The translation decoder generates a result from the output of the translation encoder. In addition, the model of the invention integrates the pre-training parameters from the unified voice text pre-training method, thereby enhancing the effect of the model in voice translation tasks.

Optionally, masking the source speech sequence samples in the hidden state in the step S34 to obtain masked speech sequence samples may include the following steps S341 to S344:

In one possible implementation, to alleviate the problem of phonetic ambiguity, the present invention introduces a novel homonym perception masking strategy, using the constructed annotation dataset. The flow is shown in FIG. 4, where the speech encoder receives the original sequence given an audioAs input and generate its context representation +.>。

In one possible embodiment, the present invention employs a word-level based forced alignment technique to align between speech and transcription to determine the temporal occurrence of individual words in a speech segment.

In one possible implementation, homonym perception masking matrices for phonetic representations are generated in correspondence with homonym dictionaries, denoted asTo determine speechRepresenting the exact location of the homonym segment. With a certain probability->The masking matrix is calculated as follows:

（1）

wherein,from uniform distribution +.>Middle sampling,/->An index set representing homonym representations in a speech sequence.

In a possible implementation, the semantics of a precise single homonymic unit are critical to effectively resolve phonetic meaning ambiguity, as shown in fig. 5 for character level contrast learning. The invention aims to promote the progress of contrast learning by utilizing homonym information, and particularly focuses on complex granularity levels, so that a character-level contrast learning method is provided. The present invention uses the same model to generate the output of the speech encoder twice. In one example, homonym-aware masking policies are applied to generate a masked representation, denoted as。

The proposed marker level contrast learning objective definition is shown in the following formula (2):

（2）

wherein,to indicate a function, if->Representing a masked homonym, 1 otherwise 0,/or->For the number of source speech sequence samples, < > for>Masking representation of the context of the source speech sequence sample for the hidden state, +.>Context representation of source speech sequence sample for hidden state, for example>Is temperature super parameter, < >>To calculate cosine similarity.

The basic idea of this approach is to encourage the generation of mask-marks that are closely aligned with the corresponding homonyms generated by the model, while remaining distinguishable from other marks in the sequence. Thus, the present invention enhances the ability of the model to efficiently understand homonym related features.

Further, sentence level contrast learning as shown in fig. 6, in order to further enhance the effect of contrast learning and identify the best sentence level representation, the present invention introduces a self-supervision method focusing on sentence level contrast learning. The invention is to pair in the time dimensionAnd->Averaging is performed. Thus, sentence-level representations of original and homonymous masked forms are obtained, noted +.>，. For a size of +.>Is a small lot example of>Sentence-level comparison learning objectives of the individual sentences are shown in the following formula (3):

（3）

wherein,is->Sentence-level representation obtained by averaging the context representation of the source speech sequence samples of the hidden states in the time dimension +.>Is->Sentence-level representation, obtained by averaging, in the time dimension, the masked representation of the context of the source speech sequence sample of the hidden state,/v>Is temperature super parameter, < >>Size of small lot example, +.>To calculate cosine similarity.

By employing this goal, the model is enabled to fully take into account the broader context and interrelationships inherent in sentences, thereby capturing complex semantic nuances.

In one possible embodiment, to ensure consistent guidance for extracting context-aware representations from speech, the present invention proposes a fine model-level contrast learning framework. This framework is specifically designed to address challenges presented by homonyms and to identify information located near ambiguous markers. It exploits the inherent knowledge of a single network, i.e. self knowledge distillation, through prediction of diverse samples. This strategy can be seen as a unique variant of contrast learning, which is characterized by the presence of only positive examples.

Specifically, after the original and occluded context representations are obtained from the speech encoder, the model level contrast learning objective is defined as follows:

（4）

wherein,represents Kullback-Leibler divergence, < >>For translating the total label of the text, +.>Source language for a given hidden stateContext representation of a sound sequence sample +.>In the case of input as translation encoder +.>Predictive probability distribution of individual target markers, +.>Masking representation of the context of a source speech sequence sample for a given hidden state +.>In the case of input as translation encoder +.>Predictive probability distribution for each target marker.

Further, by all the methods proposed, speech translation loss is combinedFinal training targetThe expression can be as follows:

（5）

wherein,、for controlling->The present invention uses them to ensure that the model maintains a balanced distribution of interest among different tasks, +.>For speech translation loss function, < >>Learning loss function for model level contrast +.>Learning a loss function for character level contrast, +.>The loss function is learned for sentence level comparison.

Table 2 below shows the test results on the tst-COMMON dataset of the multilingual speech translation corpus MuST-C English to German, english to French, and English to Spanish. Models 1 through 4 represent existing benchmarks in this translation task. Model 5 represents an implementation of the present invention on a spec UT, which previously achieved the most advanced performance.

TABLE 2

The invention first breaks it down into constituent sub-modules and evaluates their respective properties. The components are as follows: model 6 introduces model-level contrast learning on the basis of model 5. Model 7 incorporates a sentence-level contrast learning method into model 6. Model 8 applies character-level contrast learning, rather than sentence-level contrast learning methods used in model 6. The evaluation results show that the individual application of each sub-module further improves the BLEU (bilingual evaluation understudy, bilingual evaluation study) and BLEURT (Bilingual Evaluation Understudy with Representations from Transformers, transformer-characterization-based bilingual evaluation study) scores, as well as the accuracy of homonym translation.

In the embodiment of the invention, an efficient voice disambiguation method is constructed, and the current best performance (BLEU score) is obtained on the voice translation tasks of MUST-C data sets from English to German, from English to French and from English to Spanish.

As shown in fig. 7, an embodiment of the present invention provides a speech translation apparatus 700 for resolving speech ambiguity, the apparatus 700 being applied to implement a speech translation method for resolving speech ambiguity, the apparatus 700 comprising:

the obtaining module 710 is configured to obtain voice data to be translated.

A construction module 720, configured to construct a homonym dictionary.

An input module 730 for inputting the speech data into the constructed speech translation model.

And an output module 740, configured to obtain a translated text of the speech data according to the speech data, the homonym dictionary, and the speech translation model.

Optionally, the construction module 720 is further configured to:

The input module 730 is further configured to:

Optionally, the input module 730 is further configured to:

（1）/>

（2）

（3）

Optionally, the total loss functionThe following formula (4) shows:

（4）

Fig. 8 is a schematic structural diagram of an electronic device 800 according to an embodiment of the present invention, where the electronic device 800 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 801 and one or more memories 802, where at least one instruction is stored in the memories 802, and the at least one instruction is loaded and executed by the processor 801 to implement the following speech translation method for resolving speech ambiguity:

s1, acquiring voice data to be translated.

S2, constructing a homonym dictionary.

S3, inputting the voice data into the constructed voice translation model.

In an exemplary embodiment, a computer readable storage medium, such as a memory comprising instructions executable by a processor in a terminal to perform the above-described speech translation method for resolving speech ambiguities, is also provided. For example, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A speech translation method for resolving speech ambiguities, the method comprising:

s1, acquiring voice data to be translated;

s2, constructing a homonym dictionary;

s3, inputting the voice data into the constructed voice translation model;

s4, obtaining a translation text of the voice data according to the voice data, the homonym dictionary and the voice translation model;

the speech translation model comprises a speech encoder, a translation encoder and a translation decoder;

the construction process of the speech translation model in the S3 comprises the following steps:

s31, acquiring a source voice sequence sample, and marking an ambiguous voice fragment in the source voice sequence sample by using the homonym dictionary to obtain a marked source voice sequence sample;

s32, compressing the marked source voice sequence sample by utilizing the voice encoder to obtain a source voice sequence sample in a hidden state;

s33, processing the source voice sequence sample in the hidden state by using the translation encoder to obtain a first voice coding characteristic;

s34, masking the source voice sequence sample in the hidden state to obtain a masked voice sequence sample, and processing the masked voice sequence sample by using the translation encoder to obtain a second voice coding feature;

s35, calculating to obtain a character level comparison learning loss function and a sentence level comparison learning loss function according to the first voice coding characteristics and the second voice coding characteristics;

s36, processing the source voice sequence sample in the hidden state by using the translation decoder to obtain a first probability distribution of the target text;

s37, processing the masked voice sequence samples by using the translation decoder to obtain second probability distribution of the target text;

s38, according to the first probability distribution and the second probability distribution, a model level comparison learning loss function and a voice translation loss function are obtained through calculation, and then a total loss function is obtained, a voice translation model is trained according to the total loss function, and a built voice translation model is obtained.

2. The method of claim 1, wherein constructing the homonym dictionary in S2 comprises:

3. The method according to claim 1, wherein the obtaining the source voice sequence sample in S31, labeling the ambiguous voice segments in the source voice sequence sample by using the homonym dictionary, and obtaining the labeled source voice sequence sample includes:

obtaining a source voice sequence sample of a triplet, and marking an ambiguous voice fragment in the source voice sequence sample by using the homonym dictionary to obtain a marked source voice sequence sample of the triplet;

wherein the quintuples include speech, transcribed text, translated text, ambiguous words, and locations of ambiguous words in sentences.

4. The method according to claim 1, wherein masking the source speech sequence samples in the hidden state in S34 to obtain masked speech sequence samples comprises:

s341, generating a transcribed text of the source voice sequence sample in the hidden state according to the voice encoder and the source voice sequence sample in the hidden state;

s342, aligning the source voice sequence sample in the hidden state and the transcribed text;

s343, generating a homonym perception shielding matrix according to the aligned source voice sequence sample in the hidden state, the transcribed text and the homonym dictionary;

and S344, masking the source voice sequence sample in the hidden state according to the homonym perception masking matrix to obtain a masked voice sequence sample.

5. The method of claim 1, wherein the character level contrast learning loss function in S35The following formula (1) shows:

（1）

6. The method of claim 1, wherein the sentence-level contrast learning loss function in S35The following formula (2) shows:

（2）

7. The method of claim 1, wherein the model level contrast learning loss function in S38The following formula (3) shows:

（3）

8. The method according to claim 1, wherein the total loss function in S38The following formula (4) shows:

（4）

wherein,、is the coefficient weight, ++>For speech translation loss function, < >>The loss function is learned for model-level comparison,learning a loss function for character level contrast, +.>The loss function is learned for sentence level comparison.

9. A speech translation apparatus for resolving speech ambiguities, the apparatus comprising:

the acquisition module is used for acquiring voice data to be translated;

the construction module is used for constructing homonym dictionary;

the input module is used for inputting the voice data into the constructed voice translation model;

the output module is used for obtaining a translation text of the voice data according to the voice data, the homonym dictionary and the voice translation model;

the construction process of the voice translation model comprises the following steps: