CN117094329B - Voice translation method and device for solving voice ambiguity - Google Patents
Voice translation method and device for solving voice ambiguity Download PDFInfo
- Publication number
- CN117094329B CN117094329B CN202311326597.4A CN202311326597A CN117094329B CN 117094329 B CN117094329 B CN 117094329B CN 202311326597 A CN202311326597 A CN 202311326597A CN 117094329 B CN117094329 B CN 117094329B
- Authority
- CN
- China
- Prior art keywords
- voice
- translation
- source
- sequence sample
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000013519 translation Methods 0.000 title claims abstract description 147
- 238000000034 method Methods 0.000 title claims abstract description 51
- 230000006870 function Effects 0.000 claims description 69
- 238000009826 distribution Methods 0.000 claims description 35
- 230000000873 masking effect Effects 0.000 claims description 34
- 238000012545 processing Methods 0.000 claims description 25
- 230000008447 perception Effects 0.000 claims description 12
- 238000012935 Averaging Methods 0.000 claims description 11
- 239000011159 matrix material Substances 0.000 claims description 11
- 239000012634 fragment Substances 0.000 claims description 10
- 238000010276 construction Methods 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 6
- 239000003550 marker Substances 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 6
- 238000002372 labelling Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 9
- 238000012549 training Methods 0.000 description 8
- 230000015654 memory Effects 0.000 description 6
- 238000011156 evaluation Methods 0.000 description 5
- 230000000007 visual effect Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000013518 transcription Methods 0.000 description 2
- 230000035897 transcription Effects 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000013140 knowledge distillation Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 238000009827 uniform distribution Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/44—Statistical methods, e.g. probability models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a voice translation method and device for solving voice ambiguity, and relates to the technical field of voice translation. Comprising the following steps: acquiring voice data to be translated; constructing a homonym dictionary; inputting voice data into the constructed voice translation model; and obtaining a translation text of the voice data according to the voice data, the homonym dictionary and the voice translation model. The invention constructs a high-efficiency voice disambiguation method, which can effectively relieve ambiguity in a voice translation model and improve the accuracy of voice translation.
Description
Technical Field
The invention relates to the technical field of voice translation, in particular to a voice translation method and device for solving voice ambiguity.
Background
With the development of globalization and the increase of cross-cultural communication, the automatic speech translation technology is widely applied in various application scenes. End-to-End ST (End-to-End Speech Translation ) is an important research direction in this field recently. It aims to directly convert acoustic speech signals from one language into textual descriptions in another language. Compared with the traditional cascade staged translation method, the method can reduce error accumulation in the information transmission process and realize lower time delay, and therefore, the method has been widely paid attention in recent years.
Recent research progress has shown that problems caused by data limitation in ST model development can be effectively handled by combining joint pre-training of speech and text, but this increases the complexity of the model to be processed due to the cross-modal (acoustic to text) and cross-language transformations involved. Specifically, ST models face the problem of double ambiguity of acoustics and semantics. One corresponding area of this problem in text machine translation is word sense disambiguation. In the context of ST, the dual ambiguous representation of acoustics and semantics deals with homonyms, i.e., vocabularies that are the same in pronunciation but differ semantically. Accurate translation of these words is critical to ensure accuracy and reliability of translation.
In the related art, other inventions have attempted to solve the ambiguity problem by enhancing the understanding of the context by a speech translation model. However, the enhancement achieved still does not meet the desired requirements. Thus, ambiguity remains one of the significant sources of error in the speech translation model.
Disclosure of Invention
The invention aims at the problem of voice ambiguity in the prior art and provides the voice ambiguity detection method.
In order to solve the technical problems, the invention provides the following technical scheme:
in one aspect, the present invention provides a speech translation method for resolving speech ambiguity, the method implemented by an electronic device, the method comprising:
s1, acquiring voice data to be translated.
S2, constructing a homonym dictionary.
S3, inputting the voice data into the constructed voice translation model.
S4, obtaining a translation text of the voice data according to the voice data, the homonym dictionary and the voice translation model.
Optionally, constructing a homonym dictionary in S2 includes:
and constructing a homonym dictionary by using the source voice data and the Montreal forced aligner.
Optionally, the speech translation model includes a speech encoder, a translation encoder, and a translation decoder.
The construction process of the voice translation model in the S3 comprises the following steps:
s31, acquiring a source voice sequence sample, and marking an ambiguous voice fragment in the source voice sequence sample by utilizing a homonym dictionary to obtain a marked source voice sequence sample.
S32, compressing the marked source voice sequence sample by utilizing a voice encoder to obtain the source voice sequence sample in a hidden state.
S33, processing the source voice sequence sample in the hidden state by using a translation encoder to obtain a first voice coding characteristic.
S34, masking the source voice sequence sample in the hidden state to obtain a masked voice sequence sample, and processing the masked voice sequence sample by using a translation encoder to obtain a second voice coding feature.
And S35, calculating to obtain a character level comparison learning loss function and a sentence level comparison learning loss function according to the first voice coding characteristic and the second voice coding characteristic.
S36, processing the source voice sequence sample in the hidden state by using a translation decoder to obtain a first probability distribution of the target text.
S37, processing the masked speech sequence samples by using a translation decoder to obtain a second probability distribution of the target text.
S38, according to the first probability distribution and the second probability distribution, a model level comparison learning loss function and a voice translation loss function are obtained through calculation, a total loss function is further obtained, a voice translation model is trained according to the total loss function, and a built voice translation model is obtained.
Optionally, the obtaining a source voice sequence sample in S31, labeling the ambiguous voice segment in the source voice sequence sample by using the homonym dictionary, to obtain a labeled source voice sequence sample, including:
and obtaining a source voice sequence sample of the triplet, and marking the ambiguous voice fragments in the source voice sequence sample by utilizing a homonym dictionary to obtain the marked source voice sequence sample of the triplet.
Wherein the quintuples include speech, transcribed text, translated text, ambiguous words, and the position of the ambiguous words in the sentence.
Optionally, masking the source speech sequence samples in the hidden state in S34 to obtain masked speech sequence samples, including:
s341, generating a transcribed text of the source voice sequence sample in the hidden state according to the voice encoder and the source voice sequence sample in the hidden state.
S342, aligning the source voice sequence sample in the hidden state and the transcribed text.
S343, generating homonym perception shielding matrix according to the aligned hidden state source voice sequence sample, the transcribed text and the homonym dictionary.
S344, masking the source voice sequence sample in the hidden state according to the homonym perception masking matrix to obtain a masked voice sequence sample.
Optionally, the character level contrast learning loss function in S35The following formula (1) shows:
(1)
wherein,for indicating function +.>For the number of source speech sequence samples, < > for>Masking representation of the context of the source speech sequence sample for the hidden state, +.>Context representation of source speech sequence sample for hidden state, for example>Is used for the temperature super-parameter,to calculate cosine similarity.
Optionally, sentence level contrast learning loss function in S35The following formula (2) shows:
(2)
wherein,is->Sentence-level representation obtained by averaging the context representation of the source speech sequence samples of the hidden states in the time dimension +.>Is->Sentence-level representation, obtained by averaging, in the time dimension, the masked representation of the context of the source speech sequence sample of the hidden state,/v>Is temperature super parameter, < >>To be of the size of a small batch example,to calculate cosine similarity.
Optionally, model level contrast learning loss function in S38The following formula (3) shows:
(3)
wherein,represents Kullback-Leibler divergence, < >>For the total label of the translated text, +.>Contextual representation of source speech sequence samples for a given hidden state +.>In the case of input as translation encoder +.>Predictive probability distribution of individual target markers, +.>Masking table for context of source speech sequence samples for a given hidden stateShow->In the case of input as translation encoder +.>Predictive probability distribution for each target marker.
Optionally, the total loss function in S38The following formula (4) shows:
(4)
wherein,、is the coefficient weight, ++>For speech translation loss function, < >>Learning loss function for model level contrast +.>Learning a loss function for character level contrast, +.>The loss function is learned for sentence level comparison.
In another aspect, the present invention provides a speech translation apparatus for resolving speech ambiguity, the apparatus being applied to implement a speech translation method for resolving speech ambiguity, the apparatus comprising:
and the acquisition module is used for acquiring the voice data to be translated.
And the construction module is used for constructing homonym dictionary.
And the input module is used for inputting the voice data into the constructed voice translation model.
And the output module is used for obtaining the translation text of the voice data according to the voice data, the homonym dictionary and the voice translation model.
Optionally, the building module is further configured to:
and constructing a homonym dictionary by using the source voice data and the Montreal forced aligner.
Optionally, the speech translation model includes a speech encoder, a translation encoder, and a translation decoder.
An input module, further configured to:
s31, acquiring a source voice sequence sample, and marking an ambiguous voice fragment in the source voice sequence sample by utilizing a homonym dictionary to obtain a marked source voice sequence sample.
S32, compressing the marked source voice sequence sample by utilizing a voice encoder to obtain the source voice sequence sample in a hidden state.
S33, processing the source voice sequence sample in the hidden state by using a translation encoder to obtain a first voice coding characteristic.
S34, masking the source voice sequence sample in the hidden state to obtain a masked voice sequence sample, and processing the masked voice sequence sample by using a translation encoder to obtain a second voice coding feature.
And S35, calculating to obtain a character level comparison learning loss function and a sentence level comparison learning loss function according to the first voice coding characteristic and the second voice coding characteristic.
S36, processing the source voice sequence sample in the hidden state by using a translation decoder to obtain a first probability distribution of the target text.
S37, processing the masked speech sequence samples by using a translation decoder to obtain a second probability distribution of the target text.
S38, according to the first probability distribution and the second probability distribution, a model level comparison learning loss function and a voice translation loss function are obtained through calculation, a total loss function is further obtained, a voice translation model is trained according to the total loss function, and a built voice translation model is obtained.
Optionally, the input module is further configured to:
and obtaining a source voice sequence sample of the triplet, and marking the ambiguous voice fragments in the source voice sequence sample by utilizing a homonym dictionary to obtain the marked source voice sequence sample of the triplet.
Wherein the quintuples include speech, transcribed text, translated text, ambiguous words, and the position of the ambiguous words in the sentence.
Optionally, the input module is further configured to:
s341, generating a transcribed text of the source voice sequence sample in the hidden state according to the voice encoder and the source voice sequence sample in the hidden state.
S342, aligning the source voice sequence sample in the hidden state and the transcribed text.
S343, generating homonym perception shielding matrix according to the aligned hidden state source voice sequence sample, the transcribed text and the homonym dictionary.
S344, masking the source voice sequence sample in the hidden state according to the homonym perception masking matrix to obtain a masked voice sequence sample.
Optionally, character level contrast learning loss functionThe following formula (1) shows:
(1)
wherein,for indicating function +.>For the number of source speech sequence samples, < > for>Masking representation of the context of the source speech sequence sample for the hidden state, +.>Context representation of source speech sequence sample for hidden state, for example>Is used for the temperature super-parameter,to calculate cosine similarity.
Alternatively, sentence level contrast learning loss functionThe following formula (2) shows:
(2)
wherein,is->Sentence-level representation obtained by averaging the context representation of the source speech sequence samples of the hidden states in the time dimension +.>Is->Sentence-level representation, obtained by averaging, in the time dimension, the masked representation of the context of the source speech sequence sample of the hidden state,/v>Is temperature super parameter, < >>To be of the size of a small batch example,to calculate cosine similarity.
Optionally, model level contrast learning loss functionThe following formula (3) shows:
(3)
wherein,represents Kullback-Leibler divergence, < >>For translating the total label of the text, +.>Contextual representation of source speech sequence samples for a given hidden state +.>In the case of input as translation encoder +.>Predictive probability distribution of individual target markers, +.>Masking representation of the context of a source speech sequence sample for a given hidden state +.>In the case of input as translation encoder +.>Predictive probability distribution for each target marker.
Optionally, the total loss functionThe following formula (4) shows:
(4)
wherein,、is the coefficient weight, ++>For speech translation loss function, < >>Learning loss function for model level contrast +.>Learning a loss function for character level contrast, +.>The loss function is learned for sentence level comparison.
In one aspect, an electronic device is provided, the electronic device including a processor and a memory, the memory storing at least one instruction loaded and executed by the processor to implement the above-described speech translation method for resolving speech ambiguities.
In one aspect, a computer readable storage medium having stored therein at least one instruction loaded and executed by a processor to implement the above-described speech translation method for resolving speech ambiguities is provided.
Compared with the prior art, the technical scheme has at least the following beneficial effects:
according to the scheme, an efficient voice disambiguation method is constructed, and the current best performance (BLEU score) is obtained on the voice translation tasks of MUST-C data sets from English to German, from English to French and from English to Spanish.
The method has the advantages that no extra parameter is added to the model, only the data are preprocessed, a contrast learning strategy is adopted in the training stage, and the method is visual and easy to understand, simple and efficient.
The invention can solve the problem of voice ambiguity existing in the prior art, and especially aims at challenges of homonyms.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow diagram of a speech translation method for resolving speech ambiguity according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a homonym dictionary construction method according to an embodiment of the present invention;
FIG. 3 is a diagram of an overall architecture of a model provided by an embodiment of the present invention;
FIG. 4 is a schematic diagram of a voice masking step provided by an embodiment of the present invention;
FIG. 5 is a schematic diagram of character level contrast learning provided by an embodiment of the present invention;
FIG. 6 is a schematic diagram of sentence level contrast learning provided by an embodiment of the present invention;
FIG. 7 is a block diagram of a speech translation apparatus for resolving speech ambiguities provided by an embodiment of the present invention;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without creative efforts, based on the described embodiments of the present invention fall within the protection scope of the present invention.
As shown in fig. 1, an embodiment of the present invention provides a speech translation method for resolving speech ambiguity, which may be implemented by an electronic device. A speech translation method flowchart for resolving speech ambiguities as shown in fig. 1, the process flow of which may include the steps of:
s1, acquiring voice data to be translated.
S2, constructing a homonym dictionary by using the source voice data and the Montreal forced aligner.
In a possible embodiment, the speech translation dataset comprises (speech-transcribed text-translated text) triples. Thus, the present invention creates a dictionary of homonyms and labels a data set containing homonym information based thereon.
Specifically, FIG. 2 illustrates a process for constructing a homonym dictionary that includes obtaining an original dataset, inputting the original dataset into an acoustic model, and then obtaining a pronunciation-labeled word set to construct the homonym dictionary. The invention utilizes a Montreal forced alignment tool to share transcription of phonemes to construct a homonym dictionary. Homonym dictionary consists of a set of words that share the same pronunciation. Phonemes such as "HH uhd" will include, for example, { hood }. Also the phonemes of "hhae D" will include { had, head }.
S3, inputting the voice data into the constructed voice translation model.
The speech translation model comprises a speech coder, a translation coder and a translation decoder.
The construction process of the speech translation model in S3 may include the following steps S31-S38:
s31, acquiring a source voice sequence sample of the triplet, and marking the ambiguous voice fragments in the source voice sequence sample by utilizing a homonym dictionary to obtain the marked source voice sequence sample of the triplet.
Wherein the quintuples include speech, transcribed text, translated text, ambiguous words, and the position of the ambiguous words in the sentence.
In one possible embodiment, the annotated data set consists of a five-tuple containing the ambiguous words and their position in the sentence. For a five-tuple data, the invention locates homonyms in the data by using transcribed text and a homonym dictionary and saves their positions in the sentence. For one labeling example, as shown in Table 1 below:
TABLE 1
After the data are marked, the invention performs model training based on the data set.
S32, compressing the marked source voice sequence sample by utilizing a voice encoder to obtain the source voice sequence sample in a hidden state.
In one possible embodiment, FIG. 3 illustrates the overall model architecture of the present invention. The invention reconstructs the base model into three unique components: speech encoder, translation encoder and translation decoder. The speech coder first compresses the speech representation into a hidden state. These hidden states then serve as inputs to the translation encoder, producing rich semantic information extracted from the reduced speech data. The translation decoder generates a result from the output of the translation encoder. In addition, the model of the invention integrates the pre-training parameters from the unified voice text pre-training method, thereby enhancing the effect of the model in voice translation tasks.
S33, processing the source voice sequence sample in the hidden state by using a translation encoder to obtain a first voice coding characteristic.
S34, masking the source voice sequence sample in the hidden state to obtain a masked voice sequence sample, and processing the masked voice sequence sample by using a translation encoder to obtain a second voice coding feature.
Optionally, masking the source speech sequence samples in the hidden state in the step S34 to obtain masked speech sequence samples may include the following steps S341 to S344:
s341, generating a transcribed text of the source voice sequence sample in the hidden state according to the voice encoder and the source voice sequence sample in the hidden state.
In one possible implementation, to alleviate the problem of phonetic ambiguity, the present invention introduces a novel homonym perception masking strategy, using the constructed annotation dataset. The flow is shown in FIG. 4, where the speech encoder receives the original sequence given an audioAs input and generate its context representation +.>。
S342, aligning the source voice sequence sample in the hidden state and the transcribed text.
In one possible embodiment, the present invention employs a word-level based forced alignment technique to align between speech and transcription to determine the temporal occurrence of individual words in a speech segment.
S343, generating homonym perception shielding matrix according to the aligned hidden state source voice sequence sample, the transcribed text and the homonym dictionary.
S344, masking the source voice sequence sample in the hidden state according to the homonym perception masking matrix to obtain a masked voice sequence sample.
In one possible implementation, homonym perception masking matrices for phonetic representations are generated in correspondence with homonym dictionaries, denoted asTo determine speechRepresenting the exact location of the homonym segment. With a certain probability->The masking matrix is calculated as follows:
(1)
wherein,from uniform distribution +.>Middle sampling,/->An index set representing homonym representations in a speech sequence.
And S35, calculating to obtain a character level comparison learning loss function and a sentence level comparison learning loss function according to the first voice coding characteristic and the second voice coding characteristic.
In a possible implementation, the semantics of a precise single homonymic unit are critical to effectively resolve phonetic meaning ambiguity, as shown in fig. 5 for character level contrast learning. The invention aims to promote the progress of contrast learning by utilizing homonym information, and particularly focuses on complex granularity levels, so that a character-level contrast learning method is provided. The present invention uses the same model to generate the output of the speech encoder twice. In one example, homonym-aware masking policies are applied to generate a masked representation, denoted as。
The proposed marker level contrast learning objective definition is shown in the following formula (2):
(2)
wherein,to indicate a function, if->Representing a masked homonym, 1 otherwise 0,/or->For the number of source speech sequence samples, < > for>Masking representation of the context of the source speech sequence sample for the hidden state, +.>Context representation of source speech sequence sample for hidden state, for example>Is temperature super parameter, < >>To calculate cosine similarity.
The basic idea of this approach is to encourage the generation of mask-marks that are closely aligned with the corresponding homonyms generated by the model, while remaining distinguishable from other marks in the sequence. Thus, the present invention enhances the ability of the model to efficiently understand homonym related features.
Further, sentence level contrast learning as shown in fig. 6, in order to further enhance the effect of contrast learning and identify the best sentence level representation, the present invention introduces a self-supervision method focusing on sentence level contrast learning. The invention is to pair in the time dimensionAnd->Averaging is performed. Thus, sentence-level representations of original and homonymous masked forms are obtained, noted +.>,. For a size of +.>Is a small lot example of>Sentence-level comparison learning objectives of the individual sentences are shown in the following formula (3):
(3)
wherein,is->Sentence-level representation obtained by averaging the context representation of the source speech sequence samples of the hidden states in the time dimension +.>Is->Sentence-level representation, obtained by averaging, in the time dimension, the masked representation of the context of the source speech sequence sample of the hidden state,/v>Is temperature super parameter, < >>Size of small lot example, +.>To calculate cosine similarity.
By employing this goal, the model is enabled to fully take into account the broader context and interrelationships inherent in sentences, thereby capturing complex semantic nuances.
S36, processing the source voice sequence sample in the hidden state by using a translation decoder to obtain a first probability distribution of the target text.
S37, processing the masked speech sequence samples by using a translation decoder to obtain a second probability distribution of the target text.
S38, according to the first probability distribution and the second probability distribution, a model level comparison learning loss function and a voice translation loss function are obtained through calculation, a total loss function is further obtained, a voice translation model is trained according to the total loss function, and a built voice translation model is obtained.
In one possible embodiment, to ensure consistent guidance for extracting context-aware representations from speech, the present invention proposes a fine model-level contrast learning framework. This framework is specifically designed to address challenges presented by homonyms and to identify information located near ambiguous markers. It exploits the inherent knowledge of a single network, i.e. self knowledge distillation, through prediction of diverse samples. This strategy can be seen as a unique variant of contrast learning, which is characterized by the presence of only positive examples.
Specifically, after the original and occluded context representations are obtained from the speech encoder, the model level contrast learning objective is defined as follows:
(4)
wherein,represents Kullback-Leibler divergence, < >>For translating the total label of the text, +.>Source language for a given hidden stateContext representation of a sound sequence sample +.>In the case of input as translation encoder +.>Predictive probability distribution of individual target markers, +.>Masking representation of the context of a source speech sequence sample for a given hidden state +.>In the case of input as translation encoder +.>Predictive probability distribution for each target marker.
Further, by all the methods proposed, speech translation loss is combinedFinal training targetThe expression can be as follows:
(5)
wherein,、for controlling->The present invention uses them to ensure that the model maintains a balanced distribution of interest among different tasks, +.>For speech translation loss function, < >>Learning loss function for model level contrast +.>Learning a loss function for character level contrast, +.>The loss function is learned for sentence level comparison.
S4, obtaining a translation text of the voice data according to the voice data, the homonym dictionary and the voice translation model.
Table 2 below shows the test results on the tst-COMMON dataset of the multilingual speech translation corpus MuST-C English to German, english to French, and English to Spanish. Models 1 through 4 represent existing benchmarks in this translation task. Model 5 represents an implementation of the present invention on a spec UT, which previously achieved the most advanced performance.
TABLE 2
The invention first breaks it down into constituent sub-modules and evaluates their respective properties. The components are as follows: model 6 introduces model-level contrast learning on the basis of model 5. Model 7 incorporates a sentence-level contrast learning method into model 6. Model 8 applies character-level contrast learning, rather than sentence-level contrast learning methods used in model 6. The evaluation results show that the individual application of each sub-module further improves the BLEU (bilingual evaluation understudy, bilingual evaluation study) and BLEURT (Bilingual Evaluation Understudy with Representations from Transformers, transformer-characterization-based bilingual evaluation study) scores, as well as the accuracy of homonym translation.
In the embodiment of the invention, an efficient voice disambiguation method is constructed, and the current best performance (BLEU score) is obtained on the voice translation tasks of MUST-C data sets from English to German, from English to French and from English to Spanish.
The method has the advantages that no extra parameter is added to the model, only the data are preprocessed, a contrast learning strategy is adopted in the training stage, and the method is visual and easy to understand, simple and efficient.
The invention can solve the problem of voice ambiguity existing in the prior art, and especially aims at challenges of homonyms.
As shown in fig. 7, an embodiment of the present invention provides a speech translation apparatus 700 for resolving speech ambiguity, the apparatus 700 being applied to implement a speech translation method for resolving speech ambiguity, the apparatus 700 comprising:
the obtaining module 710 is configured to obtain voice data to be translated.
A construction module 720, configured to construct a homonym dictionary.
An input module 730 for inputting the speech data into the constructed speech translation model.
And an output module 740, configured to obtain a translated text of the speech data according to the speech data, the homonym dictionary, and the speech translation model.
Optionally, the construction module 720 is further configured to:
and constructing a homonym dictionary by using the source voice data and the Montreal forced aligner.
Optionally, the speech translation model includes a speech encoder, a translation encoder, and a translation decoder.
The input module 730 is further configured to:
s31, acquiring a source voice sequence sample, and marking an ambiguous voice fragment in the source voice sequence sample by utilizing a homonym dictionary to obtain a marked source voice sequence sample.
S32, compressing the marked source voice sequence sample by utilizing a voice encoder to obtain the source voice sequence sample in a hidden state.
S33, processing the source voice sequence sample in the hidden state by using a translation encoder to obtain a first voice coding characteristic.
S34, masking the source voice sequence sample in the hidden state to obtain a masked voice sequence sample, and processing the masked voice sequence sample by using a translation encoder to obtain a second voice coding feature.
And S35, calculating to obtain a character level comparison learning loss function and a sentence level comparison learning loss function according to the first voice coding characteristic and the second voice coding characteristic.
S36, processing the source voice sequence sample in the hidden state by using a translation decoder to obtain a first probability distribution of the target text.
S37, processing the masked speech sequence samples by using a translation decoder to obtain a second probability distribution of the target text.
S38, according to the first probability distribution and the second probability distribution, a model level comparison learning loss function and a voice translation loss function are obtained through calculation, a total loss function is further obtained, a voice translation model is trained according to the total loss function, and a built voice translation model is obtained.
Optionally, the input module 730 is further configured to:
and obtaining a source voice sequence sample of the triplet, and marking the ambiguous voice fragments in the source voice sequence sample by utilizing a homonym dictionary to obtain the marked source voice sequence sample of the triplet.
Wherein the quintuples include speech, transcribed text, translated text, ambiguous words, and the position of the ambiguous words in the sentence.
Optionally, the input module 730 is further configured to:
s341, generating a transcribed text of the source voice sequence sample in the hidden state according to the voice encoder and the source voice sequence sample in the hidden state.
S342, aligning the source voice sequence sample in the hidden state and the transcribed text.
S343, generating homonym perception shielding matrix according to the aligned hidden state source voice sequence sample, the transcribed text and the homonym dictionary.
S344, masking the source voice sequence sample in the hidden state according to the homonym perception masking matrix to obtain a masked voice sequence sample.
Optionally, character level contrast learning loss functionThe following formula (1) shows:
(1)/>
wherein,for indicating function +.>For the number of source speech sequence samples, < > for>Masking representation of the context of the source speech sequence sample for the hidden state, +.>Context representation of source speech sequence sample for hidden state, for example>Is used for the temperature super-parameter,to calculate cosine similarity.
Alternatively, sentence level contrast learning loss functionThe following formula (2) shows:
(2)
wherein,is->Sentence-level representation obtained by averaging the context representation of the source speech sequence samples of the hidden states in the time dimension +.>Is->Sentence-level representation, obtained by averaging, in the time dimension, the masked representation of the context of the source speech sequence sample of the hidden state,/v>Is temperature super parameter, < >>Size of small lot example, +.>To calculate cosine similarity.
Optionally, model level contrast learning loss functionThe following formula (3) shows:
(3)
wherein,represents Kullback-Leibler divergence, < >>For translating the total label of the text, +.>Contextual representation of source speech sequence samples for a given hidden state +.>In the case of input as translation encoder +.>Predictive probability distribution of individual target markers, +.>Masking representation of the context of a source speech sequence sample for a given hidden state +.>In the case of input as translation encoder +.>Predictive probability distribution for each target marker.
Optionally, the total loss functionThe following formula (4) shows:
(4)
wherein,、is the coefficient weight, ++>For speech translation loss function, < >>Learning loss function for model level contrast +.>Learning a loss function for character level contrast, +.>The loss function is learned for sentence level comparison.
In the embodiment of the invention, an efficient voice disambiguation method is constructed, and the current best performance (BLEU score) is obtained on the voice translation tasks of MUST-C data sets from English to German, from English to French and from English to Spanish.
The method has the advantages that no extra parameter is added to the model, only the data are preprocessed, a contrast learning strategy is adopted in the training stage, and the method is visual and easy to understand, simple and efficient.
The invention can solve the problem of voice ambiguity existing in the prior art, and especially aims at challenges of homonyms.
Fig. 8 is a schematic structural diagram of an electronic device 800 according to an embodiment of the present invention, where the electronic device 800 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 801 and one or more memories 802, where at least one instruction is stored in the memories 802, and the at least one instruction is loaded and executed by the processor 801 to implement the following speech translation method for resolving speech ambiguity:
s1, acquiring voice data to be translated.
S2, constructing a homonym dictionary.
S3, inputting the voice data into the constructed voice translation model.
S4, obtaining a translation text of the voice data according to the voice data, the homonym dictionary and the voice translation model.
In an exemplary embodiment, a computer readable storage medium, such as a memory comprising instructions executable by a processor in a terminal to perform the above-described speech translation method for resolving speech ambiguities, is also provided. For example, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.
Claims (9)
1. A speech translation method for resolving speech ambiguities, the method comprising:
s1, acquiring voice data to be translated;
s2, constructing a homonym dictionary;
s3, inputting the voice data into the constructed voice translation model;
s4, obtaining a translation text of the voice data according to the voice data, the homonym dictionary and the voice translation model;
the speech translation model comprises a speech encoder, a translation encoder and a translation decoder;
the construction process of the speech translation model in the S3 comprises the following steps:
s31, acquiring a source voice sequence sample, and marking an ambiguous voice fragment in the source voice sequence sample by using the homonym dictionary to obtain a marked source voice sequence sample;
s32, compressing the marked source voice sequence sample by utilizing the voice encoder to obtain a source voice sequence sample in a hidden state;
s33, processing the source voice sequence sample in the hidden state by using the translation encoder to obtain a first voice coding characteristic;
s34, masking the source voice sequence sample in the hidden state to obtain a masked voice sequence sample, and processing the masked voice sequence sample by using the translation encoder to obtain a second voice coding feature;
s35, calculating to obtain a character level comparison learning loss function and a sentence level comparison learning loss function according to the first voice coding characteristics and the second voice coding characteristics;
s36, processing the source voice sequence sample in the hidden state by using the translation decoder to obtain a first probability distribution of the target text;
s37, processing the masked voice sequence samples by using the translation decoder to obtain second probability distribution of the target text;
s38, according to the first probability distribution and the second probability distribution, a model level comparison learning loss function and a voice translation loss function are obtained through calculation, and then a total loss function is obtained, a voice translation model is trained according to the total loss function, and a built voice translation model is obtained.
2. The method of claim 1, wherein constructing the homonym dictionary in S2 comprises:
and constructing a homonym dictionary by using the source voice data and the Montreal forced aligner.
3. The method according to claim 1, wherein the obtaining the source voice sequence sample in S31, labeling the ambiguous voice segments in the source voice sequence sample by using the homonym dictionary, and obtaining the labeled source voice sequence sample includes:
obtaining a source voice sequence sample of a triplet, and marking an ambiguous voice fragment in the source voice sequence sample by using the homonym dictionary to obtain a marked source voice sequence sample of the triplet;
wherein the quintuples include speech, transcribed text, translated text, ambiguous words, and locations of ambiguous words in sentences.
4. The method according to claim 1, wherein masking the source speech sequence samples in the hidden state in S34 to obtain masked speech sequence samples comprises:
s341, generating a transcribed text of the source voice sequence sample in the hidden state according to the voice encoder and the source voice sequence sample in the hidden state;
s342, aligning the source voice sequence sample in the hidden state and the transcribed text;
s343, generating a homonym perception shielding matrix according to the aligned source voice sequence sample in the hidden state, the transcribed text and the homonym dictionary;
and S344, masking the source voice sequence sample in the hidden state according to the homonym perception masking matrix to obtain a masked voice sequence sample.
5. The method of claim 1, wherein the character level contrast learning loss function in S35The following formula (1) shows:
(1)
wherein,for indicating function +.>For the number of source speech sequence samples, < > for>Masking representation of the context of the source speech sequence sample for the hidden state, +.>Context representation of source speech sequence sample for hidden state, for example>Is used for the temperature super-parameter,to calculate cosine similarity.
6. The method of claim 1, wherein the sentence-level contrast learning loss function in S35The following formula (2) shows:
(2)
wherein,is->Sentence-level representation obtained by averaging the context representation of the source speech sequence samples of the hidden states in the time dimension +.>Is->Sentence-level representation, obtained by averaging, in the time dimension, the masked representation of the context of the source speech sequence sample of the hidden state,/v>Is temperature super parameter, < >>To be of the size of a small batch example,to calculate cosine similarity.
7. The method of claim 1, wherein the model level contrast learning loss function in S38The following formula (3) shows:
(3)
wherein,represents Kullback-Leibler divergence, < >>For translating the total label of the text, +.>Contextual representation of source speech sequence samples for a given hidden state +.>In the case of input as translation encoder +.>Predictive probability distribution of individual target markers, +.>Masking representation of the context of a source speech sequence sample for a given hidden state +.>In the case of input as translation encoder +.>Predictive probability distribution for each target marker.
8. The method according to claim 1, wherein the total loss function in S38The following formula (4) shows:
(4)
wherein,、is the coefficient weight, ++>For speech translation loss function, < >>The loss function is learned for model-level comparison,learning a loss function for character level contrast, +.>The loss function is learned for sentence level comparison.
9. A speech translation apparatus for resolving speech ambiguities, the apparatus comprising:
the acquisition module is used for acquiring voice data to be translated;
the construction module is used for constructing homonym dictionary;
the input module is used for inputting the voice data into the constructed voice translation model;
the output module is used for obtaining a translation text of the voice data according to the voice data, the homonym dictionary and the voice translation model;
the speech translation model comprises a speech encoder, a translation encoder and a translation decoder;
the construction process of the voice translation model comprises the following steps:
s31, acquiring a source voice sequence sample, and marking an ambiguous voice fragment in the source voice sequence sample by using the homonym dictionary to obtain a marked source voice sequence sample;
s32, compressing the marked source voice sequence sample by utilizing the voice encoder to obtain a source voice sequence sample in a hidden state;
s33, processing the source voice sequence sample in the hidden state by using the translation encoder to obtain a first voice coding characteristic;
s34, masking the source voice sequence sample in the hidden state to obtain a masked voice sequence sample, and processing the masked voice sequence sample by using the translation encoder to obtain a second voice coding feature;
s35, calculating to obtain a character level comparison learning loss function and a sentence level comparison learning loss function according to the first voice coding characteristics and the second voice coding characteristics;
s36, processing the source voice sequence sample in the hidden state by using the translation decoder to obtain a first probability distribution of the target text;
s37, processing the masked voice sequence samples by using the translation decoder to obtain second probability distribution of the target text;
s38, according to the first probability distribution and the second probability distribution, a model level comparison learning loss function and a voice translation loss function are obtained through calculation, and then a total loss function is obtained, a voice translation model is trained according to the total loss function, and a built voice translation model is obtained.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311326597.4A CN117094329B (en) | 2023-10-13 | 2023-10-13 | Voice translation method and device for solving voice ambiguity |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311326597.4A CN117094329B (en) | 2023-10-13 | 2023-10-13 | Voice translation method and device for solving voice ambiguity |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117094329A CN117094329A (en) | 2023-11-21 |
CN117094329B true CN117094329B (en) | 2024-02-02 |
Family
ID=88771790
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311326597.4A Active CN117094329B (en) | 2023-10-13 | 2023-10-13 | Voice translation method and device for solving voice ambiguity |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117094329B (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5526259A (en) * | 1990-01-30 | 1996-06-11 | Hitachi, Ltd. | Method and apparatus for inputting text |
CN113053362A (en) * | 2021-03-30 | 2021-06-29 | 建信金融科技有限责任公司 | Method, device, equipment and computer readable medium for speech recognition |
CN113450760A (en) * | 2021-06-07 | 2021-09-28 | 北京一起教育科技有限责任公司 | Method and device for converting text into voice and electronic equipment |
CN113569562A (en) * | 2021-07-02 | 2021-10-29 | 中译语通科技股份有限公司 | Method and system for reducing cross-modal and cross-language barrier of end-to-end voice translation |
WO2022057637A1 (en) * | 2020-09-18 | 2022-03-24 | 北京字节跳动网络技术有限公司 | Speech translation method and apparatus, and device, and storage medium |
CN115954001A (en) * | 2023-01-30 | 2023-04-11 | 阿里巴巴达摩院(杭州)科技有限公司 | Speech recognition method and model training method |
CN116129902A (en) * | 2022-12-27 | 2023-05-16 | 中科凡语(武汉)科技有限公司 | Cross-modal alignment-based voice translation method and system |
CN116206616A (en) * | 2022-12-30 | 2023-06-02 | 沈阳雅译网络技术有限公司 | Speech translation and speech recognition method based on sequence dynamic compression |
CN116227504A (en) * | 2023-02-08 | 2023-06-06 | 广州数字未来文化科技有限公司 | Communication method, system, equipment and storage medium for simultaneous translation |
CN116663577A (en) * | 2023-06-02 | 2023-08-29 | 昆明理工大学 | Cross-modal characterization alignment-based english end-to-end speech translation method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7672833B2 (en) * | 2005-09-22 | 2010-03-02 | Fair Isaac Corporation | Method and apparatus for automatic entity disambiguation |
-
2023
- 2023-10-13 CN CN202311326597.4A patent/CN117094329B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5526259A (en) * | 1990-01-30 | 1996-06-11 | Hitachi, Ltd. | Method and apparatus for inputting text |
WO2022057637A1 (en) * | 2020-09-18 | 2022-03-24 | 北京字节跳动网络技术有限公司 | Speech translation method and apparatus, and device, and storage medium |
CN113053362A (en) * | 2021-03-30 | 2021-06-29 | 建信金融科技有限责任公司 | Method, device, equipment and computer readable medium for speech recognition |
CN113450760A (en) * | 2021-06-07 | 2021-09-28 | 北京一起教育科技有限责任公司 | Method and device for converting text into voice and electronic equipment |
CN113569562A (en) * | 2021-07-02 | 2021-10-29 | 中译语通科技股份有限公司 | Method and system for reducing cross-modal and cross-language barrier of end-to-end voice translation |
CN116129902A (en) * | 2022-12-27 | 2023-05-16 | 中科凡语(武汉)科技有限公司 | Cross-modal alignment-based voice translation method and system |
CN116206616A (en) * | 2022-12-30 | 2023-06-02 | 沈阳雅译网络技术有限公司 | Speech translation and speech recognition method based on sequence dynamic compression |
CN115954001A (en) * | 2023-01-30 | 2023-04-11 | 阿里巴巴达摩院(杭州)科技有限公司 | Speech recognition method and model training method |
CN116227504A (en) * | 2023-02-08 | 2023-06-06 | 广州数字未来文化科技有限公司 | Communication method, system, equipment and storage medium for simultaneous translation |
CN116663577A (en) * | 2023-06-02 | 2023-08-29 | 昆明理工大学 | Cross-modal characterization alignment-based english end-to-end speech translation method |
Non-Patent Citations (3)
Title |
---|
An interactive mouthguard based on mechanoluminescence-powered optical fibre sensors for bite-controlled device operation;Bo Hou et.al;《nature electronics》;第682-693页 * |
维汉机器翻译未登录词识别研究;米成刚等;《计算机应用研究》;第1112-1115页 * |
面向语音处理领域语言及翻译系统的实现与应用;张炜等;计算机与现代化(第04期);第13-20页 * |
Also Published As
Publication number | Publication date |
---|---|
CN117094329A (en) | 2023-11-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7860719B2 (en) | Disfluency detection for a speech-to-speech translation system using phrase-level machine translation with weighted finite state transducers | |
WO2010046782A2 (en) | Hybrid machine translation | |
KR20210138776A (en) | Phoneme-Based Contextualization for Cross-Language Speech Recognition in End-to-End Models | |
Sitaram et al. | Speech synthesis of code-mixed text | |
CN111341293B (en) | Text voice front-end conversion method, device, equipment and storage medium | |
Sitaram et al. | Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text. | |
Sangeetha et al. | Speech translation system for english to dravidian languages | |
CN113327574A (en) | Speech synthesis method, device, computer equipment and storage medium | |
KR20080069077A (en) | Automatic speech interpretation system based on statistical automatic translation mode, translation processing method and training method thereof | |
Lee | Speech translation | |
KR102204395B1 (en) | Method and system for automatic word spacing of voice recognition using named entity recognition | |
Ananthakrishnan et al. | Automatic diacritization of Arabic transcripts for automatic speech recognition | |
Liu et al. | Use of statistical N-gram models in natural language generation for machine translation | |
Gao et al. | MARS: A statistical semantic parsing and generation-based multilingual automatic translation system | |
CN117094329B (en) | Voice translation method and device for solving voice ambiguity | |
CN114519358A (en) | Translation quality evaluation method and device, electronic equipment and storage medium | |
Sreeram et al. | A Novel Approach for Effective Recognition of the Code-Switched Data on Monolingual Language Model. | |
Núñez et al. | Phonetic normalization for machine translation of user generated content | |
Yeh et al. | Speech recognition with word fragment detection using prosody features for spontaneous speech | |
Monesh Kumar et al. | A New Robust Deep Learning‐Based Automatic Speech Recognition and Machine Transition Model for Tamil and Gujarati | |
Dasgupta et al. | Resource creation and development of an English-Bangla back transliteration system | |
KR20190052924A (en) | Apparatus and method of an automatic simultaneous interpretation using presentation scripts analysis | |
US20220215834A1 (en) | System and method for speech to text conversion | |
JP2001100788A (en) | Speech processor, speech processing method and recording medium | |
CN116229994B (en) | Construction method and device of label prediction model of Arabic language |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |