CN116504223A - Speech translation method and device, electronic equipment and storage medium - Google Patents

Speech translation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116504223A
CN116504223A CN202310522097.1A CN202310522097A CN116504223A CN 116504223 A CN116504223 A CN 116504223A CN 202310522097 A CN202310522097 A CN 202310522097A CN 116504223 A CN116504223 A CN 116504223A
Authority
CN
China
Prior art keywords
information
voice
acoustic feature
speech
feature information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310522097.1A
Other languages
Chinese (zh)
Inventor
章峻珲
马泽君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Youzhuju Network Technology Co Ltd
Original Assignee
Beijing Youzhuju Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Youzhuju Network Technology Co Ltd filed Critical Beijing Youzhuju Network Technology Co Ltd
Priority to CN202310522097.1A priority Critical patent/CN116504223A/en
Publication of CN116504223A publication Critical patent/CN116504223A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture

Abstract

The disclosure can provide a voice translation method and device, an electronic device and a storage medium, wherein the voice translation method comprises the following steps: receiving a first voice, wherein the first voice is a source language voice to be translated; extracting first acoustic feature information from the first voice, wherein the first acoustic feature information is used for representing the semantics of the source language voice; converting the first acoustic feature information into spectrum information of a target language, and encoding the spectrum information to generate second voice; the second speech is a target language speech obtained after translating the source language speech. The voice translation scheme provided by the disclosure can be suitable for more application scenes, realizes optimization of voice translation functions, and obviously improves user experience.

Description

Speech translation method and device, electronic equipment and storage medium
Technical Field
The disclosure relates to the technical field of voice processing, and in particular relates to a voice translation method and device, electronic equipment and a storage medium.
Background
Along with the rapid development of the voice processing technology, the accuracy and the intelligentization level of the voice translation function are effectively improved. In the related art, a source language text corresponding to a source language voice to be translated is generally recognized first, then the source language text is translated into a corresponding target language text, and finally a target language voice corresponding to the target language text is generated according to the target language text; wherein the target language is different from the source language, for example, the source language is Chinese, the target language is English, or the source language is Mandarin and the target language is dialect.
The above-mentioned related art can be applied to many scenarios, but for a language with an incorrect text form or even an incorrect text form, for example, a dialect with an incorrect text form, the language text cannot be obtained, so that the related art is difficult to implement a speech translation function, and needs to be solved.
Disclosure of Invention
In order to solve the problem that the voice translation function is difficult to realize due to the fact that text is not in a correct text form or even in a non-text form in the related art, the disclosure provides a voice translation method and device, electronic equipment and a storage medium, so as to solve at least one problem in the related art.
To achieve the above technical object, the present disclosure provides a speech translation method, including: receiving a first voice, wherein the first voice is a source language voice to be translated; extracting first acoustic feature information from the first voice, wherein the first acoustic feature information is used for representing the semantics of the source language voice; converting the first acoustic feature information into spectrum information of a target language; encoding the spectral information to generate a second speech; the second voice is a target language voice obtained after the source language voice is translated.
To achieve the above technical object, the present disclosure also provides a speech translation apparatus, including: the voice receiving module is used for receiving first voice which is source language voice to be translated; the feature extraction module is used for extracting first acoustic feature information from the first voice, wherein the first acoustic feature information is used for representing the semantics of the source language voice; the information conversion module is used for converting the first acoustic characteristic information into spectrum information of a target language; a voice encoding module for encoding the spectrum information to generate a second voice; the second voice is a target language voice obtained after the source language voice is translated.
To achieve the above technical purpose, the present disclosure may further provide an electronic device, including a memory and a processor, where the memory stores computer readable instructions, and when the computer readable instructions are executed by the processor, cause the processor to execute the speech translation method according to any one of the embodiments of the present disclosure.
To achieve the above technical object, the present disclosure may further provide a storage medium storing computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform the speech translation method according to any one of the embodiments of the present disclosure.
The beneficial effects of the present disclosure include: compared with the related art, the method and the device have the advantages that the spectrum information of the corresponding target language is determined based on the first acoustic characteristic information extracted from the source language voice information, and then the spectrum information is encoded into the target language voice; therefore, the whole voice translation process does not need text information, even if the source language and/or the target language have no correct text form or even have no text form, the voice translation function can be better realized, and the voice translation scheme provided by the disclosure is particularly suitable for the dialect translation scene without the correct text form, so that the voice translation scheme provided by the disclosure can be suitable for more application scenes, the optimization of the voice translation function is realized, and the user experience is obviously improved. In addition, the present disclosure can also help significantly reduce the cost of implementing speech translation functions.
Drawings
FIG. 1 illustrates a flow diagram of a speech translation method in one or more embodiments of the present disclosure.
Fig. 2 illustrates a flow diagram of converting first acoustic feature information into spectral information in a target language in one or more embodiments of the present disclosure.
Fig. 3 illustrates a flow diagram for translating first acoustic feature information in one or more embodiments of the present disclosure.
Fig. 4 shows a flow diagram of converting second acoustic feature information into spectral information in a target language in one or more embodiments of the present disclosure.
FIG. 5 illustrates a flow diagram for generating spectral information in a target language using second semantic information in one or more embodiments of the present disclosure.
Fig. 6 illustrates a flow diagram for extracting first acoustic feature information from a first voice in one or more embodiments of the present disclosure.
FIG. 7 illustrates a schematic diagram of implementing a speech translation scheme of the present disclosure based on a trained neural network model in one or more embodiments of the present disclosure.
Fig. 8 illustrates a schematic diagram of the operational principles of the translator sub-module and synthesizer sub-module in one or more embodiments of the present disclosure.
Fig. 9 shows a schematic diagram of a speech translation apparatus structure in one or more embodiments of the present disclosure.
Fig. 10 shows a schematic diagram of the internal structural composition of an electronic device in one or more embodiments of the present disclosure.
Detailed Description
The related art Speech translation task recognizes a source language Text corresponding To a source language Speech through an ASR (Automatic Speech Recognition ) system, then translates the source language Text into a corresponding target language Text through a language translation system, and finally can output the target language Speech based on the target language Text through a TTS (Text To Speech) system.
Taking dialects as an example, in the communication among the dialects, since the dialects are mainly transmitted through languages, the word forms of the dialects can be mastered by language specialists generally, and most users have difficulty in inputting accurate dialect texts as TTS input; whereas for non-native people it is often more difficult to enter accurate dialect text. In the case of cantonese, most non-native people and some native people have difficulty writing cantonese dialects in the exact text form, for example, "i am not intentional" in which the mandarin is intended to express, the exact cantonese expression is "i am well series ten ", and typically, non-native people tend to write only the text form of mandarin, and some native people also write only in harmonic form, for example "i am well heide ". In the example of the dialect, although northeast is not very different from Mandarin, the non-native people want to express "we are cold there", and the native people have more accurate expression modes, such as "me d is cold. Therefore, performing translation between mandarin and dialects can help non-native and native people to communicate accurately, but requires much dialect expertise; however, the translation data between general dialects is very rare, and many dialects even the masses to have no correct text form, so that the related technology has difficulty in realizing the voice translation function of the dialects.
Therefore, the related art cannot solve the problem that some special unusual dialects cannot perform automatic speech translation due to the text-free form. In addition, each of the ASR system, the language translation system, and the TTS system requires a large amount of data to train, and is costly to implement.
In view of the above, the present disclosure can provide a speech translation method and apparatus, an electronic device, and a storage medium, so as to effectively solve at least one problem of the related art.
As shown in fig. 1, one or more embodiments of the present disclosure can provide a speech translation method including, but not limited to, steps S100 to S400.
Step S100, receiving a first voice, wherein the first voice is a source language voice to be translated.
In a specific application, the first speech may be, for example, mandarin chinese speech uttered by a user, such as a section of speech "i am not intentional", for which the embodiments of the present disclosure may be used to translate the source language speech into a corresponding target language speech, such as mandarin chinese speech into cantonese speech.
Step S200, extracting first acoustic feature information from the first speech, where the first acoustic feature information is used to represent semantics of the source language speech.
The present embodiment may implement the acoustic feature extraction function by means of a speech recognition model to extract the first acoustic feature information from the first speech, which is not limited to this.
Step S300, converting the first acoustic feature information into spectrum information of the target language.
Embodiments of the present disclosure directly convert acoustic features of a source language into a spectrum of a target language, thereby skipping text translation processes that require high costs to overcome at least one problem in the related art.
Step S400, encoding the spectrum information to generate second voice; the second speech is a target language speech obtained after translating the source language speech.
Embodiments of the present disclosure are capable of encoding spectral information using a vocoder (vocoder) to encode the spectral information into the second speech, although not limited thereto.
In a specific application, the second voice may be, for example, cantonese voice "i well is telnet " obtained after translation by the scheme of the embodiment of the present disclosure.
In one or more embodiments of the present disclosure, spectral information of a corresponding target language is determined based on first acoustic feature information extracted from source language speech information, and the spectral information is further encoded into the target language speech; therefore, the whole voice translation process does not need text information, even if the source language and/or the target language have no correct text form or even have no text form, the voice translation function can be better realized, and the voice translation scheme provided by the disclosure is particularly suitable for the dialect translation scene without the correct text form, so that the voice translation scheme provided by the disclosure can be suitable for more application scenes, the optimization of the voice translation function is realized, and the user experience is obviously improved. In addition, the present disclosure can also help significantly reduce the cost of implementing speech translation functions.
As shown in fig. 2, in one or more embodiments of the present disclosure, the first acoustic feature information is converted into spectral information of a target language, including but not limited to steps S310 to S320.
Step S310, translating the first acoustic feature information to obtain second acoustic feature information; the second acoustic feature information is used to represent the semantics of the target language speech.
Embodiments of the present disclosure may enable translating first acoustic feature information to second acoustic feature information through a translator sub-module in an end-to-end speech translation model, although not limited thereto.
Step S320, converting the second acoustic feature information into spectrum information of the target language.
The embodiments of the present disclosure implement, for example, converting the second acoustic feature information into the spectral information of the target language through a synthesizer sub-module in the trained end-to-end speech translation model, although not limited thereto.
Based on the translation of the first acoustic feature information, the embodiment of the present disclosure can convert the first acoustic feature information for representing the meaning of the source language voice into second acoustic feature information for representing the meaning of the target language voice, and the embodiment of the present disclosure can significantly improve the accuracy of the obtained spectrum information of the target language, thereby improving the accuracy of voice translation, and further improving the performance of the voice translation function of the present disclosure.
As shown in fig. 3, in one or more embodiments of the present disclosure, the first acoustic feature information is translated, including but not limited to steps S311 to S313.
In step S311, the context information is acquired from the first acoustic feature information, where the first acoustic feature information includes the context information.
The context information refers to information in the first acoustic feature information that affects the first semantic information, and includes information of various acoustic feature changes caused by different semantics of adjacent sentences, such as one or more of intonation, rhythm, and accent, but is not limited thereto.
Step S312, extracting first semantic information from the first acoustic feature information according to the context information, the first semantic information being used to represent meaning expressed by the source language.
For example, the context information may be utilized as a basis for partitioning between different semantics of adjacent sentences, thereby accurately reaching the first semantic information.
The embodiment can extract the first semantic information from the first acoustic feature information according to the context information through a deep convolution layer contained in a translator submodule in the end-to-end speech translation model after training.
Step S313, converting the first semantic information into second acoustic feature information.
The embodiment can convert the first semantic information into the second acoustic feature information through a nonlinear activation function (ReLu), a random inactivation layer (Dropout), a full connection layer (Dense) and the like contained in a translator submodule in the end-to-end speech translation model after training.
According to the embodiment, the first semantic information can be accurately extracted through the obtained context information, so that the meaning of the source language voice can be comprehensively and accurately determined, and the meaning expressed through the source language can be comprehensively and accurately expressed in the second acoustic feature information.
As shown in fig. 4, in one or more embodiments of the present disclosure, the second acoustic feature information is converted into spectral information of the target language, including but not limited to steps S321 to S322.
Step S321, identifying second semantic information from the second acoustic feature information, where the second semantic information is used to represent meaning expressed by the target language.
The second acoustic feature information is understood through the trained neural network model, for example, the semantics contained in the second acoustic feature information can be understood through the trained long-term memory artificial neural network unit, so as to identify the second semantic information from the second acoustic feature information.
In step S322, spectrum information of the target language is generated using the second semantic information.
The training attention mechanism module predicts the spectrum information of the target language according to the semantics contained in the second acoustic feature information.
Based on the second semantic information identified from the second acoustic feature information, accurate understanding of meaning expressed in the second acoustic feature information is achieved, and therefore a more accurate frequency spectrum of the target language is generated on the basis of effectively understanding the meaning expressed by the second acoustic feature information.
As shown in fig. 5, in one or more embodiments of the present disclosure, spectrum information of the target language is generated using the second semantic information, including but not limited to steps S3220 through S3221.
In step S3220, first prediction information for characterizing the coarse-grained spectrum is predicted using the second semantic information.
The embodiment can predict coarse-grained spectrum information of the target language according to the semantics contained in the second acoustic feature information through the attention mechanism module after training.
In step S3221, second prediction information for characterizing the fine-grained spectrum is predicted using the first prediction information, and the second prediction information is used as spectrum information of the target language.
The embodiment can predict the fine-grained spectrum information based on the one-dimensional convolution and the full-connection layer after the training is completed by N (specific values are set according to actual requirements).
The embodiment provides a two-step prediction method for spectrum information of a target language, which comprises the processes of predicting a coarse-granularity spectrum and predicting a fine-granularity spectrum, and the method based on the embodiment can not only effectively improve the speed of spectrum prediction, but also give consideration to the quality of spectrum prediction, and has better comprehensive performance.
As shown in fig. 6, in one or more embodiments of the present disclosure, first acoustic feature information is extracted from a first voice, including, but not limited to, step S210 and step S220.
In step S210, original acoustic feature information is extracted from the first voice, where the original acoustic feature information includes first acoustic feature information and third acoustic feature information, and the third acoustic feature information includes tone information and pronunciation emotion information.
The present embodiment extracts the original acoustic feature information from the first speech through an ASR (Automatic Speech Recognition ) model, but is not limited thereto.
Step S220, the original acoustic feature information is filtered to filter out the third acoustic feature information, and the first acoustic feature information is obtained.
The first acoustic feature information of the present embodiment includes first semantic information including, for example, speech meaning information and context information including, for example, prosody information.
For example, the first voice is a first voice recorded by a young sun in a cheerful mood, after the embodiment, tone information and pronunciation emotion information in the voice are filtered, and voice meaning information and prosody information are reserved, so that the meaning of the voice is understood and is not interfered by redundant information.
Based on the embodiment, the method and the device can also filter the third acoustic characteristic information irrelevant to the voice translation function, so that more accurate first acoustic characteristic information is provided for the subsequent first acoustic characteristic information conversion and spectrum information coding process, the reliability of the subsequent processing process is improved, the interference of the information irrelevant to the voice translation function to the subsequent processing process is avoided, the data volume to be processed in the subsequent processing process can be obviously reduced, and the voice translation efficiency of the method and the device is obviously improved.
Optionally, the source language voice is mandarin chinese voice and the target language voice is dialect voice, or the source language voice is dialect voice and the target language voice is mandarin chinese voice.
Of course, in an alternative embodiment of the present disclosure, the source language is chinese, the target language is foreign language, or the source language is foreign language, and the target language is chinese, where the foreign language may be one of non-chinese languages such as english, japanese, french, and the like.
The embodiment of the disclosure can be better suitable for the situation of speech translation among different languages, is particularly suitable for speech translation among dialects, mandarin and the like under the condition that a certain language has no correct text form or even no text form, and has wide application scenes.
Optionally, the spectral information of the target language is mel spectral information of the target language. Wherein the mel-spectrum information represents information composed of mel-spectrum.
The present embodiment may encode mel-spectrum information into audio by a melGAN (mel Generative Adversarial Network, mel-generation antagonism network) vocoder encoding mel-spectrum information into second speech.
Based on the spectrum information of the target language composed of the mel spectrum, the embodiment of the disclosure can accurately represent the distribution of signals corresponding to the semantics of the source language voice on different frequencies, and improve the accuracy of voice translation.
As shown in fig. 7, in the embodiment of the present disclosure, the feature extraction module extracts first acoustic feature information from the input source language voice, so as to implement an output function of acoustic features corresponding to the source language voice, where the feature extraction module may specifically be a sound feature extractor. The acoustic feature extractor in the embodiments of the present disclosure is implemented for a neural network included in an ASR (Automatic Speech Recognition ) model based on a conformation structure, and takes an intermediate result BNF (BottleNeck Features, bottleneck feature) of the ASR model as first acoustic feature information. The ASR model may include a 72-layer neural network, for example, the first 32-layer neural network may be used to extract first acoustic feature information, and non-content related information such as tone, emotion, etc. may be filtered out.
Next, the present embodiment may implement the translation of acoustic features into the spectrum of the target language through an end-to-end Speech translation model (S2 ST, speech-to-Speech Translation). The end-to-end speech translation model comprises a translator submodule and a synthesizer submodule, wherein the translator submodule inputs acoustic features of source language speech and outputs acoustic features (BNF) of target language speech, and the synthesizer submodule inputs acoustic features (BNF) of target language speech and outputs frequency spectrum of target language.
As shown in fig. 8, the translator sub-module may include a nonlinear activation function (GLU), a deep convolution Layer (Depthwise convolution), a nonlinear activation function (ReLu), a random inactivation Layer (Dropout), an adder, a normalization Layer (Layer Norm), a full connection Layer (Dense), a random inactivation Layer (Dropout), and a full connection Layer (Dense) that are stacked in sequence, where the nonlinear activation function (GLU) and the nonlinear activation function (ReLu) are used to eliminate a gradient vanishing problem, the deep convolution Layer (Depthwise convolution) is used to extract semantic information currently input according to context information, the random inactivation Layer (Dropout) is used to bring randomness to a model, promote a model effect, the normalization Layer (Layer Norm) is used to implement a function of normalizing model parameters, ensure model stability, and both full connection layers (Dense) are used to perform dimension adjustment; where N in the translator sub-module may be used to represent a stack of N illustrated sub-modules.
As shown in fig. 8, the synthesizer submodule includes a nonlinear transformation network (Pre-net) for feature input processing, a long-short-term memory artificial neural network unit (LSTM cell) for understanding semantics from acoustic features of a target language, an Attention mechanism module (Attention), for example, 1-5 time steps of a source language correspond to 1-8 time steps of the target language and all represent the same meaning (for example, all represent "today's" meaning), and is used for performing input and output alignment, i.e., alignment of speech information between different languages. In addition, stopProj indicates a prediction stop, i.e. whether the sentence has reached the last time step, if so, stopping the prediction; frameProj is used to predict coarse-grained spectrum; conv1D is used to refine the predicted coarse-grained spectrum to enrich more detail; the full connection layer (Dense) is used to make dimension adjustments and ultimately output the predicted fine-grained spectrum.
It should be appreciated that the translator sub-module referred to in this disclosure is specifically a trained translator sub-module, and the synthesizer sub-module referred to is specifically a trained synthesizer sub-module.
As shown in fig. 7, the present disclosure exemplifies a training process of an end-to-end speech translation model (S2 ST) for translating cantonese in chinese, and specifically describes a training process of a translator sub-module and a synthesizer sub-module included in the end-to-end speech translation model.
The training data used in the training process comprises Mandarin audio corpus with preset time length and Guangdong audio corpus with preset time length, wherein the preset time length is 1200 hours, and the content expressed by the Mandarin audio corpus is the same as the content expressed by the Guangdong audio corpus.
During model training, respectively extracting acoustic feature information of Mandarin audio corpus and acoustic feature information of Guangdong audio corpus by using a sound feature extractor, namely, extracting acoustic feature information of source language voice and target language voice; inputting acoustic feature information of Mandarin audio corpus into a translator submodule to obtain predicted acoustic feature information, and determining loss (L) of the translator submodule by using the predicted acoustic feature information and acoustic feature information of Guangdong audio corpus Translation device ) The penalty may be an L2 penalty (average absolute error) and based on the penalty, a determination is made as to whether the translator submodule is trained, e.g., a penalty (L Translation device ) And determining that training is completed under the condition that the training value is smaller than the first preset value. In this embodiment, a joint training manner may be adopted, the acoustic feature information output by the translator submodule after training is input to the synthesizer submodule, so as to obtain the spectrum information of the predicted target language, and then the loss (L) of the synthesizer submodule is determined by using the spectrum information of the predicted target language and the spectrum information of the real target voice corresponding to the cantonese voice corpus Synthesizer ) The loss is, for example, an L2 loss (average absolute error), from which it is determined whether the synthesizer submodule has been trained, for example, the loss (L Synthesizer ) And determining that training is completed under the condition that the training value is smaller than the second preset value.
In a preferred embodiment, the embodiment of the disclosure trains the L2 loss of the coarse-granularity spectrum and the L2 loss of the fine-granularity spectrum output by the synthesizer submodule respectively, determines the softmax (normalized function) cross entropy between the stop symbol predicted by the attention mechanism module and the real stop symbol corresponding to the cantonese audio corpus, and then determines that the synthesizer submodule trains to be completed after the L2 loss of the coarse-granularity spectrum, the L2 loss of the fine-granularity spectrum and the softmax cross entropy respectively meet the specified conditions.
Compared with the conventional technology, the embodiment uses the end-to-end voice translation model (S2 ST) to translate the first acoustic feature information into the predicted frequency spectrum (mel) information of the target language, and compared with the multisystem serial scheme formed by the ASR system, the language translation system and the TTS system in the prior art, the embodiment realizes the end-to-end dialect translation function, effectively avoids the problem of accumulated errors caused by the multisystem, only needs to train the end-to-end voice translation model, greatly reduces the data volume requirement, the data labeling requirement, the model training time and the space consumption requirement and the like, and the voice translation method provided by the embodiment does not need text information, effectively solves the problem of non-text form (non-text) of languages such as very common dialects and the like, and further greatly reduces the implementation cost. In addition, the training data amount used in the training of the present disclosure is smaller, thereby further reducing the implementation cost.
As shown in fig. 9, the speech translation method provided in at least one embodiment of the present disclosure is based on the same inventive technical concept, and the at least one embodiment of the present disclosure can also provide a speech translation apparatus.
Among other things, speech translation devices in one or more embodiments of the present disclosure include, but are not limited to, a speech receiving module 901, a feature extraction module 902, an information conversion module 903, and a speech encoding module 904.
The voice receiving module 901 is configured to receive a first voice, where the first voice is a source language voice to be translated.
The feature extraction module 902 is configured to extract first acoustic feature information from a first voice, where the first acoustic feature information is used to represent semantics of a source language voice.
The information conversion module 903 is configured to convert the first acoustic feature information into spectrum information of the target language.
A speech encoding module 904 for encoding the spectral information to generate a second speech; the second speech is a target language speech obtained after translating the source language speech.
Optionally, the information conversion module 903 includes a translator sub-module and a synthesizer sub-module.
The translator sub-module is used for translating the first acoustic feature information to obtain second acoustic feature information; the second acoustic feature information is used to represent the semantics of the target language speech.
And the synthesizer submodule is used for converting the second acoustic characteristic information into the frequency spectrum information of the target language.
Optionally, the translator sub-module includes a context acquisition module, a semantic extraction module, and a semantic conversion module.
The context acquisition module is used for acquiring context information from the first acoustic feature information, wherein the first acoustic feature information comprises the context information.
The semantic extraction module is used for extracting first semantic information from the first acoustic feature information according to the context information, and the first semantic information is used for representing meaning expressed by the source language.
And the semantic conversion module is used for converting the first semantic information into second acoustic feature information.
Optionally, the synthesizer submodule includes a semantic recognition module and a spectrum generation module.
The semantic recognition module is used for recognizing second semantic information from the second acoustic feature information, and the second semantic information is used for representing meaning expressed by the target language.
And the spectrum generation module is used for generating spectrum information of the target language by using the second semantic information.
Optionally, the spectrum generation module includes a first prediction module and a second prediction module.
And the first prediction module is used for predicting first prediction information for representing the coarse-granularity frequency spectrum by using the second semantic information.
And the second prediction module is used for predicting second prediction information for representing the fine-grained frequency spectrum by using the first prediction information and is used for taking the second prediction information as the frequency spectrum information of the target language.
Optionally, the feature extraction module 902 includes an extraction sub-module and a filtering sub-module.
The extraction submodule is used for extracting original acoustic feature information from the first voice, the original acoustic feature information comprises first acoustic feature information and third acoustic feature information, and the third acoustic feature information comprises tone information and pronunciation emotion information.
And the filtering sub-module is used for filtering the original acoustic characteristic information to filter out the third acoustic characteristic information and obtain the first acoustic characteristic information.
Optionally, the source language voice is mandarin chinese voice and the target language voice is dialect voice, or the source language voice is dialect voice and the target language voice is mandarin chinese voice.
Optionally, the spectral information of the target language is mel spectral information of the target language.
As shown in fig. 10, the speech translation method provided in one or more embodiments of the present disclosure is based on the same inventive concept, and the one or more embodiments of the present disclosure can also provide an electronic device including a memory and a processor, where the memory stores computer-readable instructions that, when executed by the processor, cause the processor to perform the speech translation method in one or more embodiments of the present disclosure. The detailed implementation flow of the speech translation method is described in detail in the present specification, and will not be described here again.
The electronic device according to the present disclosure may be used as an execution subject of the speech translation method, and the electronic device may include, but is not limited to, a computer, a mobile terminal, a portable translator, and the like, which are capable of implementing the speech translation method of the present disclosure.
As shown in fig. 10, the speech translation method provided in one or more embodiments of the present disclosure is based on the same inventive concept, and the one or more embodiments of the present disclosure can also provide a storage medium storing computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform the speech translation method in one or more embodiments of the present disclosure. The detailed implementation flow of the speech translation method is described in detail in the present specification, and will not be described here again.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable storage medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection (electronic device) with one or more wires, a portable computer cartridge (magnetic device), a random access Memory (RAM, random Access Memory), a Read-Only Memory (ROM), an erasable programmable Read-Only Memory (EPROM, erasable Programmable Read-Only Memory, or flash Memory), an optical fiber device, and a portable compact disc Read-Only Memory (CDROM, compact Disc Read-Only Memory). In addition, the computer-readable storage medium may even be paper or other suitable medium upon which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It should be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits with logic gates for implementing logic functions on data signals, application specific integrated circuits with appropriate combinational logic gates, programmable gate arrays (PGA, programmable Gate Array), field programmable gate arrays (FPGA, field Programmable Gate Array), and the like.
In the description of the present specification, a description referring to the terms "present embodiment," "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present disclosure. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present disclosure, the meaning of "a plurality" is at least two, such as two, three, etc., unless explicitly specified otherwise.
The above description is of the preferred embodiments of the present disclosure, but is not intended to limit the present disclosure, and any modifications, equivalents, and simple improvements made within the spirit of the present disclosure should be included in the scope of the present disclosure.

Claims (11)

1. A method of speech translation, the method comprising:
receiving a first voice, wherein the first voice is a source language voice to be translated;
extracting first acoustic feature information from the first voice, wherein the first acoustic feature information is used for representing the semantics of the source language voice;
converting the first acoustic feature information into spectrum information of a target language;
encoding the spectral information to generate a second speech; the second voice is a target language voice obtained after the source language voice is translated.
2. The method of claim 1, wherein the converting the first acoustic feature information into the spectral information of the target language comprises:
translating the first acoustic feature information to obtain second acoustic feature information; the second acoustic feature information is used for representing the semantics of the target language voice;
and converting the second acoustic characteristic information into spectrum information of a target language.
3. The method of claim 2, wherein translating the first acoustic feature information comprises:
acquiring context information from the first acoustic feature information, wherein the first acoustic feature information comprises the context information;
extracting the first semantic information from the first acoustic feature information according to the context information, wherein the first semantic information is used for representing meaning expressed by the source language;
and converting the first semantic information into the second acoustic feature information.
4. A speech translation method according to claim 2 or 3, wherein said converting said second acoustic feature information into spectral information of a target language comprises:
identifying second semantic information from the second acoustic feature information, the second semantic information being used to represent a meaning expressed by the target language;
and generating the spectrum information of the target language by using the second semantic information.
5. The speech translation method according to claim 4, wherein generating spectral information of the target language using the second semantic information comprises:
predicting first prediction information for representing a coarse-grained spectrum by using the second semantic information;
and predicting second prediction information used for representing fine-grained frequency spectrum by using the first prediction information, and taking the second prediction information as the frequency spectrum information of the target language.
6. The method of claim 1, wherein the extracting first acoustic feature information from the first speech comprises:
extracting original acoustic feature information from the first voice, wherein the original acoustic feature information comprises the first acoustic feature information and third acoustic feature information, and the third acoustic feature information comprises tone information and pronunciation emotion information;
and filtering the original acoustic characteristic information to filter out the third acoustic characteristic information, so as to obtain the first acoustic characteristic information.
7. The method for speech translation of claim 1,
the source language speech is mandarin chinese speech and the target language speech is dialect speech, or the source language speech is dialect speech and the target language speech is mandarin chinese speech.
8. The method for speech translation of claim 1,
the spectrum information of the target language is mel spectrum information of the target language.
9. A speech translation apparatus, the apparatus comprising:
the voice receiving module is used for receiving first voice which is source language voice to be translated;
the feature extraction module is used for extracting first acoustic feature information from the first voice, wherein the first acoustic feature information is used for representing the semantics of the source language voice;
the information conversion module is used for converting the first acoustic characteristic information into spectrum information of a target language;
a voice encoding module for encoding the spectrum information to generate a second voice; the second voice is a target language voice obtained after the source language voice is translated.
10. An electronic device comprising a memory and a processor, the memory having stored therein computer readable instructions that, when executed by the processor, cause the processor to perform the speech translation method of any of claims 1 to 8.
11. A storage medium storing computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the speech translation method of any of claims 1 to 8.
CN202310522097.1A 2023-05-10 2023-05-10 Speech translation method and device, electronic equipment and storage medium Pending CN116504223A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310522097.1A CN116504223A (en) 2023-05-10 2023-05-10 Speech translation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310522097.1A CN116504223A (en) 2023-05-10 2023-05-10 Speech translation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116504223A true CN116504223A (en) 2023-07-28

Family

ID=87330103

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310522097.1A Pending CN116504223A (en) 2023-05-10 2023-05-10 Speech translation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116504223A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116911323A (en) * 2023-09-13 2023-10-20 深圳市微克科技有限公司 Real-time translation method, system and medium of intelligent wearable device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116911323A (en) * 2023-09-13 2023-10-20 深圳市微克科技有限公司 Real-time translation method, system and medium of intelligent wearable device
CN116911323B (en) * 2023-09-13 2024-03-26 深圳市微克科技股份有限公司 Real-time translation method, system and medium of intelligent wearable device

Similar Documents

Publication Publication Date Title
CN108447486B (en) Voice translation method and device
Bigi SPPAS-multi-lingual approaches to the automatic annotation of speech
US10410621B2 (en) Training method for multiple personalized acoustic models, and voice synthesis method and device
CN109523989B (en) Speech synthesis method, speech synthesis device, storage medium, and electronic apparatus
EP3994683B1 (en) Multilingual neural text-to-speech synthesis
US20220230628A1 (en) Generation of optimized spoken language understanding model through joint training with integrated knowledge-language module
CN111247581A (en) Method, device, equipment and storage medium for synthesizing voice by multi-language text
CN112352275A (en) Neural text-to-speech synthesis with multi-level textual information
CN114580382A (en) Text error correction method and device
Gibbon et al. Spoken language system and corpus design
US11798529B2 (en) Generation of optimized knowledge-based language model through knowledge graph multi-alignment
CN115116428B (en) Prosodic boundary labeling method, device, equipment, medium and program product
CN114120985A (en) Pacifying interaction method, system and equipment of intelligent voice terminal and storage medium
CN116504223A (en) Speech translation method and device, electronic equipment and storage medium
Erro et al. ZureTTS: Online platform for obtaining personalized synthetic voices
CN116312463A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
US20220189455A1 (en) Method and system for synthesizing cross-lingual speech
CN109979458A (en) News interview original text automatic generation method and relevant device based on artificial intelligence
US20220230629A1 (en) Generation of optimized spoken language understanding model through joint training with integrated acoustic knowledge-speech module
CN114974218A (en) Voice conversion model training method and device and voice conversion method and device
Öztürk Modeling phoneme durations and fundamental frequency contours in Turkish speech
WO2022159198A1 (en) Generation of optimized knowledge-based language model through knowledge graph multi-alignment
WO2022159211A1 (en) Generation of optimized spoken language understanding model through joint training with integrated knowledge-language module
CN113823259A (en) Method and device for converting text data into phoneme sequence
Carson-Berndsen Multilingual time maps: portable phonotactic models for speech technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination