CN108231062B

CN108231062B - Voice translation method and device

Info

Publication number: CN108231062B
Application number: CN201810032112.3A
Authority: CN
Inventors: 王雨蒙; 周良; 江源; 胡国平
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2018-01-12
Filing date: 2018-01-12
Publication date: 2020-12-22
Anticipated expiration: 2038-01-12
Also published as: CN108231062A

Abstract

The application discloses a voice translation method and a device, wherein the method comprises the following steps: for voice data needing text translation, performing voice recognition on the voice data to generate a voice recognition text; and extracting acoustic features from the voice data, and translating the voice recognition text according to the extracted acoustic features to obtain a translation text carrying the voice style of the voice data. Therefore, when the voice data is translated, the acoustic characteristics of the voice data are considered, so that the translated text can conform to the style and characteristics of the voice data, the translated text is more natural and has higher expressive force, and a text reader can understand the semantics and the context conveniently.

Description

Voice translation method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for speech translation.

Background

With the increasing maturity of artificial intelligence technology, people are increasingly pursuing to solve some problems by using intelligent technology, for example, once people need to spend a lot of time to learn a new language to communicate with people who use the language as a mother language, and now people can directly realize spoken language input, text translation and pronouncing and speaking translated meanings by a translator around voice recognition, intelligent translation and voice synthesis technology.

However, most of the current text translation technologies only achieve literal translation of text, that is, when performing text translation on speech data of a source speaker, the translated text often cannot express the style and features of the source speaker. For example, when a chinese speech is translated into an english text, the chinese text of the chinese speech may correspond to different english texts, but the language style and emotional characteristics expressed by the different english texts may be different, and the actual translated english text is often inappropriate, i.e., the translated text cannot express the style and characteristics of the source speaker.

Disclosure of Invention

The embodiment of the present application mainly aims to provide a method and an apparatus for speech translation, which can make a translation text conform to the style and characteristics of speech data when translating the speech data.

The embodiment of the application provides a voice translation method, which comprises the following steps:

acquiring first voice data;

generating a voice recognition text by performing voice recognition on the first voice data;

extracting target acoustic features from the first voice data, and translating the voice recognition text according to the target acoustic features to obtain a translation text carrying the voice style of the first voice data.

Optionally, the extracting the target acoustic feature from the first speech data includes:

taking the voice recognition text as a unit recognition text, or taking each text segment forming the voice recognition text as a unit recognition text respectively;

determining a voice segment corresponding to the unit identification text in the first voice data;

determining a target acoustic feature of the speech segment.

Optionally, the translating the speech recognition text according to the target acoustic feature includes:

respectively carrying out text vectorization on each unit identification text;

respectively carrying out feature vectorization on the target acoustic features corresponding to the unit identification texts;

and taking the text vectorization result and the feature vectorization result as the input feature of a pre-constructed translation model, so as to realize the translation of the voice recognition text by using the translation model.

Optionally, if the target acoustic feature includes at least one feature type, the method further includes:

determining a value range corresponding to the feature type according to pre-collected sample voice data;

dividing the value range into at least two value intervals;

then, the performing feature vectorization on the target acoustic features corresponding to each unit identification text respectively includes:

for target acoustic features corresponding to each unit identification text, determining a feature value corresponding to each feature type in the target acoustic features;

and performing feature vectorization on the characteristic value according to the value range of the characteristic value.

Optionally, the method further includes:

and carrying out voice synthesis on the translated text according to the target acoustic characteristics to obtain second voice data carrying the voice style of the first voice data.

Optionally, the performing speech synthesis on the translated text according to the target acoustic feature includes:

adjusting model parameters of a pre-constructed synthetic model by using the target acoustic characteristics;

and performing voice synthesis on the translated text by using the adjusted synthesis model.

performing voice synthesis on the translation text by using a pre-constructed synthesis model to obtain initial voice data;

and adjusting the acoustic characteristics of the initial voice data by using the target acoustic characteristics.

Optionally, the target acoustic features include one or more feature types of average speech speed, average pitch, and average volume.

An embodiment of the present application further provides a speech translation apparatus, including:

a voice data acquisition unit for acquiring first voice data;

a recognition text generation unit configured to generate a speech recognition text by performing speech recognition on the first speech data;

an acoustic feature extraction unit configured to extract a target acoustic feature from the first voice data;

and the translation text generation unit is used for translating the voice recognition text according to the target acoustic characteristics to obtain a translation text carrying the voice style of the first voice data.

Optionally, the acoustic feature extraction unit includes:

a unit text determining subunit, configured to use the speech recognition text as a unit recognition text, or use each text segment forming the speech recognition text as a unit recognition text;

a voice section determining subunit, configured to determine a voice section corresponding to the unit identification text in the first voice data;

and the acoustic feature determining subunit is used for determining the target acoustic features of the voice segments.

Optionally, the translation text generating unit includes:

the text vectorization subunit is used for respectively carrying out text vectorization on each unit identification text;

the feature vectorization subunit is configured to perform feature vectorization on the target acoustic features corresponding to each unit identification text;

and the translation text generation subunit is configured to use the text vectorization result and the feature vectorization result as input features of a pre-constructed translation model, so as to implement translation of the speech recognition text by using the translation model, and obtain a speech-style translation text carrying the first speech data.

Optionally, if the target acoustic feature includes at least one feature type, the apparatus further includes:

a value range determining unit, configured to determine a value range corresponding to the feature type according to sample voice data collected in advance;

a value range determining unit, configured to divide the value range into at least two value ranges;

then, the feature vectorization subunit includes:

the feature value determining subunit is configured to determine, for a target acoustic feature corresponding to each unit identification text, a feature value corresponding to each feature type in the target acoustic feature;

and the vectorization processing subunit is used for performing feature vectorization on the characteristic value according to the value range of the characteristic value.

Optionally, the apparatus further comprises:

and the translation voice generating unit is used for carrying out voice synthesis on the translation text according to the target acoustic characteristics to obtain second voice data carrying the voice style of the first voice data.

Optionally, the translation speech generating unit includes:

the model parameter adjusting subunit is used for adjusting the model parameters of a pre-constructed synthesis model by using the target acoustic characteristics;

a first speech generation subunit, configured to perform speech synthesis on the translation text by using the adjusted synthesis model;

alternatively, the translation speech generating unit includes:

the second voice generation subunit is used for carrying out voice synthesis on the translation text by utilizing a pre-constructed synthesis model to obtain initial voice data;

and the voice data adjusting subunit is used for adjusting the acoustic characteristics of the initial voice data by using the target acoustic characteristics.

The embodiment of the present application further provides another speech translation apparatus, including: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory is for storing one or more programs, the one or more programs including instructions, which when executed by the processor, cause the processor to perform the method of any of the above.

In the speech translation method and apparatus provided in this embodiment, for speech data that needs to be subjected to text translation, speech recognition is performed on the speech data to generate a speech recognition text; and extracting acoustic features from the voice data, and translating the voice recognition text according to the extracted acoustic features to obtain a translation text carrying the voice style of the voice data. Therefore, when the voice data is translated, the acoustic characteristics of the voice data are considered, so that the translated text can conform to the style and characteristics of the voice data, the translated text is more natural and has higher expressive force, and a text reader can understand the semantics and the context conveniently.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a speech translation method according to an embodiment of the present application;

fig. 2 is a second schematic flowchart of a speech translation method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a text translation provided by an embodiment of the present application;

fig. 4 is a third schematic flowchart of a speech translation method according to an embodiment of the present application;

fig. 5 is a schematic diagram illustrating a speech translation apparatus according to an embodiment of the present application;

fig. 6 is a schematic hardware structure diagram of a speech translation apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the current text translation technologies, most of the translation technologies only achieve literal translation of text, that is, when performing text translation on voice data of a source speaker, the translated text often cannot express the style and characteristics of the source speaker.

Therefore, the embodiment of the application provides a speech translation method, in the process of implementing speech translation, not only is text translation implemented, but also the linguistic style and emotional characteristics of a source speaker are transferred, that is, the translated text can be adaptive to the linguistic style and emotional characteristics of the source speaker, so as to generate a more natural and expressive translated text, thereby helping a text reader understand semantics and context.

Exemplary embodiments provided herein are described in detail below with reference to the accompanying drawings.

First embodiment

Referring to fig. 1, a flow chart of a speech translation method provided in an embodiment of the present application is schematically illustrated, where the method includes the following steps:

s101: first voice data is acquired.

In this embodiment, the voice data to be subjected to text translation is defined as first voice data.

The source of the first voice data is not limited in this embodiment, for example, the first voice data may be a real voice of a source speaker or a recorded voice, or may be a special-effect voice obtained by performing machine processing on the real voice or the recorded voice.

The embodiment also does not limit the length of the first voice data, for example, the first voice data may be a word, a sentence, or a paragraph.

S102: and generating a voice recognition text by performing voice recognition on the first voice data.

After the first voice data is acquired, the first voice data is converted into a voice recognition text through a voice recognition technology, such as an artificial neural network-based voice recognition technology.

S103: extracting target acoustic features from the first voice data, and translating the voice recognition text according to the target acoustic features to obtain a translation text carrying the voice style of the first voice data.

In this embodiment, one or more acoustic feature types, such as feature types of speech rate, pitch, volume, and the like, may be preset, and when performing text translation on the first speech data, specific feature values for each acoustic feature type are extracted from the first speech data, and these specific feature values are used as the target acoustic features. And when a plurality of text translation results exist, selecting the most appropriate translation text by combining the target acoustic characteristics.

It can be understood that the target acoustic feature is mainly used for describing a speech style of the first speech data, that is, describing a speech style of a source speaker corresponding to the first speech data, and therefore, if the target acoustic feature is considered in a text translation process, an obtained translated text not only conforms to a word used by the source speaker when the source speaker speaks, but also represents a text expression style when the source speaker speaks, that is, the translated text is a translated text conforming to the speech style of the source speaker.

For ease of understanding, this is now exemplified. For example, when the first speech data is "certainly", the translated text generated by the existing text translation technology may be "Yes, of course", and the embodiment considers the speech style corresponding to the first speech data, and the translated text may be "You bet", which is more suitable for the meaning and the mood style to be expressed by the first speech data of the source speaker.

In the speech translation method provided by this embodiment, for speech data that needs to be subjected to text translation, speech recognition is performed on the speech data to generate a speech recognition text; and extracting acoustic features from the voice data, and translating the voice recognition text according to the extracted acoustic features to obtain a translation text carrying the voice style of the voice data. Therefore, when the voice data is translated, the acoustic characteristics of the voice data are considered, so that the translated text can conform to the style and characteristics of the voice data, the translated text is more natural and has higher expressive force, and a text reader can understand the semantics and the context conveniently.

Second embodiment

In this embodiment, a specific implementation manner of S103 in the first embodiment is described, and please refer to the description of the first embodiment for other relevant parts.

Referring to fig. 2, a flow chart of a speech translation method provided in the embodiment of the present application is schematically illustrated, where the method includes the following steps:

s201: first voice data is acquired.

S202: and generating a voice recognition text by performing voice recognition on the first voice data.

It should be noted that S201 and S202 in the present embodiment are the same as S101 and S102 in the first embodiment, and for related description, reference is made to the first embodiment, which is not repeated herein.

S203: and taking the voice recognition text as a unit recognition text, or taking each text segment forming the voice recognition text as the unit recognition text respectively.

In this embodiment, the entire voice recognition text may be recognized as a unit; or splitting the voice recognition text, and taking each split text segment as a unit recognition text.

When the text is split, the text can be split based on a preset splitting unit, and a larger splitting unit or a smaller splitting unit can be used, for example, the text is split by taking a single syllable, a single character, a single word or the like as the splitting unit, so that a plurality of text segments are obtained. The preset splitting unit can be a manually preset splitting unit or a default splitting unit of the system.

For example, when the speech recognition text is chinese, each of the split text segments may be a single word, for example, the speech recognition text is "certainly", and the split text segments are "when", "then", and "after", respectively; for another example, when the speech recognition text is english, each of the split text segments may be a word, for example, the speech recognition text is "Yes, of course", and the split text segments are "Yes", "of", and "course", respectively.

S204: and determining a voice segment corresponding to the unit recognition text in the first voice data, and determining a target acoustic feature of the voice segment.

In this embodiment, when the whole speech recognition text is used as a unit recognition text, a speech segment corresponding to the unit recognition text is the first speech data; when each text segment in the speech recognition text is taken as a unit recognition text, the speech segment corresponding to the unit recognition text is the speech data corresponding to the text segment in the first speech data.

After the voice segment corresponding to each unit recognition text is determined, the target acoustic feature of each voice segment is further determined. Taking the example that the speech recognition text is a Chinese text "certainly" and is divided into three unit recognition texts of "when", "then", and "then", it is necessary to acquire a speech segment 1 corresponding to "then" and extract a target acoustic feature from the speech segment 1, acquire a speech segment 2 corresponding to "then" and extract a target acoustic feature from the speech segment 2, and acquire a speech segment 3 corresponding to "then" and extract a target acoustic feature from the speech segment 3.

In an implementation manner of this embodiment, the target acoustic features may include one or more feature types of an average speech rate, an average pitch, and an average volume, and of course, this embodiment is not limited to these three types of acoustic features, and may also include other types of acoustic features, such as a pitch of a speaker, an intensity of an accent, and the like.

Next, a specific description will be given of a manner of determining the target acoustic feature.

(1) When the target acoustic feature includes a feature type of "average speech rate", the average speech rate of each speech segment may be specifically calculated as follows:

for each voice segment, firstly obtaining the duration of the voice segment, then determining the number of text units contained in a unit identification text corresponding to the voice segment, and finally dividing the number of the text units by the duration to obtain the specific average speed of speech of the voice segment. The unit of text is less than or equal to the length of the unit identification text, for example, the unit of text is a syllable, a chinese character includes a syllable, and an english word includes one or more syllables because of their unequal lengths.

Taking the word "this" in the above "certainly", as an example, first obtaining the duration (for example, the unit is Minute) of the speech segment corresponding to the word "this" and then dividing the number of Syllables of the word "this" by the duration to obtain the average speech rate in units of SPM (Syllables Per Minute); if the average speech rate of the whole sentence is calculated, the calculation mode is similar.

(2) When the target acoustic feature includes a feature type of "average pitch", the average pitch of each speech segment may be specifically calculated as follows:

for each voice segment, the voice segment is firstly divided into a plurality of audio frames, the length of each audio frame can be preset manually or adopts a system default value, then the vibration frequency of each audio frame is determined, the vibration frequency can be in Hertz (HZ) unit, and finally the average value of the vibration frequencies of the plurality of audio frames is calculated, wherein the average value is the specific average pitch of the voice segment.

Taking the word "this" in the above "of course" as an example, the word "this" is first divided into a plurality of audio frames, such as audio frame 1, audio frame 2, and audio frame 3; then determining the vibration frequency sizes of the audio frame 1, the audio frame 2 and the audio frame 3, such as frequency 1, frequency 2 and frequency 3 respectively; and finally, calculating (frequency 1+ frequency 2+ frequency 3)/3, wherein the calculation result is the average pitch of the 'current' pitch. If the average pitch is calculated "of course" for the entire sentence, the calculation is similar.

(3) When the target acoustic feature includes a feature type of "average volume", the average volume of each speech segment may be specifically calculated in the following manner:

for each voice segment, the voice segment is divided into a plurality of audio frames, the length of each audio frame can be preset manually or adopts a system default value, then the amplitude of each audio frame is determined, and finally the average value of the amplitudes of the plurality of audio frames is calculated, wherein the average value is the specific average volume of the voice segment.

Taking the word "this" in the above "of course" as an example, the word "this" is first divided into a plurality of audio frames, such as audio frame 1, audio frame 2, and audio frame 3; then determining the amplitude sizes of audio frame 1, audio frame 2 and audio frame 3, such as amplitude 1, amplitude 2 and amplitude 3, respectively; finally, calculating (amplitude 1+ amplitude 2+ amplitude 3)/3, wherein the calculation result is the average volume of the 'current' sound volume. If the average volume is calculated "of course" for the entire sentence, the calculation is similar.

S205: and respectively carrying out text vectorization on each unit identification text, and respectively carrying out feature vectorization on the target acoustic features corresponding to each unit identification text.

In this embodiment, each unit identification text may be vectorized to obtain a text vector of each unit identification text, that is, the text vector represents the corresponding unit identification text. For example, word2vec method may be adopted to implement text vectorization.

In this embodiment, the target acoustic features corresponding to each unit of recognition text may be vectorized to obtain a feature vector of the target acoustic features corresponding to each unit of recognition text, that is, the feature vector is used to represent the corresponding target acoustic features.

When the target acoustic feature includes at least one feature type, in order to implement feature vectorization, in this embodiment, a value range corresponding to the feature type may be determined according to sample voice data collected in advance, and the value range is divided into at least two value intervals.

Specifically, a large amount of voice data during person-to-person communication can be collected in advance, and each piece of collected voice data is used as sample voice data; respectively carrying out voice recognition on the sample voice data through a voice recognition technology, and splitting the sample voice data according to the length of the unit recognition text in the S204, such as a single word, so as to obtain a plurality of sample text fragments; and respectively calculating the acoustic feature value of each sample text segment according to the feature type contained in the target acoustic feature in the step S204, thereby obtaining the acoustic feature value range of each feature type, and dividing each acoustic feature value range into a plurality of value intervals. For example, when the target acoustic feature includes feature types such as an average speed of speech, an average pitch, an average volume, and the like, the acoustic feature value range of the average speed of speech is divided into a plurality of value intervals, for example, 20 value intervals, the acoustic feature value range of the average pitch is divided into a plurality of value intervals, for example, 15 value intervals, and the acoustic feature value range of the average volume is divided into a plurality of value intervals, for example, 25 value intervals.

Based on the above partition result of the value interval, in an implementation manner of this embodiment, the feature vectorization may be performed according to the following steps:

step A: and determining a feature value corresponding to each feature type in the target acoustic features for the target acoustic features corresponding to each unit identification text.

In this embodiment, for each unit of recognized text, a feature value corresponding to each feature type in the target acoustic feature may be calculated based on a speech segment corresponding to the unit of recognized text in the manner described above. For example, when the target acoustic feature includes three feature types of average speech speed, average pitch, and average volume, a specific average speech speed, a specific average pitch, and a specific average volume of the unit identification text are calculated.

And B: and performing feature vectorization on the characteristic value according to the value range of the characteristic value.

After the feature value tz of the unit identification text about a certain feature type tx (such as an average speech speed) is obtained through calculation, the value range of the feature type tx is divided into a plurality of value intervals in advance, so that which value interval the feature value tz belongs to can be determined; then, each value section of the feature type tx corresponds to a vector element, so that the vector element corresponding to the value section to which the feature value tz belongs can be a preset value such as 1, and the vector elements corresponding to other value sections can be another preset value such as 0, so as to obtain a feature vector composed of the preset values, that is, the feature vector corresponding to the feature value tz.

For ease of understanding, this is now exemplified. For example, after the value range of the "average speech rate" is obtained through the calculation of the sample speech data, the value range of the "average speech rate" may be divided into a plurality of value intervals according to the numerical order, and assuming that the value range of the average speech rate is 30 to 350SPM, the value range of the "average speech rate" may be divided into 20 value intervals, such as 30 to 60, 60 to 90, and 90 to 120 … …, the vector size of the "average speech rate" is 20, that is, the vector of the "average speech rate" includes 20 vector elements, and each vector element corresponds to one value interval; when the specific average speech rate of the unit identification text falls into one of the value intervals, for example, the specific average speech rate is 40, and the specific average speech rate falls into the value interval 30-60, the value of the vector element corresponding to the value interval is 1, and the values of the vector elements corresponding to the other value intervals are 0, so that the feature vector corresponding to the specific average speech rate 40 is: (1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)

S206: and using the text vectorization result and the feature vectorization result as the input feature of a pre-constructed translation model, so as to realize the translation of the voice recognition text by using the translation model, and obtain a translation text carrying the voice style of the first voice data.

In this embodiment, a text vector of each unit of recognition text in the speech recognition text and a feature vector of a target acoustic feature corresponding to each unit of recognition text are input into a pre-constructed translation model, so that the translation model can translate the text of the speech recognition text based on the input text vector and the feature vector to obtain a translated text conforming to the speaking style of the first speech data. The translation model may be trained in advance by a large amount of collected speech recognition texts and acoustic features of corresponding speech, and may include an encoding model and a decoding model.

Inputting a text vector and a feature vector corresponding to each unit recognition text in the voice recognition text into a coding model included in the translation model; the coding model firstly carries out primary coding on the text vector of each unit identification text, and then carries out secondary coding on the primary coding result and the feature vector corresponding to the unit identification text; and finally, inputting the secondary coding result into a decoding model included in the translation model for decoding to obtain a translation text of the voice recognition text.

For example, assuming that the speech recognition text is "of course", and the text is recognized by using "when", "then" and "after" as units, the text vector and the feature vector corresponding to "when", "then" and "after" are input into the coding model included in the translation model. As shown in the text translation diagram of fig. 3, the encoding model may use a Bidirectional Long Short-term Memory (BLSTM) model to encode the text vector once, and specifically may use three BLSTM models to encode "when", "then" and "after" once, so that the relationship between each unit identification text and other unit identification texts can be known by one-time encoding; then, the coding model may perform secondary coding using a Deep Neural Network (DNN) model, and specifically may perform secondary coding using three DNN models, that is, perform secondary coding on the "then" primary coding result and the "then" corresponding feature vector, and perform secondary coding on the "then" primary coding result and the "then" corresponding feature vector.

Since the acoustic feature information of the first speech data is incorporated into the secondary encoding, when the secondary encoding result is input to the decoding model included in the translation model, a translation text carrying the speech style of the first speech data can be generated.

It should be noted that, in the specific translation, when the speech recognition text includes one sentence or a plurality of sentences, the text translation may be performed in units of sentences; in translating the current sentence, the translation may be performed in units of the size of the unit recognition text (such as a single word).

In the speech translation method provided by this embodiment, for speech data that needs to be subjected to text translation, speech recognition is performed on the speech data to generate a speech recognition text; taking the voice recognition text as a unit recognition text or splitting the voice recognition text into a plurality of unit recognition texts, and extracting acoustic features from voice segments of the unit recognition texts; and performing text translation according to a vectorization result by vectorizing the unit recognition text and the acoustic features corresponding to the unit recognition text to obtain a translation text carrying the voice style of the voice data. Therefore, when the voice data is translated, the acoustic characteristics of the voice data are considered, so that the translated text can conform to the style and characteristics of the voice data, the translated text is more natural and has higher expressive force, and a text reader can understand the semantics and the context conveniently.

Third embodiment

In the current speech translation technology, the translated audio synthesized by a machine is completely the speaking style of a training speaker in a synthesis model, the relevance of the effect of the synthesized audio and the speaking style of a source speaker before translation is very low, and sometimes, the style and the characteristics of the source speaker are difficult to express by the simply translated audio.

To solve the problem, the present embodiment provides a speech translation method, which can translate speech data of a source speaker (i.e., the first speech data in the first embodiment and the second embodiment) to obtain a translated text, and perform audio synthesis by combining acoustic features in the speech data, so that a synthesized audio is adaptive to a speech style of the source speaker, thereby implementing more natural and expressive speech translation. The method is suitable for real-time spoken language translation and other directions, and can obtain the synthetic audio with the self-adaptive source speaker style.

Referring to fig. 4, a flow chart of a speech translation method provided in the embodiment of the present application is schematically illustrated, where the method includes the following steps:

s401: first voice data is acquired.

S402: and generating a voice recognition text by performing voice recognition on the first voice data.

It should be noted that S401 and S402 in the present embodiment are the same as S101 and S102 in the first embodiment, and for related description, reference is made to the first embodiment, which is not repeated herein.

S403: extracting target acoustic features from the first voice data, and translating the voice recognition text according to the target acoustic features to obtain a translation text carrying the voice style of the first voice data.

It should be noted that S403 in this embodiment is the same as S103 in the first embodiment, and for related description, reference is made to specific implementation manners of the first embodiment or the second embodiment, which is not described herein again.

S404: and carrying out voice synthesis on the translated text according to the target acoustic characteristics to obtain second voice data carrying the voice style of the first voice data.

In the embodiment, the synthesized audio is referred to as second voice data; the second voice data carries the voice style of the first voice data, and the voice style is the word or pronunciation style according with the first voice data.

It is understood that the first speech data before translation and the second speech data after translation are usually in different languages, for example, the first speech data is chinese and the second speech data is english.

In this embodiment, a pre-constructed synthesis model and a target acoustic feature of the first speech data may be utilized to synthesize a speech from the translated text of the first speech data, so as to obtain second speech data with a speech style of the first speech data.

More specifically, S404 can be implemented in one of the following two embodiments.

In a first specific implementation manner, S404 may include: and adjusting model parameters of a pre-constructed synthesis model by using the target acoustic characteristics, and performing voice synthesis on the translation text by using the adjusted synthesis model to obtain second voice data carrying the voice style of the first voice data.

In this embodiment, the synthesis model may synthesize the voice data using the acoustic parameters, such as acoustic parameters of fundamental frequency, duration, and frequency spectrum, but the pronunciation mode of the synthesized voice may be the pronunciation style of a training speaker in the synthesis model, and therefore, the corresponding acoustic parameters of the synthesis model may be adjusted using the target acoustic features extracted from the first voice data, for example, the corresponding acoustic parameters may be adjusted using a deep learning method, so that the second voice data synthesized by using the adjusted synthesis model may better accord with the style of the first voice data in pronunciation.

When the target acoustic feature extraction target is each unit recognition text in the second embodiment, when the acoustic parameter is adjusted, a modeling unit of the synthesis model is first determined, where the modeling unit indicates a text unit in which the synthesis model performs speech synthesis, and if the modeling unit of the synthesis model is different from the length of the unit recognition text, the speech recognition text needs to be split according to the modeling unit of the synthesis model, and the target acoustic feature needs to be extracted from each split text again. For example, if the unit of modeling of the synthetic model is a syllable and the unit identifies that the text is a single word in length, the target acoustic features are re-extracted in units of syllables and the acoustic parameters of the synthetic model are adjusted by using the newly extracted target acoustic features.

In a second specific implementation manner, S404 may include: and performing voice synthesis on the translated text by using a pre-constructed synthesis model to obtain initial voice data, and adjusting the acoustic characteristics of the initial voice data by using the target acoustic characteristics to obtain second voice data carrying the voice style of the first voice data.

In this embodiment, a speech synthesis is performed using a synthesis model constructed in advance based on the translated text, and this speech synthesis data is referred to as initial speech data in this embodiment, and since the pronunciation pattern of this initial speech data may be the pronunciation style of a training speaker in the synthesis model, this pronunciation style may be different from the pronunciation style of the first speech data. Therefore, after the initial voice data is synthesized by using the synthesis model, the acoustic feature of the initial voice data can be adjusted by using the target acoustic feature extracted from the first voice data to obtain the adjusted second voice data, so that the second voice data can be more in pronunciation with the style of the first voice data.

When the initial voice data is acoustically adjusted, the length of the unit recognition text may be directly adjusted, or the length of the unit recognition text may be adjusted to be smaller or larger than the unit recognition text, for example, the unit of syllable, word, or phrase is adjusted.

Taking the target acoustic features including average duration, average pitch, and average volume as an example, when adjusting in units of syllables, calculating the average duration, average pitch, and average volume of each syllable in the first voice data, calculating the average duration, average pitch, and average volume of each syllable in the initial voice data, and performing acoustic adjustment on the initial voice data according to the calculation result. Specifically, when the average time length of a certain syllable 1 in the initial voice data is short, the average time length of the corresponding syllable in the first voice data is used for lengthening the syllable 1 in the pronunciation time, and vice versa; when the average pitch of a syllable 2 in the initial voice data is higher, compressing the syllable 2 on the pronunciation frequency by using the average pitch of the corresponding syllable in the first voice data, and vice versa; when the average volume of a syllable 3 in the initial voice data is high, the syllable 3 is compressed in pronunciation amplitude by using the average pitch of the corresponding syllable in the first voice data, and vice versa.

Of course, in this embodiment, the initial voice data may not be adjusted, and the initial voice data may be directly used as the translated final voice data.

In the speech translation method provided by this embodiment, for speech data that needs to be subjected to text translation, speech recognition is performed on the speech data to generate a speech recognition text; acoustic features are extracted from the voice data, and the voice recognition text is translated according to the extracted acoustic features to obtain a translation text carrying the voice style of the voice data; and finally, performing voice synthesis on the translated text according to the target acoustic characteristics to obtain second voice data carrying the voice style of the first voice data. Therefore, when the voice data is translated, the acoustic characteristics of the voice data are considered, so that the translated text can accord with the style and the characteristics of the voice data, the translated text is more natural and has expression, and a text reader can understand the semantics and the context conveniently.

Fourth embodiment

Based on the speech translation methods provided in the first to third embodiments, the present application also provides a speech translation apparatus, and a fourth embodiment will be described with reference to the accompanying drawings.

Referring to fig. 5, a schematic diagram of a speech translation apparatus provided in an embodiment of the present application is shown, where the apparatus 500 includes:

a voice data acquisition unit 501 configured to acquire first voice data;

a recognition text generation unit 502 for generating a voice recognition text by performing voice recognition on the first voice data;

an acoustic feature extraction unit 503, configured to extract a target acoustic feature from the first speech data;

a translated text generating unit 504, configured to translate the speech recognition text according to the target acoustic feature, so as to obtain a translated text in a speech style carrying the first speech data.

In an implementation manner of this embodiment, the acoustic feature extraction unit 503 includes:

In an implementation manner of this embodiment, the translation text generating unit 504 includes:

In an implementation manner of this embodiment, if the target acoustic feature includes at least one feature type, the apparatus 500 further includes:

then, the feature vectorization subunit includes:

In an implementation manner of this embodiment, the apparatus 500 further includes:

In an implementation manner of this embodiment, the translation speech generating unit includes:

alternatively, the translation speech generating unit includes:

In one implementation manner of this embodiment, the target acoustic features include one or more feature types of average speech speed, average pitch, and average volume.

In the speech translation apparatus provided in this embodiment, for speech data that needs to be subjected to text translation, speech recognition is performed on the speech data to generate a speech recognition text; and extracting acoustic features from the voice data, and translating the voice recognition text according to the extracted acoustic features to obtain a translation text carrying the voice style of the voice data. Therefore, when the voice data is translated, the acoustic characteristics of the voice data are considered, so that the translated text can conform to the style and characteristics of the voice data, the translated text is more natural and has higher expressive force, and a text reader can understand the semantics and the context conveniently.

Fifth embodiment

Referring to fig. 6, a hardware structure diagram of a speech translation apparatus provided for the embodiment of the present application is shown, where the system 600 includes a memory 601 and a receiver 602, and a processor 603 connected to the memory 601 and the receiver 602 respectively, where the memory 601 is used to store a set of program instructions, and the processor 603 is used to call the program instructions stored in the memory 601 to perform the following operations:

acquiring first voice data;

In an implementation manner of this embodiment, the processor 603 is further configured to call the program instructions stored in the memory 601 to perform the following operations:

determining a target acoustic feature of the speech segment.

respectively carrying out text vectorization on each unit identification text;

In an implementation manner of this embodiment, if the target acoustic feature includes at least one feature type, the processor 603 is further configured to call the program instructions stored in the memory 601 to perform the following operations:

dividing the value range into at least two value intervals;

In some embodiments, the processor 603 may be a Central Processing Unit (CPU), the Memory 601 may be an internal Memory of Random Access Memory (RAM), and the receiver 602 may include a common physical interface, which may be an Ethernet (Ethernet) interface or an Asynchronous Transfer Mode (ATM) interface. The processor 603, receiver 602, and memory 601 may be integrated into one or more separate circuits or hardware, such as: application Specific Integrated Circuit (ASIC).

As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that all or part of the steps in the above embodiment methods can be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network communication device such as a media gateway, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of speech translation, comprising:

acquiring first voice data;

determining a target acoustic feature of the speech segment;

respectively carrying out text vectorization on each unit identification text;

and using the text vectorization result and the feature vectorization result as the input feature of a pre-constructed translation model, so as to realize the translation of the voice recognition text by using the translation model, and obtain a translation text carrying the voice style of the first voice data.

2. The method of claim 1, wherein if the target acoustic feature comprises at least one feature type, the method further comprises:

dividing the value range into at least two value intervals;

3. The method of claim 1, further comprising:

4. The method of claim 3, wherein the speech synthesizing the translated text according to the target acoustic features comprises:

5. The method of claim 3, wherein the speech synthesizing the translated text according to the target acoustic features comprises:

6. The method of any one of claims 1 to 5, wherein the target acoustic features comprise one or more feature types of average speech rate, average pitch, and average volume.

7. A speech translation apparatus, comprising:

a voice data acquisition unit for acquiring first voice data;

the translation text generation unit is used for translating the voice recognition text according to the target acoustic characteristics to obtain a translation text carrying the voice style of the first voice data;

the acoustic feature extraction unit includes:

the acoustic feature determining subunit is used for determining a target acoustic feature of the voice segment;

the translation text generation unit includes:

8. The apparatus of claim 7, wherein if the target acoustic feature comprises at least one feature type, the apparatus further comprises:

then, the feature vectorization subunit includes:

9. The apparatus of claim 7, further comprising:

10. The apparatus according to claim 9, wherein the translation speech generating unit comprises:

alternatively, the translation speech generating unit includes:

11. The apparatus according to any one of claims 7 to 10, wherein the target acoustic features comprise one or more feature types of average speech rate, average pitch, and average volume.

12. A speech translation apparatus, comprising: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory is to store one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of any of claims 1-6.