CN108231062A

CN108231062A - A kind of voice translation method and device

Info

Publication number: CN108231062A
Application number: CN201810032112.3A
Authority: CN
Inventors: 王雨蒙; 周良; 江源; 胡国平
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2018-01-12
Filing date: 2018-01-12
Publication date: 2018-06-29
Anticipated expiration: 2038-01-12
Also published as: CN108231062B

Abstract

This application discloses a kind of voice translation method and device, the method includes：For needing the voice data of progress text translation, by carrying out speech recognition to the voice data, speech recognition text is generated；And acoustic feature is extracted from the voice data, speech recognition text is translated according to the acoustic feature of extraction, obtains carrying the cypher text of the voice style of the voice data.It can be seen that, due to when carrying out text translation to voice data, consider the acoustic feature that voice data has in itself, cypher text is enabled to meet the style and characteristic of the voice data, so that cypher text is more natural, understands semantic and linguistic context with more expressiveness, and then convenient for text reading person.

Description

A kind of voice translation method and device

Technical field

This application involves field of computer technology more particularly to a kind of voice translation methods and device.

Background technology

Increasingly mature with artificial intelligence technology, people pursue more and more to be solved using intellectual technology Problem, for example, once people required a great deal of time to learn a new language, could with using the language as mother tongue People links up, and present, people can directly by translator, around speech recognition, intelligent translation and speech synthesis technique, To realize that spoken language inputs, text is translated and the meaning after saying translation of pronouncing.

But in current text translation technology, most of translation technology only realizes that text is literal to be turned over It translates, that is to say, that when the voice data to source speaker carries out text translation, the text after translation cannot often give expression to source The style and characteristic of speaker.For example, when certain Chinese speech is translated into English text, due to the Chinese text of the Chinese speech May correspond to different English texts, but diction expressed by different English texts and emotion feature may be it is different, And the English text that actual translations go out is often inappropriate, i.e., cypher text be beyond expression out source speaker style and spy Point.

Invention content

The main purpose of the embodiment of the present application is to provide a kind of voice translation method and device, be carried out when to voice data During translation, cypher text can be made to meet the style and characteristic of the voice data.

The embodiment of the present application provides a kind of voice translation method, including：

Obtain the first voice data；

By carrying out speech recognition to first voice data, speech recognition text is generated；

Target acoustical feature is extracted from first voice data, the voice is known according to the target acoustical feature Other text is translated, and obtains carrying the cypher text of the voice style of first voice data.

Optionally, it is described that target acoustical feature is extracted from first voice data, including：

Using the speech recognition text as identified in units text or each text for the speech recognition text will being formed This segment is respectively as identified in units text；

Determine sound bite corresponding with the identified in units text in first voice data；

Determine the target acoustical feature of the sound bite.

Optionally, it is described that the speech recognition text is translated according to the target acoustical feature, including：

Each identified in units text is subjected to text vector respectively；

The corresponding target acoustical feature of each identified in units text is subjected to feature vector respectively；

Using the text vector and described eigenvector as a result, input as the translation model built in advance Feature, to realize the translation to the speech recognition text using the translation model.

Optionally, if the target acoustical feature includes at least one characteristic type, the method further includes：

According to the sample voice data collected in advance, the corresponding value range of the characteristic type is determined；

The value range is divided at least two intervals；

Then, it is described that the corresponding target acoustical feature of each identified in units text is subjected to feature vector respectively, it wraps It includes：

Target acoustical feature corresponding for each identified in units text, determines each in the target acoustical feature The corresponding characteristic value of characteristic type；

According to the value range where the characteristic value, the characteristic value is subjected to feature vector.

Optionally, the method further includes：

Phonetic synthesis is carried out to the cypher text according to the target acoustical feature, obtains carrying first voice The second speech data of the voice style of data.

Optionally, it is described that phonetic synthesis is carried out to the cypher text according to the target acoustical feature, including：

The model parameter of synthetic model built in advance using the target acoustical Character adjustment；

Phonetic synthesis is carried out to the cypher text using the synthetic model after adjustment.

Phonetic synthesis is carried out to the cypher text using the synthetic model built in advance, obtains initial speech data；

The acoustic feature of the initial speech data is adjusted using the target acoustical feature.

Optionally, the target acoustical feature includes one or more of average word speed, average pitch, average volume spy Levy type.

The embodiment of the present application additionally provides a kind of speech translation apparatus, including：

Voice data acquiring unit, for obtaining the first voice data；

Text generation unit is identified, for by carrying out speech recognition to first voice data, generating speech recognition Text；

Acoustic feature extraction unit, for extracting target acoustical feature from first voice data；

Cypher text generation unit, for being translated according to the target acoustical feature to the speech recognition text, Obtain carrying the cypher text of the voice style of first voice data.

Optionally, the acoustic feature extraction unit includes：

Unit of text determination subelement, for the speech recognition text as identified in units text or will to be formed Each text fragments of the speech recognition text are respectively as identified in units text；

Sound bite determination subelement, it is corresponding with the identified in units text in first voice data for determining Sound bite；

Acoustic feature determination subelement, for determining the target acoustical feature of the sound bite.

Optionally, the cypher text generation unit includes：

Text vector beggar's unit, for each identified in units text to be carried out text vector respectively；

Feature vector beggar's unit, for the corresponding target acoustical feature of each identified in units text to be carried out respectively Feature vector；

Cypher text generates subelement, for using the text vector and described eigenvector as a result, as The input feature vector of the translation model built in advance, to realize the translation to the speech recognition text using the translation model, Obtain carrying the cypher text of the voice style of first voice data.

Optionally, if the target acoustical feature includes at least one characteristic type, described device further includes：

Value range determination unit, for according to the sample voice data collected in advance, determining that the characteristic type corresponds to Value range；

Interval determination unit, for the value range to be divided at least two intervals；

Then, described eigenvector beggar unit includes：

Characteristic value determination subelement, for for the corresponding target acoustical feature of each identified in units text, determining The corresponding characteristic value of each characteristic type in the target acoustical feature；

Vectorization handles subelement, for according to the value range where the characteristic value, the characteristic value being carried out special Levy vectorization.

Optionally, described device further includes：

Translated speech generation unit, for carrying out phonetic synthesis to the cypher text according to the target acoustical feature, Obtain carrying the second speech data of the voice style of first voice data.

Optionally, the translated speech generation unit includes：

Model parameter adjusts subelement, for the mould of synthetic model built in advance using the target acoustical Character adjustment Shape parameter；

First speech production subelement, for carrying out voice conjunction to the cypher text using the synthetic model after adjustment Into；

Alternatively, the translated speech generation unit includes：

Second speech production subelement, for carrying out voice conjunction to the cypher text using the synthetic model built in advance Into obtaining initial speech data；

Voice data adjusts subelement, for special to the acoustics of the initial speech data using the target acoustical feature Sign is adjusted.

The embodiment of the present application additionally provides another speech translation apparatus, including：Processor, memory, system bus；

The processor and the memory are connected by the system bus；

For the memory for storing one or more programs, one or more of programs include instruction, described instruction The processor is made when being performed by the processor to perform method described in any one of the above embodiments.

A kind of voice translation method and device provided in this embodiment, for needing the voice data of progress text translation, By carrying out speech recognition to the voice data, speech recognition text is generated；And acoustic feature, root are extracted from the voice data Speech recognition text is translated according to the acoustic feature of extraction, obtains carrying the translation text of the voice style of the voice data This.As it can be seen that due to when carrying out text translation to voice data, it is contemplated that the acoustic feature that the voice data has in itself makes Cypher text can meet the style and characteristic of the voice data so that cypher text it is more natural, with more expressiveness, And then understand semantic and linguistic context convenient for text reading person.

Description of the drawings

In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or it will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the application Some embodiments, for those of ordinary skill in the art, without creative efforts, can also basis These attached drawings obtain other attached drawings.

Fig. 1 is a kind of one of flow diagram of voice translation method provided by the embodiments of the present application；

Fig. 2 is the two of a kind of flow diagram of voice translation method provided by the embodiments of the present application；

Fig. 3 translates schematic diagram for text provided by the embodiments of the present application；

Fig. 4 is the three of a kind of flow diagram of voice translation method provided by the embodiments of the present application；

Fig. 5 is a kind of composition schematic diagram of speech translation apparatus provided by the embodiments of the present application；

Fig. 6 is a kind of hardware architecture diagram of speech translation apparatus provided by the embodiments of the present application.

Specific embodiment

Purpose, technical scheme and advantage to make the embodiment of the present application are clearer, below in conjunction with the embodiment of the present application In attached drawing, the technical solution in the embodiment of the present application is clearly and completely described, it is clear that described embodiment is Some embodiments of the present application, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art All other embodiments obtained without making creative work shall fall in the protection scope of this application.

In current text translation technology, most of translation technology only realizes the literal translation of text, that is, When carrying out text translation to the voice data of source speaker, the text after translation cannot often express source speaker style and Feature.

For this purpose, the embodiment of the present application provides a kind of voice translation method, it is not only real during voiced translation is realized Show text translation, also deliver the diction and emotion characteristic of source speaker, that is, cypher text can be made to be adaptive to source The diction and emotion characteristic of speaker, to generate cypher text more natural, with more expressiveness, so as to which help text is read Reader understands semantic and linguistic context.

The exemplary embodiment provided below in conjunction with the accompanying drawings the application is specifically introduced.

First embodiment

Be a kind of flow diagram of voice translation method provided by the embodiments of the present application referring to Fig. 1, this method include with Lower step：

S101：Obtain the first voice data.

In the present embodiment, it would be desirable to which the voice data for carrying out text translation is defined as the first voice data.

The present embodiment does not limit the source of first voice data, is said for example, first voice data can be source The real speech or recorded speech of words people can also be after carrying out machine processing to the real speech or the recorded speech Special efficacy voice.

The present embodiment does not limit the length of first voice data yet, for example, first voice data can be one A word, can also be in short, can also be one section of word.

S102：By carrying out speech recognition to first voice data, speech recognition text is generated.

After first voice data is got, by speech recognition technology, such as the language based on artificial neural network First voice data is converted into speech recognition text by sound identification technology.

S103：Target acoustical feature is extracted from first voice data, according to the target acoustical feature to described Speech recognition text is translated, and obtains carrying the cypher text of the voice style of first voice data.

In the present embodiment, one or more acoustic feature types, such as word speed, pitch, volume etc. can be pre-set Characteristic type, when carrying out text translation to first voice data, extraction is about each from first voice data The specific features value of acoustic feature type, and using these specific features values as the target acoustical feature.When there are multiple texts During this translation result, with reference to the most suitable cypher text of target acoustical Feature Selection.

It is understood that the target acoustical feature is mainly used for describing the voice style of first voice data, Namely the locution of the source speaker corresponding to first voice data is described, therefore, if in text translation process Consider the target acoustical feature, the cypher text obtained not only conforms to word when source speaker speaks, is also demonstrated by source Text representation style when speaker speaks, that is to say, that the cypher text is the translation for meeting the locution of source speaker Text.

For ease of understanding, it now illustrates.For example, when first voice data is " of coursing ", using existing text The cypher text of translation technology generation may be " Yes, of course ", and the present embodiment considers first voice data Corresponding voice style, cypher text may be " You bet ", and the cypher text is more suitable for the first of the source speaker The voice data meaning to be expressed and tone style.

A kind of voice translation method provided in this embodiment, for needing the voice data of progress text translation, by right The voice data carries out speech recognition, generates speech recognition text；And acoustic feature is extracted from the voice data, according to extraction Acoustic feature speech recognition text is translated, obtain carrying the cypher text of the voice style of the voice data.It can See, due to when carrying out text translation to voice data, it is contemplated that the acoustic feature that voice data has in itself so that translation text Originally can meet the style and characteristic of the voice data so that cypher text it is more natural, with more expressiveness, and then be convenient for Text reading person understands semantic and linguistic context.

Second embodiment

The present embodiment will focus on the specific implementation for introducing S103 in above-mentioned first embodiment, and other relevant portions please join See the introduction of first embodiment.

Be a kind of flow diagram of voice translation method provided by the embodiments of the present application referring to Fig. 2, this method include with Lower step：

S201：Obtain the first voice data.

S202：By carrying out speech recognition to first voice data, speech recognition text is generated.

It should be noted that the S201 and S202 and the S101 in first embodiment in the present embodiment are consistent with S102, phase Speak on somebody's behalf it is bright refer to first embodiment, details are not described herein.

S203：Using the speech recognition text as identified in units text or the speech recognition text will be formed Each text fragments are respectively as identified in units text.

In the present embodiment, the speech recognition text integrally can be identified text as a unit；It can also incite somebody to action The speech recognition text is split, using each text fragments split out as an identified in units text.

When carrying out text fractionation, preset fractionation unit can be based on and carry out text fractionation, use larger fractionation list Position or use smaller split cells, such as with single syllable or with single word or with single word etc. for split unit into It composes a piece of writing this fractionation, so as to obtain multiple text fragments.Wherein, the preset fractionation unit can be artificial preset fractionation The fractionation unit of unit or system default.

For example, when the speech recognition text is Chinese, each text fragments for splitting out can be single word, such as institute Speech recognition text is stated as " of coursing ", the text fragments split out are respectively " when ", " right ", " "；In another example the voice When identifying text as English, each text fragments for splitting out can be a word, for example the speech recognition text is " Yes, of course ", the text fragments split out are respectively " Yes ", " of ", " course ".

S204：It determines sound bite corresponding with the identified in units text in first voice data, and determines institute State the target acoustical feature of sound bite.

In the present embodiment, when using the speech recognition text, integrally as during identified in units text, the identified in units is literary This corresponding sound bite, as described first voice data；Make when by each text fragments in the speech recognition text When identifying text for unit, the corresponding sound bite of identified in units text, in as described first voice data with the text The corresponding voice data of segment.

After the corresponding sound bite of each identified in units text is determined, the target of each sound bite is further determined that Acoustic feature.Using the speech recognition text as Chinese text " of coursing " and be split as " when ", " right ", " " these three For identified in units text, need to obtain " when " corresponding sound bite 1 and target acoustical feature extracted from sound bite 1, Obtain " right " corresponding sound bite 2 and extracted from sound bite 2 target acoustical feature and, the corresponding language of acquisition " " Tablet section 3 simultaneously extracts target acoustical feature from sound bite 3.

In a kind of realization method of the present embodiment, the target acoustical feature can include average word speed, average pitch, One or more of average volume characteristic type, certainly, the present embodiment do not limit this three classes acoustic feature, can also include it The acoustic feature of its type, for example, the tone height of speaker, stress intensity etc..

Next, the method for determination of the target acoustical feature is specifically introduced.

It (1), specifically can be according to lower section when the target acoustical feature includes " average word speed " this characteristic type Formula calculates the average word speed of each sound bite：

For each sound bite, the duration of the sound bite is first obtained, then determines the corresponding unit of the sound bite The unit-in-context number that identification text is included, finally with this article our unit number divided by the duration, obtains the sound bite Specifically be averaged word speed.Wherein, the unit-in-context is less than or equal to the length of identified in units text, for example the unit-in-context is One syllable, Chinese character include a syllable, and an English word is because its length differs, so including one or more A syllable.

Now by taking " when " word in above-mentioned " of coursing " as an example, first obtain the corresponding sound bite of " when " word duration (such as Unit is minute), then with the syllable number of " when " word divided by the duration, obtain with SPM (Syllables Per Minute, Syllable number per minute) be unit average word speed；If calculate the average word speed of " of coursing " this whole word, calculation class Seemingly.

It (2), specifically can be according to lower section when the target acoustical feature includes " average pitch " this characteristic type Formula calculates the average pitch of each sound bite：

For each sound bite, the sound bite is first divided into several audio frames, the length of each audio frame can With by manually presetting or using system default value, then determining the vibration frequency size of each audio frame, the vibration frequency The average value of the vibration frequency of this several audio frame can be finally calculated, which is should with hertz (HZ) for unit The specific average pitch of sound bite.

Now by taking " when " word in above-mentioned " of coursing " as an example, " when " word is first divided into several audio frames, for example be respectively Audio frame 1, audio frame 2 and audio frame 3；Then the vibration frequency size of audio frame 1, audio frame 2 and audio frame 3 is determined, such as Respectively frequency 1, frequency 2 and frequency 3；(frequency 1+ frequency 2+ frequencies 3)/3 are finally calculated, which is the flat of " when " Equal pitch.If calculating the average pitch of " of coursing " this whole word, calculation is similar.

It (3), specifically can be according to lower section when the target acoustical feature includes " average volume " this characteristic type Formula calculates the average volume of each sound bite：

For each sound bite, the sound bite is first divided into several audio frames, the length of each audio frame can With by manually presetting or using system default value, then determining the amplitude size of each audio frame, it is several finally to calculate this The average value of the amplitude of a audio frame, the average value are the specific average volume of the sound bite.

Now by taking " when " word in above-mentioned " of coursing " as an example, " when " word is first divided into several audio frames, for example be respectively Audio frame 1, audio frame 2 and audio frame 3；Then it determines the amplitude size of audio frame 1, audio frame 2 and audio frame 3, for example distinguishes For amplitude 1, amplitude 2 and amplitude 3；(amplitude 1+ amplitude 2+ amplitudes 3)/3 are finally calculated, which is the average sound of " when " Amount.If calculating the average volume of " of coursing " this whole word, calculation is similar.

S205：Each identified in units text is subjected to text vector, and each identified in units is literary respectively This corresponding target acoustical feature carries out feature vector respectively.

In the present embodiment, per unit can be identified text vector, obtains the text of per unit identification text Vector represents corresponding identified in units text using text vector.For example, may be used word2vec methods realize text to Quantization.

In the present embodiment, per unit can be identified to the corresponding target acoustical feature vector of text, obtained each The feature vector of the corresponding target acoustical feature of identified in units text represents that corresponding target acoustical is special using feature vector Sign.

When the target acoustical feature includes at least one characteristic type, in order to realize feature vector, the present embodiment The corresponding value range of the characteristic type can be determined according to the sample voice data collected in advance, and by the value model It encloses and is divided at least two intervals.

Specifically, voice data during a large amount of person-to-person communications can be collected in advance, by each voice data of collection As sample voice data；By speech recognition technology, respectively the sample voice data are carried out with speech recognition, and according to The length of identified in units text described in S204, such as single word split the sample voice data, obtain multiple samples This text fragments；According to the characteristic type that target acoustical feature described in S204 is included, each sample text piece is calculated respectively The acoustic feature value of section, so as to obtain the acoustic feature value range of each characteristic type, and by each acoustic feature value model It encloses and is divided into multiple intervals.For example, when the target acoustical feature includes average word speed, average pitch, average volume etc. During characteristic type, the acoustic feature value range of average word speed is divided into multiple intervals, such as 20 intervals, it will The acoustic feature value range of average pitch is divided into multiple intervals, such as 15 intervals, by the sound of average volume It learns feature value range and is divided into multiple intervals, such as 25 intervals.

It, can be according to following step in a kind of realization method of the present embodiment based on the division result of above-mentioned interval It is rapid to carry out feature vector：

Step A：Target acoustical feature corresponding for each identified in units text, determines the target acoustical feature In the corresponding characteristic value of each characteristic type.

In the present embodiment, text is identified for per unit, the corresponding voice sheet of identified in units text can be based on Section in the way of above-mentioned introduction, calculates the corresponding characteristic value of each characteristic type in the target acoustical feature.For example, work as When the target acoustical feature includes average word speed, average pitch, these three characteristic types of average volume, the identified in units is calculated Specifically be averaged word speed, specific average pitch, the specific average volume of text.

Step B：According to the value range where the characteristic value, the characteristic value is subjected to feature vector.

After characteristic value tz of the identified in units text about certain characteristic type tx (such as averagely word speed) is calculated, Since the value range of this feature type tx has been partitioned into multiple intervals in advance, hence, it can be determined that characteristic value tz belongs to In wherein which interval；Later, due to each interval of characteristic type tx, a vector element is corresponded to respectively, because The corresponding vector element of the affiliated intervals of characteristic value tz, can be taken a preset value such as 1 by this, and other each value areas Between corresponding vector element be taken as another preset value such as 0, in this way, just obtaining a feature being made of each preset value Vector is to get to the corresponding feature vectors of characteristic value tz.

For ease of understanding, it now illustrates.For example, when being calculated by the sample voice data " average word speed " After value range, multiple intervals can be divided into according to numerical values recited sequence, it is assumed that the value range of average word speed is 30 ~350SPM can be divided into 20 intervals, such as 30~60,60~90,90~120 ..., then " average language The vector magnitude of speed " is 20, i.e. the vector of " average word speed " includes 20 vector elements, and each vector element corresponds to a value Section；When the specifically averagely word speed of the identified in units text falls into one of interval, such as the word speed that is specifically averaged It is 40, falls into interval 30~60, then the corresponding vector element value of the interval is 1, and other intervals correspond to Vector element value be 0, then specifically averagely 40 corresponding feature vector of word speed be：(1,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0)

S206：Using the text vector and described eigenvector as a result, as the translation model built in advance Input feature vector, to realize translation to the speech recognition text using the translation model, obtain carrying described first The cypher text of the voice style of voice data.

In the present embodiment, by the per unit identification text vector of text in the speech recognition text and each The feature vector of the corresponding target acoustical feature of identified in units text, is input in the translation model built in advance, in this way, this is turned over Text vector and feature vector that model can be based on input are translated, the text translation to the speech recognition text is realized, obtains To the cypher text for the locution for meeting first voice data.Wherein, the translation model can be by collecting in advance A large amount of speech recognition texts and corresponding voice acoustic feature in advance training obtain, the translation model can include coding Model and decoded model.

By the corresponding text vector of identified in units text each in the speech recognition text and feature vector, it is input to In encoding model included by the translation model；Encoding model first carries out the text vector of per unit identification text primary Coding, then first encoding result is subjected to secondary coding together with the corresponding feature vector of identified in units text；Finally by two The decoded model that secondary coding result is input to included by the translation model is decoded, and obtains turning over for the speech recognition text Translation sheet.

For example, it is assumed that the speech recognition text is " of coursing ", and " when ", " right ", " " are known respectively as unit " when ", " right ", " " corresponding text vector and feature vector are first input to the translation model and are wrapped by other text In the encoding model included.Text translation schematic diagram as shown in Figure 3, encoding model can use two-way long short-term memory (Bidirectional Long Short-term Memory, BLSTM) model carries out first encoding to text vector, specifically may be used Three BLSTM models to be used to carry out first encoding to " when ", " right ", " " respectively, in this way, can be known by first encoding Relationship between road per unit identification text and other identified in units texts；Later, encoding model can use deep layer nerve Network (Deep Neural Network, DNN) model carries out secondary coding, three DNN models can specifically be used to carry out secondary Coding, that is, the corresponding feature vector of first encoding result and " when " to " when " carries out secondary coding, to the first encoding of " right " As a result and " right " corresponding feature vector carries out secondary coding, to the first encoding result of " " and " " corresponding feature vector Carry out secondary coding.

Due to having incorporated the acoustic feature information of first voice data in secondary coding, so, by secondary coding During the decoded model being as a result input to included by the translation model, the language for carrying first voice data can be generated The cypher text of sound style.

It should be noted that in specific translation, when the speech recognition text includes a sentence or multiple sentences, Text translation can be carried out as unit of sentence；It, can be according to the identified in units text when being translated to current sentence Size (such as single word) is translated for unit.

A kind of voice translation method provided in this embodiment, for needing the voice data of progress text translation, by right The voice data carries out speech recognition, generates speech recognition text；The speech recognition text is identified into text as a unit Or multiple identified in units texts are split into, and acoustic feature is extracted from the sound bite of identified in units text；By to unit Identify text vector, and the corresponding acoustic feature vectorization to identified in units text, carrying out text according to vectorization result turns over It translates, obtains carrying the cypher text of the voice style of the voice data.As it can be seen that due to carrying out text translation to voice data When, it is contemplated that the acoustic feature that voice data has in itself so that cypher text can meet style and the spy of the voice data Point, so that cypher text is more natural, understands semantic and linguistic context with more expressiveness, and then convenient for text reading person.

3rd embodiment

In current voiced translation technology, the audio of machine synthesis, is entirely synthetic training utterance in model after translation The locution of people, the effect of Composite tone and the locution relevance for translating preceding source speaker are very low, sometimes, turn over merely The audio translated is difficult to give expression to the style and characteristic of source speaker.

To solve the defect, a kind of voice translation method is present embodiments provided, it can be to the voice data of source speaker (the first voice data i.e. in first embodiment and second embodiment) is translated, and obtains cypher text, and combine the voice Acoustic feature in data carries out audio synthesis, and Composite tone is made to be adaptive to the voice style of source speaker, to realize more certainly So, with more the voiced translation of expressiveness.This method is suitable for the directions such as real-time Interpreter, can obtain adaptive source pronunciation The Composite tone of customs and morals of the people lattice.

Be a kind of flow diagram of voice translation method provided by the embodiments of the present application referring to Fig. 4, this method include with Lower step：

S401：Obtain the first voice data.

S402：By carrying out speech recognition to first voice data, speech recognition text is generated.

It should be noted that the S401 and S402 and the S101 in first embodiment in the present embodiment are consistent with S102, phase Speak on somebody's behalf it is bright refer to first embodiment, details are not described herein.

S403：Target acoustical feature is extracted from first voice data, according to the target acoustical feature to described Speech recognition text is translated, and obtains carrying the cypher text of the voice style of first voice data.

It should be noted that the S403 in the present embodiment is consistent with the S103 in first embodiment, related description refers to The specific implementation of the first embodiment or the second embodiment, details are not described herein.

S404：Phonetic synthesis is carried out to the cypher text according to the target acoustical feature, obtains carrying described the The second speech data of the voice style of one voice data.

Wherein, the audio after synthesis is known as second speech data by the present embodiment；The second speech data carries institute The voice style of the first voice data is stated, refers to the word for meeting first voice data or pronunciation style.

It is understood that the first voice data before translation is typically different languages from the second speech data after translation Speech, for example, the first voice data is Chinese, second speech data is English.

In the present embodiment, the target acoustical of the synthetic model built in advance and first voice data can be utilized The cypher text of first voice data is synthesized voice, obtains the voice style with first voice data by feature Second speech data.

More specifically, one of following two embodiments, which may be used, realizes S404.

In the first specific implementation, S404 can include：It is built in advance using the target acoustical Character adjustment Synthetic model model parameter, using the synthetic model after adjustment to the cypher text carry out phonetic synthesis, carried The second speech data of the voice style of first voice data.

In the present embodiment, synthetic model can utilize parameters,acoustic to synthesize voice data, such as fundamental frequency, duration, frequency The parameters,acoustics such as spectrum, still, the sound producing pattern for synthesizing voice may be the pronunciation style of training utterance people in synthetic model, because This, can utilize the target acoustical feature extracted from first voice data, adjust the corresponding parameters,acoustic of synthetic model, For example corresponding parameters,acoustic is adjusted using the method for deep learning so that using synthesized by the synthetic model after adjustment Second speech data more meets the style of the first voice data in pronunciation.

It should be noted that when the extracting object of the target acoustical feature is the per unit identification in second embodiment During text, when adjusting parameters,acoustic, it is first determined the modeling unit of synthetic model, the modeling unit represent the synthetic model Can which kind of unit-in-context be phonetic synthesis be carried out with, if the modeling unit of synthetic model and the length of the identified in units text are not Together, then it needs to split the speech recognition text according to the modeling unit of synthetic model, and carry again from each fractionation text Take target acoustical feature.For example, the modeling unit of synthetic model is syllable, and the length of the identified in units text is single Word then needs to extract the target acoustical feature again as unit of syllable, and utilizes the target acoustical Character adjustment newly extracted The parameters,acoustic of synthetic model.

In second of specific implementation, S404 can include：Using the synthetic model built in advance to the translation Text carries out phonetic synthesis, initial speech data is obtained, using the target acoustical feature to the sound of the initial speech data It learns feature to be adjusted, obtains carrying the second speech data of the voice style of first voice data.

In the present embodiment, phonetic synthesis is first carried out using the synthetic model built in advance based on the cypher text, The speech synthesis data is known as initial speech data by the present embodiment, since the sound producing pattern of the initial speech data may be to close The pronunciation style of training utterance people into model, but the pronunciation style and the pronunciation style of first voice data may not Together.Therefore, after using synthetic model synthesis initial speech data, the mesh extracted from first voice data can be utilized Mark acoustic feature is adjusted the acoustic feature of the initial speech data, the second speech data after being adjusted, in this way, Second speech data can be made more to meet the style of first voice data in pronunciation.

When carrying out acoustics adjustment to the initial speech data, the length of the identified in units text can be directly used Be adjusted, can also use and be adjusted less than or greater than the length of the identified in units text, for example, with syllable or word, Or word or short sentence are adjusted for unit.

Now by taking the target acoustical feature includes average duration, average pitch, average volume as an example, when as unit of syllable When being adjusted, calculate per monosyllabic average duration, average pitch, average volume in first voice data, and calculate Per monosyllabic average duration, average pitch, average volume in the initial speech data, according to result of calculation to described initial Voice data carries out acoustics adjustment.Specifically, it when the average duration of certain syllable 1 in the initial speech data is shorter, utilizes The average duration that syllable is corresponded in first voice data carries out elongation processing to syllable 1 on tone period, and vice versa； When the average pitch of certain syllable 2 in the initial speech data is higher, using corresponding to syllable in first voice data Average pitch carries out compression processing to syllable 2 on pronouncing frequency, and vice versa；When certain syllable 3 in the initial speech data Average volume it is higher when, using corresponding to the average pitch of syllable in first voice data to syllable 3 on pronunciation amplitude Compression processing is carried out, vice versa.

Certainly, the present embodiment can not also be adjusted the initial speech data, directly by the initial speech number According to as the final voice data after translation.

A kind of voice translation method provided in this embodiment, for needing the voice data of progress text translation, by right The voice data carries out speech recognition, generates speech recognition text；And acoustic feature is extracted from the voice data, according to extraction Acoustic feature speech recognition text is translated, obtain carrying the cypher text of the voice style of the voice data；Most Phonetic synthesis is carried out to the cypher text according to the target acoustical feature afterwards, obtains carrying first voice data The second speech data of voice style.As it can be seen that due to when carrying out text translation to voice data, it is contemplated that voice data is in itself The acoustic feature having so that cypher text can meet the style and characteristic of the voice data, so that cypher text is more Nature understands semantic and linguistic context with more expressiveness, and then convenient for text reading person, in addition, synthesizing language based on cypher text During sound, due to considering the voice style of the voice data before translating during phonetic synthesis so that phonetic synthesis result is in pronunciation The style and characteristic of the voice data before translation can be met, so as to generate voiced translation more natural, with more expressiveness As a result.

Fourth embodiment

Based on the voice translation method that first embodiment above to 3rd embodiment provides, present invention also provides a kind of languages Sound translating equipment, fourth embodiment will be introduced the speech translation apparatus with reference to attached drawing.

It is a kind of composition schematic diagram of speech translation apparatus provided by the embodiments of the present application referring to Fig. 5, which wraps It includes：

Voice data acquiring unit 501, for obtaining the first voice data；

Text generation unit 502 is identified, for by carrying out speech recognition to first voice data, generation voice to be known Other text；

Acoustic feature extraction unit 503, for extracting target acoustical feature from first voice data；

Cypher text generation unit 504, for being turned over according to the target acoustical feature to the speech recognition text It translates, obtains carrying the cypher text of the voice style of first voice data.

In a kind of realization method of the present embodiment, the acoustic feature extraction unit 503 includes：

In a kind of realization method of the present embodiment, the cypher text generation unit 504 includes：

In a kind of realization method of the present embodiment, if the target acoustical feature includes at least one characteristic type, Described device 500 further includes：

Then, described eigenvector beggar unit includes：

Characteristic value determination subelement, for the corresponding target acoustical feature of each identified in units text, determining institute State the corresponding characteristic value of each characteristic type in target acoustical feature；

In a kind of realization method of the present embodiment, described device 500 further includes：

In a kind of realization method of the present embodiment, the translated speech generation unit includes：

Alternatively, the translated speech generation unit includes：

In a kind of realization method of the present embodiment, the target acoustical feature includes average word speed, average pitch, is averaged One or more of volume characteristic type.

A kind of speech translation apparatus provided in this embodiment, for needing the voice data of progress text translation, by right The voice data carries out speech recognition, generates speech recognition text；And acoustic feature is extracted from the voice data, according to extraction Acoustic feature speech recognition text is translated, obtain carrying the cypher text of the voice style of the voice data.It can See, due to when carrying out text translation to voice data, it is contemplated that the acoustic feature that the voice data has in itself so that translation Text can meet the style and characteristic of the voice data so that cypher text it is more natural, with more expressiveness, Jin Erbian Understand semantic and linguistic context in text reading person.

5th embodiment

Referring to Fig. 6, for a kind of hardware architecture diagram of speech translation apparatus provided by the embodiments of the present application, the system 600 include memory 601 and receiver 602 and the processing being connect respectively with the memory 601 and the receiver 602 Device 603, for storing batch processing instruction, the processor 603 is used to that the memory 601 to be called to deposit the memory 601 The program instruction of storage performs following operation：

Obtain the first voice data；

In a kind of realization method of the present embodiment, the processor 603 is additionally operable to call what the memory 601 stored Program instruction performs following operation：

Determine the target acoustical feature of the sound bite.

Each identified in units text is subjected to text vector respectively；

In a kind of realization method of the present embodiment, if the target acoustical feature includes at least one characteristic type, The processor 603 is additionally operable to the program instruction that the memory 601 is called to store and performs following operation：

The value range is divided at least two intervals；

In some embodiments, the processor 603 can be central processing unit (CentralProcessing Unit, CPU), the memory 601 can be the interior of random access memory (RandomAccess Memory, RAM) type Portion's memory, the receiver 602 can include General Physics interface, and the physical interface can be that ether (Ethernet) connects Mouth or asynchronous transfer mode (Asynchronous Transfer Mode, ATM) interface.The processor 603, receiver 602 One or more independent circuits or hardware can be integrated into memory 601, such as：Application-specific integrated circuit (Application Specific Integrated Circuit, ASIC).

As seen through the above description of the embodiments, those skilled in the art can be understood that above-mentioned implementation All or part of step in example method can add the mode of required general hardware platform to realize by software.Based on such Understand, the part that the technical solution of the application substantially in other words contributes to the prior art can be in the form of software product It embodies, which can be stored in storage medium, such as ROM/RAM, magnetic disc, CD, including several Instruction is used so that computer equipment (can be personal computer, the network communications such as server or Media Gateway Equipment, etc.) perform method described in certain parts of each embodiment of the application or embodiment.

It should be noted that each embodiment is described by the way of progressive in this specification, each embodiment emphasis is said Bright is all difference from other examples, and just to refer each other for identical similar portion between each embodiment.For reality For applying device disclosed in example, since it is corresponded to the methods disclosed in the examples, so description is fairly simple, related part Referring to method part illustration.

It should also be noted that, herein, relational terms such as first and second and the like are used merely to one Entity or operation are distinguished with another entity or operation, without necessarily requiring or implying between these entities or operation There are any actual relationship or orders.Moreover, term " comprising ", "comprising" or its any other variant are intended to contain Lid non-exclusive inclusion, so that process, method, article or equipment including a series of elements not only will including those Element, but also including other elements that are not explicitly listed or further include as this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that Also there are other identical elements in process, method, article or equipment including the element.

The foregoing description of the disclosed embodiments enables professional and technical personnel in the field to realize or using the application. A variety of modifications of these embodiments will be apparent for those skilled in the art, it is as defined herein General Principle can in other embodiments be realized in the case where not departing from spirit herein or range.Therefore, the application The embodiments shown herein is not intended to be limited to, and is to fit to and the principles and novel features disclosed herein phase one The most wide range caused.

Claims

1. a kind of voice translation method, which is characterized in that including：

Obtain the first voice data；

Target acoustical feature is extracted from first voice data, according to the target acoustical feature to speech recognition text This is translated, and obtains carrying the cypher text of the voice style of first voice data.

2. according to the method described in claim 1, it is characterized in that, described extract target acoustical from first voice data Feature, including：

Using the speech recognition text as identified in units text or each text piece for the speech recognition text will being formed Section is respectively as identified in units text；

Determine the target acoustical feature of the sound bite.

3. according to the method described in claim 2, it is characterized in that, described know the voice according to the target acoustical feature Other text is translated, including：

Each identified in units text is subjected to text vector respectively；

Using the text vector and described eigenvector as a result, the input as the translation model built in advance is special Sign, to realize the translation to the speech recognition text using the translation model.

4. if according to the method described in claim 3, it is characterized in that, the target acoustical feature includes at least one feature class Type, then the method further include：

The value range is divided at least two intervals；

Then, it is described that the corresponding target acoustical feature of each identified in units text is subjected to feature vector respectively, including：

Target acoustical feature corresponding for each identified in units text, determines each feature in the target acoustical feature The corresponding characteristic value of type；

5. according to the method described in claim 1, it is characterized in that, the method further includes：

Phonetic synthesis is carried out to the cypher text according to the target acoustical feature, obtains carrying first voice data Voice style second speech data.

It is 6. according to the method described in claim 5, it is characterized in that, described literary to the translation according to the target acoustical feature This progress phonetic synthesis, including：

It is 7. according to the method described in claim 5, it is characterized in that, described literary to the translation according to the target acoustical feature This progress phonetic synthesis, including：

8. method according to any one of claims 1 to 7, which is characterized in that the target acoustical feature includes average language One or more of speed, average pitch, average volume characteristic type.

9. a kind of speech translation apparatus, which is characterized in that including：

Voice data acquiring unit, for obtaining the first voice data；

Cypher text generation unit for being translated according to the target acoustical feature to the speech recognition text, obtains Carry the cypher text of the voice style of first voice data.

10. device according to claim 9, which is characterized in that the acoustic feature extraction unit includes：

Unit of text determination subelement, for using the speech recognition text as identified in units text or will be formed described in Each text fragments of speech recognition text are respectively as identified in units text；

Sound bite determination subelement, for determining voice corresponding with the identified in units text in first voice data Segment；

11. device according to claim 10, which is characterized in that the cypher text generation unit includes：

Feature vector beggar's unit, for the corresponding target acoustical feature of each identified in units text to be carried out feature respectively Vectorization；

Cypher text generates subelement, for using the text vector and described eigenvector as a result, as advance The input feature vector of the translation model of structure to be realized the translation to the speech recognition text using the translation model, is obtained Carry the cypher text of the voice style of first voice data.

12. according to the devices described in claim 11, which is characterized in that if the target acoustical feature includes at least one feature Type, then described device further include：

Value range determination unit, for according to the sample voice data collected in advance, determining that the characteristic type is corresponding and taking It is worth range；

Then, described eigenvector beggar unit includes：

Characteristic value determination subelement, for for the corresponding target acoustical feature of each identified in units text, determining described The corresponding characteristic value of each characteristic type in target acoustical feature；

Vectorization handles subelement, for according to the value range where the characteristic value, by the characteristic value carry out feature to Quantization.

13. device according to claim 9, which is characterized in that described device further includes：

Translated speech generation unit for carrying out phonetic synthesis to the cypher text according to the target acoustical feature, obtains Carry the second speech data of the voice style of first voice data.

14. device according to claim 13, which is characterized in that the translated speech generation unit includes：

Model parameter adjusts subelement, for the model of the synthetic model ginseng built in advance using the target acoustical Character adjustment Number；

First speech production subelement, for carrying out phonetic synthesis to the cypher text using the synthetic model after adjustment；

Alternatively, the translated speech generation unit includes：

Second speech production subelement, for carrying out phonetic synthesis to the cypher text using the synthetic model built in advance, Obtain initial speech data；

Voice data adjusts subelement, for using the target acoustical feature to the acoustic feature of the initial speech data into Row adjustment.

15. device according to any one of claims 1 to 7, which is characterized in that the target acoustical feature includes average language One or more of speed, average pitch, average volume characteristic type.

16. a kind of speech translation apparatus, which is characterized in that including：Processor, memory, system bus；

The processor and the memory are connected by the system bus；

The memory is for storing one or more programs, and one or more of programs include instruction, and described instruction works as quilt The processor makes the processor perform the method as described in any one of claim 1-8 when performing.