CN108231062A - A kind of voice translation method and device - Google Patents
A kind of voice translation method and device Download PDFInfo
- Publication number
- CN108231062A CN108231062A CN201810032112.3A CN201810032112A CN108231062A CN 108231062 A CN108231062 A CN 108231062A CN 201810032112 A CN201810032112 A CN 201810032112A CN 108231062 A CN108231062 A CN 108231062A
- Authority
- CN
- China
- Prior art keywords
- text
- feature
- voice data
- target acoustical
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000013519 translation Methods 0.000 title claims abstract description 110
- 238000000034 method Methods 0.000 title claims abstract description 68
- 238000000605 extraction Methods 0.000 claims abstract description 12
- 239000013598 vector Substances 0.000 claims description 71
- 230000015572 biosynthetic process Effects 0.000 claims description 32
- 238000003786 synthesis reaction Methods 0.000 claims description 32
- 230000015654 memory Effects 0.000 claims description 20
- 239000012634 fragment Substances 0.000 claims description 12
- 238000004519 manufacturing process Methods 0.000 claims description 6
- 239000000284 extract Substances 0.000 claims description 2
- 238000013139 quantization Methods 0.000 claims description 2
- 241000208340 Araliaceae Species 0.000 claims 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 claims 1
- 235000003140 Panax quinquefolius Nutrition 0.000 claims 1
- 235000008434 ginseng Nutrition 0.000 claims 1
- 230000014616 translation Effects 0.000 description 95
- 238000005516 engineering process Methods 0.000 description 14
- 238000010586 diagram Methods 0.000 description 12
- 238000005194 fractionation Methods 0.000 description 9
- 238000012545 processing Methods 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 230000003044 adaptive effect Effects 0.000 description 3
- 239000002131 composite material Substances 0.000 description 3
- 230000008451 emotion Effects 0.000 description 3
- RTZKZFJDLAIYFH-UHFFFAOYSA-N Diethyl ether Chemical compound CCOCC RTZKZFJDLAIYFH-UHFFFAOYSA-N 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 230000002194 synthesizing effect Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000007306 turnover Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
This application discloses a kind of voice translation method and device, the method includes:For needing the voice data of progress text translation, by carrying out speech recognition to the voice data, speech recognition text is generated;And acoustic feature is extracted from the voice data, speech recognition text is translated according to the acoustic feature of extraction, obtains carrying the cypher text of the voice style of the voice data.It can be seen that, due to when carrying out text translation to voice data, consider the acoustic feature that voice data has in itself, cypher text is enabled to meet the style and characteristic of the voice data, so that cypher text is more natural, understands semantic and linguistic context with more expressiveness, and then convenient for text reading person.
Description
Technical field
This application involves field of computer technology more particularly to a kind of voice translation methods and device.
Background technology
Increasingly mature with artificial intelligence technology, people pursue more and more to be solved using intellectual technology
Problem, for example, once people required a great deal of time to learn a new language, could with using the language as mother tongue
People links up, and present, people can directly by translator, around speech recognition, intelligent translation and speech synthesis technique,
To realize that spoken language inputs, text is translated and the meaning after saying translation of pronouncing.
But in current text translation technology, most of translation technology only realizes that text is literal to be turned over
It translates, that is to say, that when the voice data to source speaker carries out text translation, the text after translation cannot often give expression to source
The style and characteristic of speaker.For example, when certain Chinese speech is translated into English text, due to the Chinese text of the Chinese speech
May correspond to different English texts, but diction expressed by different English texts and emotion feature may be it is different,
And the English text that actual translations go out is often inappropriate, i.e., cypher text be beyond expression out source speaker style and spy
Point.
Invention content
The main purpose of the embodiment of the present application is to provide a kind of voice translation method and device, be carried out when to voice data
During translation, cypher text can be made to meet the style and characteristic of the voice data.
The embodiment of the present application provides a kind of voice translation method, including:
Obtain the first voice data;
By carrying out speech recognition to first voice data, speech recognition text is generated;
Target acoustical feature is extracted from first voice data, the voice is known according to the target acoustical feature
Other text is translated, and obtains carrying the cypher text of the voice style of first voice data.
Optionally, it is described that target acoustical feature is extracted from first voice data, including:
Using the speech recognition text as identified in units text or each text for the speech recognition text will being formed
This segment is respectively as identified in units text;
Determine sound bite corresponding with the identified in units text in first voice data;
Determine the target acoustical feature of the sound bite.
Optionally, it is described that the speech recognition text is translated according to the target acoustical feature, including:
Each identified in units text is subjected to text vector respectively;
The corresponding target acoustical feature of each identified in units text is subjected to feature vector respectively;
Using the text vector and described eigenvector as a result, input as the translation model built in advance
Feature, to realize the translation to the speech recognition text using the translation model.
Optionally, if the target acoustical feature includes at least one characteristic type, the method further includes:
According to the sample voice data collected in advance, the corresponding value range of the characteristic type is determined;
The value range is divided at least two intervals;
Then, it is described that the corresponding target acoustical feature of each identified in units text is subjected to feature vector respectively, it wraps
It includes:
Target acoustical feature corresponding for each identified in units text, determines each in the target acoustical feature
The corresponding characteristic value of characteristic type;
According to the value range where the characteristic value, the characteristic value is subjected to feature vector.
Optionally, the method further includes:
Phonetic synthesis is carried out to the cypher text according to the target acoustical feature, obtains carrying first voice
The second speech data of the voice style of data.
Optionally, it is described that phonetic synthesis is carried out to the cypher text according to the target acoustical feature, including:
The model parameter of synthetic model built in advance using the target acoustical Character adjustment;
Phonetic synthesis is carried out to the cypher text using the synthetic model after adjustment.
Optionally, it is described that phonetic synthesis is carried out to the cypher text according to the target acoustical feature, including:
Phonetic synthesis is carried out to the cypher text using the synthetic model built in advance, obtains initial speech data;
The acoustic feature of the initial speech data is adjusted using the target acoustical feature.
Optionally, the target acoustical feature includes one or more of average word speed, average pitch, average volume spy
Levy type.
The embodiment of the present application additionally provides a kind of speech translation apparatus, including:
Voice data acquiring unit, for obtaining the first voice data;
Text generation unit is identified, for by carrying out speech recognition to first voice data, generating speech recognition
Text;
Acoustic feature extraction unit, for extracting target acoustical feature from first voice data;
Cypher text generation unit, for being translated according to the target acoustical feature to the speech recognition text,
Obtain carrying the cypher text of the voice style of first voice data.
Optionally, the acoustic feature extraction unit includes:
Unit of text determination subelement, for the speech recognition text as identified in units text or will to be formed
Each text fragments of the speech recognition text are respectively as identified in units text;
Sound bite determination subelement, it is corresponding with the identified in units text in first voice data for determining
Sound bite;
Acoustic feature determination subelement, for determining the target acoustical feature of the sound bite.
Optionally, the cypher text generation unit includes:
Text vector beggar's unit, for each identified in units text to be carried out text vector respectively;
Feature vector beggar's unit, for the corresponding target acoustical feature of each identified in units text to be carried out respectively
Feature vector;
Cypher text generates subelement, for using the text vector and described eigenvector as a result, as
The input feature vector of the translation model built in advance, to realize the translation to the speech recognition text using the translation model,
Obtain carrying the cypher text of the voice style of first voice data.
Optionally, if the target acoustical feature includes at least one characteristic type, described device further includes:
Value range determination unit, for according to the sample voice data collected in advance, determining that the characteristic type corresponds to
Value range;
Interval determination unit, for the value range to be divided at least two intervals;
Then, described eigenvector beggar unit includes:
Characteristic value determination subelement, for for the corresponding target acoustical feature of each identified in units text, determining
The corresponding characteristic value of each characteristic type in the target acoustical feature;
Vectorization handles subelement, for according to the value range where the characteristic value, the characteristic value being carried out special
Levy vectorization.
Optionally, described device further includes:
Translated speech generation unit, for carrying out phonetic synthesis to the cypher text according to the target acoustical feature,
Obtain carrying the second speech data of the voice style of first voice data.
Optionally, the translated speech generation unit includes:
Model parameter adjusts subelement, for the mould of synthetic model built in advance using the target acoustical Character adjustment
Shape parameter;
First speech production subelement, for carrying out voice conjunction to the cypher text using the synthetic model after adjustment
Into;
Alternatively, the translated speech generation unit includes:
Second speech production subelement, for carrying out voice conjunction to the cypher text using the synthetic model built in advance
Into obtaining initial speech data;
Voice data adjusts subelement, for special to the acoustics of the initial speech data using the target acoustical feature
Sign is adjusted.
Optionally, the target acoustical feature includes one or more of average word speed, average pitch, average volume spy
Levy type.
The embodiment of the present application additionally provides another speech translation apparatus, including:Processor, memory, system bus;
The processor and the memory are connected by the system bus;
For the memory for storing one or more programs, one or more of programs include instruction, described instruction
The processor is made when being performed by the processor to perform method described in any one of the above embodiments.
A kind of voice translation method and device provided in this embodiment, for needing the voice data of progress text translation,
By carrying out speech recognition to the voice data, speech recognition text is generated;And acoustic feature, root are extracted from the voice data
Speech recognition text is translated according to the acoustic feature of extraction, obtains carrying the translation text of the voice style of the voice data
This.As it can be seen that due to when carrying out text translation to voice data, it is contemplated that the acoustic feature that the voice data has in itself makes
Cypher text can meet the style and characteristic of the voice data so that cypher text it is more natural, with more expressiveness,
And then understand semantic and linguistic context convenient for text reading person.
Description of the drawings
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or it will show below
There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the application
Some embodiments, for those of ordinary skill in the art, without creative efforts, can also basis
These attached drawings obtain other attached drawings.
Fig. 1 is a kind of one of flow diagram of voice translation method provided by the embodiments of the present application;
Fig. 2 is the two of a kind of flow diagram of voice translation method provided by the embodiments of the present application;
Fig. 3 translates schematic diagram for text provided by the embodiments of the present application;
Fig. 4 is the three of a kind of flow diagram of voice translation method provided by the embodiments of the present application;
Fig. 5 is a kind of composition schematic diagram of speech translation apparatus provided by the embodiments of the present application;
Fig. 6 is a kind of hardware architecture diagram of speech translation apparatus provided by the embodiments of the present application.
Specific embodiment
Purpose, technical scheme and advantage to make the embodiment of the present application are clearer, below in conjunction with the embodiment of the present application
In attached drawing, the technical solution in the embodiment of the present application is clearly and completely described, it is clear that described embodiment is
Some embodiments of the present application, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art
All other embodiments obtained without making creative work shall fall in the protection scope of this application.
In current text translation technology, most of translation technology only realizes the literal translation of text, that is,
When carrying out text translation to the voice data of source speaker, the text after translation cannot often express source speaker style and
Feature.
For this purpose, the embodiment of the present application provides a kind of voice translation method, it is not only real during voiced translation is realized
Show text translation, also deliver the diction and emotion characteristic of source speaker, that is, cypher text can be made to be adaptive to source
The diction and emotion characteristic of speaker, to generate cypher text more natural, with more expressiveness, so as to which help text is read
Reader understands semantic and linguistic context.
The exemplary embodiment provided below in conjunction with the accompanying drawings the application is specifically introduced.
First embodiment
Be a kind of flow diagram of voice translation method provided by the embodiments of the present application referring to Fig. 1, this method include with
Lower step:
S101:Obtain the first voice data.
In the present embodiment, it would be desirable to which the voice data for carrying out text translation is defined as the first voice data.
The present embodiment does not limit the source of first voice data, is said for example, first voice data can be source
The real speech or recorded speech of words people can also be after carrying out machine processing to the real speech or the recorded speech
Special efficacy voice.
The present embodiment does not limit the length of first voice data yet, for example, first voice data can be one
A word, can also be in short, can also be one section of word.
S102:By carrying out speech recognition to first voice data, speech recognition text is generated.
After first voice data is got, by speech recognition technology, such as the language based on artificial neural network
First voice data is converted into speech recognition text by sound identification technology.
S103:Target acoustical feature is extracted from first voice data, according to the target acoustical feature to described
Speech recognition text is translated, and obtains carrying the cypher text of the voice style of first voice data.
In the present embodiment, one or more acoustic feature types, such as word speed, pitch, volume etc. can be pre-set
Characteristic type, when carrying out text translation to first voice data, extraction is about each from first voice data
The specific features value of acoustic feature type, and using these specific features values as the target acoustical feature.When there are multiple texts
During this translation result, with reference to the most suitable cypher text of target acoustical Feature Selection.
It is understood that the target acoustical feature is mainly used for describing the voice style of first voice data,
Namely the locution of the source speaker corresponding to first voice data is described, therefore, if in text translation process
Consider the target acoustical feature, the cypher text obtained not only conforms to word when source speaker speaks, is also demonstrated by source
Text representation style when speaker speaks, that is to say, that the cypher text is the translation for meeting the locution of source speaker
Text.
For ease of understanding, it now illustrates.For example, when first voice data is " of coursing ", using existing text
The cypher text of translation technology generation may be " Yes, of course ", and the present embodiment considers first voice data
Corresponding voice style, cypher text may be " You bet ", and the cypher text is more suitable for the first of the source speaker
The voice data meaning to be expressed and tone style.
A kind of voice translation method provided in this embodiment, for needing the voice data of progress text translation, by right
The voice data carries out speech recognition, generates speech recognition text;And acoustic feature is extracted from the voice data, according to extraction
Acoustic feature speech recognition text is translated, obtain carrying the cypher text of the voice style of the voice data.It can
See, due to when carrying out text translation to voice data, it is contemplated that the acoustic feature that voice data has in itself so that translation text
Originally can meet the style and characteristic of the voice data so that cypher text it is more natural, with more expressiveness, and then be convenient for
Text reading person understands semantic and linguistic context.
Second embodiment
The present embodiment will focus on the specific implementation for introducing S103 in above-mentioned first embodiment, and other relevant portions please join
See the introduction of first embodiment.
Be a kind of flow diagram of voice translation method provided by the embodiments of the present application referring to Fig. 2, this method include with
Lower step:
S201:Obtain the first voice data.
S202:By carrying out speech recognition to first voice data, speech recognition text is generated.
It should be noted that the S201 and S202 and the S101 in first embodiment in the present embodiment are consistent with S102, phase
Speak on somebody's behalf it is bright refer to first embodiment, details are not described herein.
S203:Using the speech recognition text as identified in units text or the speech recognition text will be formed
Each text fragments are respectively as identified in units text.
In the present embodiment, the speech recognition text integrally can be identified text as a unit;It can also incite somebody to action
The speech recognition text is split, using each text fragments split out as an identified in units text.
When carrying out text fractionation, preset fractionation unit can be based on and carry out text fractionation, use larger fractionation list
Position or use smaller split cells, such as with single syllable or with single word or with single word etc. for split unit into
It composes a piece of writing this fractionation, so as to obtain multiple text fragments.Wherein, the preset fractionation unit can be artificial preset fractionation
The fractionation unit of unit or system default.
For example, when the speech recognition text is Chinese, each text fragments for splitting out can be single word, such as institute
Speech recognition text is stated as " of coursing ", the text fragments split out are respectively " when ", " right ", " ";In another example the voice
When identifying text as English, each text fragments for splitting out can be a word, for example the speech recognition text is
" Yes, of course ", the text fragments split out are respectively " Yes ", " of ", " course ".
S204:It determines sound bite corresponding with the identified in units text in first voice data, and determines institute
State the target acoustical feature of sound bite.
In the present embodiment, when using the speech recognition text, integrally as during identified in units text, the identified in units is literary
This corresponding sound bite, as described first voice data;Make when by each text fragments in the speech recognition text
When identifying text for unit, the corresponding sound bite of identified in units text, in as described first voice data with the text
The corresponding voice data of segment.
After the corresponding sound bite of each identified in units text is determined, the target of each sound bite is further determined that
Acoustic feature.Using the speech recognition text as Chinese text " of coursing " and be split as " when ", " right ", " " these three
For identified in units text, need to obtain " when " corresponding sound bite 1 and target acoustical feature extracted from sound bite 1,
Obtain " right " corresponding sound bite 2 and extracted from sound bite 2 target acoustical feature and, the corresponding language of acquisition " "
Tablet section 3 simultaneously extracts target acoustical feature from sound bite 3.
In a kind of realization method of the present embodiment, the target acoustical feature can include average word speed, average pitch,
One or more of average volume characteristic type, certainly, the present embodiment do not limit this three classes acoustic feature, can also include it
The acoustic feature of its type, for example, the tone height of speaker, stress intensity etc..
Next, the method for determination of the target acoustical feature is specifically introduced.
It (1), specifically can be according to lower section when the target acoustical feature includes " average word speed " this characteristic type
Formula calculates the average word speed of each sound bite:
For each sound bite, the duration of the sound bite is first obtained, then determines the corresponding unit of the sound bite
The unit-in-context number that identification text is included, finally with this article our unit number divided by the duration, obtains the sound bite
Specifically be averaged word speed.Wherein, the unit-in-context is less than or equal to the length of identified in units text, for example the unit-in-context is
One syllable, Chinese character include a syllable, and an English word is because its length differs, so including one or more
A syllable.
Now by taking " when " word in above-mentioned " of coursing " as an example, first obtain the corresponding sound bite of " when " word duration (such as
Unit is minute), then with the syllable number of " when " word divided by the duration, obtain with SPM (Syllables Per Minute,
Syllable number per minute) be unit average word speed;If calculate the average word speed of " of coursing " this whole word, calculation class
Seemingly.
It (2), specifically can be according to lower section when the target acoustical feature includes " average pitch " this characteristic type
Formula calculates the average pitch of each sound bite:
For each sound bite, the sound bite is first divided into several audio frames, the length of each audio frame can
With by manually presetting or using system default value, then determining the vibration frequency size of each audio frame, the vibration frequency
The average value of the vibration frequency of this several audio frame can be finally calculated, which is should with hertz (HZ) for unit
The specific average pitch of sound bite.
Now by taking " when " word in above-mentioned " of coursing " as an example, " when " word is first divided into several audio frames, for example be respectively
Audio frame 1, audio frame 2 and audio frame 3;Then the vibration frequency size of audio frame 1, audio frame 2 and audio frame 3 is determined, such as
Respectively frequency 1, frequency 2 and frequency 3;(frequency 1+ frequency 2+ frequencies 3)/3 are finally calculated, which is the flat of " when "
Equal pitch.If calculating the average pitch of " of coursing " this whole word, calculation is similar.
It (3), specifically can be according to lower section when the target acoustical feature includes " average volume " this characteristic type
Formula calculates the average volume of each sound bite:
For each sound bite, the sound bite is first divided into several audio frames, the length of each audio frame can
With by manually presetting or using system default value, then determining the amplitude size of each audio frame, it is several finally to calculate this
The average value of the amplitude of a audio frame, the average value are the specific average volume of the sound bite.
Now by taking " when " word in above-mentioned " of coursing " as an example, " when " word is first divided into several audio frames, for example be respectively
Audio frame 1, audio frame 2 and audio frame 3;Then it determines the amplitude size of audio frame 1, audio frame 2 and audio frame 3, for example distinguishes
For amplitude 1, amplitude 2 and amplitude 3;(amplitude 1+ amplitude 2+ amplitudes 3)/3 are finally calculated, which is the average sound of " when "
Amount.If calculating the average volume of " of coursing " this whole word, calculation is similar.
S205:Each identified in units text is subjected to text vector, and each identified in units is literary respectively
This corresponding target acoustical feature carries out feature vector respectively.
In the present embodiment, per unit can be identified text vector, obtains the text of per unit identification text
Vector represents corresponding identified in units text using text vector.For example, may be used word2vec methods realize text to
Quantization.
In the present embodiment, per unit can be identified to the corresponding target acoustical feature vector of text, obtained each
The feature vector of the corresponding target acoustical feature of identified in units text represents that corresponding target acoustical is special using feature vector
Sign.
When the target acoustical feature includes at least one characteristic type, in order to realize feature vector, the present embodiment
The corresponding value range of the characteristic type can be determined according to the sample voice data collected in advance, and by the value model
It encloses and is divided at least two intervals.
Specifically, voice data during a large amount of person-to-person communications can be collected in advance, by each voice data of collection
As sample voice data;By speech recognition technology, respectively the sample voice data are carried out with speech recognition, and according to
The length of identified in units text described in S204, such as single word split the sample voice data, obtain multiple samples
This text fragments;According to the characteristic type that target acoustical feature described in S204 is included, each sample text piece is calculated respectively
The acoustic feature value of section, so as to obtain the acoustic feature value range of each characteristic type, and by each acoustic feature value model
It encloses and is divided into multiple intervals.For example, when the target acoustical feature includes average word speed, average pitch, average volume etc.
During characteristic type, the acoustic feature value range of average word speed is divided into multiple intervals, such as 20 intervals, it will
The acoustic feature value range of average pitch is divided into multiple intervals, such as 15 intervals, by the sound of average volume
It learns feature value range and is divided into multiple intervals, such as 25 intervals.
It, can be according to following step in a kind of realization method of the present embodiment based on the division result of above-mentioned interval
It is rapid to carry out feature vector:
Step A:Target acoustical feature corresponding for each identified in units text, determines the target acoustical feature
In the corresponding characteristic value of each characteristic type.
In the present embodiment, text is identified for per unit, the corresponding voice sheet of identified in units text can be based on
Section in the way of above-mentioned introduction, calculates the corresponding characteristic value of each characteristic type in the target acoustical feature.For example, work as
When the target acoustical feature includes average word speed, average pitch, these three characteristic types of average volume, the identified in units is calculated
Specifically be averaged word speed, specific average pitch, the specific average volume of text.
Step B:According to the value range where the characteristic value, the characteristic value is subjected to feature vector.
After characteristic value tz of the identified in units text about certain characteristic type tx (such as averagely word speed) is calculated,
Since the value range of this feature type tx has been partitioned into multiple intervals in advance, hence, it can be determined that characteristic value tz belongs to
In wherein which interval;Later, due to each interval of characteristic type tx, a vector element is corresponded to respectively, because
The corresponding vector element of the affiliated intervals of characteristic value tz, can be taken a preset value such as 1 by this, and other each value areas
Between corresponding vector element be taken as another preset value such as 0, in this way, just obtaining a feature being made of each preset value
Vector is to get to the corresponding feature vectors of characteristic value tz.
For ease of understanding, it now illustrates.For example, when being calculated by the sample voice data " average word speed "
After value range, multiple intervals can be divided into according to numerical values recited sequence, it is assumed that the value range of average word speed is 30
~350SPM can be divided into 20 intervals, such as 30~60,60~90,90~120 ..., then " average language
The vector magnitude of speed " is 20, i.e. the vector of " average word speed " includes 20 vector elements, and each vector element corresponds to a value
Section;When the specifically averagely word speed of the identified in units text falls into one of interval, such as the word speed that is specifically averaged
It is 40, falls into interval 30~60, then the corresponding vector element value of the interval is 1, and other intervals correspond to
Vector element value be 0, then specifically averagely 40 corresponding feature vector of word speed be:(1,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0)
S206:Using the text vector and described eigenvector as a result, as the translation model built in advance
Input feature vector, to realize translation to the speech recognition text using the translation model, obtain carrying described first
The cypher text of the voice style of voice data.
In the present embodiment, by the per unit identification text vector of text in the speech recognition text and each
The feature vector of the corresponding target acoustical feature of identified in units text, is input in the translation model built in advance, in this way, this is turned over
Text vector and feature vector that model can be based on input are translated, the text translation to the speech recognition text is realized, obtains
To the cypher text for the locution for meeting first voice data.Wherein, the translation model can be by collecting in advance
A large amount of speech recognition texts and corresponding voice acoustic feature in advance training obtain, the translation model can include coding
Model and decoded model.
By the corresponding text vector of identified in units text each in the speech recognition text and feature vector, it is input to
In encoding model included by the translation model;Encoding model first carries out the text vector of per unit identification text primary
Coding, then first encoding result is subjected to secondary coding together with the corresponding feature vector of identified in units text;Finally by two
The decoded model that secondary coding result is input to included by the translation model is decoded, and obtains turning over for the speech recognition text
Translation sheet.
For example, it is assumed that the speech recognition text is " of coursing ", and " when ", " right ", " " are known respectively as unit
" when ", " right ", " " corresponding text vector and feature vector are first input to the translation model and are wrapped by other text
In the encoding model included.Text translation schematic diagram as shown in Figure 3, encoding model can use two-way long short-term memory
(Bidirectional Long Short-term Memory, BLSTM) model carries out first encoding to text vector, specifically may be used
Three BLSTM models to be used to carry out first encoding to " when ", " right ", " " respectively, in this way, can be known by first encoding
Relationship between road per unit identification text and other identified in units texts;Later, encoding model can use deep layer nerve
Network (Deep Neural Network, DNN) model carries out secondary coding, three DNN models can specifically be used to carry out secondary
Coding, that is, the corresponding feature vector of first encoding result and " when " to " when " carries out secondary coding, to the first encoding of " right "
As a result and " right " corresponding feature vector carries out secondary coding, to the first encoding result of " " and " " corresponding feature vector
Carry out secondary coding.
Due to having incorporated the acoustic feature information of first voice data in secondary coding, so, by secondary coding
During the decoded model being as a result input to included by the translation model, the language for carrying first voice data can be generated
The cypher text of sound style.
It should be noted that in specific translation, when the speech recognition text includes a sentence or multiple sentences,
Text translation can be carried out as unit of sentence;It, can be according to the identified in units text when being translated to current sentence
Size (such as single word) is translated for unit.
A kind of voice translation method provided in this embodiment, for needing the voice data of progress text translation, by right
The voice data carries out speech recognition, generates speech recognition text;The speech recognition text is identified into text as a unit
Or multiple identified in units texts are split into, and acoustic feature is extracted from the sound bite of identified in units text;By to unit
Identify text vector, and the corresponding acoustic feature vectorization to identified in units text, carrying out text according to vectorization result turns over
It translates, obtains carrying the cypher text of the voice style of the voice data.As it can be seen that due to carrying out text translation to voice data
When, it is contemplated that the acoustic feature that voice data has in itself so that cypher text can meet style and the spy of the voice data
Point, so that cypher text is more natural, understands semantic and linguistic context with more expressiveness, and then convenient for text reading person.
3rd embodiment
In current voiced translation technology, the audio of machine synthesis, is entirely synthetic training utterance in model after translation
The locution of people, the effect of Composite tone and the locution relevance for translating preceding source speaker are very low, sometimes, turn over merely
The audio translated is difficult to give expression to the style and characteristic of source speaker.
To solve the defect, a kind of voice translation method is present embodiments provided, it can be to the voice data of source speaker
(the first voice data i.e. in first embodiment and second embodiment) is translated, and obtains cypher text, and combine the voice
Acoustic feature in data carries out audio synthesis, and Composite tone is made to be adaptive to the voice style of source speaker, to realize more certainly
So, with more the voiced translation of expressiveness.This method is suitable for the directions such as real-time Interpreter, can obtain adaptive source pronunciation
The Composite tone of customs and morals of the people lattice.
Be a kind of flow diagram of voice translation method provided by the embodiments of the present application referring to Fig. 4, this method include with
Lower step:
S401:Obtain the first voice data.
S402:By carrying out speech recognition to first voice data, speech recognition text is generated.
It should be noted that the S401 and S402 and the S101 in first embodiment in the present embodiment are consistent with S102, phase
Speak on somebody's behalf it is bright refer to first embodiment, details are not described herein.
S403:Target acoustical feature is extracted from first voice data, according to the target acoustical feature to described
Speech recognition text is translated, and obtains carrying the cypher text of the voice style of first voice data.
It should be noted that the S403 in the present embodiment is consistent with the S103 in first embodiment, related description refers to
The specific implementation of the first embodiment or the second embodiment, details are not described herein.
S404:Phonetic synthesis is carried out to the cypher text according to the target acoustical feature, obtains carrying described the
The second speech data of the voice style of one voice data.
Wherein, the audio after synthesis is known as second speech data by the present embodiment;The second speech data carries institute
The voice style of the first voice data is stated, refers to the word for meeting first voice data or pronunciation style.
It is understood that the first voice data before translation is typically different languages from the second speech data after translation
Speech, for example, the first voice data is Chinese, second speech data is English.
In the present embodiment, the target acoustical of the synthetic model built in advance and first voice data can be utilized
The cypher text of first voice data is synthesized voice, obtains the voice style with first voice data by feature
Second speech data.
More specifically, one of following two embodiments, which may be used, realizes S404.
In the first specific implementation, S404 can include:It is built in advance using the target acoustical Character adjustment
Synthetic model model parameter, using the synthetic model after adjustment to the cypher text carry out phonetic synthesis, carried
The second speech data of the voice style of first voice data.
In the present embodiment, synthetic model can utilize parameters,acoustic to synthesize voice data, such as fundamental frequency, duration, frequency
The parameters,acoustics such as spectrum, still, the sound producing pattern for synthesizing voice may be the pronunciation style of training utterance people in synthetic model, because
This, can utilize the target acoustical feature extracted from first voice data, adjust the corresponding parameters,acoustic of synthetic model,
For example corresponding parameters,acoustic is adjusted using the method for deep learning so that using synthesized by the synthetic model after adjustment
Second speech data more meets the style of the first voice data in pronunciation.
It should be noted that when the extracting object of the target acoustical feature is the per unit identification in second embodiment
During text, when adjusting parameters,acoustic, it is first determined the modeling unit of synthetic model, the modeling unit represent the synthetic model
Can which kind of unit-in-context be phonetic synthesis be carried out with, if the modeling unit of synthetic model and the length of the identified in units text are not
Together, then it needs to split the speech recognition text according to the modeling unit of synthetic model, and carry again from each fractionation text
Take target acoustical feature.For example, the modeling unit of synthetic model is syllable, and the length of the identified in units text is single
Word then needs to extract the target acoustical feature again as unit of syllable, and utilizes the target acoustical Character adjustment newly extracted
The parameters,acoustic of synthetic model.
In second of specific implementation, S404 can include:Using the synthetic model built in advance to the translation
Text carries out phonetic synthesis, initial speech data is obtained, using the target acoustical feature to the sound of the initial speech data
It learns feature to be adjusted, obtains carrying the second speech data of the voice style of first voice data.
In the present embodiment, phonetic synthesis is first carried out using the synthetic model built in advance based on the cypher text,
The speech synthesis data is known as initial speech data by the present embodiment, since the sound producing pattern of the initial speech data may be to close
The pronunciation style of training utterance people into model, but the pronunciation style and the pronunciation style of first voice data may not
Together.Therefore, after using synthetic model synthesis initial speech data, the mesh extracted from first voice data can be utilized
Mark acoustic feature is adjusted the acoustic feature of the initial speech data, the second speech data after being adjusted, in this way,
Second speech data can be made more to meet the style of first voice data in pronunciation.
When carrying out acoustics adjustment to the initial speech data, the length of the identified in units text can be directly used
Be adjusted, can also use and be adjusted less than or greater than the length of the identified in units text, for example, with syllable or word,
Or word or short sentence are adjusted for unit.
Now by taking the target acoustical feature includes average duration, average pitch, average volume as an example, when as unit of syllable
When being adjusted, calculate per monosyllabic average duration, average pitch, average volume in first voice data, and calculate
Per monosyllabic average duration, average pitch, average volume in the initial speech data, according to result of calculation to described initial
Voice data carries out acoustics adjustment.Specifically, it when the average duration of certain syllable 1 in the initial speech data is shorter, utilizes
The average duration that syllable is corresponded in first voice data carries out elongation processing to syllable 1 on tone period, and vice versa;
When the average pitch of certain syllable 2 in the initial speech data is higher, using corresponding to syllable in first voice data
Average pitch carries out compression processing to syllable 2 on pronouncing frequency, and vice versa;When certain syllable 3 in the initial speech data
Average volume it is higher when, using corresponding to the average pitch of syllable in first voice data to syllable 3 on pronunciation amplitude
Compression processing is carried out, vice versa.
Certainly, the present embodiment can not also be adjusted the initial speech data, directly by the initial speech number
According to as the final voice data after translation.
A kind of voice translation method provided in this embodiment, for needing the voice data of progress text translation, by right
The voice data carries out speech recognition, generates speech recognition text;And acoustic feature is extracted from the voice data, according to extraction
Acoustic feature speech recognition text is translated, obtain carrying the cypher text of the voice style of the voice data;Most
Phonetic synthesis is carried out to the cypher text according to the target acoustical feature afterwards, obtains carrying first voice data
The second speech data of voice style.As it can be seen that due to when carrying out text translation to voice data, it is contemplated that voice data is in itself
The acoustic feature having so that cypher text can meet the style and characteristic of the voice data, so that cypher text is more
Nature understands semantic and linguistic context with more expressiveness, and then convenient for text reading person, in addition, synthesizing language based on cypher text
During sound, due to considering the voice style of the voice data before translating during phonetic synthesis so that phonetic synthesis result is in pronunciation
The style and characteristic of the voice data before translation can be met, so as to generate voiced translation more natural, with more expressiveness
As a result.
Fourth embodiment
Based on the voice translation method that first embodiment above to 3rd embodiment provides, present invention also provides a kind of languages
Sound translating equipment, fourth embodiment will be introduced the speech translation apparatus with reference to attached drawing.
It is a kind of composition schematic diagram of speech translation apparatus provided by the embodiments of the present application referring to Fig. 5, which wraps
It includes:
Voice data acquiring unit 501, for obtaining the first voice data;
Text generation unit 502 is identified, for by carrying out speech recognition to first voice data, generation voice to be known
Other text;
Acoustic feature extraction unit 503, for extracting target acoustical feature from first voice data;
Cypher text generation unit 504, for being turned over according to the target acoustical feature to the speech recognition text
It translates, obtains carrying the cypher text of the voice style of first voice data.
In a kind of realization method of the present embodiment, the acoustic feature extraction unit 503 includes:
Unit of text determination subelement, for the speech recognition text as identified in units text or will to be formed
Each text fragments of the speech recognition text are respectively as identified in units text;
Sound bite determination subelement, it is corresponding with the identified in units text in first voice data for determining
Sound bite;
Acoustic feature determination subelement, for determining the target acoustical feature of the sound bite.
In a kind of realization method of the present embodiment, the cypher text generation unit 504 includes:
Text vector beggar's unit, for each identified in units text to be carried out text vector respectively;
Feature vector beggar's unit, for the corresponding target acoustical feature of each identified in units text to be carried out respectively
Feature vector;
Cypher text generates subelement, for using the text vector and described eigenvector as a result, as
The input feature vector of the translation model built in advance, to realize the translation to the speech recognition text using the translation model,
Obtain carrying the cypher text of the voice style of first voice data.
In a kind of realization method of the present embodiment, if the target acoustical feature includes at least one characteristic type,
Described device 500 further includes:
Value range determination unit, for according to the sample voice data collected in advance, determining that the characteristic type corresponds to
Value range;
Interval determination unit, for the value range to be divided at least two intervals;
Then, described eigenvector beggar unit includes:
Characteristic value determination subelement, for the corresponding target acoustical feature of each identified in units text, determining institute
State the corresponding characteristic value of each characteristic type in target acoustical feature;
Vectorization handles subelement, for according to the value range where the characteristic value, the characteristic value being carried out special
Levy vectorization.
In a kind of realization method of the present embodiment, described device 500 further includes:
Translated speech generation unit, for carrying out phonetic synthesis to the cypher text according to the target acoustical feature,
Obtain carrying the second speech data of the voice style of first voice data.
In a kind of realization method of the present embodiment, the translated speech generation unit includes:
Model parameter adjusts subelement, for the mould of synthetic model built in advance using the target acoustical Character adjustment
Shape parameter;
First speech production subelement, for carrying out voice conjunction to the cypher text using the synthetic model after adjustment
Into;
Alternatively, the translated speech generation unit includes:
Second speech production subelement, for carrying out voice conjunction to the cypher text using the synthetic model built in advance
Into obtaining initial speech data;
Voice data adjusts subelement, for special to the acoustics of the initial speech data using the target acoustical feature
Sign is adjusted.
In a kind of realization method of the present embodiment, the target acoustical feature includes average word speed, average pitch, is averaged
One or more of volume characteristic type.
A kind of speech translation apparatus provided in this embodiment, for needing the voice data of progress text translation, by right
The voice data carries out speech recognition, generates speech recognition text;And acoustic feature is extracted from the voice data, according to extraction
Acoustic feature speech recognition text is translated, obtain carrying the cypher text of the voice style of the voice data.It can
See, due to when carrying out text translation to voice data, it is contemplated that the acoustic feature that the voice data has in itself so that translation
Text can meet the style and characteristic of the voice data so that cypher text it is more natural, with more expressiveness, Jin Erbian
Understand semantic and linguistic context in text reading person.
5th embodiment
Referring to Fig. 6, for a kind of hardware architecture diagram of speech translation apparatus provided by the embodiments of the present application, the system
600 include memory 601 and receiver 602 and the processing being connect respectively with the memory 601 and the receiver 602
Device 603, for storing batch processing instruction, the processor 603 is used to that the memory 601 to be called to deposit the memory 601
The program instruction of storage performs following operation:
Obtain the first voice data;
By carrying out speech recognition to first voice data, speech recognition text is generated;
Target acoustical feature is extracted from first voice data, the voice is known according to the target acoustical feature
Other text is translated, and obtains carrying the cypher text of the voice style of first voice data.
In a kind of realization method of the present embodiment, the processor 603 is additionally operable to call what the memory 601 stored
Program instruction performs following operation:
Using the speech recognition text as identified in units text or each text for the speech recognition text will being formed
This segment is respectively as identified in units text;
Determine sound bite corresponding with the identified in units text in first voice data;
Determine the target acoustical feature of the sound bite.
In a kind of realization method of the present embodiment, the processor 603 is additionally operable to call what the memory 601 stored
Program instruction performs following operation:
Each identified in units text is subjected to text vector respectively;
The corresponding target acoustical feature of each identified in units text is subjected to feature vector respectively;
Using the text vector and described eigenvector as a result, input as the translation model built in advance
Feature, to realize the translation to the speech recognition text using the translation model.
In a kind of realization method of the present embodiment, if the target acoustical feature includes at least one characteristic type,
The processor 603 is additionally operable to the program instruction that the memory 601 is called to store and performs following operation:
According to the sample voice data collected in advance, the corresponding value range of the characteristic type is determined;
The value range is divided at least two intervals;
Target acoustical feature corresponding for each identified in units text, determines each in the target acoustical feature
The corresponding characteristic value of characteristic type;
According to the value range where the characteristic value, the characteristic value is subjected to feature vector.
In a kind of realization method of the present embodiment, the processor 603 is additionally operable to call what the memory 601 stored
Program instruction performs following operation:
Phonetic synthesis is carried out to the cypher text according to the target acoustical feature, obtains carrying first voice
The second speech data of the voice style of data.
In a kind of realization method of the present embodiment, the processor 603 is additionally operable to call what the memory 601 stored
Program instruction performs following operation:
The model parameter of synthetic model built in advance using the target acoustical Character adjustment;
Phonetic synthesis is carried out to the cypher text using the synthetic model after adjustment.
In a kind of realization method of the present embodiment, the processor 603 is additionally operable to call what the memory 601 stored
Program instruction performs following operation:
Phonetic synthesis is carried out to the cypher text using the synthetic model built in advance, obtains initial speech data;
The acoustic feature of the initial speech data is adjusted using the target acoustical feature.
In a kind of realization method of the present embodiment, the target acoustical feature includes average word speed, average pitch, is averaged
One or more of volume characteristic type.
In some embodiments, the processor 603 can be central processing unit (CentralProcessing
Unit, CPU), the memory 601 can be the interior of random access memory (RandomAccess Memory, RAM) type
Portion's memory, the receiver 602 can include General Physics interface, and the physical interface can be that ether (Ethernet) connects
Mouth or asynchronous transfer mode (Asynchronous Transfer Mode, ATM) interface.The processor 603, receiver 602
One or more independent circuits or hardware can be integrated into memory 601, such as:Application-specific integrated circuit (Application
Specific Integrated Circuit, ASIC).
As seen through the above description of the embodiments, those skilled in the art can be understood that above-mentioned implementation
All or part of step in example method can add the mode of required general hardware platform to realize by software.Based on such
Understand, the part that the technical solution of the application substantially in other words contributes to the prior art can be in the form of software product
It embodies, which can be stored in storage medium, such as ROM/RAM, magnetic disc, CD, including several
Instruction is used so that computer equipment (can be personal computer, the network communications such as server or Media Gateway
Equipment, etc.) perform method described in certain parts of each embodiment of the application or embodiment.
It should be noted that each embodiment is described by the way of progressive in this specification, each embodiment emphasis is said
Bright is all difference from other examples, and just to refer each other for identical similar portion between each embodiment.For reality
For applying device disclosed in example, since it is corresponded to the methods disclosed in the examples, so description is fairly simple, related part
Referring to method part illustration.
It should also be noted that, herein, relational terms such as first and second and the like are used merely to one
Entity or operation are distinguished with another entity or operation, without necessarily requiring or implying between these entities or operation
There are any actual relationship or orders.Moreover, term " comprising ", "comprising" or its any other variant are intended to contain
Lid non-exclusive inclusion, so that process, method, article or equipment including a series of elements not only will including those
Element, but also including other elements that are not explicitly listed or further include as this process, method, article or equipment
Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that
Also there are other identical elements in process, method, article or equipment including the element.
The foregoing description of the disclosed embodiments enables professional and technical personnel in the field to realize or using the application.
A variety of modifications of these embodiments will be apparent for those skilled in the art, it is as defined herein
General Principle can in other embodiments be realized in the case where not departing from spirit herein or range.Therefore, the application
The embodiments shown herein is not intended to be limited to, and is to fit to and the principles and novel features disclosed herein phase one
The most wide range caused.
Claims (16)
1. a kind of voice translation method, which is characterized in that including:
Obtain the first voice data;
By carrying out speech recognition to first voice data, speech recognition text is generated;
Target acoustical feature is extracted from first voice data, according to the target acoustical feature to speech recognition text
This is translated, and obtains carrying the cypher text of the voice style of first voice data.
2. according to the method described in claim 1, it is characterized in that, described extract target acoustical from first voice data
Feature, including:
Using the speech recognition text as identified in units text or each text piece for the speech recognition text will being formed
Section is respectively as identified in units text;
Determine sound bite corresponding with the identified in units text in first voice data;
Determine the target acoustical feature of the sound bite.
3. according to the method described in claim 2, it is characterized in that, described know the voice according to the target acoustical feature
Other text is translated, including:
Each identified in units text is subjected to text vector respectively;
The corresponding target acoustical feature of each identified in units text is subjected to feature vector respectively;
Using the text vector and described eigenvector as a result, the input as the translation model built in advance is special
Sign, to realize the translation to the speech recognition text using the translation model.
4. if according to the method described in claim 3, it is characterized in that, the target acoustical feature includes at least one feature class
Type, then the method further include:
According to the sample voice data collected in advance, the corresponding value range of the characteristic type is determined;
The value range is divided at least two intervals;
Then, it is described that the corresponding target acoustical feature of each identified in units text is subjected to feature vector respectively, including:
Target acoustical feature corresponding for each identified in units text, determines each feature in the target acoustical feature
The corresponding characteristic value of type;
According to the value range where the characteristic value, the characteristic value is subjected to feature vector.
5. according to the method described in claim 1, it is characterized in that, the method further includes:
Phonetic synthesis is carried out to the cypher text according to the target acoustical feature, obtains carrying first voice data
Voice style second speech data.
It is 6. according to the method described in claim 5, it is characterized in that, described literary to the translation according to the target acoustical feature
This progress phonetic synthesis, including:
The model parameter of synthetic model built in advance using the target acoustical Character adjustment;
Phonetic synthesis is carried out to the cypher text using the synthetic model after adjustment.
It is 7. according to the method described in claim 5, it is characterized in that, described literary to the translation according to the target acoustical feature
This progress phonetic synthesis, including:
Phonetic synthesis is carried out to the cypher text using the synthetic model built in advance, obtains initial speech data;
The acoustic feature of the initial speech data is adjusted using the target acoustical feature.
8. method according to any one of claims 1 to 7, which is characterized in that the target acoustical feature includes average language
One or more of speed, average pitch, average volume characteristic type.
9. a kind of speech translation apparatus, which is characterized in that including:
Voice data acquiring unit, for obtaining the first voice data;
Text generation unit is identified, for by carrying out speech recognition to first voice data, generating speech recognition text;
Acoustic feature extraction unit, for extracting target acoustical feature from first voice data;
Cypher text generation unit for being translated according to the target acoustical feature to the speech recognition text, obtains
Carry the cypher text of the voice style of first voice data.
10. device according to claim 9, which is characterized in that the acoustic feature extraction unit includes:
Unit of text determination subelement, for using the speech recognition text as identified in units text or will be formed described in
Each text fragments of speech recognition text are respectively as identified in units text;
Sound bite determination subelement, for determining voice corresponding with the identified in units text in first voice data
Segment;
Acoustic feature determination subelement, for determining the target acoustical feature of the sound bite.
11. device according to claim 10, which is characterized in that the cypher text generation unit includes:
Text vector beggar's unit, for each identified in units text to be carried out text vector respectively;
Feature vector beggar's unit, for the corresponding target acoustical feature of each identified in units text to be carried out feature respectively
Vectorization;
Cypher text generates subelement, for using the text vector and described eigenvector as a result, as advance
The input feature vector of the translation model of structure to be realized the translation to the speech recognition text using the translation model, is obtained
Carry the cypher text of the voice style of first voice data.
12. according to the devices described in claim 11, which is characterized in that if the target acoustical feature includes at least one feature
Type, then described device further include:
Value range determination unit, for according to the sample voice data collected in advance, determining that the characteristic type is corresponding and taking
It is worth range;
Interval determination unit, for the value range to be divided at least two intervals;
Then, described eigenvector beggar unit includes:
Characteristic value determination subelement, for for the corresponding target acoustical feature of each identified in units text, determining described
The corresponding characteristic value of each characteristic type in target acoustical feature;
Vectorization handles subelement, for according to the value range where the characteristic value, by the characteristic value carry out feature to
Quantization.
13. device according to claim 9, which is characterized in that described device further includes:
Translated speech generation unit for carrying out phonetic synthesis to the cypher text according to the target acoustical feature, obtains
Carry the second speech data of the voice style of first voice data.
14. device according to claim 13, which is characterized in that the translated speech generation unit includes:
Model parameter adjusts subelement, for the model of the synthetic model ginseng built in advance using the target acoustical Character adjustment
Number;
First speech production subelement, for carrying out phonetic synthesis to the cypher text using the synthetic model after adjustment;
Alternatively, the translated speech generation unit includes:
Second speech production subelement, for carrying out phonetic synthesis to the cypher text using the synthetic model built in advance,
Obtain initial speech data;
Voice data adjusts subelement, for using the target acoustical feature to the acoustic feature of the initial speech data into
Row adjustment.
15. device according to any one of claims 1 to 7, which is characterized in that the target acoustical feature includes average language
One or more of speed, average pitch, average volume characteristic type.
16. a kind of speech translation apparatus, which is characterized in that including:Processor, memory, system bus;
The processor and the memory are connected by the system bus;
The memory is for storing one or more programs, and one or more of programs include instruction, and described instruction works as quilt
The processor makes the processor perform the method as described in any one of claim 1-8 when performing.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810032112.3A CN108231062B (en) | 2018-01-12 | 2018-01-12 | Voice translation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810032112.3A CN108231062B (en) | 2018-01-12 | 2018-01-12 | Voice translation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108231062A true CN108231062A (en) | 2018-06-29 |
CN108231062B CN108231062B (en) | 2020-12-22 |
Family
ID=62641526
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810032112.3A Active CN108231062B (en) | 2018-01-12 | 2018-01-12 | Voice translation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108231062B (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109271645A (en) * | 2018-08-24 | 2019-01-25 | 深圳市福瑞达显示技术有限公司 | A kind of multilingual translation machine based on holographic fan screen |
CN109448458A (en) * | 2018-11-29 | 2019-03-08 | 郑昕匀 | A kind of Oral English Training device, data processing method and storage medium |
CN110008481A (en) * | 2019-04-10 | 2019-07-12 | 南京魔盒信息科技有限公司 | Translated speech generation method, device, computer equipment and storage medium |
CN111768756A (en) * | 2020-06-24 | 2020-10-13 | 华人运通(上海)云计算科技有限公司 | Information processing method, information processing apparatus, vehicle, and computer storage medium |
CN111785258A (en) * | 2020-07-13 | 2020-10-16 | 四川长虹电器股份有限公司 | Personalized voice translation method and device based on speaker characteristics |
CN111865752A (en) * | 2019-04-23 | 2020-10-30 | 北京嘀嘀无限科技发展有限公司 | Text processing device, method, electronic device and computer readable storage medium |
CN112037768A (en) * | 2019-05-14 | 2020-12-04 | 北京三星通信技术研究有限公司 | Voice translation method and device, electronic equipment and computer readable storage medium |
CN112183120A (en) * | 2020-09-18 | 2021-01-05 | 北京字节跳动网络技术有限公司 | Speech translation method, device, equipment and storage medium |
CN112382272A (en) * | 2020-12-11 | 2021-02-19 | 平安科技(深圳)有限公司 | Speech synthesis method, apparatus, device and storage medium capable of controlling speech speed |
WO2021102647A1 (en) * | 2019-11-25 | 2021-06-03 | 深圳市欢太科技有限公司 | Data processing method and apparatus, and storage medium |
WO2023005729A1 (en) * | 2021-07-28 | 2023-02-02 | 北京有竹居网络技术有限公司 | Speech information processing method and apparatus, and electronic device |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080133245A1 (en) * | 2006-12-04 | 2008-06-05 | Sehda, Inc. | Methods for speech-to-speech translation |
CN101281518A (en) * | 2007-03-28 | 2008-10-08 | 株式会社东芝 | Speech translation apparatus, method and program |
CN101373592A (en) * | 2007-08-21 | 2009-02-25 | 株式会社东芝 | Speech translation apparatus and method |
CN101727904A (en) * | 2008-10-31 | 2010-06-09 | 国际商业机器公司 | Voice translation method and device |
CN101937431A (en) * | 2010-08-18 | 2011-01-05 | 华南理工大学 | Emotional voice translation device and processing method |
CN105786801A (en) * | 2014-12-22 | 2016-07-20 | 中兴通讯股份有限公司 | Speech translation method, communication method and related device |
CN106339371A (en) * | 2016-08-30 | 2017-01-18 | 齐鲁工业大学 | English and Chinese word meaning mapping method and device based on word vectors |
US20170031899A1 (en) * | 2015-07-31 | 2017-02-02 | Samsung Electronics Co., Ltd. | Apparatus and method for determining translation word |
CN107170453A (en) * | 2017-05-18 | 2017-09-15 | 百度在线网络技术(北京)有限公司 | Across languages phonetic transcription methods, equipment and computer-readable recording medium based on artificial intelligence |
CN107464559A (en) * | 2017-07-11 | 2017-12-12 | 中国科学院自动化研究所 | Joint forecast model construction method and system based on Chinese rhythm structure and stress |
-
2018
- 2018-01-12 CN CN201810032112.3A patent/CN108231062B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080133245A1 (en) * | 2006-12-04 | 2008-06-05 | Sehda, Inc. | Methods for speech-to-speech translation |
CN101281518A (en) * | 2007-03-28 | 2008-10-08 | 株式会社东芝 | Speech translation apparatus, method and program |
CN101373592A (en) * | 2007-08-21 | 2009-02-25 | 株式会社东芝 | Speech translation apparatus and method |
CN101727904A (en) * | 2008-10-31 | 2010-06-09 | 国际商业机器公司 | Voice translation method and device |
CN101937431A (en) * | 2010-08-18 | 2011-01-05 | 华南理工大学 | Emotional voice translation device and processing method |
CN105786801A (en) * | 2014-12-22 | 2016-07-20 | 中兴通讯股份有限公司 | Speech translation method, communication method and related device |
US20170031899A1 (en) * | 2015-07-31 | 2017-02-02 | Samsung Electronics Co., Ltd. | Apparatus and method for determining translation word |
CN106339371A (en) * | 2016-08-30 | 2017-01-18 | 齐鲁工业大学 | English and Chinese word meaning mapping method and device based on word vectors |
CN107170453A (en) * | 2017-05-18 | 2017-09-15 | 百度在线网络技术(北京)有限公司 | Across languages phonetic transcription methods, equipment and computer-readable recording medium based on artificial intelligence |
CN107464559A (en) * | 2017-07-11 | 2017-12-12 | 中国科学院自动化研究所 | Joint forecast model construction method and system based on Chinese rhythm structure and stress |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109271645A (en) * | 2018-08-24 | 2019-01-25 | 深圳市福瑞达显示技术有限公司 | A kind of multilingual translation machine based on holographic fan screen |
CN109448458A (en) * | 2018-11-29 | 2019-03-08 | 郑昕匀 | A kind of Oral English Training device, data processing method and storage medium |
CN110008481A (en) * | 2019-04-10 | 2019-07-12 | 南京魔盒信息科技有限公司 | Translated speech generation method, device, computer equipment and storage medium |
CN110008481B (en) * | 2019-04-10 | 2023-04-28 | 南京魔盒信息科技有限公司 | Translated voice generating method, device, computer equipment and storage medium |
CN111865752A (en) * | 2019-04-23 | 2020-10-30 | 北京嘀嘀无限科技发展有限公司 | Text processing device, method, electronic device and computer readable storage medium |
CN112037768A (en) * | 2019-05-14 | 2020-12-04 | 北京三星通信技术研究有限公司 | Voice translation method and device, electronic equipment and computer readable storage medium |
WO2021102647A1 (en) * | 2019-11-25 | 2021-06-03 | 深圳市欢太科技有限公司 | Data processing method and apparatus, and storage medium |
CN111768756A (en) * | 2020-06-24 | 2020-10-13 | 华人运通(上海)云计算科技有限公司 | Information processing method, information processing apparatus, vehicle, and computer storage medium |
CN111768756B (en) * | 2020-06-24 | 2023-10-20 | 华人运通(上海)云计算科技有限公司 | Information processing method, information processing device, vehicle and computer storage medium |
CN111785258A (en) * | 2020-07-13 | 2020-10-16 | 四川长虹电器股份有限公司 | Personalized voice translation method and device based on speaker characteristics |
CN111785258B (en) * | 2020-07-13 | 2022-02-01 | 四川长虹电器股份有限公司 | Personalized voice translation method and device based on speaker characteristics |
CN112183120A (en) * | 2020-09-18 | 2021-01-05 | 北京字节跳动网络技术有限公司 | Speech translation method, device, equipment and storage medium |
WO2022057637A1 (en) * | 2020-09-18 | 2022-03-24 | 北京字节跳动网络技术有限公司 | Speech translation method and apparatus, and device, and storage medium |
CN112183120B (en) * | 2020-09-18 | 2023-10-20 | 北京字节跳动网络技术有限公司 | Speech translation method, device, equipment and storage medium |
CN112382272A (en) * | 2020-12-11 | 2021-02-19 | 平安科技(深圳)有限公司 | Speech synthesis method, apparatus, device and storage medium capable of controlling speech speed |
CN112382272B (en) * | 2020-12-11 | 2023-05-23 | 平安科技(深圳)有限公司 | Speech synthesis method, device, equipment and storage medium capable of controlling speech speed |
WO2022121187A1 (en) * | 2020-12-11 | 2022-06-16 | 平安科技(深圳)有限公司 | Speech synthesis method and apparatus capable of controlling speech speed, and device and storage medium |
WO2023005729A1 (en) * | 2021-07-28 | 2023-02-02 | 北京有竹居网络技术有限公司 | Speech information processing method and apparatus, and electronic device |
Also Published As
Publication number | Publication date |
---|---|
CN108231062B (en) | 2020-12-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108231062A (en) | A kind of voice translation method and device | |
CN110491382B (en) | Speech recognition method and device based on artificial intelligence and speech interaction equipment | |
JP7280386B2 (en) | Multilingual speech synthesis and cross-language voice cloning | |
CN102779508B (en) | Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof | |
JP7395792B2 (en) | 2-level phonetic prosody transcription | |
JP6777768B2 (en) | Word vectorization model learning device, word vectorization device, speech synthesizer, their methods, and programs | |
CN111276120B (en) | Speech synthesis method, apparatus and computer-readable storage medium | |
CN108447486A (en) | A kind of voice translation method and device | |
CN109767778B (en) | Bi-L STM and WaveNet fused voice conversion method | |
CN115516552A (en) | Speech recognition using synthesis of unexplained text and speech | |
CN115485766A (en) | Speech synthesis prosody using BERT models | |
CN113506562B (en) | End-to-end voice synthesis method and system based on fusion of acoustic features and text emotional features | |
CN112489629A (en) | Voice transcription model, method, medium, and electronic device | |
WO2023245389A1 (en) | Song generation method, apparatus, electronic device, and storage medium | |
JP2024012423A (en) | Predicting parametric vocoder parameter from prosodic feature | |
CN109326278B (en) | Acoustic model construction method and device and electronic equipment | |
CN116364055A (en) | Speech generation method, device, equipment and medium based on pre-training language model | |
CN114387945A (en) | Voice generation method and device, electronic equipment and storage medium | |
Raghavendra et al. | A multilingual screen reader in Indian languages | |
CN116524898A (en) | Sound video generation method and device, electronic equipment and storage medium | |
CN114333903A (en) | Voice conversion method and device, electronic equipment and storage medium | |
KR102426020B1 (en) | Method and apparatus for Speech Synthesis Containing Emotional Rhymes with Scarce Speech Data of a Single Speaker | |
CN116403562B (en) | Speech synthesis method and system based on semantic information automatic prediction pause | |
CN113763924B (en) | Acoustic deep learning model training method, and voice generation method and device | |
Jiang et al. | A Phoneme Sequence Driven Lightweight End-To-End Speech Synthesis Approach |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |