CN107992485A

CN107992485A - A kind of simultaneous interpretation method and device

Info

Publication number: CN107992485A
Application number: CN201711207834.XA
Authority: CN
Inventors: 刘欢; 刘晓博
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2017-11-27
Filing date: 2017-11-27
Publication date: 2018-05-04

Abstract

The embodiment of the present invention provides a kind of simultaneous interpretation method and apparatus, the described method includes：Gather source language speech data；Obtain and export voiced translation result that is corresponding with the recognition result of the source language speech data, being represented with object language；Wherein, the voiced translation result is obtained by natural person's phonetic synthesis, and the target voice belongs to different language with the original language.The embodiment of the present invention can realize automatic speech recognition and translation, reduce human cost, improve the accuracy and integrality of translation result, voiced translation result has natural person's phonetic feature, is effectively improved audio experience.

Description

A kind of simultaneous interpretation method and device

Technical field

The present embodiments relate to field of computer technology, and in particular to a kind of simultaneous interpretation method and device.

Background technology

At present, more and more scenes need to use simultaneous interpretation technology.Traditional simultaneous interpretation technology be spokesman A into The speech of row original language, object language output is transcribed into by translator B.But this mode is, it is necessary to extra translation Personnel are translated, and can not realize that automatic speech recognition is handled with translation.Since human translation is likely to result in dredging for word Leakage or translation error, the integrality of translation result, accuracy it is poor therefore, simultaneous interpretation method existing in the prior art exists Of high cost, translation integrity, accuracy difference defect.

The content of the invention

An embodiment of the present invention provides a kind of simultaneous interpretation method and device, it is intended to solves prior art simultaneous interpretation method The technical problem of existing of high cost, translation integrity, accuracy difference.

For this reason, the embodiment of the present invention provides following technical solution：

In a first aspect, an embodiment of the present invention provides a kind of simultaneous interpretation method, including：Gather source language speech data； Obtain and export voiced translation result that is corresponding with the recognition result of the source language speech data, being represented with object language； Wherein, the voiced translation result is obtained by natural person's phonetic synthesis, and the target voice belongs to different languages from the original language Kind.

Second aspect, an embodiment of the present invention provides a kind of synchronous translation apparatus, including：Collecting unit, for gathering source Language voice data；Acquiring unit, for obtain and export it is corresponding with the recognition result of the source language speech data, with mesh The voiced translation result that poster speech represents；Wherein, the voiced translation result is obtained by natural person's phonetic synthesis, the target language Sound belongs to different language with the original language.

The third aspect, an embodiment of the present invention provides a kind of device for simultaneous interpretation, includes memory, Yi Jiyi A either more than one program one of them or more than one program storage is configured to by one in memory Or more than one processor performs the one or more programs and includes the instruction for being used for being operated below：Collection Source language speech data；Obtain and export it is corresponding with the recognition result of the source language speech data, represented with object language Voiced translation result；Wherein, the voiced translation result is obtained by natural person's phonetic synthesis, the target voice and the source Language belongs to different language.

Fourth aspect, an embodiment of the present invention provides a kind of machine readable media, is stored thereon with instruction, when by one or When multiple processors perform so that device performs the simultaneous interpretation method as shown in first aspect.

Simultaneous interpretation method and device provided in an embodiment of the present invention, can gather source language speech data, obtain and defeated Go out voiced translation result that is corresponding with the recognition result of the source language speech data, being represented with object language.Wherein, it is described Voiced translation result is obtained by natural person's phonetic synthesis, and the target voice belongs to different language with the original language.It is different from The prior art can not human translation mode, method provided in an embodiment of the present invention can realize automatic speech recognition with translation, Human cost is reduced, improves efficiency, effectively increases the integrality and accuracy of translation result.Further, since obtain Voiced translation result is obtained by natural person's phonetic synthesis so that the auditory perception of audience is more natural, warm, significantly improves same The quality that sound is interpreted, improves audio experience.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is attached drawing needed in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments described in invention, for those of ordinary skill in the art, without creative efforts, Other attached drawings can also be obtained according to these attached drawings.

Fig. 1 is the simultaneous interpretation method flow diagram that one embodiment of the invention provides；

Fig. 2 is the simultaneous interpretation method flow diagram that another embodiment of the present invention provides；

Fig. 3 is the synchronous translation apparatus schematic diagram that one embodiment of the invention provides；

Fig. 4 is a kind of block diagram with synchronous translation apparatus according to an exemplary embodiment；

Fig. 5 is the block diagram of the server according to an exemplary embodiment.

Embodiment

An embodiment of the present invention provides a kind of simultaneous interpretation method and device, it is possible to achieve automatic speech recognition and translation, Human cost is reduced, improves the accuracy and integrality of translation result, voiced translation result has natural person's phonetic feature, It is effectively improved audio experience.

In order to make those skilled in the art more fully understand the technical solution in the present invention, below in conjunction with of the invention real The attached drawing in example is applied, the technical solution in the embodiment of the present invention is clearly and completely described, it is clear that described implementation Example is only part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, this area is common Technical staff's all other embodiments obtained without making creative work, should all belong to protection of the present invention Scope.

The simultaneous interpretation method shown in exemplary embodiment of the present is introduced below in conjunction with attached drawing 1 to attached drawing 2.

Referring to Fig. 1, the simultaneous interpretation method flow diagram provided for one embodiment of the invention.As shown in Figure 1, it can include：

S101, gathers source language speech data.

During specific implementation, source language speech data can be gathered by the audio collection such as microphone unit.The original language Data are specifically as follows the data to be translated for object language.For example, user is made a speech in meeting using Chinese, it is expected To the simultaneous interpretation result that English is object language.Client can gather source language speech number by the microphone that user uses According to.Before or after voice data is gathered, user can also select to need the object language being translated as by user interface, turn over Result is translated to represent using object language.Usually, original language belongs to different language with object language.

S102, obtain and export it is with the recognition result of the source language speech data corresponding, represented with object language Voiced translation result.

In some embodiments, it can not only obtain and export voiced translation as a result, can also obtain and export and institute The text translation result that the recognition result of stating source language speech data is corresponding, is represented with object language.It is, of course, also possible to obtain And export the recognition result of the source language speech data.

In some embodiments, two-way above voice data can be gathered at the same time, and is exported respectively to two-way above sound The voiced translation result of frequency evidence.Wherein, the collection source language speech data include：Gather at least two source language speech numbers According to the source of at least two source language speech data is different with phonetic feature；It is described to obtain and export and the original language The voiced translation result that the recognition result of voice data is corresponding, is represented with object language includes：Obtain and export and institute respectively At least two voiced translation knots that the recognition result of stating at least two source language speech data is corresponding, is represented with object language Fruit.Wherein, the phonetic feature includes tamber characteristic and style and features.For example, in meeting and salon's scene, may go out Existing men and women's dialogue, can gather the voice data of men and women's dialogue respectively, and export the English synthesis voiced translation knot of female voice respectively Fruit and the English synthesis voiced translation result of male voice.

In some embodiments, the voiced translation result is obtained by natural person's phonetic synthesis, the target voice with The original language belongs to different language.For example, the voiced translation result can have the tamber characteristic and/or wind of target audio Lattice feature.Specifically, can be grasped in response to user for the selection of target sound color type and/or target stylistic category or switching Make, determine target sound color type corresponding with the selection or handover operation and/or target stylistic category, obtain and the target sound The tamber characteristic and/or style and features of color and/or the corresponding target audio of target style.It is described to obtain and export and the source language The recognition result of saying voice data is corresponding, the voiced translation result that is represented with object language includes：Obtain and export with it is described The recognition result of source language speech data is corresponding, being represented with object language, corresponding with target tone color and/or target style Synthesize voiced translation result；The synthesis voiced translation result has tamber characteristic corresponding with target tone color, and/or, it is described Synthesis voiced translation result has style and features corresponding with target style.

For example, user interface can be provided, the audio of synthesis is voluntarily selected for user.The audio of synthesis can include The tone color type of synthesis, and/or stylistic category.Tone color type for example can include sweet type female voice, child's voice, vicissitudes type male voice, Adult form male voice etc..Corresponding, tamber characteristic generally comprises spectrum signature, energy feature etc..Usually, style and features are used for table Levy locution, feature of speaking or the language Symbols power of a people.In embodiments of the present invention, style and features generally refer to At least one of duration and the higher duration prosodic features of rhythm fluctuating correlation, fundamental frequency feature, energy feature.The duration Prosodic features generally comprise the duration of some character/word, pause, whether the feature such as stress.

When synthesizing voiced translation result, text translation result corresponding with the recognition result, target sound are generally utilized Frequency synthesis obtains.Specifically, text feature data can be determined according to the text translation result, according to the tone color of target audio Feature and/or style and features carry out phonetic synthesis with the text feature data, obtain synthesis voice data as voiced translation As a result.

In some embodiments, the voiced translation result has the tamber characteristic of source language data.Specifically, may be used Source language speech data to be identified, and obtain voiced translation result corresponding with recognition result.Wherein, the voiced translation As a result synthesize to obtain according to the source language speech data and text translation result corresponding with the recognition result；The text This translation result represents that the object language belongs to different language with the original language with object language；The voiced translation knot Fruit at least has the tamber characteristic of the source language speech data.

In one implementation, the tamber characteristic of the spokesman with source language speech data can be obtained, with target The voiced translation that language is presented is as a result, so as to reach the translation result of " in unison ".Specifically, according to the source language speech data And text translation result corresponding with the recognition result is synthesized to obtain voiced translation result and included：

(1) text feature data are determined according to the text translation result.

It should be noted that after source language speech data are gathered, voice recognition processing can be carried out to the voice data, Obtain voice recognition result.Further, which is translated, the text for obtaining representing with object language turns over Translate result.Text feature data are determined according to text translation result., can be with for given any text during specific implementation Text feature data are obtained by text analyzing.The present invention without limiting, can adopt the mode for obtaining text feature data Carried out with the method for the prior art.

(2) tamber characteristic of the source language speech data is obtained.

Wherein, it is special to generally comprise the spectrum signature of the voice data, fundamental frequency for the tamber characteristic of the source language speech data Sign etc..

(3) phonetic synthesis is carried out according to the tamber characteristic of the source language speech data and the text feature data, obtained To synthesis voice data as voiced translation result.

In a kind of possible implementation, can according to the spectrum signature and/or fundamental frequency feature of source language speech data, The text feature data carry out phonetic synthesis, obtain synthesis voice data.The synthesis voice data has source language speech number According to spokesman tamber characteristic but presented with object language.For example, user A has said that in short " hello using Chinese ", obtained synthesis voice is English voice " hello ", and the English voice has the tamber characteristic of user A.

In alternatively possible implementation, the style and features of the source language data can also be obtained.The style Feature includes at least one of duration prosodic features, fundamental frequency feature, energy feature.Usually, style and features are used to characterize one Personal locution, feature of speaking or language Symbols power.In embodiments of the present invention, style and features generally refer to and duration At least one of higher duration prosodic features, fundamental frequency feature, energy feature with rhythm fluctuating correlation.The duration rhythm Feature generally comprise the duration of some character/word, pause, whether the feature such as stress.

Correspondingly, it is described that voice is carried out according to the tamber characteristic of the source language speech data and the text feature data Synthesis, obtaining the voiced translation result of object language includes：According to the tamber characteristic of the source language speech data, the source language The style and features and the text feature data for saying data carry out phonetic synthesis, obtain synthesis voice data as voiced translation knot Fruit；The voiced translation result has the tamber characteristic and style and features of the source language speech data.In this realization, language Sound translation result not only has the tamber characteristic of source language speech data spokesman, also with source language speech data spokesman's Style and features.In this way, realize the simultaneous interpretation translation of " same to sensual pleasure " " the same to painting style ".

In some possible implementations, the style and features of target style audio can also be obtained；The style and features Including at least one of duration prosodic features, fundamental frequency feature, energy feature；The target style audio and object language pair Should.For example, original language is Chinese, target style audio is English, and has certain style and features, such as Donald Trump Locution.

Correspondingly, it is described that voice is carried out according to the tamber characteristic of the source language speech data and the text feature data Synthesis, obtaining the voiced translation result of object language includes：According to the tamber characteristic of the source language speech data, the target The style and features of style audio and the text feature data carry out phonetic synthesis, obtain synthesis voice data as voiced translation As a result；The voiced translation result has the tamber characteristic of the source language speech data and the style of the target style audio Feature.That is, in this realization, voiced translation result remains the tone color of original language spokesman, but has target language The style and features of another spokesman of speech so that voiced translation result is more in line with the custom of object language audience, so as to fulfill The simultaneous interpretation translation of " same to sensual pleasure " " the different paintings style ".

In some embodiments, in response to selection of the user for target style or handover operation, definite and the selection Or the corresponding target style of handover operation, the style and features of acquisition target style audio corresponding with the target style.Citing Illustrate, some styles can be provided and select or switch for user.

In some embodiments, tamber characteristic, the target style sound according to the source language speech data The style and features of frequency and the text feature data carry out phonetic synthesis, obtain synthesis voice data as voiced translation result bag Include implemented below step：

A, according to the text feature data, the duration prosodic features of the target style audio, the source language speech The tamber characteristic of data obtains the acoustic feature data of the source language speech data.

In some possible embodiments, it is described according to the text feature data, the target style audio when The acoustic feature that long prosodic features, the tamber characteristic of the source language speech data obtain the source language speech data includes： Target duration is determined according to the duration prosodic features of the target style audio；According to the text feature data, the target Duration, the tamber characteristic of the source language speech data obtain the acoustic feature data of the source language speech data.At this In implementation, target duration is determined according to the duration prosodic features of target style audio, the prior art is instead of and uses source Language voice data predicts the mode of duration.And then according to text feature data, the target duration, the original language language The tamber characteristic of sound data obtains the acoustic feature data of the source language speech data.

In some possible embodiments, it is described according to the text feature data, the target style audio when The acoustic feature that long prosodic features, the tamber characteristic of the source language speech data obtain the source language speech data includes： Prediction duration is obtained according to the duration characteristics of the text feature data, the source language speech data；During according to the prediction It is long to carry out linear interpolation processing with target duration, obtain the duration characteristics after interpolation；The target duration is according to the target wind The duration prosodic features of lattice voice data determines；According to the duration characteristics after the text feature data, the interpolation, the source The tamber characteristic of language voice data obtains the acoustic feature data of the source language speech data.It should be noted that closing Into when spinning out sound, it is possible that the situation that synthetic effect is unstable., can be according to source language speech in order to improve this case Duration characteristics, this article notebook data of data obtain prediction duration, are determined according to the duration prosodic features of target style voice data Target duration.Linear interpolation processing is carried out with target duration according to the prediction duration, the duration characteristics after interpolation is obtained, utilizes The tamber characteristic of duration characteristics and source language speech data after difference obtains acoustic feature data.

B, by the fundamental frequency feature of the target style audio and/or energy feature and the acoustics of the source language speech data Characteristic is merged, the acoustic feature data after being merged.

, can be by the fundamental frequency feature of target style audio and/or energy feature and the source language speech during specific implementation Fundamental frequency feature and/or energy feature in the acoustic feature data of data carry out Fusion Features respectively, the acoustics after being merged Characteristic.

Wherein, Feature Fusion Algorithm can be very flexible, be one of which example below：

S_tr(n)=(T (n) * S_mean/T_mean)*w+S(n)*(1-w),where 0≤w≤1.0

Wherein, S_tr(n) n-th frame fundamental frequency (or energy) feature being characterized after fusion, living person n-th is said in source when S (n) is synthesizes Fundamental frequency (or energy) feature of frame prediction, T (n) are that the target of extraction says fundamental frequency (or energy) feature of living person's n-th frame prediction, S_meanThe characteristic mean in expression source speaker's sound storehouse, T_meanRepresent the corresponding characteristic mean of target speaker's audio, w represents fusion Coefficient.

In some embodiments, duration characteristics are obtained according to the mode of prediction duration and target duration linear interpolation, Then after step B, further include：After the acoustic feature data after being merged, to the acoustic feature data after fusion into line Property interpolation processing so that the duration of the acoustic feature data after the fusion is consistent with the target duration.

C, the acoustic feature data conversion is obtained into style and features, the source with target style audio into speech waveform The synthesis voice data of the tamber characteristic of language voice data.

Merge to obtain the sound of source audio according to the duration prosodic features of target style audio different from former embodiment The mode of feature is learned, in other embodiments, prediction duration can be obtained according to the duration characteristics of source audio, according to prediction Duration and other style and features of target style audio merged after acoustic feature, then to the acoustic feature carry out it is poor Value, to reduce the influence for spinning out sound.

Specifically, the tamber characteristic according to the source language speech data, the style of the target style audio are special The text feature data of seeking peace carry out phonetic synthesis, and obtain synthesis voice data includes as voiced translation result：

A ', institute is obtained according to the text feature data, the tamber characteristic of the source language speech data and duration characteristics State the acoustic feature data of source language speech data.

In this implementation, it can be obtained according to the duration characteristics of text feature data, source language speech data pre- Duration is surveyed, acoustic feature data are obtained according to the tamber characteristic of the prediction duration, source language speech data.

B ', by the fundamental frequency feature of the target style audio and/or energy feature and the sound of the source language speech data Learn characteristic to be merged, the acoustic feature data after being merged.

Acoustic feature data after fusion are carried out linear interpolation processing so that the acoustic feature number after the fusion by C ' According to duration it is consistent with target duration.The target duration is true according to the duration prosodic features of the target style voice data It is fixed.

D ', by the acoustic feature data conversion after processing into speech waveform, obtains having the style of target style audio special Levy, the synthesis voice data of the tamber characteristic of source language speech data.

In some embodiments, can be with order to remove the style relevance of the source spokesman in source language speech data Remove the status information of source language speech data., can be according to the text feature data, target wind when synthesizing voice data The style and features of lattice audio and the tamber characteristic of the source language speech data of removal status information carry out phonetic synthesis, are closed Into voice data.

In embodiments of the present invention, the style and features of target style audio can be fused in source language speech data, So that the voice after synthesis has more prosodic features, with more expressive force, the quality of phonetic synthesis is effectively increased.

It should be noted that specifically executive agent, above-mentioned steps can be held the unlimited system of the embodiment of the present invention by client OK, it can also be performed, can also partly be performed partly by client executing by server by server.

Be more clearly understood that embodiment of the present invention under concrete scene for the ease of those skilled in the art, below with Embodiment of the present invention is introduced in one specific example.It should be noted that the specific example is only so that this area skill Art personnel more clearly understand the present invention, but embodiments of the present invention are not limited to the specific example.

S201, client collection source language speech data.

During specific implementation, user's manual operation is not required after client operation, as long as spokesman says " beginning simultaneous interpretation " i.e. Identification interpretative function can be opened, that is, starts to perform S202.

The source language speech data are carried out serializing processing, obtain the source language speech of serializing by S202, client Data.

S203, client is by the source language speech data sending of serializing to server.

S204, server carry out unserializing processing to the original language of the serializing of reception.

S205, server carry out speech recognition to source language speech data, obtain language recognition result.

S206, server carry out translation processing to voice recognition result, obtain text translation result.

S207, server by utilizing text translation result and source language speech data obtain synthesis voiced translation result.

Specific implementation is referred to the method shown in Fig. 1.

S208, server will synthesize voiced translation result, recognition result is sent to client.

S209, the client output synthesis voiced translation result.

Client can export the synthesis voiced translation result in a manner of voice.

The corresponding device of method provided in an embodiment of the present invention and equipment are introduced below.

Referring to Fig. 3, the synchronous translation apparatus schematic diagram provided for one embodiment of the invention.

A kind of synchronous translation apparatus 300, including：

Collecting unit 301, for gathering source language speech data.Wherein, the specific implementation of the collecting unit 301 can be with Realized with reference to the step 101 of embodiment illustrated in fig. 1.

Acquiring unit 302, for obtain and export it is corresponding with the recognition result of the source language speech data, with target The voiced translation result that language represents；Wherein, the voiced translation result is obtained by natural person's phonetic synthesis, the target voice Belong to different language with the original language.Wherein, the specific implementation of the acquiring unit 302 is referred to embodiment illustrated in fig. 1 Step 102 and realize.

In some embodiments, described device further includes：

Text output unit, for obtain and export it is corresponding with the recognition result of the source language speech data, with mesh The text translation result that poster speech represents；And/or

Recognition result output unit, for obtaining and exporting the recognition result of the source language speech data.

In some embodiments, the collecting unit is specifically used for：At least two source language speech data are gathered, it is described The source of at least two source language speech data is different with phonetic feature；

The acquiring unit is specifically used for：

Obtain respectively and export it is corresponding with the recognition result of at least two source language speech data, with object language At least two voiced translation results represented.

In some embodiments, described device further includes：

Determination unit, for the selection or switching in response to user for target sound color type and/or target stylistic category Operation, determines target sound color type corresponding with the selection or handover operation and/or target stylistic category, obtains and the target The tamber characteristic and/or style and features of tone color and/or the corresponding target audio of target style；

The acquiring unit is specifically used for：

Obtain and export, with object language representing and mesh corresponding with the recognition result of the source language speech data Mark with phonetic symbols color and/or the corresponding synthesis voiced translation result of target style；The synthesis voiced translation result has and target tone color Corresponding tamber characteristic, and/or, the synthesis voiced translation result has style and features corresponding with target style.

In some embodiments, the acquiring unit is specifically used for：

Obtain and export the voiced translation of the tamber characteristic with source language speech data as a result, the voiced translation result Synthesize to obtain according to the source language speech data and text translation result corresponding with the recognition result；The text turns over Result is translated to represent with object language.

In some implementations, the acquiring unit includes：

Text feature data determination unit, for determining text feature data according to the text translation result；

Tamber characteristic determination unit, for obtaining the tamber characteristic of the source language speech data；

Phonetic synthesis unit, for the tamber characteristic according to the source language speech data and the text feature data into Row phonetic synthesis, obtains synthesis voice data as voiced translation result.

In some embodiments, described device further includes：

First style and features determination unit, for obtaining the style and features of the source language data；The style and features bag Include at least one of duration prosodic features, fundamental frequency feature, energy feature；

The phonetic synthesis unit specifically includes：

First phonetic synthesis unit, for the tamber characteristic according to the source language speech data, the source language data Style and features and the text feature data carry out phonetic synthesis, obtain synthesis voice data as voiced translation result；Institute Predicate sound translation result has the tamber characteristic and style and features of the source language speech data.

In some embodiments, described device further includes：

Second style and features determination unit, for obtaining the style and features of target style audio；The style and features include At least one of duration prosodic features, fundamental frequency feature, energy feature；The target style audio is corresponding with object language；

The phonetic synthesis unit specifically includes：

Second phonetic synthesis unit, for the tamber characteristic according to the source language speech data, the target style sound The style and features of frequency and the text feature data carry out phonetic synthesis, obtain synthesis voice data as voiced translation result； The voiced translation result has the tamber characteristic of the source language speech data and the style and features of the target style audio.

In some embodiments, the second phonetic synthesis unit includes：

First integrated unit, for according to the text feature data, the target style audio duration prosodic features, The tamber characteristic of the source language speech data obtains the acoustic feature data of the source language speech data；

Second integrated unit, for by the fundamental frequency feature of the target style audio and/or energy feature and the source language The acoustic feature data of speech voice data are merged, the acoustic feature data after being merged；

First converting unit, for into speech waveform, obtaining the acoustic feature data conversion with target style sound The synthesis voice data of the style and features of frequency, the tamber characteristic of source language speech data.

In some embodiments, the second phonetic synthesis unit includes：

Acoustic feature predicting unit, for special according to the text feature data, the tone color of the source language speech data Duration characteristics of seeking peace obtain the acoustic feature data of the source language speech data；

3rd integrated unit, for by the fundamental frequency feature of the target style audio and/or energy feature and the source language The acoustic feature data of speech voice data are merged, the acoustic feature data after being merged；

Feature interpolating unit, for carrying out linear interpolation processing to the acoustic feature data after fusion so that the fusion The duration of acoustic feature data afterwards is consistent with target duration；The target duration according to the target style voice data when Long prosodic features determines；

Second converting unit, for into speech waveform, obtaining the acoustic feature data conversion after processing with target wind The style and features of lattice audio, the source language speech data tamber characteristic synthesis voice data.

Wherein, the setting of apparatus of the present invention each unit or module is referred to the method shown in Fig. 1 to Fig. 2 and realizes, This is not repeated.

Referring to Fig. 4, for a kind of block diagram for synchronous translation apparatus according to an exemplary embodiment.Referring to Fig. 4, For a kind of block diagram for synchronous translation apparatus according to an exemplary embodiment.For example, device 400 can be mobile electricity Words, computer, digital broadcast terminal, messaging devices, game console, tablet device, Medical Devices, body-building equipment are a Personal digital assistant etc..

With reference to Fig. 4, device 400 can include following one or more assemblies：Processing component 402, memory 404, power supply Component 406, multimedia component 408, audio component 410, the interface 412 of input/output (I/O), sensor component 414, and Communication component 416.

The integrated operation of the usual control device 400 of processing component 402, such as with display, call, data communication, phase The operation that machine operates and record operation is associated.Processing component 402 can refer to including one or more processors 420 to perform Order, to complete all or part of step of above-mentioned method.In addition, processing component 402 can include one or more modules, just Interaction between processing component 402 and other assemblies.For example, processing component 402 can include multi-media module, it is more to facilitate Interaction between media component 408 and processing component 402.

Memory 404 is configured as storing various types of data to support the operation in equipment 400.These data are shown Example includes the instruction of any application program or method for being operated on device 400, and contact data, telephone book data, disappears Breath, picture, video etc..Memory 404 can be by any kind of volatibility or non-volatile memory device or their group Close and realize, as static RAM (SRAM), electrically erasable programmable read-only memory (EEPROM) are erasable to compile Journey read-only storage (EPROM), programmable read only memory (PROM), read-only storage (ROM), magnetic memory, flash Device, disk or CD.

Power supply module 406 provides electric power for the various assemblies of device 400.Power supply module 406 can include power management system System, one or more power supplys, and other components associated with generating, managing and distributing electric power for device 400.

Multimedia component 408 is included in the screen of one output interface of offer between described device 400 and user.One In a little embodiments, screen can include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen Curtain may be implemented as touch-screen, to receive input signal from the user.Touch panel includes one or more touch sensings Device is to sense the gesture on touch, slip and touch panel.The touch sensor can not only sense touch or sliding action Border, but also detect and the duration and pressure associated with the touch or slide operation.In certain embodiments, more matchmakers Body component 408 includes a front camera and/or rear camera.When equipment 400 is in operator scheme, such as screening-mode or During video mode, front camera and/or rear camera can receive exterior multi-medium data.Each front camera and Rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.

Audio component 410 is configured as output and/or input audio signal.For example, audio component 410 includes a Mike Wind (MIC), when device 400 is in operator scheme, during such as call model, logging mode and speech recognition mode, microphone by with It is set to reception external audio signal.The received audio signal can be further stored in memory 404 or via communication set Part 416 is sent.In certain embodiments, audio component 410 further includes a loudspeaker, for exports audio signal.

I/O interfaces 412 provide interface between processing component 402 and peripheral interface module, and above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include but be not limited to：Home button, volume button, start button and lock Determine button.

Sensor component 414 includes one or more sensors, and the state for providing various aspects for device 400 is commented Estimate.For example, sensor component 414 can detect opening/closed mode of equipment 400, and the relative positioning of component, for example, it is described Component is the display and keypad of device 400, and sensor component 414 can be with 400 1 components of detection device 400 or device Position change, the existence or non-existence that user contacts with device 400,400 orientation of device or acceleration/deceleration and device 400 Temperature change.Sensor component 414 can include proximity sensor, be configured to detect without any physical contact Presence of nearby objects.Sensor component 414 can also include optical sensor, such as CMOS or ccd image sensor, for into As being used in application.In certain embodiments, which can also include acceleration transducer, gyro sensors Device, Magnetic Sensor, pressure sensor or temperature sensor.

Communication component 416 is configured to facilitate the communication of wired or wireless way between device 400 and other equipment.Device 400 can access the wireless network based on communication standard, such as WiFi, 2G or 3G, or combinations thereof.In an exemplary implementation In example, communication component 414 receives broadcast singal or broadcast related information from external broadcasting management system via broadcast channel. In one exemplary embodiment, the communication component 414 further includes near-field communication (NFC) module, to promote junction service.Example Such as, in NFC module radio frequency identification (RFID) technology can be based on, Infrared Data Association (IrDA) technology, ultra wide band (UWB) technology, Bluetooth (BT) technology and other technologies are realized.

In the exemplary embodiment, device 400 can be believed by one or more application application-specific integrated circuit (ASIC), numeral Number processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for performing the above method.

Specifically, an embodiment of the present invention provides a kind of synchronous translation apparatus 400, memory 404, and one are included Either more than one program one of them or more than one program storage is in memory 404, and is configured to by one Or more than one processor 420 performs the one or more programs and includes the instruction for being used for being operated below： Gather source language speech data；Obtain and export it is corresponding with the recognition result of the source language speech data, with object language The voiced translation result of expression；Wherein, the voiced translation result is obtained by natural person's phonetic synthesis, the target voice and institute State original language and belong to different language.

Further, the processor 420 is specific is additionally operable to perform the one or more programs and include to be used for Carry out the instruction of following operation：Obtain and export it is corresponding with the recognition result of the source language speech data, with object language The text translation result of expression；And/or obtain and export the recognition result of the source language speech data.

Further, the processor 420 is specific is additionally operable to perform the one or more programs and include to be used for Carry out the instruction of following operation：At least two source language speech data are gathered, at least two source language speech data are come Source is different with phonetic feature；Obtain respectively and export it is corresponding with the recognition result of at least two source language speech data, At least two voiced translation results represented with object language.

Further, the processor 420 is specific is additionally operable to perform the one or more programs and include to be used for Carry out the instruction of following operation：In response to user for the selection of target sound color type and/or target stylistic category or switching behaviour Make, determine target sound color type corresponding with the selection or handover operation and/or target stylistic category, obtain and the target sound The tamber characteristic and/or style and features of color and/or the corresponding target audio of target style；Obtain and export and the original language language The synthesis voice corresponding, represented with object language, corresponding with target tone color and/or target style of recognition result of sound data Translation result；The synthesis voiced translation result has tamber characteristic corresponding with target tone color, and/or, the synthesis voice Translation result has style and features corresponding with target style.

Further, the processor 420 is specific is additionally operable to perform the one or more programs and include to be used for Carry out the instruction of following operation：Obtain and export the voiced translation of the tamber characteristic with source language speech data as a result, described Voiced translation result is synthesized according to the source language speech data and text translation result corresponding with the recognition result Arrive；The text translation result is represented with object language.

Further, the processor 420 is specific is additionally operable to perform the one or more programs and include to be used for Carry out the instruction of following operation：Text feature data are determined according to the text translation result；Obtain the source language speech number According to tamber characteristic；Phonetic synthesis is carried out according to the tamber characteristic of the source language speech data and the text feature data, Synthesis voice data is obtained as voiced translation result.

Further, the processor 420 is specific is additionally operable to perform the one or more programs and include to be used for Carry out the instruction of following operation：Obtain the style and features of the source language data；The style and features include duration prosodic features, At least one of fundamental frequency feature, energy feature；According to the tamber characteristic of the source language speech data, the source language data Style and features and the text feature data carry out phonetic synthesis, obtain synthesis voice data as voiced translation result；Institute Predicate sound translation result has the tamber characteristic and style and features of the source language speech data.

Further, the processor 420 is specific is additionally operable to perform the one or more programs and include to be used for Carry out the instruction of following operation：Obtain the style and features of target style audio；The style and features include duration prosodic features, base At least one of frequency feature, energy feature；The target style audio is corresponding with object language；According to the source language speech The tamber characteristic of data, the style and features of the target style audio and the text feature data carry out phonetic synthesis, obtain Voice data is synthesized as voiced translation result；The voiced translation result has the tamber characteristic of the source language speech data With the style and features of the target style audio.

Further, the processor 420 is specific is additionally operable to perform the one or more programs and include to be used for Carry out the instruction of following operation：According to the text feature data, the duration prosodic features of the target style audio, the source The tamber characteristic of language voice data obtains the acoustic feature data of the source language speech data；By the target style audio Fundamental frequency feature and/or energy feature merged with the acoustic feature data of the source language speech data, after obtaining fusion Acoustic feature data；By the acoustic feature data conversion into speech waveform, obtain having the style of target style audio special Levy, the synthesis voice data of the tamber characteristic of source language speech data.

Further, the processor 420 is specific is additionally operable to perform the one or more programs and include to be used for Carry out the instruction of following operation：It is special according to the text feature data, the tamber characteristic of the source language speech data and duration Obtain the acoustic feature data of the source language speech data；By the fundamental frequency feature and/or energy of the target style audio Feature is merged with the acoustic feature data of the source language speech data, the acoustic feature data after being merged；To melting Acoustic feature data after conjunction carry out linear interpolation processing so that when the duration and target of the acoustic feature data after the fusion It is long consistent；The target duration is determined according to the duration prosodic features of the target style voice data；By the acoustics after processing Characteristic is converted into speech waveform, obtains style and features, the sound of the source language speech data with target style audio The synthesis voice data of color characteristic.

Further, the processor 420 is specific is additionally operable to perform the one or more programs and include to be used for Carry out the instruction of following operation：In response to selection of the user for target style or handover operation, determine and the selection or switching Corresponding target style is operated, obtains the style and features of target style audio corresponding with the target style.

In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instructing, example are additionally provided Such as include the memory 404 of instruction, above-metioned instruction can be performed to complete the above method by the processor 420 of device 400.For example, The non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk With optical data storage devices etc..

A kind of machine readable media, such as the machine readable media can be non-transitorycomputer readable storage medium, When the instruction in the medium is performed by the processor of device (terminal or server) so that device is able to carry out a kind of same Sound is interpreted method, the described method includes：Gather source language speech data；Obtain and export the knowledge with the source language speech data The voiced translation result that other result is corresponding, is represented with object language；Wherein, the voiced translation result is closed by natural person's voice Into obtaining, the target voice belongs to different language with the original language.

Fig. 5 is the structure diagram of server in the embodiment of the present invention.The server 500 can because configuration or performance it is different and Produce bigger difference, can include one or more central processing units (central processing units, CPU) 522 (for example, one or more processors) and memory 532, one or more storage application programs 542 or The storage medium 530 (such as one or more mass memory units) of data 544.Wherein, memory 532 and storage medium 530 can be of short duration storage or persistently storage.One or more modules can be included by being stored in the program of storage medium 530 (diagram does not mark), each module can include operating the series of instructions in server.Further, central processing unit 522 could be provided as communicating with storage medium 530, and the series of instructions behaviour in storage medium 530 is performed on server 500 Make.

Server 500 can also include one or more power supplys 526, one or more wired or wireless networks Interface 550, one or more input/output interfaces 558, one or more keyboards 556, and/or, one or one Above operating system 541, such as Windows ServerTM, Mac OSXTM, UnixTM, LinuxTM, FreeBSDTM etc..

Those skilled in the art will readily occur to the present invention its after considering specification and putting into practice invention disclosed herein Its embodiment.It is contemplated that cover the present invention any variations, uses, or adaptations, these modifications, purposes or Person's adaptive change follows the general principle of the present invention and including the undocumented common knowledge in the art of the disclosure Or conventional techniques.Description and embodiments are considered only as exemplary, and true scope and spirit of the invention are by following Claim is pointed out.

It should be appreciated that the invention is not limited in the precision architecture for being described above and being shown in the drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present invention is only limited by appended claim

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent replacement, improvement and so on, should all be included in the protection scope of the present invention.

It should be noted that herein, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant are intended to Non-exclusive inclusion, so that process, method, article or equipment including a series of elements not only will including those Element, but also including other elements that are not explicitly listed, or further include as this process, method, article or equipment Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that Also there are other identical element in process, method, article or equipment including the key element.The present invention can be by calculating Described in the general context for the computer executable instructions that machine performs, such as program module.Usually, program module includes holding Row particular task realizes the routine of particular abstract data type, program, object, component, data structure etc..It can also divide The present invention is put into practice in cloth computing environment, in these distributed computing environment, by by communication network and connected long-range Processing equipment performs task.In a distributed computing environment, program module can be located at the local including storage device In remote computer storage medium.

Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment Divide mutually referring to what each embodiment stressed is the difference with other embodiment.It is real especially for device For applying example, since it is substantially similar to embodiment of the method, so describing fairly simple, related part is referring to embodiment of the method Part explanation.Device embodiment described above is only schematical, wherein described be used as separating component explanation Unit may or may not be physically separate, may or may not be as the component that unit is shown Physical location, you can with positioned at a place, or can also be distributed in multiple network unit.Can be according to the actual needs Some or all of module therein is selected to realize the purpose of this embodiment scheme.Those of ordinary skill in the art are not paying In the case of creative work, you can to understand and implement.The above is only the embodiment of the present invention, should be referred to Go out, for those skilled in the art, without departing from the principle of the present invention, can also make some Improvements and modifications, these improvements and modifications also should be regarded as protection scope of the present invention.

Claims

A kind of 1. simultaneous interpretation method, it is characterised in that including：

Gather source language speech data；

Obtain and export voiced translation knot that is corresponding with the recognition result of the source language speech data, being represented with object language Fruit；

Wherein, the voiced translation result is obtained by natural person's phonetic synthesis, and the target voice belongs to not with the original language Same languages.
2. according to the method described in claim 1, it is characterized in that, the method further includes：

Obtain and export text that is corresponding with the recognition result of the source language speech data, being represented with object language and translate knot Fruit；And/or

Obtain and export the recognition result of the source language speech data.
3. according to the method described in claim 1, it is characterized in that, the collection source language speech data include：

At least two source language speech data are gathered, the source of at least two source language speech data and phonetic feature are not Together；

It is described to obtain and export voice that is corresponding with the recognition result of the source language speech data, being represented with object language and turn over Translating result includes：

Obtain respectively and export it is corresponding with the recognition result of at least two source language speech data, represented with object language At least two voiced translation results.
4. according to the method described in claim 1, it is characterized in that, the method further includes：

In response to user for target sound color type and/or selection or the handover operation of target stylistic category, definite and the selection Or the corresponding target sound color type of handover operation and/or target stylistic category, obtain and the target tone color and/or target style The tamber characteristic and/or style and features of corresponding target audio；

It is described to obtain and export voice that is corresponding with the recognition result of the source language speech data, being represented with object language and turn over Translating result includes：

Obtain and export, with object language representing and target sound corresponding with the recognition result of the source language speech data Color and/or the corresponding synthesis voiced translation result of target style；The synthesis voiced translation result has corresponding with target tone color Tamber characteristic, and/or, it is described synthesis voiced translation result there are style and features corresponding with target style.
5. according to the method described in claim 1, it is characterized in that, the voiced translation result has the source language speech number According to tamber characteristic, the voiced translation result is according to the source language speech data and text corresponding with the recognition result This translation result synthesizes to obtain；The text translation result is represented with object language.
6. according to the method described in claim 1, it is characterized in that, according to the source language speech data and with the identification As a result corresponding text translation result is synthesized to obtain voiced translation result and included：

Text feature data are determined according to the text translation result；

Obtain the tamber characteristic of the source language speech data；

Phonetic synthesis is carried out according to the tamber characteristic of the source language speech data and the text feature data, obtains synthesis language Sound data are as voiced translation result.
7. according to the method described in claim 6, it is characterized in that, the method further includes：

Obtain the style and features of the source language data；It is special that the style and features include duration prosodic features, fundamental frequency feature, energy At least one of sign；

It is described that phonetic synthesis is carried out according to the tamber characteristic of the source language speech data and the text feature data, obtain mesh The voiced translation result of poster speech includes：

According to the tamber characteristic of the source language speech data, the style and features of the source language data and the text feature number According to phonetic synthesis is carried out, synthesis voice data is obtained as voiced translation result；The voiced translation result has the source language Say the tamber characteristic and style and features of voice data.
A kind of 8. synchronous translation apparatus, it is characterised in that including：

Collecting unit, for gathering source language speech data；

Acquiring unit, for obtain and export it is corresponding with the recognition result of the source language speech data, with object language table The voiced translation result shown；Wherein, the voiced translation result is obtained by natural person's phonetic synthesis, the target voice with it is described Original language belongs to different language.
9. a kind of device for simultaneous interpretation, it is characterised in that include memory, and one or more than one journey Sequence, either more than one program storage in memory and is configured to by one or more than one processor for one of them Perform the one or more programs and include the instruction for being used for being operated below：

Gather source language speech data；

Obtain and export voiced translation knot that is corresponding with the recognition result of the source language speech data, being represented with object language Fruit；

Wherein, the voiced translation result is obtained by natural person's phonetic synthesis, and the target voice belongs to not with the original language Same languages.
10. a kind of machine readable media, is stored thereon with instruction, when executed by one or more processors so that device is held Simultaneous interpretation method of the row as described in one or more in claim 1 to 7.