CN108986820A

CN108986820A - For the method, apparatus of voiced translation, electronic equipment and storage medium

Info

Publication number: CN108986820A
Application number: CN201810714043.4A
Authority: CN
Inventors: 何中军; 吴华; 王海峰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-06-29
Filing date: 2018-06-29
Publication date: 2018-12-11
Anticipated expiration: 2038-06-29
Also published as: CN108986820B

Abstract

Embodiment of the disclosure provides a kind of for the method, apparatus of voiced translation, electronic equipment and computer readable storage medium.In the method, aligned phoneme sequence corresponding with the name entity in the voice data of source from the user, source language is determined, aligned phoneme sequence includes at least one phoneme of source language；Geographical location based on aligned phoneme sequence and user determines that the target text of the object language form of name entity indicates；And indicated based on target text, generate corresponding with source voice data, object language form target speech data.Embodiment of the disclosure can improve the accuracy of voiced translation.

Description

For the method, apparatus of voiced translation, electronic equipment and storage medium

Technical field

Embodiment of the disclosure is generally related to technical field of information processing, and more specifically it relates to a kind of be used for language Method, apparatus, electronic equipment and the computer readable storage medium of sound translation.

Background technique

Voiced translation refers to that the voice by a kind of language (also referred to as original language) is converted to another language (also referred to as target Language) voice, can solve the people using different language across the communication problem of language.Traditional speech translation apparatus Main operational principle is progress speech recognition first, then machine translation system is called to obtain translation, finally calls speech synthesis Translation is exported as voice.

However, the voice messaging that this traditional voiced translation scheme is inputted merely with user, without utilizing other Possible relevant information.This may cause the ineffective of voiced translation, will be unable to meet in the scene of many voiced translations The demand of user.

Summary of the invention

Embodiment of the disclosure is related to a kind of for the method, apparatus of voiced translation, electronic equipment and computer-readable depositing Storage media.

In the disclosure in a first aspect, providing a kind of method for voiced translation.This method comprises: determining and coming from The corresponding aligned phoneme sequence of name entity in the voice data of user, source language source, aligned phoneme sequence includes original language At least one phoneme of form.This method further include: the geographical location based on aligned phoneme sequence and user determines the mesh of name entity The target text for marking linguistic form indicates.This method further comprises: being indicated, is generated and source voice data phase based on target text Corresponding, object language form target speech data.

In the second aspect of the disclosure, a kind of device for voiced translation is provided.The device includes: the first determining mould Block is configured to determine that phoneme sequence corresponding with the name entity in the voice data of source from the user, source language Column, aligned phoneme sequence includes at least one phoneme of source language.The device further include: the second determining module is configured as base In the geographical location of aligned phoneme sequence and user, determine that the target text of the object language form of name entity indicates.The device into One step includes: generation module, is configured as indicating based on target text, generates, object language corresponding with source voice data The target speech data of form.

In the third aspect of the disclosure, a kind of electronic equipment is provided.The electronic equipment includes one or more processors； And storage device, for storing one or more programs.When one or more programs are executed by one or more processors, So that the method that one or more processors realize first aspect.

In the fourth aspect of the disclosure, a kind of computer readable storage medium is provided, computer program is stored thereon with, The method of first aspect is realized when the computer program is executed by processor.

It should be appreciated that content described in Summary be not intended to limit embodiment of the disclosure key or Important feature, it is also non-for limiting the scope of the present disclosure.Other features of the disclosure will become easy reason by description below Solution.

Detailed description of the invention

The following detailed description is read with reference to the accompanying drawings, above-mentioned and other purposes, the feature of embodiment of the disclosure It will be easy to understand with advantage.In the accompanying drawings, several implementations of the disclosure are shown by way of example rather than limitation Example, in which:

Fig. 1 shows some embodiments of the present disclosure can be in the schematic diagram for the example context wherein realized；

Fig. 2 shows the schematic flow charts of the method according to an embodiment of the present disclosure for voiced translation；

Fig. 3 shows the schematic block diagram of the device according to an embodiment of the present disclosure for voiced translation；And

Fig. 4 shows a kind of schematic block diagram of equipment that can be used to implement embodiment of the disclosure.

Through all attached drawings, same or similar reference label is used to represent same or similar component.

Specific embodiment

Several exemplary embodiments shown in below with reference to the accompanying drawings describe the principle and spirit of the disclosure.It should Understand, describes these specific embodiments merely to enabling those skilled in the art to more fully understand and realizing this public affairs It opens, and not limits the scope of the present disclosure in any way.

As noted above, the voice messaging that traditional voiced translation scheme is inputted merely with user, without utilizing Other possible relevant informations.This may cause the ineffective of voiced translation, will be unable in the scene of many voiced translations It meets the needs of users.

As specific example, if user need to translate voice is " how to get to may I ask youming subway station? " when, Traditional voiced translation scheme is likely to occur translation error.For example, due to do not have geographical location information related to user and Knowledge, the place name " youming " in the above-mentioned voice of user may be mistakenly identified as adjective by traditional voiced translation scheme " to be had Name ", and then lead to translation error, it is final to translated speech " the how to go to the famous to make mistake Station? ".

As other example, in other possible voiced translation scenes, the identical multiple places of pronunciation may be practical It is upper that there is different titles, therefore there is different voiced translations.For example, the pronunciation " dongchong " for the Chinese phonetic alphabet, deep It is entitled " Dong Chong " that ditch between fields has a place, and also there is the place of one entitled " gushing in east " in Hong Kong.In another example for the hair of the Chinese phonetic alphabet Sound " zhongguo ", Japan have an area also to cry " China ".In other scene, identical place name is in different geographical positions Entirely different pronunciation may be had by setting, such as " Burger King " is traditionally known as " Hungry Jack ' s " in Australia, etc. Deng.

Make us inventors discovered through research that traditional voiced translation scheme can not obtain in above-mentioned voiced translation scene Satisfied translation result.Main reason is that traditional scheme does not account for and user during carrying out voiced translation Related geographical location information, so as to cause to user speech identification and translation can all there is a problem of it is inaccurate.

Inventor also found that the geographical location information of user has very the accuracy for improving voiced translation by research Big help.For example, for naming entity, name, place name, mechanism name, proper noun etc., during voiced translation The translation accuracy to the name entity mentioned in user speech can be improved in the geographical location information for introducing user.Upper In the scene of text description, it is contemplated that the geographical location where user can eliminate the ambiguity of name entity, and acquisition is more accurately translated Text.

In view of the above analysis and research of inventor, embodiment of the disclosure proposes a kind of side for voiced translation Method, device, electronic equipment and computer readable storage medium, to improve the accuracy of voiced translation.Embodiment of the disclosure is logical The geographical location information using user is crossed, ambiguity present in speech recognition and voiced translation can be eliminated, and then improve voice Translation accuracy.Embodiment of the disclosure is especially suitable for the scenes such as overseas trip, can apply the translation in mobile phone In the products such as application program, translator.Several embodiments of the disclosure are described with reference to the accompanying drawing.

Fig. 1 shows some embodiments of the present disclosure can be in the schematic diagram for the example context 100 wherein realized.Such as Fig. 1 It is shown, in example context 100, user 110 using a kind of language (also referred to as original language) to calculate equipment 120 issue voice and Generate voice data, also referred to as source voice data 115.In this example, original language is Chinese, and user 110 is said with Chinese Out " how to get to excuse me, having removed bright subway station? ".For example, user 110 may Japan travel abroad, and need to calculate Above-mentioned source voice data 115 is translated into another language, also referred to as object language by equipment 120.

It calculates equipment 120 and obtains source voice data 115, and source voice data 115 is converted to the voice number of object language According to referred to as target speech data 125.In this example, object language is English.It should be appreciated that above-mentioned example is only to be The purpose of explanation, and it is not intended to limit the range of embodiment of the disclosure.For example, in other embodiments, original language can also be with It is any language such as English, French, Japanese, object language is also possible to any language such as Chinese, French, Japanese.

It will be understood that calculating equipment 120 can be any type of mobile terminal, fixed terminal or portable terminal, including Mobile phone, website, unit, equipment, multimedia computer, multimedia plate, internet node, communicator, desktop computer, Laptop computer, notebook computer, netbook computer, tablet computer, PCS Personal Communications System (PCS) equipment, individual Navigation equipment, personal digital assistant (PDA), audio/video player, digital camera/video camera, positioning device, television reception Device, radio broadcast receiver, electronic book equipment, game station or any combination thereof, accessory including these equipment and outer If any combination thereof.Any type of interface for user can be supported (all it is also contemplated that calculating equipment 120 Such as " wearable " circuit).

Additionally, it should be noted that term " voice " refers to the audio with linguistic property in the context of the disclosure.To the greatest extent Pipe Fig. 1, which is shown by user 110, issues voice, but this is only exemplary.In other embodiments, voice can also be by The electronic equipments such as loudspeaker issue.Therefore, voice can only be issued by user 110 unless the context clearly indicates otherwise, otherwise It is not limited to be originated from user 110 by " voice " that user 110 issues, but can also be issued by other equipment or device.

Fig. 2 shows the schematic flow charts of the method 200 according to an embodiment of the present disclosure for voiced translation.One In a little embodiments, method 200 can be realized by the calculating equipment 120 of Fig. 1, such as can be by the processor of calculating equipment 120 Or processing unit is realized.In other embodiments, all or part of method 200 can also be by independently of system of computational devices 120 calculating equipment is realized, or can be realized by other units in example context 100.For that will combine convenient for discussing Fig. 1 describes method 200.

At 210, calculates equipment 120 and determine and in user 110, source language source voice data 115 The corresponding aligned phoneme sequence of entity is named, which includes at least one phoneme of source language.For example, showing in Fig. 1 In exemplary scene out, the aligned phoneme sequence " youming " that equipment 120 determines name entity " having bright " is calculated.

Generally, in the context of this article, the phoneme in so-called aligned phoneme sequence can be the sound for indicating original language The unit of sound.For example, phoneme can correspond to phonetic when original language is Chinese, when original language is English, phoneme can be right It should be in phonetic symbol, etc..It should be appreciated that the purpose that above-mentioned example is merely to illustrate that, and it is not intended to limit embodiment of the disclosure Range.

In some embodiments, in order to determine aligned phoneme sequence corresponding with name entity, calculating equipment 120 can be first Source voice data 115 is identified as to the source text of source language.For example, calculating equipment 120 can in the exemplary scene of Fig. 1 It is identified as " why excuse me, having removed bright subway station with the source text of source language with the source voice data 115 for issuing user 110 Walk? "

Then, the source text can be segmented with the source of the source language of determining name entity by calculating equipment 120 Textual representation.For example, calculate equipment 120 can be identified by participle source text " how to get to excuse me, having removed bright subway station? " In " having bright ".Then, the source textual representation for naming entity can be converted to the phoneme sequence of source language by calculating equipment 120 Column.For example, calculating equipment 120 " can will have bright " " youming " for being converted to pinyin representation, to obtain and name entity phase Corresponding aligned phoneme sequence.In this way, source voice data 115 can effectively and accurately be identified by calculating equipment 120 In include name entity aligned phoneme sequence.

In other embodiments, calculate equipment 120 first can also mark to obtain to the progress pronunciation of source voice data 115 " qingwen, quyoumingditiezhanzanmezou? ", then segmented again, so that it is determined that corresponding with name entity Aligned phoneme sequence " youming ".In a further embodiment, calculating equipment 120 can be used any mode appropriate from user Aligned phoneme sequence corresponding with name entity is determined in 110 source voice data 115.

As mentioned above, during carrying out speech recognition to source voice data 115, the geography of user 110 is utilized Location information can improve the accuracy of speech recognition.It can by the geographical location information of user 110 for example, calculating equipment 120 To obtain the place where user 110, and the place may be associated with multiple name entity (for example, place names).User 110 exists These place names, such as " having bright " may be mentioned in the source voice data 115 for needing to translate.In this case, if above-mentioned The geographical location for not considering user 110 during participle may will be unable to accurately identify the ground that user 110 is previously mentioned Name.

Therefore, in some embodiments, when the source text to source voice data 115 segments, calculating equipment 120 can To determine associated with the geographical location of user 110, source language name entity sets.For example, in the example field of Fig. 1 Jing Zhong, the geographical location of user 110 is Japan, and name entity sets associated with Japan may be " Tokyo, Osaka, cross Shore, have it is bright ... ", etc..Then, calculating equipment 120 can be based on the name entity sets to the source document of source voice data 115 This is segmented, so as to avoid place name " having bright " being mistakenly identified as adjective " famous ".In this way, calculating is set Standby 120 can improve the accuracy of speech recognition (especially participle operation).

In some embodiments, the geographical location of user 110 for example can by mobile terminal with positioning function or Other instruments with positioning device of person obtain.In other embodiments, geographical location related with user 110 can also lead to Any other suitable mode is crossed to obtain, for example, the geographical location where being inputted from user 110 to calculating equipment 120, Or the geographical location that user 110 needs to inquire.In addition, it should be understood that the specific geographic position mentioned in the context of this article The purpose being merely to illustrate that is set, and is not intended to limit the range of embodiment of the disclosure.

At 220, the geographical location of aligned phoneme sequence and user 110 of the equipment 120 based on name entity is calculated, determines name The target text of the object language form of entity indicates.For example, calculating equipment 120 can lead in exemplary scene shown in fig. 1 It crosses positioning system (such as GPS) and determines that the current geographical location of user 110 is Japan.In turn, it calculates equipment 120 and is based on name in fact The aligned phoneme sequence " youming " of body and geographical location " Japan " determine object language (for example, English) form of name entity Target text indicates " Ariake ".

In some embodiments, in order to determine that the target text of the object language form of name entity indicates, equipment is calculated 120 can use the geographical location of the aligned phoneme sequence of name entity and user 110 as index, search in predetermined dictionary and name The associated entry of entity.The entry of the predetermined dictionary may include naming the aligned phoneme sequence of entity, target text expression and ground Manage position.In other words, which establishes the name geographical location of entity, the aligned phoneme sequence of original language, target language in advance Mapping relations between the target text expression of speech.Therefore, the target of name entity can be obtained from the entry by calculating equipment 120 Textual representation.

By searching for the mode of dictionary, the target language of name entity can quickly and accurately be determined by calculating equipment 120 The target text of speech form indicates.In some embodiments, the entry in the predetermined dictionary can also include the source of name entity The other informations such as the source textual representation of linguistic form and the type for naming entity, so that the entry to dictionary is more effectively managed Reason, and increase the indexed mode of entry.

As unrestricted example, each data entry in the predetermined dictionary can be a five-tuple (S, Y, T, P, K), wherein S refers to textual representation of the name entity in original language, and Y refers to aligned phoneme sequence of the name entity in original language, T is the textual representation for naming entity in object language, and P is geographical location related with name entity, and K is the class for naming entity Type.

For example, the entry in the predetermined dictionary may be (have bright, youming, Ariake, Japan | Tokyo, LOC), (Chinese Fort king, hanbaowang, Hungry Jack ' s, Australia, RESTRAUNT), etc..Data type in entry can be according to Factually border requirement definition, such as can be place (LOC), restaurant (RESTRAUNT), hotel (HOTEL), hospital (HOSPITAL), etc..

In this case, for the exemplary scene of Fig. 1, calculating equipment 120 can be by name entity " having bright " The geographical location information (for example, Japan) of aligned phoneme sequence " youming " and user 110, retrieve from the predetermined dictionary " there are bright, youming, Ariake, Japanese, LOC ", so that it is determined that the mesh of the object language form of name entity " having bright " with entry It marks textual representation " Ariake ".

It should be appreciated that it is above-described by searching for predetermined dictionary come determine target text indicate mode be only illustrative , it is not intended to limit the range of embodiment of the disclosure.For example, in other embodiments, calculating equipment 120 can also be by any Other modes appropriate determine the target language of name entity from the geographical location of the aligned phoneme sequence and user 110 of naming entity Speech form textual representation.

At 230, calculate equipment 120 is indicated based on the target text of name entity, is generated opposite with source voice data 115 Target speech data 125 answer, object language form.For example, calculating equipment 120 based on determination in the exemplary scene of Fig. 1 The English words of " having bright " for obtaining indicate " Ariake " to generate target speech data 125, such as " Excuse me, how to Go to the Ariake Subway Station? ".

In some embodiments, in order to generate the target speech data 125 of object language form, calculating equipment 120 can be with The source text of source voice data 115 is translated as to the target text (for example, English text) of object language form.Then, it calculates Equipment 120, which can use, determines that the target text of the name entity obtained indicates to adjust the target text, and then will be through adjusting Whole target text is converted to target speech data.In this way, existing language can be made full use of by calculating equipment 120 Sound translation flow and tool, and only need to carry out limited adjustment, to save the operating burden for calculating equipment 120.

Continue using the scene of Fig. 1 as example, calculating equipment 120 initially " can excuse me, go Chinese form brightly How to get to is iron station? " be translated as " Excuse me, how to go to the famous Subway Station? ".Then, it counts Calculating equipment 120 indicates " Ariake " using the English for determining " having bright " for obtaining to adjust above-mentioned translation.For example, the adjustment can It can include that the word " famous " in translation is replaced with into " Ariake ".Then, calculating equipment 120 can be by from text to language Above-mentioned English is converted to English voice by the switch technology of sound (TTS) herein.

In other embodiments, calculating equipment 120 also can use the target text expression of name entity to source voice number It is post-processed according to 115 source text, such as right in source text to replace using the entry form of " having bright " in predetermined dictionary The word answered, and mark the target text and type of name entity simultaneously, such as available text " excuse me, go bright | Ariake | how to get to is LOC subway station? ".

Hereafter, calculating equipment 120 can call machine translation system to translate above-mentioned text.For example, using current Machine translation neural network based, can the sentence to source language text carry out pressure decoding, by the word of label or Phrase forces output for correctly as a result, namely " Excuse me, how to go to the Ariake Subway Station? ".

By above-mentioned example it is found that embodiment of the disclosure is related with user by introducing during voiced translation Geographical location information, can correctly identify out the name entity in the voice of user, and then available with name entity phase Associated correct voiced translation, it is thus eliminated that the ambiguity of speech recognition and translation, improves voiced translation accuracy.

Fig. 3 shows the schematic block diagram of the device 300 according to an embodiment of the present disclosure for voiced translation.Some In embodiment, device 300 can be included in the calculating equipment 120 of Fig. 1 or be implemented as to calculate equipment 120.

As shown in figure 3, device 300 includes the first determining module 310, the second determining module 320 and generation module 330.The One determining module 310 is configured to determine that opposite with the name entity in the voice data of source from the user, source language The aligned phoneme sequence answered, aligned phoneme sequence include at least one phoneme of source language.Second determining module 320 is configured as being based on The geographical location of aligned phoneme sequence and user determines that the target text of the object language form of name entity indicates.Generation module 330 It is configured as indicating based on target text, generates corresponding with source voice data, object language form target speech data.

In some embodiments, the first determining module 310 further include: the first identification module is configured as source voice number According to the source text for being identified as source language；Word segmentation module is configured as segmenting source text to determine name entity The source textual representation of source language；And first conversion module, it is configured as source textual representation being converted to source language Aligned phoneme sequence.

In some embodiments, word segmentation module further include: third determining module is configured to determine that related to geographical location Connection, source language name entity sets；And word segmentation module is additionally configured to based on name entity sets to source text It is segmented.

In some embodiments, the second determining module 320 further include: searching module is configured as with aligned phoneme sequence and ground Managing position is index, entry associated with name entity is searched in predetermined dictionary, entry includes aligned phoneme sequence, target text Expression and geographical location；And module is obtained, it is configured as obtaining target text expression from entry.

In some embodiments, entry further includes at least one of following: naming the source literal table of the source language of entity Show and name the type of entity.

In some embodiments, generation module 330 further include: the second identification module is configured as knowing source voice data Not Wei source language source text；Translation module is configured as source text being translated as the target text of object language form； Module is adjusted, is configured as adjusting target text using target text expression；And second conversion module, being configured as will be through The target text of adjustment is converted to target speech data.

It can be used to implement the block diagram of the equipment 400 of embodiment of the disclosure Fig. 4 schematically shows one kind.Such as figure Shown in 4, equipment 400 includes central processing unit (CPU) 401, can be according to being stored in read only memory devices (ROM) Computer program instructions in 402 are loaded into the calculating in random access memory device (RAM) 403 from storage unit 408 Machine program instruction, to execute various movements appropriate and processing.In RAM 403, can also store equipment 400 operate it is required each Kind program and data.CPU 401, ROM 402 and RAM 403 are connected with each other by bus 404.Input/output (I/O) interface 405 are also connected to bus 404.

Multiple components in equipment 400 are connected to I/O interface 405, comprising: input unit 406, such as keyboard, mouse etc.； Output unit 407, such as various types of displays, loudspeaker etc.；Storage unit 408, such as disk, CD etc.；And it is logical Believe unit 409, such as network interface card, modem, wireless communication transceiver etc..Communication unit 409 allows equipment 400 by such as The computer network of internet and/or various telecommunication networks exchange information/data with other equipment.

Each process as described above and processing, such as method 200 can be executed by processing unit 401.For example, one In a little embodiments, method 200 can be implemented as computer software programs, be tangibly embodied in machine readable media, such as Storage unit 408.In some embodiments, some or all of of computer program can be via ROM 402 and/or communication unit Member 409 and be loaded into and/or be installed in equipment 400.When computer program is loaded into RAM403 and is executed by CPU 401 When, the one or more steps of method as described above 200 can be executed.

As it is used herein, term " includes " and its similar term should be understood as that opening includes, i.e., " including but not It is limited to ".Term "based" should be understood as " being based at least partially on ".Term " one embodiment " or " embodiment " should manage Solution is " at least one embodiment ".Term " first ", " second " etc. may refer to different or identical object.May be used also herein It can include other specific and implicit definition.

As it is used herein, term " determination " covers various movements.For example, " determination " may include operation, It calculates, processing, export, investigation, searches (for example, searching in table, database or another data structure), finds out.In addition, " determination " may include receiving (for example, receiving information), access (for example, data in access memory) etc..In addition, " determination " It may include parsing, selection, selection, foundation etc..

It should be noted that embodiment of the disclosure can be realized by the combination of hardware, software or software and hardware.Firmly Part part can use special logic to realize；Software section can store in memory, by instruction execution system appropriate, Such as microprocessor or special designs hardware execute.It will be appreciated by those skilled in the art that above-mentioned device and method can It is realized with using computer executable instructions and/or being included in the processor control code, such as in programmable memory Or such code is provided in the data medium of such as optics or electrical signal carrier.

In addition, although describing the operation of disclosed method in the accompanying drawings with particular order, this do not require that or Person implies must execute these operations in this particular order, or has to carry out operation shown in whole and be just able to achieve expectation Result.On the contrary, the step of describing in flow chart can change and execute sequence.Additionally or alternatively, it is convenient to omit Mou Xiebu Suddenly, multiple step groups are combined into a step to execute, and/or a step is decomposed into execution of multiple steps.It shall also be noted that It can be embodied in one apparatus according to the feature and function of two or more devices of the disclosure.Conversely, above-described The feature and function of one device can be to be embodied by multiple devices with further division.

Although describing the disclosure by reference to several specific embodiments, but it is to be understood that it is public that the present disclosure is not limited to institutes The specific embodiment opened.The disclosure is intended to cover in spirit and scope of the appended claims included various modifications and equivalent Arrangement.

Claims

1. a kind of method for voiced translation, comprising:

Determine aligned phoneme sequence corresponding with the name entity in the voice data of source from the user, source language, it is described Aligned phoneme sequence includes at least one phoneme of source language；

Geographical location based on the aligned phoneme sequence and the user determines the target of the object language form of the name entity Textual representation；And

It is indicated based on the target text, generates target voice corresponding with the source voice data, object language form Data.

2. according to the method described in claim 1, wherein determining that aligned phoneme sequence corresponding with name entity includes:

The source voice data is identified as to the source text of source language；

The source text is segmented with the source textual representation of the source language of the determination name entity；And

The source textual representation is converted to the aligned phoneme sequence of source language.

3. according to the method described in claim 2, wherein to the source text carry out participle include:

Determine associated with the geographical location, source language name entity sets；And

The source text is segmented based on the name entity sets.

4. according to the method described in claim 1, wherein determining the target text table of the object language form of the name entity Show and includes:

It is index with the aligned phoneme sequence and the geographical location, is searched in predetermined dictionary associated with the name entity Entry, the entry includes the aligned phoneme sequence, the target text indicates and the geographical location；And

Obtaining the target text from the entry indicates.

5. according to the method described in claim 4, wherein the entry further includes at least one of following:

The type of the source textual representation of the source language of the name entity and the name entity.

6. according to the method described in claim 1, the target speech data for wherein generating object language form includes:

The source voice data is identified as to the source text of source language；

The source text is translated as to the target text of object language form；

It is indicated using the target text to adjust the target text；And

The adjusted target text is converted into the target speech data.

7. a kind of device for voiced translation, comprising:

First determining module is configured to determine that and the name entity in the voice data of source from the user, source language Corresponding aligned phoneme sequence, the aligned phoneme sequence include at least one phoneme of source language；

Second determining module is configured as the geographical location based on the aligned phoneme sequence and the user, determines that the name is real The target text of the object language form of body indicates；And

Generation module is configured as being indicated based on the target text, generates, target language corresponding with the source voice data The target speech data of speech form.

8. device according to claim 7, wherein first determining module further include:

First identification module is configured as the source voice data being identified as the source text of source language；

Word segmentation module is configured as segmenting the source text with the source document of the source language of the determination name entity Word indicates；And

First conversion module is configured as being converted to the source textual representation into the aligned phoneme sequence of source language.

9. device according to claim 8, wherein the word segmentation module further include:

Third determining module is configured to determine that associated with the geographical location, source language name entity sets； And

The word segmentation module is additionally configured to segment the source text based on the name entity sets.

10. device according to claim 7, wherein second determining module further include:

Searching module is configured as with the aligned phoneme sequence and the geographical location as index, lookup and institute in predetermined dictionary The name associated entry of entity is stated, the entry includes the aligned phoneme sequence, the target text indicates and the geographical position It sets；And

Module is obtained, is configured as obtaining the target text expression from the entry.

11. device according to claim 10, wherein the entry further includes at least one of following:

12. device according to claim 7, wherein the generation module further include:

Second identification module is configured as the source voice data being identified as the source text of source language；

Translation module is configured as the source text being translated as the target text of object language form；

Module is adjusted, is configured as indicating using the target text to adjust the target text；And

Second conversion module is configured as the adjusted target text being converted to the target speech data.

13. a kind of electronic equipment, comprising:

One or more processors；And

Storage device, for storing one or more programs, when one or more of programs are by one or more of processing When device executes, so that one or more of processors realize such as method of any of claims 1-6.

14. a kind of computer readable storage medium is stored thereon with computer program, realization when described program is executed by processor Such as method of any of claims 1-6.