CN108986820B

CN108986820B - Method, device, electronic equipment and storage medium for speech translation

Info

Publication number: CN108986820B
Application number: CN201810714043.4A
Authority: CN
Inventors: 何中军; 吴华; 王海峰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-06-29
Filing date: 2018-06-29
Publication date: 2020-12-18
Anticipated expiration: 2038-06-29
Also published as: CN108986820A

Abstract

Embodiments of the present disclosure provide a method, apparatus, electronic device, and computer-readable storage medium for speech translation. In the method, a phoneme sequence corresponding to a named entity in source speech data from a user in a source language form is determined, the phoneme sequence including at least one phoneme in the source language form; determining a target literal representation in a target language form of the named entity based on the phoneme sequence and the geographic location of the user; and generating target speech data in the target language corresponding to the source speech data based on the target textual representation. Embodiments of the present disclosure may improve the accuracy of speech translation.

Description

Method, device, electronic equipment and storage medium for speech translation

Technical Field

Embodiments of the present disclosure relate generally to the field of information processing technology, and more particularly, to a method, apparatus, electronic device, and computer-readable storage medium for speech translation.

Background

Speech translation refers to the conversion of speech in one language (also known as the Source language) to speech in another language (also known as the target language), which can solve the problem of cross-language communication between people using different languages. The main working principle of the traditional speech translation equipment is that speech recognition is firstly carried out, then a machine translation system is called to obtain a translated text, and finally speech synthesis is called to output the translated text as speech.

However, such a conventional speech translation scheme only utilizes speech information input by a user, and does not utilize other possible related information. This may result in poor speech translation, which may not meet the user's requirements in many scenarios of speech translation.

Disclosure of Invention

Embodiments of the present disclosure relate to a method, apparatus, electronic device, and computer-readable storage medium for speech translation.

In a first aspect of the disclosure, a method for speech translation is provided. The method comprises the following steps: a phoneme sequence corresponding to a named entity in source speech data from a user in a source language is determined, the phoneme sequence including at least one phoneme in the source language. The method further comprises the following steps: a target literal representation in the target language form of the named entity is determined based on the phoneme sequence and the geographic location of the user. The method further comprises the following steps: target speech data in a target language corresponding to the source speech data is generated based on the target textual representation.

In a second aspect of the disclosure, an apparatus for speech translation is provided. The device includes: a first determination module configured to determine a phoneme sequence corresponding to a named entity in source speech data from a user in a source language, the phoneme sequence including at least one phoneme in the source language. The device also includes: a second determination module configured to determine a target literal representation in the target language form of the named entity based on the phoneme sequence and the geographic location of the user. The apparatus further comprises: a generation module configured to generate target speech data in a target language form corresponding to the source speech data based on the target textual representation.

In a third aspect of the disclosure, an electronic device is provided. The electronic device includes one or more processors; and a storage device for storing one or more programs. The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of the first aspect.

In a fourth aspect of the disclosure, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when executed by a processor, implements the method of the first aspect.

It should be understood that the statements herein reciting aspects are not intended to limit the critical or essential features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other objects, features and advantages of the embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

FIG. 1 illustrates a schematic diagram of an example environment in which some embodiments of the present disclosure can be implemented;

FIG. 2 shows a schematic flow diagram of a method for speech translation according to an embodiment of the present disclosure;

FIG. 3 shows a schematic block diagram of an apparatus for speech translation according to an embodiment of the present disclosure; and

FIG. 4 shows a schematic block diagram of a device that may be used to implement embodiments of the present disclosure.

Throughout the drawings, the same or similar reference numerals are used to designate the same or similar components.

Detailed Description

The principles and spirit of the present disclosure will be described with reference to a number of exemplary embodiments shown in the drawings. It is understood that these specific embodiments are described merely to enable those skilled in the art to better understand and implement the present disclosure, and are not intended to limit the scope of the present disclosure in any way.

As noted above, conventional speech translation schemes only utilize speech information input by the user, and do not utilize other possibly relevant information. This may result in poor speech translation effect, and will not meet the user's requirement in many speech translation scenarios.

As a specific example, if the voice that the user needs to translate is "ask how to walk in yourming subway station? "conventional speech translation schemes may experience translation errors. For example, without geographic location information and knowledge related to the user, the conventional speech translation scheme may misrecognize the place name "yuming" in the above speech of the user as the adjective "named", thereby causing a translation error, and finally giving an erroneous translated speech "how to the motion station? ".

As a further example, in possible other speech translation scenarios, multiple places with the same pronunciation may actually have different names, and thus different speech translations. For example, for the pronunciation of Pinyin "dongchong", Shenzhen has a place named "Dongyong" and hong Kong also has a place named "Dongyong". For another example, there is a region in japan called "china" for the pronunciation of pinyin "zhongguo". In other scenarios, the same place name may have completely different pronunciations at different geographical locations, e.g., "hamburger king" is customarily known in australia as "Hungry Jack's", and so on.

The inventor finds out through research that the traditional voice translation scheme cannot achieve a satisfactory translation result in the voice translation scene. The main reason is that the traditional solution does not take into account the geographical location information related to the user during the speech translation process, so that the recognition and translation of the user speech may be inaccurate.

The inventor also finds that the geographic position information of the user is very helpful for improving the accuracy of the speech translation through research. For example, for named entities, such as names of people, places, organizations, proper nouns, etc., introducing the geographic location information of the user in the process of voice translation can improve the accuracy of translation of the named entities mentioned in the voice of the user. In the scenario described above, the geographic location of the user is taken into account to disambiguate the named entity and obtain a more accurate translation.

In view of the above analysis and research of the inventors, embodiments of the present disclosure propose a method, apparatus, electronic device, and computer-readable storage medium for speech translation to improve accuracy of speech translation. The embodiment of the disclosure can eliminate ambiguity existing in voice recognition and voice translation by using the geographical position information of the user, thereby improving the accuracy of voice translation. The embodiment of the disclosure is particularly suitable for scenes such as outbound travel and the like, and can be applied to products such as translation application programs and translation machines of mobile phones. Several embodiments of the present disclosure are described below in conjunction with the following figures.

Fig. 1 illustrates a schematic diagram of an example environment 100 in which some embodiments of the present disclosure can be implemented. As shown in FIG. 1, in the example environment 100, a user 110 produces speech data, also referred to as source speech data 115, using speech uttered in one language (also referred to as the source language) to a computing device 120. In this example, the source language is Chinese, and the user 110 says "ask for a question in Chinese to tell how to go to a subway station? ". For example, the user 110 may be traveling outbound in Japan, and the computing device 120 may be required to translate the source audio data 115 described above into another language, also referred to as a target language.

Computing device 120 takes source speech data 115 and converts source speech data 115 into speech data in a target language, referred to as target speech data 125. In this example, the target language is English. It should be understood that the above examples are for illustrative purposes only and are not intended to limit the scope of the embodiments of the present disclosure. For example, in other embodiments, the source language may be any language such as english, french, japanese, etc., and the target language may be any language such as chinese, french, japanese, etc.

It will be appreciated that the computing device 120 may be any type of mobile terminal, fixed terminal, or portable terminal including a mobile phone, station, unit, device, multimedia computer, multimedia tablet, internet node, communicator, desktop computer, laptop computer, notebook computer, netbook computer, tablet computer, Personal Communication System (PCS) device, personal navigation device, Personal Digital Assistant (PDA), audio/video player, digital camera/camcorder, positioning device, television receiver, radio broadcast receiver, electronic book device, gaming device, or any combination thereof, including accessories and peripherals of these devices, or any combination thereof. It is also contemplated that computing device 120 can support any type of interface to the user (such as "wearable" circuitry, etc.).

Further, it should be noted that in the context of the present disclosure, the term "speech" refers to audio having linguistic properties. Although fig. 1 shows speech being spoken by user 110, this is merely exemplary. In other embodiments, the speech may also be emitted by an electronic device such as a speaker. Thus, unless the context clearly indicates otherwise that speech may only be uttered by user 110, the "speech" uttered by user 110 is not limited to originating from user 110, but may also be uttered by other devices or apparatus.

FIG. 2 shows a schematic flow diagram of a method 200 for speech translation according to an embodiment of the present disclosure. In some embodiments, the method 200 may be implemented by the computing device 120 of fig. 1, for example, may be implemented by a processor or processing unit of the computing device 120. In other embodiments, all or part of the method 200 may also be implemented by a computing device separate from the computing device system 120, or may be implemented by other units in the example environment 100. For ease of discussion, the method 200 will be described in conjunction with FIG. 1.

At 210, computing device 120 determines a phoneme sequence corresponding to a named entity in source speech data 115 from user 110 in a source language form, the phoneme sequence including at least one phoneme in the source language form. For example, in the example scenario shown in fig. 1, the computing device 120 determines the phoneme sequence "youming" of the named entity "bright".

Generally, in the context of this document, a phoneme in a so-called phoneme sequence may be a unit representing the sound of a source language. For example, a phoneme may correspond to pinyin when the source language is Chinese, a phoneme may correspond to a phonetic symbol when the source language is English, and so on. It should be understood that the above examples are for illustrative purposes only and are not intended to limit the scope of the embodiments of the present disclosure.

In some embodiments, to determine the phoneme sequence corresponding to the named entity, computing device 120 may first identify source speech data 115 as source text in a source language. For example, in the example scenario of fig. 1, the computing device 120 may identify the source speech data 115 uttered by the user 110 as source text "ask a question in the source language to tell how a subway station is going? "

Computing device 120 can then tokenize the source text to determine a source text representation of the named entity in the source language. For example, the computing device 120 may recognize the source text "ask a question to tell how a subway station was going by word segmentation? "middle" is clear. Computing device 120 may then convert the source textual representation of the named entity into a sequence of phonemes in the source language. For example, computing device 120 may convert "bright" to "youming" of the pinyin representation to obtain a sequence of phonemes corresponding to the named entity. In this manner, computing device 120 may efficiently and accurately identify the phoneme sequence of the named entity included in source speech data 115.

In other embodiments, the computing device 120 may also first perform a pronunciation annotation on the source speech data 115 to obtain "qingwen, qieyumingditizhanzanmezu? "and then followed by word segmentation to determine the phoneme sequence" youming "corresponding to the named entity. In further embodiments, computing device 120 may determine the sequence of phonemes corresponding to the named entity from source speech data 115 of user 110 using any suitable approach.

As mentioned above, utilizing the geographic location information of the user 110 in speech recognition of the source speech data 115 may improve the accuracy of speech recognition. For example, computing device 120 may obtain, via geographic location information of user 110, a location at which user 110 is located, and the location may be associated with multiple named entities (e.g., a place name). The user 110 may refer to these place names, e.g., "clear," in the source speech data 115 that needs to be translated. In this case, if the geographical location of the user 110 is not considered in the above word segmentation process, the place name mentioned by the user 110 may not be accurately identified.

Thus, in some embodiments, when tokenizing the source text of the source speech data 115, the computing device 120 may determine a set of named entities in the source language associated with the geographic location of the user 110. For example, in the example scenario of fig. 1, the geographic location of user 110 is japan, and the set of named entities associated with japan may be "tokyo, osaka, shores, mingming … …," and so on. The computing device 120 can then tokenize the source text of the source speech data 115 based on the set of named entities, such that misrecognizing the place name "clear" as the adjective "named. In this manner, the computing device 120 may improve the accuracy of speech recognition (particularly word segmentation operations).

In some embodiments, the geographic location of the user 110 may be obtained, for example, by a mobile terminal or other instrument with a location-determining device having location-determining capabilities. In other embodiments, the geographic location associated with the user 110 may also be obtained in any other suitable manner, for example, the geographic location at which the user 110 is input to the computing device 120, or the geographic location that the user 110 needs to query. Furthermore, it should be understood that the particular geographic locations mentioned in the context of this disclosure are for illustrative purposes only and are not intended to limit the scope of embodiments of the present disclosure.

At 220, the computing device 120 determines a target textual representation in the target language for the named entity based on the phoneme sequence of the named entity and the geographic location of the user 110. For example, in the example scenario illustrated in fig. 1, computing device 120 may determine that the current geographic location of user 110 is japan via a positioning system (such as GPS). In turn, the computing device 120 determines a target literal representation "Ariake" in the target language (e.g., english) of the named entity based on the phoneme sequence "youming" and the geographic location "japan" of the named entity.

In some embodiments, to determine a target textual representation in the target language of the named entity, the computing device 120 may look up an entry associated with the named entity in a predetermined dictionary using the phoneme sequence of the named entity and the geographic location of the user 110 as indices. The entries of the predetermined dictionary may include phoneme sequences, target word representations, and geographic locations of the named entities. In other words, the predetermined dictionary establishes in advance a mapping relationship between the geographical location of the named entity, the phoneme sequence of the source language, and the target textual representation of the target language. Thus, the computing device 120 can obtain the target literal representation of the named entity from the entry.

By looking up the dictionary, the computing device 120 can quickly and accurately determine the target textual representation in the target language for the named entity. In some embodiments, the entries in the predetermined dictionary may also include other information such as a source literal representation of the named entity in the source language and the type of the named entity to more efficiently manage the entries of the dictionary and increase the manner in which the entries are indexed.

As a non-limiting example, each data entry in the predetermined dictionary may be a five-tuple (S, Y, T, P, K), where S refers to a textual representation of the named entity in the Source language, Y refers to a sequence of phonemes for the named entity in the Source language, T refers to a textual representation of the named entity in the target language, P refers to a geographic location associated with the named entity, and K refers to a type of the named entity.

For example, the entries in the predetermined dictionary may be (bright, youming, Ariake, japan | tokyo, LOC), (hamburg king, hanbaowang, Hungry Jack's, australia, restround), and so on. The type of data in the entry may be defined according to the actual need, and may be, for example, Location (LOC), restaurant (RESTRAUNT), HOTEL (HOTEL), HOSPITAL (HOSPITAL), and so on.

In such a case, for the example scenario of fig. 1, computing device 120 may retrieve the matching terms "clear, youing, Ariake, japan, LOC" from the predetermined dictionary by naming the phoneme sequence "bright" of the entity "and the geographic location information of user 110 (e.g., japan), thereby determining that the target literal representation in the target language of the named entity" bright "represents" Ariake.

It should be understood that the manner in which the target textual representation is determined by looking up the predetermined dictionary described above is merely illustrative and is not intended to limit the scope of embodiments of the present disclosure. For example, in other embodiments, the computing device 120 may determine the target linguistic form textual representation of the named entity from the sequence of phonemes of the named entity and the geographic location of the user 110 in any suitable other manner.

At 230, the computing device 120 generates target speech data 125 in the target language corresponding to the source speech data 115 based on the target textual representation of the named entity. For example, in the example scenario of fig. 1, the computing device 120 generates the target speech data 125 based on the determined "bright" english-language representation "Ariake," e.g., "accuse me, how to go to the Ariake Subway Station? ".

In some embodiments, to generate the target speech data 125 in the target language, the computing device 120 may translate the source text of the source speech data 115 into target text (e.g., english text) in the target language. The computing device 120 may then utilize the determined target textual representation of the named entity to adjust the target text, which in turn converts the adjusted target text to target speech data. In this manner, computing device 120 may take full advantage of existing speech translation processes and tools, with only limited adjustments, thereby saving computing device 120 from the operational burden.

Continuing with the scenario of fig. 1 as an example, the computing device 120 may initially ask a chinese form "to tell how to go at a subway station? "translate to" Excuse me, how to go to the facial Subway Station? ". Computing device 120 then adjusts the translation using the determined "clear" English expression "Ariake". For example, the adjustment may include replacing the word "famous" in the transliteration with "Ariake". Computing device 120 may then convert the english text to english speech via a text-to-speech (TTS) conversion technique.

In other embodiments, the computing device 120 may also post-process the source text of the source speech data 115 with the target textual representation of the named entity, for example, replacing corresponding words in the source text with entries in a predetermined dictionary that "have an indication" and marking the target text and type of the named entity at the same time, such as a text "ask" can be obtained to have an indication | Ariake | how to walk on a LOC subway station? ".

Thereafter, computing device 120 may invoke a machine translation system to translate the text. For example, with current neural network-based machine translation, sentences of source language text can be forcibly decoded, and tagged words or phrases are forcibly output as correct results, i.e., "accuse me, how to go to the Ariake Subway Station? ".

As can be seen from the above examples, in the embodiment of the present disclosure, by introducing the geographic location information related to the user in the speech translation process, the named entity in the speech of the user can be correctly identified, and then a correct speech translation associated with the named entity can be obtained, so that ambiguity of speech recognition and translation is eliminated, and accuracy of speech translation is improved.

Fig. 3 shows a schematic block diagram of an apparatus 300 for speech translation according to an embodiment of the present disclosure. In some embodiments, the apparatus 300 may be included in the computing device 120 of fig. 1 or implemented as the computing device 120.

As shown in fig. 3, the apparatus 300 includes a first determining module 310, a second determining module 320, and a generating module 330. The first determination module 310 is configured to determine a phoneme sequence corresponding to a named entity in source speech data from a user in a source language form, the phoneme sequence including at least one phoneme in the source language form. The second determination module 320 is configured to determine a target textual representation in the target language of the named entity based on the phoneme sequence and the geographic location of the user. The generation module 330 is configured to generate target speech data in a target language corresponding to the source speech data based on the target textual representation.

In some embodiments, the first determining module 310 further comprises: a first recognition module configured to recognize source speech data as source text in a source language; a tokenization module configured to tokenize the source text to determine a source text representation of the named entity in the source language; and a first conversion module configured to convert the source textual representation into a sequence of phonemes in the source language.

In some embodiments, the word segmentation module further comprises: a third determination module configured to determine a set of named entities in a source language associated with a geographic location; and the tokenization module is further configured to tokenize the source text based on the set of named entities.

In some embodiments, the second determining module 320 further comprises: a lookup module configured to look up an entry associated with the named entity in a predetermined dictionary with the phoneme sequence and the geographic location as indices, the entry including the phoneme sequence, the target word representation, and the geographic location; and an obtaining module configured to obtain the target textual representation from the entry.

In some embodiments, the entry further comprises at least one of: a source textual representation of the named entity in the source language, and a type of the named entity.

In some embodiments, the generation module 330 further comprises: a second recognition module configured to recognize source speech data as source text in a source language; a translation module configured to translate the source text into a target text in a target language; an adjustment module configured to adjust the target text using the target word representation; and a second conversion module configured to convert the adjusted target text into target speech data.

Fig. 4 schematically illustrates a block diagram of an apparatus 400 that may be used to implement embodiments of the present disclosure. As shown in fig. 4, device 400 includes a Central Processing Unit (CPU)401 that may perform various appropriate actions and processes in accordance with computer program instructions stored in a read-only memory device (ROM)402 or loaded from a storage unit 408 into a random access memory device (RAM) 403. In the RAM403, various programs and data required for the operation of the device 400 can also be stored. The CPU 401, ROM 402, and RAM403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

A number of components in device 400 are connected to I/O interface 405, including: an input unit 406 such as a keyboard, a mouse, or the like; an output unit 407 such as various types of displays, speakers, and the like; a storage unit 408 such as a magnetic disk, optical disk, or the like; and a communication unit 409 such as a network card, modem, wireless communication transceiver, etc. The communication unit 409 allows the device 400 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The various processes and processes described above, such as method 200, may be performed by processing unit 401. For example, in some embodiments, the method 200 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 408. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 400 via the ROM 402 and/or the communication unit 409. When the computer program is loaded into RAM403 and executed by CPU 401, one or more steps of method 200 described above may be performed.

As used herein, the terms "comprises," comprising, "and the like are to be construed as open-ended inclusions, i.e.," including, but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions may also be included herein.

As used herein, the term "determining" encompasses a wide variety of actions. For example, "determining" can include calculating, computing, processing, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure), ascertaining and the like. Further, "determining" can include receiving (e.g., receiving information), accessing (e.g., accessing data in a memory), and the like. Further, "determining" may include resolving, selecting, choosing, establishing, and the like.

It should be noted that the embodiments of the present disclosure can be realized by hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided, for example, in programmable memory or on a data carrier such as an optical or electronic signal carrier.

Further, while the operations of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Rather, the steps depicted in the flowcharts may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions. It should also be noted that the features and functions of two or more devices according to the present disclosure may be embodied in one device. Conversely, the features and functions of one apparatus described above may be further divided into embodiments by a plurality of apparatuses.

While the present disclosure has been described with reference to several particular embodiments, it is to be understood that the disclosure is not limited to the particular embodiments disclosed. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method for speech translation, comprising:

determining a phoneme sequence corresponding to a named entity in source speech data from a user in a source language form, the phoneme sequence including at least one phoneme in the source language form;

determining a target textual representation in a target language of the named entity based on the phoneme sequence and the geographic location of the user; and

and generating target voice data in a target language form corresponding to the source voice data based on the target word representation.

2. The method of claim 1, wherein determining the sequence of phonemes corresponding to the named entity comprises:

identifying the source speech data as source text in a source language form;

segmenting the source text to determine a source text representation of the named entity in the source language; and

the source textual representation is converted into a sequence of phonemes in a source language.

3. The method of claim 2, wherein tokenizing the source text comprises:

determining a set of named entities in a source language associated with the geographic location; and

segmenting the source text based on the set of named entities.

4. The method of claim 1, wherein determining a target literal representation in a target language form of the named entity comprises:

indexing the phoneme sequence and the geographic location to look up an entry associated with the named entity in a predetermined dictionary, the entry including the phoneme sequence, the target word representation, and the geographic location; and

obtaining the target textual representation from the entry.

5. The method of claim 4, wherein the entry further comprises at least one of:

a source textual representation of the named entity in a source language, and a type of the named entity.

6. The method of claim 1, wherein generating target speech data in a target language comprises:

identifying the source speech data as source text in a source language form;

translating the source text into a target text in a target language form;

adjusting the target text using the target word representation; and

converting the adjusted target text into the target speech data.

7. An apparatus for speech translation, comprising:

a first determination module configured to determine a phoneme sequence corresponding to a named entity in source speech data from a user in a source language form, the phoneme sequence including at least one phoneme in the source language form;

a second determination module configured to determine a target textual representation in a target language of the named entity based on the phoneme sequence and the geographic location of the user; and

a generating module configured to generate target speech data in a target language form corresponding to the source speech data based on the target textual representation.

8. The apparatus of claim 7, wherein the first determining module further comprises:

a first recognition module configured to recognize the source speech data as source text in a source language;

a tokenization module configured to tokenize the source text to determine a source text representation of the named entity in a source language; and

a first conversion module configured to convert the source textual representation into a sequence of phonemes in a source language.

9. The apparatus of claim 8, wherein the word segmentation module further comprises:

a third determination module configured to determine a set of named entities in a source language associated with the geographic location; and is

The tokenization module is further configured to tokenize the source text based on the set of named entities.

10. The apparatus of claim 7, wherein the second determining module further comprises:

a lookup module configured to look up an entry associated with the named entity in a predetermined dictionary, indexed by the phoneme sequence and the geographic location, the entry including the phoneme sequence, the target word representation, and the geographic location; and

an obtaining module configured to obtain the target textual representation from the entry.

11. The apparatus of claim 10, wherein the entry further comprises at least one of:

12. The apparatus of claim 7, wherein the generating module further comprises:

a second recognition module configured to recognize the source speech data as source text in a source language;

a translation module configured to translate the source text into a target text in a target language;

an adjustment module configured to adjust the target text using the target word representation; and

a second conversion module configured to convert the adjusted target text into the target speech data.

13. An electronic device, comprising:

one or more processors; and

storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method of any one of claims 1-6.

14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-6.