WO2019169686A1 - 语音翻译方法、装置和计算机设备 - Google Patents

语音翻译方法、装置和计算机设备 Download PDF

Info

Publication number
WO2019169686A1
WO2019169686A1 PCT/CN2018/082039 CN2018082039W WO2019169686A1 WO 2019169686 A1 WO2019169686 A1 WO 2019169686A1 CN 2018082039 W CN2018082039 W CN 2018082039W WO 2019169686 A1 WO2019169686 A1 WO 2019169686A1
Authority
WO
WIPO (PCT)
Prior art keywords
text information
accent
information
user
voice
Prior art date
Application number
PCT/CN2018/082039
Other languages
English (en)
French (fr)
Inventor
周毕兴
Original Assignee
深圳市沃特沃德股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市沃特沃德股份有限公司 filed Critical 深圳市沃特沃德股份有限公司
Publication of WO2019169686A1 publication Critical patent/WO2019169686A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/187Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • the present invention relates to the field of electronic technologies, and in particular, to a speech translation method, apparatus and computer device.
  • the main object of the present invention is to provide a speech translation method, apparatus and computer device, which aim to improve the accuracy of translation and enhance the user experience.
  • an embodiment of the present invention provides a voice translation method, where the method includes the following steps:
  • the step of synthesizing the first voice information according to the preset relationship between the preset first accent phoneme and the word, and the step of generating the first text information includes:
  • the step of performing translation processing on the first text information includes:
  • the step of synthesizing the second text information into the second voice information comprises:
  • the performing voice synthesis according to the preset relationship between the preset second accent phoneme and the word, and the step of generating the second voice information includes:
  • the step of receiving the first voice information further includes: establishing a correspondence between the user's accent phoneme and the word, the user's voice phoneme including the first voice phoneme.
  • the step of establishing a correspondence between a user's accent phoneme and a word includes:
  • the step of receiving test voice information includes:
  • the step of receiving test voice information includes:
  • the method is applied to a terminal device or a server.
  • the embodiment of the invention simultaneously provides a speech translation device, the device comprising:
  • a voice receiving module configured to receive first voice information
  • a voice recognition module configured to perform voice recognition on the first voice information according to a preset correspondence between the preset first voice phoneme and a word, to generate first text information
  • the translation processing module is configured to perform translation processing on the first text information.
  • the voice recognition module includes:
  • a decomposing unit configured to decompose the first voice information into a plurality of first accent phonemes
  • a converting unit configured to convert the plurality of first accent phonemes into a plurality of words according to a correspondence between the first accent phoneme and a word
  • a combination unit configured to combine the plurality of words into the first text information.
  • the translation processing module includes:
  • a translation unit configured to translate the first text information into second text information
  • a synthesizing unit configured to synthesize the second text information into the second voice information.
  • the synthesizing unit is configured to perform speech synthesis on the second text information according to a preset correspondence relationship between the second accent phoneme and the words, to generate second voice information.
  • the synthesizing unit includes:
  • Decomposing a subunit configured to decompose the second text information into a plurality of words
  • a conversion subunit configured to convert the plurality of words into a plurality of second accent phonemes according to a correspondence between the second accent phoneme and a word
  • a synthesis subunit configured to synthesize the plurality of second accent phonemes into second voice information.
  • the device further includes a relationship establishing module, where the relationship establishing module is configured to: establish a correspondence between a user's accent phoneme and a word, the user accent phoneme includes a first accent phoneme.
  • the relationship establishing module includes:
  • a receiving unit configured to receive test voice information
  • a recognition unit configured to perform voice recognition on the test voice information according to a correspondence between a standard accent phoneme and a word, and generate standard text information
  • An output unit configured to output the standard text information
  • a correcting unit configured to receive a modification of the standard text information by the user, and obtain corrected text information
  • a comparison unit configured to compare the standard text information with the corrected text information, and obtain a mapping relationship between a word in the standard text information and a word in the corrected text information
  • a establishing unit configured to establish the standard according to a mapping relationship between a word in the standard text information and a word in the corrected text information, and a standard accent phoneme corresponding to a word in the standard text information
  • a mapping relationship between the accent phoneme and the words in the corrected text information, and the mapping relationship is used as a correspondence relationship between the user accent phoneme and the words.
  • the receiving unit includes:
  • a first prompt subunit configured to prompt the user to read a plurality of basic pronunciation units
  • the first receiving subunit is configured to receive test voice information sent when the user reads the plurality of basic pronunciation units.
  • the receiving unit includes:
  • a second prompt subunit configured to prompt the user to read a piece of text information, where the text information includes a plurality of basic pronunciation units;
  • a second receiving subunit configured to receive test voice information sent by the user when the text information is read aloud.
  • Embodiments of the present invention also provide a computer device including a memory, a processor, and at least one application stored in the memory and configured to be executed by the processor, the application being configured to perform the aforementioned speech translation method.
  • a speech translation method provided by an embodiment of the present invention, by presetting a correspondence between a user's accent phoneme (such as a first accent phoneme) and a word, and using the corresponding relationship between the user's accent phoneme and the word to perform the translated voice information.
  • Speech recognition which can recognize the voice information with non-standard accent as accurate text information, accurately understand the meaning that the user wants to express, and then translate the text information to improve the accuracy of translation.
  • the translated text information can be synthesized into the voice information with the user's accent according to the correspondence between the user's accent phoneme and the words, so that the second user can be more intimate and easier to understand, and the user is improved.
  • the second user can be more intimate and easier to understand, and the user is improved.
  • FIG. 1 is a flow chart of a first embodiment of a speech translation method of the present invention
  • FIG. 2 is a flow chart of a second embodiment of the speech translation method of the present invention.
  • FIG. 3 is a specific flow chart of step S10 of Figure 2;
  • FIG. 4 is a block diagram showing a first embodiment of a speech translation apparatus of the present invention.
  • FIG. 5 is a block diagram of the speech recognition module of Figure 4.
  • FIG. 6 is a block diagram of a translation processing module of FIG. 4;
  • Figure 7 is a block diagram of the synthesizing unit of Figure 6;
  • Figure 8 is a block diagram showing a second embodiment of the speech translation apparatus of the present invention.
  • FIG. 9 is a block diagram of the relationship establishing module of FIG. 8;
  • Figure 10 is a block diagram of the receiving unit of Figure 9;
  • FIG. 11 is a block diagram of still another module of the receiving unit of FIG.
  • terminal and terminal device used herein include both a wireless signal receiver device, a device having only a wireless signal receiver without a transmitting capability, and a receiving and transmitting hardware.
  • Such devices may include: cellular or other communication devices having a single line display or a multi-line display or a cellular or other communication device without a multi-line display; PCS (Personal Communications Service, a personal communication system that can combine voice, data processing, fax and/or data communication capabilities; PDA (Personal Digital Assistant), which can include radio frequency receivers, pagers, Internet/Intranet access, Web browser, notepad, calendar and/or GPS (Global Positioning System) receiver; conventional laptop and/or palmtop computer or other device with and/or conventional knee including radio frequency receiver Up and/or palmtop or other device.
  • PCS Personal Communications Service
  • PDA Personal Digital Assistant
  • GPS Global Positioning System
  • terminal may be portable, transportable, installed in a vehicle (aviation, sea and/or land), or adapted and/or configured to operate locally, and/or Run in any other location on the Earth and/or space in a distributed form.
  • the "terminal” and “terminal device” used herein may also be a communication terminal, an internet terminal, a music/video playing terminal, and may be, for example, a PDA or a MID (Mobile).
  • Internet Device, mobile Internet device) and/or mobile phone with music/video playback function can also be smart TV, set-top box and other devices.
  • the server used herein includes, but is not limited to, a computer, a network host, a single network server, a plurality of network server sets, or a cloud composed of a plurality of servers.
  • the cloud is composed of a large number of computers or network servers based on Cloud Computing, which is a kind of distributed computing, a super virtual computer composed of a group of loosely coupled computers.
  • communication can be implemented by any communication means between the server, the terminal device and the WNS server, including but not limited to, mobile communication based on 3GPP, LTE, WIMAX, and computer network communication based on TCP/IP and UDP protocols. And short-range wireless transmission based on Bluetooth and infrared transmission standards.
  • the voice translation method and apparatus of the embodiments of the present invention may be applied to a terminal device or to a server.
  • the terminal device may be a dedicated translation machine, or may be a mobile terminal such as a mobile phone or a tablet, or may be a computer terminal such as a personal computer or a notebook computer.
  • the server mainly refers to a computer device in a network cloud that translates voice information sent by the terminal device. The following is an example of application to a server.
  • the terminal device collects the first voice information sent by the first user, and sends the first voice information to the server, where the server receives the first voice information sent by the terminal device.
  • the first voice information sent by the first user includes a first accent
  • the accent database is preset in the server, and the accent database includes a correspondence relationship between the first accent phoneme and the words.
  • the server After receiving the first voice information, the server first decomposes the first voice information into a plurality of first voice elements, and then queries the correspondence between the first voice phoneme and the words in the accent database, and the first voice information is The first accent phonemes are converted into a plurality of words, and finally the plurality of words are combined into the first text information. Therefore, the voice with non-standard accent can be recognized as an accurate text, and the meaning that the user wants to express can be accurately understood.
  • the server translates the first text information into the second text information of the target language, and outputs the second text information through the terminal device, such as: sending the second text information to the terminal device, where the terminal device displays The second text information.
  • the server translates the first text information into the second text information of the target language, and synthesizes the second text information into the second voice information, and outputs the second voice information through the terminal device, such as:
  • the second voice information is sent to the terminal device, and the terminal device outputs the second voice information through the sounding unit.
  • the server preset preset accent database further includes a correspondence between the second accent phoneme and the words.
  • the server may further perform voice synthesis on the second text information according to the preset correspondence between the second accent phoneme and the words to generate the second voice information, the second The voice information is voice information with the accent of the second user, so that the second user sounds more intimate and easier to understand.
  • the server first decomposes the second text information into a plurality of words, and then queries the correspondence between the second accent phoneme and the words in the accent database, and converts the plurality of words in the second text information into multiple words.
  • the two-tone phoneme finally combines a plurality of second-tone phonemes into the second voice information.
  • step S11 in the second embodiment of the speech translation method of the present invention, the following steps are further included before step S11:
  • the server pre-establishes the correspondence between the user's accent phonemes and words, and stores them in the accent database.
  • the user accent phoneme includes at least a first accent phoneme, that is, an accent phoneme of the first user, and may further include a second accent phoneme, a third accent phoneme, and the like, that is, a plurality of user accent phonemes may be separately established for different users. The correspondence of words.
  • the specific process for the server to establish the correspondence between the user's accent phonemes and words is as follows:
  • the server receives the test voice information of the plurality of minimum pronunciation units in a certain language through the terminal device, for example, the terminal device collects the test voice information sent by the user, and sends the test voice information to the terminal device.
  • the server and the server receive the test voice information sent by the terminal device.
  • the server prompts the user to read a plurality of basic pronunciation units through the terminal device, and receives test voice information sent when the user reads a plurality of basic pronunciation units.
  • the terminal device displays a plurality of basic pronunciation units of the source language, prompting the user to read the basic pronunciation units; when the user reads aloud, the terminal device collects the test voice information sent by the user and sends the test voice information to the server; the server receives the transmission sent by the terminal device. Test voice messages.
  • the server prompts the user to read a piece of text information through the terminal device, where the text information includes a plurality of basic pronunciation units, and receives test voice information sent by the user when reading a piece of text information.
  • the terminal device displays a piece of text information including a plurality of basic pronunciation units of the source language, and prompts the user to read the piece of text information; when the user reads the piece of text information, the terminal device uses the test voice information sent by the user, and sends To the server; the server receives the test voice information sent by the terminal device.
  • S102 Perform speech recognition on the test voice information according to a correspondence between the standard accent phonemes and the words, and generate standard text information.
  • the server After receiving the test voice information, the server performs voice recognition on the test voice information according to the corresponding relationship between the standard accent phoneme and the word in the corresponding language in the language database, and generates standard text information. Specifically, the server first decomposes the test voice information into a plurality of phonemes, and then queries a correspondence between a standard accent phoneme of the corresponding language in the language database and the words, and converts the plurality of phonemes in the test voice information into a plurality of words, and finally Combine multiple words into standard text information.
  • the standard text information is output through the terminal device. Specifically, the server sends the standard text information to the terminal device, and the terminal device displays the standard text information.
  • the server receives the modification of the standard text information by the user through the terminal device, and obtains the corrected text information.
  • the user may modify the standard text information displayed on the terminal device.
  • the terminal device obtains the modified corrected text information, and sends the corrected text information to the server, and the server receives the corrected text information sent by the terminal device. .
  • the server compares the standard text information with the corrected text information, compares the corresponding words in the two text information, and obtains the mapping relationship between the words in the standard text information and the words in the corrected text information.
  • the server searches for the correspondence between the standard accent phonemes and the words in the language database, obtains the standard accent phonemes corresponding to each word in the standard text information, and maps the words in the standard text information according to the words in the standard text information.
  • the relationship establishes a mapping relationship between the standard accent phoneme and the words in the corrected text information, and uses the mapping relationship as a correspondence between the user's accent phoneme and the word.
  • the test voice information of different users can be collected, and the corresponding relationship between the user's accent phoneme and the word corresponding to each user is established, and the corresponding relationship can be generated as a corresponding user profile.
  • the voice translation method of the embodiment of the present invention can be applied to a terminal device in addition to a server, that is, when the terminal device performs local translation, the voice translation method can be implemented on the local device.
  • the terminal device needs to perform cloud translation, the above voice translation method needs to be implemented by the server.
  • the foregoing server may be a server that integrates voice recognition, text translation, and voice synthesis, or may be three independent servers that implement voice recognition, text translation, and voice synthesis, respectively.
  • the voice recognition server receives the first voice information sent by the terminal device, and performs voice recognition on the first voice information according to the preset correspondence relationship between the first accent phoneme and the words, generates the first text information, and generates the first
  • the text information is sent to the text translation server;
  • the text translation server receives the first text information, translates the first text information into the second text information, and sends the second text information to the voice synthesis server;
  • the voice synthesis server receives the second text information, The second text information is synthesized into the second voice information, and the second voice information is sent to the terminal device;
  • the terminal device receives the second voice information, and outputs the second voice information through the sounding unit.
  • the speech translation method of the embodiment of the present invention by presetting the correspondence between the user's accent phoneme (such as the first accent phoneme) and the word, using the correspondence between the user's accent phoneme and the word to perform speech recognition, thereby It can identify the voice information with non-standard accent as accurate text information, accurately understand the meaning that the user wants to express, and then translate the text information to improve the accuracy of translation. Further, the translated text information can be synthesized into the voice information with the user's accent according to the correspondence between the user's accent phoneme and the words, so that the second user can be more intimate and easier to understand, and the user is improved. Experience.
  • the apparatus includes a voice receiving module 10, a voice recognition module 20, and a translation processing module 30, wherein: a voice receiving module 10 is configured to receive first voice information; The recognition module 20 is configured to perform voice recognition on the first voice information according to the preset correspondence between the first accent phoneme and the words, to generate first text information, and the translation processing module 30 is configured to translate the first text information. deal with.
  • a voice translation device is applied to a server as an example.
  • the terminal device collects the first voice information sent by the first user, and sends the first voice information to the voice receiving module 10, and the voice receiving module 10 receives the first voice information sent by the terminal device.
  • the first voice information sent by the first user includes a first accent
  • the accent database is preset in the server, and the accent database includes a correspondence relationship between the first accent phoneme and the words.
  • the voice recognition module 20 includes a decomposition unit 11, a conversion unit 12, and a combination unit 13, wherein: a decomposition unit 11 is configured to decompose the first voice information into a plurality of first voice elements; and the conversion unit 12, It is used for querying the correspondence between the first accent phoneme and the words in the accent database, and converting the plurality of first accent phonemes in the first voice information into a plurality of words according to the correspondence between the first accent phoneme and the words a combination unit 13 for combining a plurality of words into the first text information. Therefore, the voice with non-standard accent can be recognized as an accurate text, and the meaning that the user wants to express can be accurately understood.
  • the translation processing module 30 After the speech recognition module 20 recognizes the first speech information as the first text information, the translation processing module 30 performs a translation process on the first text information.
  • the translation processing module 30 translates the first text information into the second text information of the target language, and outputs the second text information through the terminal device, such as: transmitting the second text information to the terminal device, the terminal The device displays the second text message.
  • the translation processing module 30 includes a translation unit 31 and a synthesis unit 32, as shown in FIG. 6, wherein: the translation unit 31 is configured to translate the first text information into the second text information; the synthesizing unit 32, The second text information is synthesized into the second voice information, and the second voice information is output by the terminal device, for example, the second voice information is sent to the terminal device, and the terminal device outputs the second voice information through the sounding unit.
  • the server preset preset accent database further includes a correspondence between the second accent phoneme and the words.
  • the synthesizing unit 32 may further synthesize the second text information according to the preset correspondence between the second accent phoneme and the words to generate the second voice information.
  • the second voice information is voice information with an accent of the second user, so that the second user sounds more intimate and easier to understand.
  • the synthesizing unit 32 includes a decomposing sub-unit 321, a converting sub-unit 322, and a synthesizing sub-unit 323, wherein: the decomposing sub-unit 321 is configured to decompose the second text information into a plurality of words; and the converting sub-unit 322 And querying the correspondence between the second accent phoneme and the words in the accent database, and converting the plurality of words in the second text information into the plurality of second accent phonemes according to the correspondence between the second accent phoneme and the words a synthesizing subunit 323, configured to synthesize the plurality of second accent phonemes into the second voice information.
  • the second voice information with the accent of the second user is output, the second user sounds more intimate and easy to understand, greatly improving the user experience.
  • the apparatus further includes a relationship establishing module 40, and the relationship establishing module 40 is configured to: establish a correspondence between a user's accent phoneme and a word.
  • the server pre-establishes the correspondence between the user's accent phonemes and words, and stores them in the accent database.
  • the user accent phoneme includes at least a first accent phoneme, that is, an accent phoneme of the first user, and may further include a second accent phoneme, a third accent phoneme, and the like, that is, a plurality of user accent phonemes may be separately established for different users. The correspondence of words.
  • the relationship establishing module 40 includes a receiving unit 41, an identifying unit 42, an output unit 43, a correcting unit 44, a comparing unit 45, and an establishing unit 46, wherein: the receiving unit 41 is configured to receive test voices.
  • the information identifying unit 42 is configured to perform voice recognition on the test voice information according to the correspondence between the standard accent phonemes and the words to generate standard text information; the output unit 43 is configured to output standard text information; and the correcting unit 44 is configured to receive the user.
  • the correction text information is obtained by modifying the standard text information;
  • the comparing unit 45 is configured to compare the standard text information with the corrected text information, and obtain a mapping relationship between the words in the standard text information and the words in the corrected text information;
  • the unit 46 is configured to establish a standard accent phoneme and the corrected text information according to the mapping relationship between the words in the standard text information and the words in the corrected text information, and the standard accent phonemes corresponding to the words in the standard text information.
  • the mapping relationship of the words in the word, and the mapping relationship as the correspondence between the user's accent phonemes and words.
  • the receiving unit 41 receives test voice information including a plurality of minimum pronunciation units in a certain language through the terminal device, for example, the terminal device collects test voice information sent by the user, and tests the voice information. The signal is sent to the receiving unit 41, and the receiving unit 41 receives the test voice information sent by the terminal device.
  • the receiving unit 41 includes a first prompting subunit 411 and a first receiving subunit 412, wherein: the first prompting subunit 411 is configured to prompt the user to read a plurality of basic pronunciation units through the terminal device.
  • the first receiving subunit 412 is configured to receive test voice information sent when the user reads a plurality of basic pronunciation units.
  • the terminal device displays a plurality of basic pronunciation units of the source language, prompting the user to read the basic pronunciation units; when the user reads aloud, the terminal device collects the test voice information sent by the user, and sends the test voice information to the first receiving subunit 412; A receiving subunit 412 receives the test voice information sent by the terminal device.
  • the receiving unit 41 includes a second prompting subunit 413 and a second receiving subunit 414, wherein: the second prompting subunit 413 is configured to prompt the user to read a piece of text information by using the terminal device, where The text information includes a plurality of basic pronunciation units; and the second reception sub-unit 414 is configured to receive test voice information that is sent when the user reads the text information.
  • the terminal device displays a piece of text information including a plurality of basic pronunciation units of the source language, and prompts the user to read the piece of text information; when the user reads the piece of text information, the terminal device uses the test voice information sent by the user, and sends The second receiving subunit 414 is received by the second receiving subunit 414. The test voice information sent by the terminal device is received.
  • the recognition unit 42 After receiving the test voice information, the recognition unit 42 performs voice recognition on the test voice information according to the correspondence between the standard accent phonemes of the corresponding language in the language database and the words, and generates standard text information. Specifically, the identification unit 42 first decomposes the test voice information into a plurality of phonemes, and then queries the correspondence between the standard accent phonemes of the corresponding language in the language database and the words, and converts the plurality of phonemes in the test voice information into multiple words. Finally, combine multiple words into standard text information.
  • the output unit 43 After the standard text information is generated, the output unit 43 outputs the standard text information through the terminal device. Specifically, the output unit 43 sends the standard text information to the terminal device, and the terminal device displays the standard text information.
  • the correcting unit 44 receives the modification of the standard text information by the user through the terminal device, and obtains the corrected text information. Specifically, the user can modify the standard text information displayed on the terminal device. After the modification, the terminal device obtains the modified corrected text information, and sends the corrected text information to the correcting unit 44, and the correcting unit 44 receives the terminal device to send. Corrected text information.
  • the comparing unit 45 compares the standard text information with the corrected text information, compares the corresponding words in the two text information, and obtains the mapping relationship between the words in the standard text information and the words in the corrected text information. .
  • the establishing unit 46 searches for the correspondence between the standard accent phonemes and the words in the language database, obtains the standard accent phonemes corresponding to each word in the standard text information, and corrects the words in the text information according to the words in the standard text information.
  • the mapping relationship establishes a mapping relationship between the standard accent phonemes and the words in the corrected text information, and uses the mapping relationship as a correspondence between the user's accent phonemes and words.
  • the establishing module can collect test voice information of different users, respectively establish corresponding correspondences between user accent phonemes and words corresponding to each user, and can generate the corresponding relationship as a corresponding user profile.
  • the voice translation device of the embodiment of the present invention performs voice recognition on the voice information to be translated by using the correspondence between the user's accent phoneme and the word by presetting the corresponding relationship between the user's accent phoneme (such as the first voice phoneme) and the word. It can identify the voice information with non-standard accent as accurate text information, accurately understand the meaning that the user wants to express, and then translate the text information to improve the accuracy of translation. Further, the translated text information can be synthesized into the voice information with the user's accent according to the correspondence between the user's accent phoneme and the words, so that the second user can be more intimate and easier to understand, and the user is improved. Experience.
  • the invention also proposes a computer device, which may be a terminal device or a server.
  • the computer device includes a memory, a processor, and at least one application stored in the memory and configured to be executed by the processor, the application being configured to perform the aforementioned speech translation method.
  • the voice translation method includes the steps of: receiving first voice information; performing voice recognition on the first voice information according to a preset correspondence relationship between the first accent phoneme and a word, and generating first text information; Translate processing.
  • the speech translation method described in this embodiment is the speech translation method involved in the foregoing embodiment of the present invention, and details are not described herein again.
  • the present invention includes apparatus that is directed to performing one or more of the operations described herein. These devices may be specially designed and manufactured for the required purposes, or may also include known devices in a general purpose computer. These devices have computer programs stored therein that are selectively activated or reconfigured.
  • Such computer programs may be stored in a device (eg, computer) readable medium or in any type of medium suitable for storing electronic instructions and coupled to a bus, respectively, including but not limited to any Types of disks (including floppy disks, hard disks, optical disks, CD-ROMs, and magneto-optical disks), ROM (Read-Only Memory), RAM (Random Access Memory), EPROM (Erasable Programmable) Read-Only Memory, EEPROM (Electrically Erasable) Programmable Read-Only Memory, flash memory, magnetic card or light card.
  • a readable medium includes any medium that is stored or transmitted by a device (eg, a computer) in a readable form.
  • each block of the block diagrams and/or block diagrams and/or flow diagrams and combinations of blocks in the block diagrams and/or block diagrams and/or flow diagrams can be implemented by computer program instructions. .
  • these computer program instructions can be implemented by a general purpose computer, a professional computer, or a processor of other programmable data processing methods, such that the processor is executed by a computer or other programmable data processing method.
  • steps, measures, and solutions in the various operations, methods, and processes that have been discussed in the present invention may be alternated, changed, combined, or deleted. Further, other steps, measures, and schemes of the various operations, methods, and processes that have been discussed in the present invention may be alternated, modified, rearranged, decomposed, combined, or deleted. Further, the steps, measures, and solutions in the prior art having various operations, methods, and processes disclosed in the present invention may also be alternated, changed, rearranged, decomposed, combined, or deleted.

Abstract

一种语音翻译方法、装置和计算机设备,所述方法包括以下步骤:接收第一语音信息;根据预置的第一口音音素与字词的对应关系对第一语音信息进行语音识别,生成第一文本信息;对第一文本信息进行翻译处理。从而可以将口音不标准的语音信息识别为准确的文本信息,准确理解用户想表达的意思,再对文本信息进行翻译则提高了翻译的准确性。进一步地,还可以根据用户口音音素与字词的对应关系将翻译后的文本信息语音合成为带用户口音的语音信息,从而使得第二用户听起来更加亲切,也更加容易听懂,提升了用户体验。

Description

语音翻译方法、装置和计算机设备 技术领域
本发明涉及电子技术领域,特别是涉及到一种语音翻译方法、装置和计算机设备。
背景技术
随着经济的快速发展,对外交流越来越广泛,而对于许多人来说语言不通是对外交流的一大障碍。为了解决上述问题,市场上出现了各种各样的翻译设备。翻译设备凭借着强大的语言翻译功能,深受广大有语言翻译需求的人士的欢迎,同时也是人们学习外语的好帮手。翻译设备可以在双方对话的过程中进行翻译,使得使用不同语言的用户可以无障碍交流。
现实生活中,很多人说话时口音都不标准,而是带有自己特定的口音,而翻译设备在翻译时则是按照标准口音来对用户发出的语音信息进行语音识别的,因此必然导致某些字词或语句识别错误,从而导致翻译不准确,造成用户困惑,影响用户体验。
由此可见,如何提高语音翻译的准确性,是当前亟需解决的技术问题。
技术问题
本发明的主要目的为提供一种语音翻译方法、装置和计算机设备,旨在提高翻译的准确性,提升用户体验。
技术解决方案
为达以上目的,本发明实施例提出一种语音翻译方法,所述方法包括以下步骤:
接收第一语音信息;
根据预置的第一口音音素与字词的对应关系对所述第一语音信息进行语音识别,生成第一文本信息;
对所述第一文本信息进行翻译处理。
可选地,所述根据预置的第一口音音素与字词的对应关系对所述第一语音信息进行语音合成,生成第一文本信息的步骤包括:
将所述第一语音信息分解成多个第一口音音素;
根据所述第一口音音素与字词的对应关系,将所述多个第一口音音素转换为多个字词;
将所述多个字词组合为第一文本信息。
可选地,所述对所述第一文本信息进行翻译处理的步骤包括:
将所述第一文本信息翻译为第二文本信息;
将所述第二文本信息语音合成为第二语音信息。
可选地,所述将所述第二文本信息语音合成为第二语音信息的步骤包括:
根据预置的第二口音音素与字词的对应关系对所述第二文本信息进行语音合成,生成第二语音信息。
可选地,所述根据预置的第二口音音素与字词的对应关系进行语音合成,生成第二语音信息的步骤包括:
将所述第二文本信息分解成多个字词;
根据所述第二口音音素与字词的对应关系,将所述多个字词转换为多个第二口音音素;
将所述多个第二口音音素合成为第二语音信息。
可选地,所述接收第一语音信息的步骤之前还包括:建立用户口音音素与字词的对应关系,所述用户口音音素包括第一口音音素。
可选地,所述建立用户口音音素与字词的对应关系的步骤包括:
接收测试语音信息;
根据标准口音音素与字词的对应关系对所述测试语音信息进行语音识别,生成标准文本信息;
输出所述标准文本信息;
接收用户对所述标准文本信息的修改,获得校正文本信息;
对比所述标准文本信息与所述校正文本信息,获取所述标准文本信息中的字词与所述校正文本信息中的字词的映射关系;
根据所述标准文本信息中的字词与所述校正文本信息中的字词的映射关系,以及所述标准文本信息中的字词所对应的标准口音音素,建立所述标准口音音素与所述校正文本信息中的字词的映射关系,并将所述映射关系作为所述用户口音音素与字词的对应关系。
可选地,所述接收测试语音信息的步骤包括:
提示用户朗读多个基本发音单元;
接收所述用户朗读所述多个基本发音单元时发出的测试语音信息。
可选地,所述接收测试语音信息的步骤包括:
提示用户朗读一段文字信息,所述文字信息中包含多个基本发音单元;
接收所述用户朗读所述文字信息时发出的测试语音信息。
可选地,所述方法应用于终端设备或服务器。
本发明实施例同时提出一种语音翻译装置,所述装置包括:
语音接收模块,用于接收第一语音信息;
语音识别模块,用于根据预置的第一口音音素与字词的对应关系对所述第一语音信息进行语音识别,生成第一文本信息;
翻译处理模块,用于对所述第一文本信息进行翻译处理。
可选地,所述语音识别模块包括:
分解单元,用于将所述第一语音信息分解成多个第一口音音素;
转换单元,用于根据所述第一口音音素与字词的对应关系,将所述多个第一口音音素转换为多个字词;
组合单元,用于将所述多个字词组合为第一文本信息。
可选地,所述翻译处理模块包括:
翻译单元,用于将所述第一文本信息翻译为第二文本信息;
合成单元,用于将所述第二文本信息语音合成为第二语音信息。
可选地,所述合成单元用于:根据预置的第二口音音素与字词的对应关系对所述第二文本信息进行语音合成,生成第二语音信息。
可选地,所述合成单元包括:
分解子单元,用于将所述第二文本信息分解成多个字词;
转换子单元,用于根据所述第二口音音素与字词的对应关系,将所述多个字词转换为多个第二口音音素;
合成子单元,用于将所述多个第二口音音素合成为第二语音信息。
可选地,所述装置还包括关系建立模块,所述关系建立模块用于:建立用户口音音素与字词的对应关系,所述用户口音音素包括第一口音音素。
可选地,所述关系建立模块包括:
接收单元,用于接收测试语音信息;
识别单元,用于根据标准口音音素与字词的对应关系对所述测试语音信息进行语音识别,生成标准文本信息;
输出单元,用于输出所述标准文本信息;
校正单元,用于接收用户对所述标准文本信息的修改,获得校正文本信息;
对比单元,用于对比所述标准文本信息与所述校正文本信息,获取所述标准文本信息中的字词与所述校正文本信息中的字词的映射关系;
建立单元,用于根据所述标准文本信息中的字词与所述校正文本信息中的字词的映射关系,以及所述标准文本信息中的字词所对应的标准口音音素,建立所述标准口音音素与所述校正文本信息中的字词的映射关系,并将所述映射关系作为所述用户口音音素与字词的对应关系。
可选地,所述接收单元包括:
第一提示子单元,用于提示用户朗读多个基本发音单元;
第一接收子单元,用于接收所述用户朗读所述多个基本发音单元时发出的测试语音信息。
可选地,所述接收单元包括:
第二提示子单元,用于提示用户朗读一段文字信息,所述文字信息中包含多个基本发音单元;
第二接收子单元,用于接收所述用户朗读所述文字信息时发出的测试语音信息。
本发明实施例还提出一种计算机设备,其包括存储器、处理器和至少一个被存储在存储器中并被配置为由处理器执行的应用程序,所述应用程序被配置为用于执行前述语音翻译方法。
有益效果
本发明实施例所提供的一种语音翻译方法,通过预置用户口音音素(如第一口音音素)与字词的对应关系,利用用户口音音素与字词的对应关系对待翻译的语音信息进行语音识别,从而可以将口音不标准的语音信息识别为准确的文本信息,准确理解用户想表达的意思,再对文本信息进行翻译,提高了翻译的准确性。进一步地,还可以根据用户口音音素与字词的对应关系将翻译后的文本信息语音合成为带用户口音的语音信息,从而使得第二用户听起来更加亲切,也更加容易听懂,提升了用户体验。
附图说明
图1是本发明的语音翻译方法第一实施例的流程图;
图2是本发明的语音翻译方法第二实施例的流程图;
图3是图2中步骤S10的具体流程图;
图4是本发明的语音翻译装置第一实施例的模块示意图;
图5是图4中的语音识别模块的模块示意图;
图6是图4中的翻译处理模块的模块示意图;
图7是图6中的合成单元的模块示意图;
图8是本发明的语音翻译装置第二实施例的模块示意图;
图9是图8中的关系建立模块的模块示意图;
图10是图9中的接收单元的模块示意图;
图11是图9中的接收单元的又一模块示意图。
本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
本发明的最佳实施方式
应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本发明,而不能解释为对本发明的限制。
本技术领域技术人员可以理解,除非特意声明,这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。应该进一步理解的是,本发明的说明书中使用的措辞“包括”是指存在所述特征、整数、步骤、操作、元件和/或组件,但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元件、组件和/或它们的组。应该理解,当我们称元件被“连接”或“耦接”到另一元件时,它可以直接连接或耦接到其他元件,或者也可以存在中间元件。此外,这里使用的“连接”或“耦接”可以包括无线连接或无线耦接。这里使用的措辞“和/或”包括一个或更多个相关联的列出项的全部或任一单元和全部组合。
本技术领域技术人员可以理解,除非另外定义,这里使用的所有术语(包括技术术语和科学术语),具有与本发明所属领域中的普通技术人员的一般理解相同的意义。还应该理解的是,诸如通用字典中定义的那些术语,应该被理解为具有与现有技术的上下文中的意义一致的意义,并且除非像这里一样被特定定义,否则不会用理想化或过于正式的含义来解释。
本技术领域技术人员可以理解,这里所使用的“终端”、“终端设备”既包括无线信号接收器的设备,其仅具备无发射能力的无线信号接收器的设备,又包括接收和发射硬件的设备,其具有能够在双向通信链路上,执行双向通信的接收和发射硬件的设备。这种设备可以包括:蜂窝或其他通信设备,其具有单线路显示器或多线路显示器或没有多线路显示器的蜂窝或其他通信设备;PCS(Personal Communications Service,个人通信系统),其可以组合语音、数据处理、传真和/或数据通信能力;PDA(Personal Digital Assistant,个人数字助理),其可以包括射频接收器、寻呼机、互联网/内联网访问、网络浏览器、记事本、日历和/或GPS(Global Positioning System,全球定位系统)接收器;常规膝上型和/或掌上型计算机或其他设备,其具有和/或包括射频接收器的常规膝上型和/或掌上型计算机或其他设备。这里所使用的“终端”、“终端设备”可以是便携式、可运输、安装在交通工具(航空、海运和/或陆地)中的,或者适合于和/或配置为在本地运行,和/或以分布形式,运行在地球和/或空间的任何其他位置运行。这里所使用的“终端”、“终端设备”还可以是通信终端、上网终端、音乐/视频播放终端,例如可以是PDA、MID(Mobile Internet Device,移动互联网设备)和/或具有音乐/视频播放功能的移动电话,也可以是智能电视、机顶盒等设备。
本技术领域技术人员可以理解,这里所使用的服务器,其包括但不限于计算机、网络主机、单个网络服务器、多个网络服务器集或多个服务器构成的云。在此,云由基于云计算(Cloud Computing)的大量计算机或网络服务器构成,其中,云计算是分布式计算的一种,由一群松散耦合的计算机集组成的一个超级虚拟计算机。本发明的实施例中,服务器、终端设备与WNS服务器之间可通过任何通信方式实现通信,包括但不限于,基于3GPP、LTE、WIMAX的移动通信、基于TCP/IP、UDP协议的计算机网络通信以及基于蓝牙、红外传输标准的近距无线传输方式。
本发明实施例的语音翻译方法和装置,可以应用于终端设备,也可以应用于服务器。所述终端设备可以是专门的翻译机,也可以是手机、平板等移动终端,还可以是个人电脑、笔记本电脑等计算机终端。所述服务器主要指对终端设备发送的语音信息进行翻译处理的网络云端的计算机设备。以下以应用于服务器为例进行信息说明。
参照图1,提出本发明的语音翻译方法第一实施例,所述方法包括以下步骤:
S11、接收第一语音信息。
本发明实施例中,终端设备采集第一用户发出的第一语音信息,并将第一语音信息发送给服务器,服务器则接收终端设备发送的第一语音信息。
S12、根据预置的第一口音音素与字词的对应关系对第一语音信息进行语音识别,生成第一文本信息。
本发明实施例中,第一用户发出的第一语音信息含有第一口音,服务器中预置了口音数据库,口音数据库中包括第一口音音素与字词的对应关系。服务器接收到第一语音信息后,首先将第一语音信息分解成多个第一口音音素,然后查询口音数据库中第一口音音素与字词的对应关系,将第一语音信息中的多个第一口音音素转换为多个字词,最后将多个字词组合为第一文本信息。从而可以将口音不标准的语音识别为准确的文字,准确理解用户想表达的意思。
S13、对第一文本信息进行翻译处理。
在某些实施例中,服务器将第一文本信息翻译为目标语言的第二文本信息,并通过终端设备输出该第二文本信息,如:将第二文本信息发送给终端设备,终端设备则显示该第二文本信息。
在另一些实施例中,服务器将第一文本信息翻译为目标语言的第二文本信息,并将第二文本信息语音合成为第二语音信息,并通过终端设备输出该第二语音信息,如:将第二语音信息发送给终端设备,终端设备则通过发声单元输出第二语音信息。
进一步地,服务器预置的口音数据库中,还包括第二口音音素与字词的对应关系。在将第二文本信息语音合成为第二语音信息时,服务器还可以根据预置的第二口音音素与字词的对应关系对第二文本信息进行语音合成来生成第二语音信息,该第二语音信息为带第二用户的口音的语音信息,从而使得第二用户听起来更加亲切,也更加容易听懂。
具体的,服务器首先将第二文本信息分解成多个字词,然后查询口音数据库中的第二口音音素与字词的对应关系,将第二文本信息中的多个字词转换为多个第二口音音素,最后将多个第二口音音素合成为第二语音信息。当输出带第二用户的口音的第二语音信息时,第二用户听起来就更加亲切易懂,极大的提升了用户体验。
进一步地,如图2所示,在本发明的语音翻译方法第二实施例中,步骤S11之前还包括以下步骤:
S10、建立用户口音音素与字词的对应关系,该用户口音音素包括第一口音音素。
本实施例中,服务器预先建立用户口音音素与字词的对应关系,并存储于口音数据库中。所述用户口音音素至少包括第一口音音素,即第一用户的口音音素,还可以包括第二口音音素、第三口音音素等等,即可以针对不同的用户分别建立多个用户口音音素与字词的对应关系。
可选地,如图3所示,服务器建立用户口音音素与字词的对应关系的具体流程如下:
S101、接收测试语音信息。
我们知道,每种语言的字词发音都是由基本发音单元(为最小发音单元)组成的,一个人的口音特征在基本发音单元上就可以很好的体现出来,而每种语言的最小发音单元的数量是有限的。有鉴于此,本发明实施例中,服务器通过终端设备接收包含了某种语言的多个最小发音单元的测试语音信息,如:终端设备采集用户发出的测试语音信息,并把测试语音信息发送给服务器,服务器则接收该终端设备发送的测试语音信息。
可选地,服务器通过终端设备提示用户朗读多个基本发音单元,接收用户朗读多个基本发音单元时发出的测试语音信息。具体实施时,终端设备显示源语言的多个基本发音单元,提示用户朗读这些基本发音单元;当用户朗读时,终端设备采集用户发出的测试语音信息,并发送给服务器;服务器接收终端设备发送的测试语音信息。
可选地,服务器通过终端设备提示用户朗读一段文字信息,该文字信息中包含多个基本发音单元,并接收用户朗读一段文字信息时发出的测试语音信息。具体实施时,终端设备显示一段包含源语言的多个基本发音单元的文字信息,提示用户朗读这段文字信息;当用户朗读这段文字信息时,终端设备采用用户发出的测试语音信息,并发送给服务器;服务器接收终端设备发送的测试语音信息。
S102、根据标准口音音素与字词的对应关系对测试语音信息进行语音识别,生成标准文本信息。
服务器接收到测试语音信息后,根据语言数据库中对应语言的标准口音音素与字词的对应关系对测试语音信息进行语音识别,生成标准文本信息。具体的,服务器首先将测试语音信息分解成多个音素,然后查询语言数据库中对应语言的标准口音音素与字词的对应关系,将测试语音信息中的多个音素转换为多个字词,最后将多个字词组合为标准文本信息。
S103、输出标准文本信息。
服务器生成标准文本信息后,通过终端设备输出该标准文本信息。具体的,服务器将标准文本信息发送给终端设备,终端设备显示该标准文本信息。
S104、接收用户对标准文本信息的修改,获得校正文本信息。
服务器通过终端设备接收用户对标准文本信息的修改,获得校正文本信息。具体的,用户可以对终端设备上显示的标准文本信息进行修改,当修改完毕后,终端设备获得修改后的校正文本信息,并将校正文本信息发送给服务器,服务器接收终端设备发送的校正文本信息。
S105、对比标准文本信息与校正文本信息,获取标准文本信息中的字词与校正文本信息中的字词的映射关系。
服务器对标准文本信息和校正文本信息进行比较,对比两个文本信息中对应的字词,获取标准文本信息中的字词与校正文本信息中的字词的映射关系。
S106、根据标准文本信息中的字词与校正文本信息中的字词的映射关系,以及标准文本信息中的字词所对应的标准口音音素,建立标准口音音素与校正文本信息中的字词的映射关系,并将该映射关系作为用户口音音素与字词的对应关系。
服务器查找语言数据库中标准口音音素与字词的对应关系,获取标准文本信息中每个字词所对应的标准口音音素,并根据标准文本信息中的字词与校正文本信息中的字词的映射关系,建立起标准口音音素与校正文本信息中的字词的映射关系,并将该映射关系作为用户口音音素与字词的对应关系。
采用上述方法,可以采集不同用户的测试语音信息,分别建立起各个用户所对应的用户口音音素与字词的对应关系,并可以将该对应关系生成为对应的用户配置文件。
本发明实施例的语音翻译方法除了应用于服务器外,还可以应用于终端设备,即终端设备进行本地翻译时,可以在本机上实现上述语音翻译方法。当终端设备需要进行云端翻译时,则需要通过服务器来实现上述语音翻译方法。
本发明实施例中,前述服务器,可以集语音识别、文字翻译和语音合成为一体的服务器,也可以是分别实现语音识别、文字翻译和语音合成的三个独立的服务器。例如:语音识别服务器接收终端设备发送的第一语音信息,并根据预置的第一口音音素与字词的对应关系对第一语音信息进行语音识别,生成第一文本信息,并将第一文本信息发送给文字翻译服务器;文字翻译服务器接收第一文本信息,将第一文本信息翻译为第二文本信息,并将第二文本信息发送给语音合成服务器;语音合成服务器接收第二文本信息,将第二文本信息语音合成为第二语音信息,并将第二语音信息发送给终端设备;终端设备接收第二语音信息,并通过发声单元输出第二语音信息。
本发明实施例的语音翻译方法,通过预置用户口音音素(如第一口音音素)与字词的对应关系,利用用户口音音素与字词的对应关系对待翻译的语音信息进行语音识别,从而可以将口音不标准的语音信息识别为准确的文本信息,准确理解用户想表达的意思,再对文本信息进行翻译,提高了翻译的准确性。进一步地,还可以根据用户口音音素与字词的对应关系将翻译后的文本信息语音合成为带用户口音的语音信息,从而使得第二用户听起来更加亲切,也更加容易听懂,提升了用户体验。
参照图4,提出本发明的语音翻译装置一实施例,所述装置包括语音接收模块10、语音识别模块20和翻译处理模块30,其中:语音接收模块10,用于接收第一语音信息;语音识别模块20,用于根据预置的第一口音音素与字词的对应关系对第一语音信息进行语音识别,生成第一文本信息;翻译处理模块30,用于对第一文本信息进行翻译处理。
本发明实施例中,以语音翻译装置应用于服务器为例。终端设备采集第一用户发出的第一语音信息,并将第一语音信息发送给语音接收模块10,语音接收模块10接收终端设备发送的第一语音信息。
本发明实施例中,第一用户发出的第一语音信息含有第一口音,服务器中预置了口音数据库,口音数据库中包括第一口音音素与字词的对应关系。
如图5所示,语音识别模块20包括分解单元11、转换单元12和组合单元13,其中:分解单元11,用于将第一语音信息分解成多个第一口音音素;转换单元12,用于查询口音数据库中第一口音音素与字词的对应关系,根据第一口音音素与字词的对应关系将第一语音信息中的多个第一口音音素转换为多个字词;组合单元13,用于将多个字词组合为第一文本信息。从而可以将口音不标准的语音识别为准确的文字,准确理解用户想表达的意思。
当语音识别模块20将第一语音信息识别为第一文本信息后,翻译处理模块30则对第一文本信息进行翻译处理。
在某些实施例中,翻译处理模块30将第一文本信息翻译为目标语言的第二文本信息,并通过终端设备输出该第二文本信息,如:将第二文本信息发送给终端设备,终端设备则显示该第二文本信息。
在另一些实施例中,翻译处理模块30如图6所示,包括翻译单元31和合成单元32,其中:翻译单元31,用于将第一文本信息翻译为第二文本信息;合成单元32,用于将第二文本信息语音合成为第二语音信息,并通过终端设备输出该第二语音信息,如:将第二语音信息发送给终端设备,终端设备则通过发声单元输出第二语音信息。
进一步地,服务器预置的口音数据库中,还包括第二口音音素与字词的对应关系。在将第二文本信息语音合成为第二语音信息时,合成单元32还可以根据预置的第二口音音素与字词的对应关系对第二文本信息进行语音合成来生成第二语音信息,该第二语音信息为带第二用户的口音的语音信息,从而使得第二用户听起来更加亲切,也更加容易听懂。
如图7所示,合成单元32包括分解子单元321、转换子单元322和合成子单元323,其中:分解子单元321,用于将第二文本信息分解成多个字词;转换子单元322,用于查询口音数据库中的第二口音音素与字词的对应关系,根据第二口音音素与字词的对应关系,将第二文本信息中的多个字词转换为多个第二口音音素;合成子单元323,用于将多个第二口音音素合成为第二语音信息。当输出带第二用户的口音的第二语音信息时,第二用户听起来就更加亲切易懂,极大的提升了用户体验。
进一步地,如图8所示,在本发明的语音翻译装置第二实施例中,该装置还包括关系建立模块40,该关系建立模块40用于:建立用户口音音素与字词的对应关系。
本实施例中,服务器预先建立用户口音音素与字词的对应关系,并存储于口音数据库中。所述用户口音音素至少包括第一口音音素,即第一用户的口音音素,还可以包括第二口音音素、第三口音音素等等,即可以针对不同的用户分别建立多个用户口音音素与字词的对应关系。
可选地,如图9所示,关系建立模块40包括接收单元41、识别单元42、输出单元43、校正单元44、对比单元45和建立单元46,其中:接收单元41,用于接收测试语音信息;识别单元42,用于根据标准口音音素与字词的对应关系对测试语音信息进行语音识别,生成标准文本信息;输出单元43,用于输出标准文本信息;校正单元44,用于接收用户对标准文本信息的修改,获得校正文本信息;对比单元45,用于对比标准文本信息与所述校正文本信息,获取标准文本信息中的字词与校正文本信息中的字词的映射关系;建立单元46,用于根据标准文本信息中的字词与校正文本信息中的字词的映射关系,以及标准文本信息中的字词所对应的标准口音音素,建立标准口音音素与所述校正文本信息中的字词的映射关系,并将映射关系作为所述用户口音音素与字词的对应关系。
我们知道,每种语言的的字词发音都是由基本发音单元(为最小发音单元)组成的,一个人的口音特征在基本发音单元上就可以很好的体现出来,而每种语言的最小发音单元的数量是有限的。有鉴于此,本发明实施例中,接收单元41通过终端设备接收包含了某种语言的多个最小发音单元的测试语音信息,如:终端设备采集用户发出的测试语音信息,并把测试语音信息发送给接收单元41,接收单元41则接收该终端设备发送的测试语音信息。
可选地,接收单元41如图10所示,包括第一提示子单元411和第一接收子单元412,其中:第一提示子单元411,用于通过终端设备提示用户朗读多个基本发音单元;第一接收子单元412,用于接收用户朗读多个基本发音单元时发出的测试语音信息。
具体实施时,终端设备显示源语言的多个基本发音单元,提示用户朗读这些基本发音单元;当用户朗读时,终端设备采集用户发出的测试语音信息,并发送给第一接收子单元412;第一接收子单元412接收终端设备发送的测试语音信息。
可选地,接收单元41如图11所示,包括第二提示子单元413和第二接收子单元414,其中:第二提示子单元413,用于通过终端设备提示用户朗读一段文字信息,该文字信息中包含多个基本发音单元;第二接收子单元414,用于接收用户朗读文字信息时发出的测试语音信息。
具体实施时,终端设备显示一段包含源语言的多个基本发音单元的文字信息,提示用户朗读这段文字信息;当用户朗读这段文字信息时,终端设备采用用户发出的测试语音信息,并发送给第二接收子单元414;第二接收子单元414接收终端设备发送的测试语音信息。
接收到测试语音信息后,识别单元42根据语言数据库中对应语言的标准口音音素与字词的对应关系对测试语音信息进行语音识别,生成标准文本信息。具体的,识别单元42首先将测试语音信息分解成多个音素,然后查询语言数据库中对应语言的标准口音音素与字词的对应关系,将测试语音信息中的多个音素转换为多个字词,最后将多个字词组合为标准文本信息。
生成标准文本信息后,输出单元43通过终端设备输出该标准文本信息。具体的,输出单元43将标准文本信息发送给终端设备,终端设备显示该标准文本信息。
校正单元44通过终端设备接收用户对标准文本信息的修改,获得校正文本信息。具体的,用户可以对终端设备上显示的标准文本信息进行修改,当修改完毕后,终端设备获得修改后的校正文本信息,并将校正文本信息发送给校正单元44,校正单元44接收终端设备发送的校正文本信息。
获得校正文本信息后,对比单元45对标准文本信息和校正文本信息进行比较,对比两个文本信息中对应的字词,获取标准文本信息中的字词与校正文本信息中的字词的映射关系。
建立单元46查找语言数据库中标准口音音素与字词的对应关系,获取标准文本信息中每个字词所对应的标准口音音素,并根据标准文本信息中的字词与校正文本信息中的字词的映射关系,建立起标准口音音素与校正文本信息中的字词的映射关系,并将该映射关系作为用户口音音素与字词的对应关系。
采用上述方法,建立模块可以采集不同用户的测试语音信息,分别建立起各个用户所对应的用户口音音素与字词的对应关系,并可以将该对应关系生成为对应的用户配置文件。
本发明实施例的语音翻译装置,通过预置用户口音音素(如第一口音音素)与字词的对应关系,利用用户口音音素与字词的对应关系对待翻译的语音信息进行语音识别,从而可以将口音不标准的语音信息识别为准确的文本信息,准确理解用户想表达的意思,再对文本信息进行翻译,提高了翻译的准确性。进一步地,还可以根据用户口音音素与字词的对应关系将翻译后的文本信息语音合成为带用户口音的语音信息,从而使得第二用户听起来更加亲切,也更加容易听懂,提升了用户体验。
本发明同时提出一种计算机设备,所述计算机设备可以是终端设备,也可以是服务器。该计算机设备包括存储器、处理器和至少一个被存储在存储器中并被配置为由处理器执行的应用程序,所述应用程序被配置为用于执行前述语音翻译方法。所述语音翻译方法包括以下步骤:接收第一语音信息;根据预置的第一口音音素与字词的对应关系对第一语音信息进行语音识别,生成第一文本信息;对第一文本信息进行翻译处理。本实施例中所描述的语音翻译方法为本发明中上述实施例所涉及的语音翻译方法,在此不再赘述。
本领域技术人员可以理解,本发明包括涉及用于执行本申请中所述操作中的一项或多项的设备。这些设备可以为所需的目的而专门设计和制造,或者也可以包括通用计算机中的已知设备。这些设备具有存储在其内的计算机程序,这些计算机程序选择性地激活或重构。这样的计算机程序可以被存储在设备(例如,计算机)可读介质中或者存储在适于存储电子指令并分别耦联到总线的任何类型的介质中,所述计算机可读介质包括但不限于任何类型的盘(包括软盘、硬盘、光盘、CD-ROM、和磁光盘)、ROM(Read-Only Memory,只读存储器)、RAM(Random Access Memory,随机存储器)、EPROM(Erasable Programmable Read-Only Memory,可擦写可编程只读存储器)、EEPROM(Electrically Erasable Programmable Read-Only Memory,电可擦可编程只读存储器)、闪存、磁性卡片或光线卡片。也就是,可读介质包括由设备(例如,计算机)以能够读的形式存储或传输信息的任何介质。
本技术领域技术人员可以理解,可以用计算机程序指令来实现这些结构图和/或框图和/或流图中的每个框以及这些结构图和/或框图和/或流图中的框的组合。本技术领域技术人员可以理解,可以将这些计算机程序指令提供给通用计算机、专业计算机或其他可编程数据处理方法的处理器来实现,从而通过计算机或其他可编程数据处理方法的处理器来执行本发明公开的结构图和/或框图和/或流图的框或多个框中指定的方案。
本技术领域技术人员可以理解,本发明中已经讨论过的各种操作、方法、流程中的步骤、措施、方案可以被交替、更改、组合或删除。进一步地,具有本发明中已经讨论过的各种操作、方法、流程中的其他步骤、措施、方案也可以被交替、更改、重排、分解、组合或删除。进一步地,现有技术中的具有与本发明中公开的各种操作、方法、流程中的步骤、措施、方案也可以被交替、更改、重排、分解、组合或删除。
以上所述仅为本发明的优选实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。

Claims (20)

  1. 一种语音翻译方法,其特征在于,包括以下步骤:
    接收第一语音信息;
    根据预置的第一口音音素与字词的对应关系对所述第一语音信息进行语音识别,生成第一文本信息;
    对所述第一文本信息进行翻译处理。
  2. 根据权利要求1所述的语音翻译方法,其特征在于,所述根据预置的第一口音音素与字词的对应关系对所述第一语音信息进行语音合成,生成第一文本信息的步骤包括:
    将所述第一语音信息分解成多个第一口音音素;
    根据所述第一口音音素与字词的对应关系,将所述多个第一口音音素转换为多个字词;
    将所述多个字词组合为第一文本信息。
  3. 根据权利要求1所述的语音翻译方法,其特征在于,所述对所述第一文本信息进行翻译处理的步骤包括:
    将所述第一文本信息翻译为第二文本信息;
    将所述第二文本信息语音合成为第二语音信息。
  4. 根据权利要求3所述的语音翻译方法,其特征在于,所述将所述第二文本信息语音合成为第二语音信息的步骤包括:
    根据预置的第二口音音素与字词的对应关系对所述第二文本信息进行语音合成,生成第二语音信息。
  5. 根据权利要求4所述的语音翻译方法,其特征在于,所述根据预置的第二口音音素与字词的对应关系进行语音合成,生成第二语音信息的步骤包括:
    将所述第二文本信息分解成多个字词;
    根据所述第二口音音素与字词的对应关系,将所述多个字词转换为多个第二口音音素;
    将所述多个第二口音音素合成为第二语音信息。
  6. 根据权利要求1所述的语音翻译方法,其特征在于,所述接收第一语音信息的步骤之前还包括:建立用户口音音素与字词的对应关系,所述用户口音音素包括第一口音音素。
  7. 根据权利要求1所述的语音翻译方法,其特征在于,所述建立用户口音音素与字词的对应关系的步骤包括:
    接收测试语音信息;
    根据标准口音音素与字词的对应关系对所述测试语音信息进行语音识别,生成标准文本信息;
    输出所述标准文本信息;
    接收用户对所述标准文本信息的修改,获得校正文本信息;
    对比所述标准文本信息与所述校正文本信息,获取所述标准文本信息中的字词与所述校正文本信息中的字词的映射关系;
    根据所述标准文本信息中的字词与所述校正文本信息中的字词的映射关系,以及所述标准文本信息中的字词所对应的标准口音音素,建立所述标准口音音素与所述校正文本信息中的字词的映射关系,并将所述映射关系作为所述用户口音音素与字词的对应关系。
  8. 根据权利要求7所述的语音翻译方法,其特征在于,所述接收测试语音信息的步骤包括:
    提示用户朗读多个基本发音单元;
    接收所述用户朗读所述多个基本发音单元时发出的测试语音信息。
  9. 根据权利要求7所述的语音翻译方法,其特征在于,所述接收测试语音信息的步骤包括:
    提示用户朗读一段文字信息,所述文字信息中包含多个基本发音单元;
    接收所述用户朗读所述文字信息时发出的测试语音信息。
  10. 根据权利要求1所述的语音翻译方法,其特征在于,所述方法应用于终端设备或服务器。
  11. 一种语音翻译装置,其特征在于,包括:
    语音接收模块,用于接收第一语音信息;
    语音识别模块,用于根据预置的第一口音音素与字词的对应关系对所述第一语音信息进行语音识别,生成第一文本信息;
    翻译处理模块,用于对所述第一文本信息进行翻译处理。
  12. 根据权利要求11所述的语音翻译装置,其特征在于,所述语音识别模块包括:
    分解单元,用于将所述第一语音信息分解成多个第一口音音素;
    转换单元,用于根据所述第一口音音素与字词的对应关系,将所述多个第一口音音素转换为多个字词;
    组合单元,用于将所述多个字词组合为第一文本信息。
  13. 根据权利要求11所述的语音翻译装置,其特征在于,所述翻译处理模块包括:
    翻译单元,用于将所述第一文本信息翻译为第二文本信息;
    合成单元,用于将所述第二文本信息语音合成为第二语音信息。
  14. 根据权利要求13所述的语音翻译装置,其特征在于,所述合成单元用于:根据预置的第二口音音素与字词的对应关系对所述第二文本信息进行语音合成,生成第二语音信息。
  15. 根据权利要求14所述的语音翻译装置,其特征在于,所述合成单元包括:
    分解子单元,用于将所述第二文本信息分解成多个字词;
    转换子单元,用于根据所述第二口音音素与字词的对应关系,将所述多个字词转换为多个第二口音音素;
    合成子单元,用于将所述多个第二口音音素合成为第二语音信息。
  16. 根据权利要求11所述的语音翻译装置,其特征在于,所述装置还包括关系建立模块,所述关系建立模块用于:建立用户口音音素与字词的对应关系,所述用户口音音素包括第一口音音素。
  17. 根据权利要求11所述的语音翻译装置,其特征在于,所述关系建立模块包括:
    接收单元,用于接收测试语音信息;
    识别单元,用于根据标准口音音素与字词的对应关系对所述测试语音信息进行语音识别,生成标准文本信息;
    输出单元,用于输出所述标准文本信息;
    校正单元,用于接收用户对所述标准文本信息的修改,获得校正文本信息;
    对比单元,用于对比所述标准文本信息与所述校正文本信息,获取所述标准文本信息中的字词与所述校正文本信息中的字词的映射关系;
    建立单元,用于根据所述标准文本信息中的字词与所述校正文本信息中的字词的映射关系,以及所述标准文本信息中的字词所对应的标准口音音素,建立所述标准口音音素与所述校正文本信息中的字词的映射关系,并将所述映射关系作为所述用户口音音素与字词的对应关系。
  18. 根据权利要求17所述的语音翻译装置,其特征在于,所述接收单元包括:
    第一提示子单元,用于提示用户朗读多个基本发音单元;
    第一接收子单元,用于接收所述用户朗读所述多个基本发音单元时发出的测试语音信息。
  19. 根据权利要求17所述的语音翻译装置,其特征在于,所述接收单元包括:
    第二提示子单元,用于提示用户朗读一段文字信息,所述文字信息中包含多个基本发音单元;
    第二接收子单元,用于接收所述用户朗读所述文字信息时发出的测试语音信息。
  20. 一种计算机设备,包括存储器、处理器和至少一个被存储在所述存储器中并被配置为由所述处理器执行的应用程序,其特征在于,所述应用程序被配置为用于执行权利要求1至10任一项所述的语音翻译方法。
PCT/CN2018/082039 2018-03-06 2018-04-04 语音翻译方法、装置和计算机设备 WO2019169686A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810182915.7 2018-03-06
CN201810182915.7A CN108447473A (zh) 2018-03-06 2018-03-06 语音翻译方法和装置

Publications (1)

Publication Number Publication Date
WO2019169686A1 true WO2019169686A1 (zh) 2019-09-12

Family

ID=63193757

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/082039 WO2019169686A1 (zh) 2018-03-06 2018-04-04 语音翻译方法、装置和计算机设备

Country Status (2)

Country Link
CN (1) CN108447473A (zh)
WO (1) WO2019169686A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109300469A (zh) * 2018-09-05 2019-02-01 满金坝(深圳)科技有限公司 基于机器学习的同声传译方法及装置
CN110232908B (zh) * 2019-07-30 2022-02-18 厦门钛尚人工智能科技有限公司 一种分布式语音合成系统
CN113628626A (zh) * 2020-05-09 2021-11-09 阿里巴巴集团控股有限公司 语音识别方法、装置和系统以及翻译方法和系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1815551A (zh) * 2006-02-28 2006-08-09 安徽中科大讯飞信息科技有限公司 在方言语音合成系统中进行文本方言化处理的方法
CN101976304A (zh) * 2010-10-16 2011-02-16 陈长江 智能生活管家系统及方法
CN103943109A (zh) * 2014-04-28 2014-07-23 深圳如果技术有限公司 一种将语音转换为文字的方法及装置
WO2015030340A1 (ko) * 2013-08-28 2015-03-05 한국전자통신연구원 핸즈프리 자동 통역 서비스를 위한 단말 장치 및 핸즈프리 장치와, 핸즈프리 자동 통역 서비스 방법
CN106486125A (zh) * 2016-09-29 2017-03-08 安徽声讯信息技术有限公司 一种基于语音识别技术的同声传译系统
CN107274886A (zh) * 2016-04-06 2017-10-20 中兴通讯股份有限公司 一种语音识别方法和装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104407834A (zh) * 2014-11-13 2015-03-11 腾讯科技(成都)有限公司 信息输入方法和装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1815551A (zh) * 2006-02-28 2006-08-09 安徽中科大讯飞信息科技有限公司 在方言语音合成系统中进行文本方言化处理的方法
CN101976304A (zh) * 2010-10-16 2011-02-16 陈长江 智能生活管家系统及方法
WO2015030340A1 (ko) * 2013-08-28 2015-03-05 한국전자통신연구원 핸즈프리 자동 통역 서비스를 위한 단말 장치 및 핸즈프리 장치와, 핸즈프리 자동 통역 서비스 방법
CN103943109A (zh) * 2014-04-28 2014-07-23 深圳如果技术有限公司 一种将语音转换为文字的方法及装置
CN107274886A (zh) * 2016-04-06 2017-10-20 中兴通讯股份有限公司 一种语音识别方法和装置
CN106486125A (zh) * 2016-09-29 2017-03-08 安徽声讯信息技术有限公司 一种基于语音识别技术的同声传译系统

Also Published As

Publication number Publication date
CN108447473A (zh) 2018-08-24

Similar Documents

Publication Publication Date Title
US7593842B2 (en) Device and method for translating language
JP2023022150A (ja) 双方向音声翻訳システム、双方向音声翻訳方法及びプログラム
US20160048508A1 (en) Universal language translator
WO2018214314A1 (zh) 同声翻译的实现方法和装置
WO2019169686A1 (zh) 语音翻译方法、装置和计算机设备
US20070050188A1 (en) Tone contour transformation of speech
WO2008084476A2 (en) Vowel recognition system and method in speech to text applications
JP5628749B2 (ja) 通訳端末及び通訳端末間の相互通信を用いた通訳方法
GB2557714A (en) Determining phonetic relationships
WO2019075829A1 (zh) 语音翻译方法、装置和翻译设备
US20080300855A1 (en) Method for realtime spoken natural language translation and apparatus therefor
US9213693B2 (en) Machine language interpretation assistance for human language interpretation
KR101412657B1 (ko) 두 대 이상의 통역 단말기간 상호 통신을 이용한 통역 방법 및 장치
JP3473204B2 (ja) 翻訳装置及び携帯端末装置
CN103853736A (zh) 交通信息语音查询系统及其语音处理单元
JP2009122989A (ja) 翻訳装置
WO2019000619A1 (zh) 翻译方法、翻译设备及翻译系统
WO2019071541A1 (zh) 语音翻译方法、装置和终端设备
CN107995624B (zh) 基于多路径数据传输进行声音数据输出的方法
KR100553437B1 (ko) 음성 합성을 이용한 음성 메시지 전송 기능을 가지는무선통신 단말기 및 그 방법
Wang et al. Real-Time Voice-Call Language Translation
JP2002300259A (ja) 音声通話装置の評価試験方法及びシステム
KR20090081046A (ko) 인터넷을 이용한 언어 학습 시스템 및 방법
JP6680125B2 (ja) ロボットおよび音声対話方法
Ansari et al. Multilingual speech to speech translation system in bluetooth environment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18908451

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18908451

Country of ref document: EP

Kind code of ref document: A1