WO2019075829A1 - 语音翻译方法、装置和翻译设备 - Google Patents

语音翻译方法、装置和翻译设备 Download PDF

Info

Publication number
WO2019075829A1
WO2019075829A1 PCT/CN2017/111961 CN2017111961W WO2019075829A1 WO 2019075829 A1 WO2019075829 A1 WO 2019075829A1 CN 2017111961 W CN2017111961 W CN 2017111961W WO 2019075829 A1 WO2019075829 A1 WO 2019075829A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
speech
gender
original
information
Prior art date
Application number
PCT/CN2017/111961
Other languages
English (en)
French (fr)
Inventor
郑勇
王文祺
Original Assignee
深圳市沃特沃德股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市沃特沃德股份有限公司 filed Critical 深圳市沃特沃德股份有限公司
Publication of WO2019075829A1 publication Critical patent/WO2019075829A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • the present invention relates to the field of electronic technologies, and in particular, to a speech translation method and apparatus.
  • the specific implementation manner is as follows: the user speaks and presses a specific button of the translation device, and the translation device collects the voice information and performs translation processing. After the user finishes speaking, the user presses the button once, and the translation device outputs the translated voice information.
  • the voice gender of the voice information after the translation processing of the translation device is preset, and the user can set the voice as a male voice or a female voice. Once set, the voice after the translation is processed regardless of whether the person speaking is a male or a female voice. Information is the same voice gender. For example, when the male voice is set, if the speaking person is a female, the voice information after the translation processing is a male voice; when the female voice is set, if the speaking person is a male, the translated voice information is a female voice.
  • the voice gender of the translated voice information may be inconsistent with the voice gender of the original voice information, resulting in uncoordinated original voice and translated voice, making the user feel strange.
  • the realism of communication is greatly reduced, and the user experience is not good.
  • the main object of the present invention is to provide a speech translation method and apparatus, which aims to solve the technical problem that the speech gender of the translated speech information is inconsistent with the speech gender of the original speech information, enhance the realism of communication, and enhance user experience.
  • an embodiment of the present invention provides a voice translation method, where the method includes the following steps.
  • the step of identifying the voice gender of the original voice information includes:
  • the step of acquiring a frequency of a pitch of the original voice information includes:
  • the voice frame has a length of 20-30 ms.
  • the sampling frequency is 8 kHz.
  • the threshold value is 180-220 Hz.
  • the step of performing translation processing on the original voice information according to the selected voice synthesized voiceprint includes:
  • the step of identifying the voice gender of the original voice information comprises: identifying a voice gender of the voice information whenever a voice information is detected.
  • the speech synthesis voiceprint includes a male voice voice pattern and a female voice voice pattern
  • the step of selecting a corresponding voice synthesis voiceprint according to the voice gender includes:
  • the voice gender is a male voice
  • the male voiceprint is selected
  • the voice gender is a female voice
  • the female voice voice is selected.
  • Embodiments of the present invention also provide a voice translation apparatus, where the apparatus includes:
  • a gender identification module configured to identify a voice gender of the original voice information
  • a voiceprint selection module configured to select a corresponding voice synthesis voiceprint according to the voice gender
  • a translation processing module configured to perform translation processing on the original voice information according to the selected voice synthesized voiceprint, so that the voice gender of the translated voice information is consistent with the voice gender of the original voice information.
  • the gender identification module includes:
  • an obtaining unit configured to acquire a frequency of a pitch of the original voice information
  • a comparing unit configured to compare a frequency of the pitch and a size of a threshold
  • a first identifying unit configured to determine, when the frequency of the pitch is less than or equal to a threshold value, a voice gender of the original voice information is a male voice
  • a second identifying unit configured to determine that the voice gender of the original voice information is a female voice when the frequency of the pitch is greater than a threshold value.
  • the acquiring unit includes:
  • a sampling subunit configured to continuously sample the original voice information by a preset sampling frequency, M frame, M>2;
  • an extraction subunit configured to perform a pitch frequency feature extraction on the collected speech frame
  • a statistical subunit configured to calculate a frequency of a pitch of the original voice information according to the extracted pitch frequency feature.
  • the translation processing module includes:
  • a first processing unit configured to perform voice recognition processing on the original voice information, to obtain a first character string in an original language
  • a second processing unit configured to perform a character translation process on the first character string to obtain a second character string of the target language
  • the third processing unit is configured to perform voice synthesis processing on the second character string by using the selected voice synthesis voiceprint to obtain voice information in the target language.
  • the gender identification module is configured to: identify each time a piece of voice information is detected, The voice gender of the voice message.
  • the voice synthesized voiceprint includes a male voice voice pattern and a female voice voice pattern
  • the voiceprint selection module includes
  • a first selecting unit configured to: when the voice gender is a male voice, select the male voiceprint
  • a second selecting unit configured to: when the voice gender is a female voice, select the female voice voice.
  • Embodiments of the present invention further provide a translation apparatus, the translation apparatus including a memory, a processor, and at least one application stored in the memory and configured to be executed by the processor, the application The program is configured to perform the aforementioned speech translation method.
  • a speech translation method provided by an embodiment of the present invention, by identifying the speech gender of the original speech information, selecting a corresponding speech synthesis voiceprint according to the speech gender, and finally synthesizing the original voice according to the selected speech.
  • the speech information is translated and processed, so that the phoneticity of the translated speech information is consistent with the speech gender of the original speech information, and the adaptation to the speech gender is realized.
  • the translated voice is a male voice.
  • the translated voice is a female voice, which makes the original voice and the translated voice coordinate, which greatly enhances the realism of communication and enhances the user experience.
  • FIG. 1 is a flow chart of an embodiment of a speech translation method of the present invention
  • step S11 in FIG. 1 is a specific flowchart of step S11 in FIG. 1;
  • FIG. 3 is a block diagram showing an embodiment of a speech translation apparatus of the present invention.
  • FIG. 4 is a block diagram of the gender identification module of FIG. 3;
  • FIG. 5 is a block diagram of the acquisition unit of FIG. 4;
  • FIG. 6 is a schematic block diagram of the voiceprint selection module of FIG. 3; [0059] FIG.
  • FIG. 7 is a block diagram of the translation processing module of FIG. 3.
  • terminal and terminal device used herein include both a device of a wireless signal receiver, a device having only a wireless signal receiver without a transmitting capability, and a receiving and receiving device.
  • Such a device may comprise: a cellular or other communication device having a single line display or a multi-line display or a cellular or other communication device without a multi-line display; PCS (Persona 1 Communications Service), which may combine voice, Data processing, fax and/or data communication capabilities; PDA (Personal Digital Assistant), which can include radio frequency receivers, pagers, Internet/Intranet access, web browsers, notepads, calendars and/or Or a GPS (Global Positioning System) receiver; a conventional laptop and/or palmtop computer or other device having a conventional laptop and/or palmtop computer or other including a radio frequency receiver or other device.
  • PCS Personala 1 Communications Service
  • PDA Personal Digital Assistant
  • GPS Global Positioning System
  • terminal may be portable, transportable, installed in a vehicle (aviation, sea and/or land), or adapted and/or configured to operate locally, and/or Run in any other location on the Earth and/or space in a distributed fashion.
  • the "terminal” and “terminal device” used herein may also be a communication terminal, an internet terminal, a music/video playback terminal, and may be, for example, a PDA, a MID (Mobile Internet Device), and/or have a music/video playback.
  • Functional mobile phones can also be smart TVs, set-top boxes and other devices.
  • the server used herein includes, but is not limited to, a computer, a network host, a single network server, a plurality of network server sets, or a cloud composed of a plurality of servers.
  • the cloud consists of a large number of computers or network servers based on Cloud Computing, which is a kind of distributed computing, a super virtual computer composed of a group of loosely coupled computers.
  • communication may be implemented by any communication means between the server, the terminal device and the WNS server, including but not limited to, mobile communication based on 3GPP, LTE, WIMAX, and computer network communication based on TCP/IP and UDP protocols. And short-range wireless transmission based on Bluetooth and infrared transmission standards.
  • the speech translation method and apparatus of the embodiments of the present invention may be applied to a translation device, and may also be applied to a server.
  • the translation device can be a dedicated translation machine, a mobile terminal such as a mobile phone or a tablet, or a computer terminal such as a personal computer or a notebook computer.
  • a speech translation method of the present invention is proposed. The method includes the following steps:
  • the original voice information that is, the voice information to be translated, according to the embodiment of the present invention.
  • the original voice information may be voice information collected on the spot, and may be voice information stored locally or voice information obtained from other devices.
  • the translation device can collect voice information sent by the user through a microphone, and the voice information is the original voice information.
  • the server receives the voice information sent by the translation device, and the voice information is the original voice information.
  • the pitch frequency may be used as a recognition basis, and the gender of the original voice information is identified by a gender recognition algorithm such as VQ (Vector Quantization), HMM. (Hidden Markov
  • Model hidden Markov model
  • SVM Small Vector Machines
  • the voice gender of the original voice information may be identified in the following manner, including the following steps:
  • the original voice information is continuously sampled by the M (M>2) frame at a preset sampling frequency, and then the pitch frequency feature is extracted from the collected voice frame, and finally the original is calculated according to the extracted pitch rate feature.
  • the frequency of the pitch of the voice information is continuously sampled by the M (M>2) frame at a preset sampling frequency, and then the pitch frequency feature is extracted from the collected voice frame, and finally the original is calculated according to the extracted pitch rate feature. The frequency of the pitch of the voice information.
  • the sampling frequency can be selected to be 8 kHz, and of course other frequencies can be selected.
  • the value range of M is preferably 25 ⁇
  • the length of each voice frame is preferably 20-3
  • the pitch of the acquired speech frame can be averaged, and the average value is taken as the frequency of the pitch of the original speech information.
  • the pitch frequency of the male voice is smaller than the pitch frequency of the female voice, and the pitch frequency distribution range of the male voice is generally between 0-20.
  • the pitch frequency distribution of female voices is generally between 200-500HZ, so the threshold can be set to 180-220Hz, if set to 200Hz.
  • the voice gender of the voice information includes male voice and female voice.
  • ij ij recognizes that the phonetic gender of the original voice information is male.
  • the frequency of the pitch is greater than the threshold ⁇ , the voice gender of the original voice information is recognized as a female voice.
  • each time a piece of voice information is detected the voice gender of the voice information is recognized once, so that each piece of voice information respectively matches the corresponding voice synthesized voiceprint, so that after the translation process
  • the voice gender of each piece of voice information is consistent with the voice gender of each piece of voice information.
  • VAD voice activity detection
  • two voice synthesized voiceprints are preset, which are male voice voice and female voice voiceprint.
  • the male voiceprint is selected; when the voice gender of the original voice information is recognized as a female voice, the female voiceprint is selected.
  • the male voice voice and the female voice voiceprint respectively include at least two, each having a different pitch frequency, and the corresponding male voice voice or female voice voice pattern may be selected according to the frequency of the pitch of the original voice information.
  • S13 Perform translation processing on the original voice information according to the selected voice synthesized voiceprint.
  • the original voice information is translated according to the selected voice synthesized voiceprint, so that the voice gender of the translated voice information is consistent with the voice gender of the original voice information, thereby enhancing the reality of the communication. Sense, enhance the user experience.
  • the translation processing of voice information mainly includes three processes of speech recognition, text translation, and speech synthesis, and specifically: firstly performing speech recognition processing on the original speech information to obtain the first character string in the original language.
  • Performing a text translation process on the first character string to obtain a second character string of the target language performing speech synthesis processing on the second character string by using the selected voice synthesis voiceprint to obtain voice information of the target language.
  • the translation device can perform translation processing locally, that is, the original speech information is sequentially subjected to three processes of speech recognition, text translation, and speech synthesis, and the code stream of the speech information of the target language is obtained. [0093] The translation device can also perform translation processing by the server.
  • the translation device first sends the original voice information to the voice recognition server, the voice recognition server performs voice recognition on the original voice information, recognizes the first character string and returns to the translation device; the translation device receives the first character string, and The first string is sent to the text translation server, and the text translation server translates the first string into a text, translates it into a second string of the target language and returns it to the translation device; the translation device receives the second string, and the second character
  • the string and the selected speech synthesis voiceprint are sent to the speech synthesis server, and the speech synthesis server performs speech synthesis processing on the second character string by using the selected speech synthesis voiceprint to obtain the speech information of the target language, and the speech information of the target language is coded.
  • the form of the stream is returned to the translation device, and the translation device receives the code stream of the voice information of the target language to obtain the translated voice information.
  • the translation device may also send the original voice information and the selected voice synthesis voiceprint to a server, and the server directly performs voice recognition and text translation processing on the original voice information, and utilizes The selected speech synthesis voiceprint is used for speech synthesis to obtain a code stream of the speech information of the target language.
  • the server sequentially performs speech recognition, text translation, and speech synthesis on the original voice information to obtain voice information of the target language.
  • the voice information of the target language is sent to the translation device in the form of a code stream.
  • the voice information is output, for example, the driver speaker outputs the voice information. Since the voice gender of the output voice information is consistent with the voice gender of the original voice information, the user feels more realistic and enhances the user experience.
  • the speech translation method of the embodiment of the present invention by identifying the speech gender of the original speech information, and then selecting the corresponding speech synthesis voiceprint according to the speech gender, and finally translating the original speech information according to the selected speech synthesis voiceprint Processing, so that the voice gender of the translated voice information is consistent with the voice gender of the original voice information, and the adaptation to the voice gender is realized.
  • the translated voice is a male voice.
  • the translated voice is a female voice, which makes the original voice and the translated voice coordinate, which greatly enhances the realism of communication and enhances the user experience.
  • the apparatus includes a gender recognition module 10, a voiceprint selection module 20, and a translation processing module 30, wherein: a gender identification module 10 is used to identify the original Voice gender of voice information; voiceprint selection module 20, for selecting corresponding according to original voice gender a speech synthesis voiceprint; a translation processing module 30, configured to perform translation processing on the original voice information according to the selected voice synthesis voiceprint, so that the voice gender of the translated voice information is consistent with the voice gender of the original voice information .
  • the original voice information that is, the voice information to be translated, according to the embodiment of the present invention.
  • the original voice information may be voice information collected on the spot, and may be voice information stored locally or voice information obtained from other devices.
  • the translation device can collect voice information sent by the user through a microphone, and the voice information is the original voice information.
  • the server receives the voice information sent by the translation device, and the voice information is the original voice information.
  • the gender recognition module 10 may use the pitch frequency as the recognition basis, and identify the voice gender of the original voice information by using a gender recognition algorithm such as VQ, HMM, SVM, etc. .
  • the gender identification module 10 includes an obtaining unit 11, a comparing unit 12, a first identifying unit 13, and a second identifying unit 14, wherein: the obtaining unit 11 is configured to acquire the original voice.
  • the gender is a male voice; the second identifying unit 14 is configured to determine that the voice gender of the original voice information is a female voice when the frequency of the pitch is greater than a threshold value.
  • the obtaining unit 11 includes a sampling subunit 111, an extracting subunit 112, and a statistical subunit 113, wherein: the sampling subunit 111 is configured to continuously sample the original voice information at a preset sampling frequency.
  • An M (M>2) frame An M (M>2) frame
  • an extraction sub-unit 112 configured to perform a pitch frequency feature extraction on the collected speech frame
  • a statistical sub-unit 113 configured to calculate a frequency of the pitch of the original speech information according to the extracted pitch frequency feature.
  • the sampling frequency can be selected to be 8 kHz, and of course other frequencies can be selected.
  • the length of each speech frame is preferably 20-3 0 ms.
  • the statistical sub-unit 113 may average the pitch frequency of the acquired speech frame as the frequency of the pitch of the original speech information.
  • the pitch frequency of the male voice is smaller than the pitch frequency of the female voice.
  • the pitch frequency distribution range of the male voice is generally between 0-20 0 Hz, and the pitch frequency distribution range of the female voice is generally between 200-500 Hz, so the threshold value can be set. Set to 180-220Hz, if set to 200Hz.
  • the voice gender of the voice information includes male voice and female voice.
  • the first recognition unit 13 recognizes that the voice gender of the original voice information is a male voice.
  • the second identifying unit 14 recognizes that the speech quality of the original speech information is female.
  • the gender recognition unit identifies the voice gender of the voice information once, so that each piece of voice information respectively matches the corresponding voice synthesized voiceprint, so that after the translation process The voice gender of each piece of voice information is consistent with the voice gender of each piece of voice information.
  • the gender identification unit may determine the start and end of a piece of speech information by using the inter-turn interval of the two speeches, for example: when no speech information is detected within the preset length, a speech is determined. End, when the voice message ⁇ is detected again, it is determined that the next voice starts.
  • voice activity detection (VAD) technology can be used to detect whether voice information is included in the sound signal.
  • the gender recognition unit may also detect the start and end of a piece of voice information by detecting whether a particular button is triggered, for example: when a particular button is triggered for the first time, a piece of voice information begins. When a specific button is triggered again, a piece of voice information ends.
  • the voiceprint selection module 20 includes a first selection unit 21 and a second selection unit 22, wherein: the first selection unit 21 is configured to select a male voice when the voice gender of the original voice information is a male voice.
  • the second selection unit 22 is configured to select a female voice voice when the voice gender of the original voice information is a female voice.
  • the male voiceprint and the female voiceprint respectively include at least two, each having a different pitch frequency
  • the voiceprint selection module 20 can select the corresponding male voiceprint or female voice according to the frequency of the pitch of the original voice information. Pattern.
  • the voice information after the translation processing is more consistent with the voiceprint of the original voice information, further enhancing the realism.
  • the translation processing module 30 performs translation processing on the original voice information according to the selected voice synthesis voiceprint, so that the voice gender of the translated voice information is consistent with the voice gender of the original voice information, which enhances the sense of reality and enhances The user experience.
  • the translation processing of voice information mainly includes three processes of voice recognition, text translation, and speech synthesis.
  • the translation processing module 30 includes a first processing unit 31, a second processing unit 32, and a third processing unit 33: a first processing unit 31, configured to perform speech recognition processing on the original speech information to obtain an original language. a first character string; a second processing unit 32, configured to perform a text translation process on the first character string to obtain a second character string in the target language; and a third processing unit 33, configured to synthesize the voiceprint pair by using the selected voice
  • the second character string is subjected to speech synthesis processing to obtain voice information of the target language.
  • the translation processing module 30 can perform translation processing locally on the translation device, that is, perform three processes of speech recognition, text translation, and speech synthesis on the original speech information to obtain a code stream of the speech information of the target language.
  • the translation processing module 30 can also perform translation processing by the server. For example: the first processing unit 31 first sends the original voice information to the voice recognition server, and the voice recognition server performs voice recognition on the original voice information, identifies the first character string and returns it to the translation device; and the second processing unit 32 receives the first a string, and sending the first string to the text translation server, the text translation server translating the first string into a text, translating into a second string of the target language and returning to the translation device; the third processing unit 33 receives the first a second character string, and the second character string and the selected voice synthesis voiceprint are sent to the voice synthesis server, and the voice synthesis server performs voice synthesis processing on the second character string by using the selected voice synthesis voiceprint to obtain voice information of the target language.
  • the voice information of the target language is returned to the translation device in the form of a code stream, and the third processing unit 33 receives the code stream of the voice information of the target language to obtain the translated voice information.
  • the translation processing module 30 may also send the original voice information and the selected voice synthesized voiceprint to a server, and the server directly performs voice recognition and text translation processing on the original voice information.
  • the speech synthesis is performed by using the selected speech synthesis voiceprint to obtain the code stream of the speech information of the target language.
  • the translation processing module 30 sequentially performs speech recognition, text translation, and speech synthesis on the original speech information by the first processing unit 31, the second processing unit 32, and the third processing unit 33. Three processing procedures to obtain voice information of the target language.
  • the voice information of the target language is sent to the translation device in the form of a code stream.
  • the voice information is output, for example, the driver speaker outputs the voice information. Since the voice gender of the output voice information is consistent with the voice gender of the original voice information, the user feels more realistic and enhances the user experience.
  • the speech translation apparatus of the embodiment of the present invention selects the corresponding speech synthesis voiceprint according to the speech gender by recognizing the speech gender of the original speech information, and finally translates the original speech information according to the selected speech synthesis voiceprint. Processing, so that the voice gender of the translated voice information is consistent with the voice gender of the original voice information, and the adaptation to the voice gender is realized.
  • the translated voice is a male voice.
  • the translated voice is a female voice, which makes the original voice and the translated voice coordinate, which greatly enhances the realism of communication and enhances the user experience.
  • the speech translation method and apparatus are particularly suitable for a translation machine, and utilize the interaction feature of the half-duplex data transmission of the translator, and each time the user speaks a sentence, the user's gender is identified according to the user's voice information. According to this, the voice information consistent with the user's gender is translated, thereby enhancing the authenticity of the communication and improving the user experience.
  • the present invention also proposes a translation apparatus including a memory, a processor, and at least one application stored in the memory and configured to be executed by the processor, the application being configured to use Perform a speech translation method.
  • the speech translation method comprises the steps of: identifying a speech gender of the original speech information; selecting a corresponding speech synthesis voiceprint according to the speech gender of the original speech information; and translating the original speech information according to the selected speech synthesis voiceprint, The speech gender of the speech information after the translation processing is made to coincide with the speech gender of the original speech information.
  • the speech translation method described in this embodiment is the speech translation method involved in the above embodiment of the present invention, and details are not described herein again.
  • the present invention includes apparatus related to performing one or more of the operations described herein.
  • These devices may be specially designed and manufactured for the required purposes, or may also include known devices in a general purpose computer.
  • These devices have computer programs stored therein that are selectively activated or reconfigured.
  • Such computer programs may be stored in a device (eg, computer) readable medium or stored in any device suitable for storing electronic instructions and separately coupled to the bus.
  • the computer readable medium includes, but is not limited to, any type of disk (including floppy disk, hard disk, optical disk, CD-ROM, and magneto-optical disk), ROM (Read-Only Memory, read only memory), RAM ( Random Access Memory, EPROM (Erasable Programmable Read-Only)
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • EPROM Erasable Programmable Read-Only
  • a readable medium includes any medium that is stored or transmitted by a device (e.g., a computer) in a readable form.
  • each block of the block diagrams and/or block diagrams and/or flow diagrams can be implemented by computer program instructions and/or in the block diagrams and/or block diagrams and/or flow diagrams.
  • the combination of boxes Those skilled in the art will appreciate that these computer program instructions can be implemented by a general purpose computer, a professional computer, or a processor of other programmable data processing methods, such that the processor is executed by a computer or other programmable data processing method.
  • the block diagrams and/or block diagrams of the invention and/or the schemes specified in the blocks or blocks of the flow diagram are invented.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

一种语音翻译方法、装置和翻译设备,方法包括以下步骤:识别原始的语音信息的语音性别(S11);根据原始的语音信息的语音性别选择对应的语音合成声纹(S12);根据选择的语音合成声纹对原始的语音信息进行翻译处理(S13),以使翻译处理后的语音信息的语音性别与原始的语音信息的语音性别相一致,实现了对语音性别的自适应。

Description

语音翻译方法、 装置和翻译设备 技术领域
[0001] 本发明涉及电子技术领域, 特别是涉及到一种语音翻译方法和装置。
背景技术
[0002] 目前, 两个说不同语言的用户交流吋, 可以通过翻译设备进行翻译, 从而实现 无障碍交流。 具体实现方式为: 用户说话吋按压一次翻译设备的特定按键, 翻 译设备则采集语音信息并进行翻译处理, 用户说完一段话后再按压一次按键, 翻译设备则输出翻译处理后的语音信息。
[0003] 翻译设备翻译处理后的语音信息的语音性别是预先设定好的, 用户可以设定为 男声或者女声, 一旦设定好后, 无论说话的人是男性还是女性, 翻译处理后的 语音信息都是相同的语音性别。 例如, 当设定为男声后, 如果说话的人是女性 , 翻译处理后的语音信息则为男声; 当设定为女声后, 如果说话的人是男性, 翻译处理后的语音信息则为女声。
[0004] 由此可见, 现有技术中, 翻译处理后的语音信息的语音性别与原始的语音信息 的语音性别有可能不一致, 导致原始语音和翻译语音不协调, 使得用户听起来 感觉很怪异, 大大降低了交流的真实感, 用户体验不佳。
技术问题
[0005] 本发明的主要目的为提供一种语音翻译方法和装置, 旨在解决翻译处理后的语 音信息的语音性别与原始的语音信息的语音性别不一致的技术问题, 增强交流 的真实感, 提升用户体验。
问题的解决方案
技术解决方案
[0006] 为达以上目的, 本发明实施例提出一种语音翻译方法, 所述方法包括以下步骤
[0007] 识别原始的语音信息的语音性别;
[0008] 根据所述语音性别选择对应的语音合成声纹; [0009] 根据选择的语音合成声纹对所述原始的语音信息进行翻译处理, 以使翻译处理 后的语音信息的语音性别与原始的语音信息的语音性别相一致。
[0010] 可选地, 所述识别原始的语音信息的语音性别的步骤包括:
[0011] 获取所述原始的语音信息的基音的频率;
[0012] 比较所述基音的频率与门限值的大小;
[0013] 当所述基音的频率小于或等于门限值吋, 识别所述原始的语音信息的语音性别 为男声;
[0014] 当所述基音的频率大于门限值吋, 识别所述原始的语音信息的语音性别为女声
[0015] 可选地, 所述获取所述原始的语音信息的基音的频率的步骤包括:
[0016] 以预设的采样频率对所述原始的语音信息连续采样 M帧, M≥2;
[0017] 对采集的语音帧进行基音频率特征提取;
[0018] 根据提取的基音频率特征统计出所述原始的语音信息的基音的频率。
[0019] 可选地, 25≥M≤35。
[0020] 可选地, 所述语音帧的吋长为 20-30ms。
[0021] 可选地, 所述采样频率为 8kHz。
[0022] 可选地, 所述门限值为 180-220Hz。
[0023] 可选地, 所述根据选择的语音合成声纹对所述原始的语音信息进行翻译处理的 步骤包括:
[0024] 对所述原始的语音信息进行语音识别处理, 得到原始语言的第一字符串;
[0025] 对所述第一字符串进行文字翻译处理, 得到目标语言的第二字符串;
[0026] 利用选择的语音合成声纹对所述第二字符串进行语音合成处理, 得到目标语言 的语音信息。
[0027] 可选地, 所述识别原始的语音信息的语音性别的步骤包括: 每当检测到一段语 音信息幵始吋, 则识别所述语音信息的语音性别。
[0028] 可选地, 所述语音合成声纹包括男声声纹和女声声纹, 所述根据所述语音性别 选择对应的语音合成声纹的步骤包括:
[0029] 当所述语音性别为男声吋, 选择所述男声声纹; [0030] 当所述语音性别为女声吋, 选择所述女声声纹。
[0031] 本发明实施例同吋提出一种语音翻译装置, 所述装置包括:
[0032] 性别识别模块, 用于识别原始的语音信息的语音性别;
[0033] 声纹选择模块, 用于根据所述语音性别选择对应的语音合成声纹;
[0034] 翻译处理模块, 用于根据选择的语音合成声纹对所述原始的语音信息进行翻译 处理, 以使翻译处理后的语音信息的语音性别与原始的语音信息的语音性别相 一致。
[0035] 可选地, 所述性别识别模块包括:
[0036] 获取单元, 用于获取所述原始的语音信息的基音的频率;
[0037] 比较单元, 用于比较所述基音的频率与门限值的大小;
[0038] 第一识别单元, 用于当所述基音的频率小于或等于门限值吋, 确定所述原始的 语音信息的语音性别为男声;
[0039] 第二识别单元, 用于当所述基音的频率大于门限值吋, 确定所述原始的语音信 息的语音性别为女声。
[0040] 可选地, 所述获取单元包括:
[0041] 采样子单元, 用于以预设的采样频率对所述原始的语音信息连续采样 M帧, M >2;
[0042] 提取子单元, 用于对采集的语音帧进行基音频率特征提取;
[0043] 统计子单元, 用于根据提取的基音频率特征统计出所述原始的语音信息的基音 的频率。
[0044] 可选地, 所述翻译处理模块包括:
[0045] 第一处理单元, 用于对所述原始的语音信息进行语音识别处理, 得到原始语言 的第一字符串;
[0046] 第二处理单元, 用于对所述第一字符串进行文字翻译处理, 得到目标语言的第 二字符串;
[0047] 第三处理单元, 用于利用选择的语音合成声纹对所述第二字符串进行语音合成 处理, 得到目标语言的语音信息。
[0048] 可选地, 所述性别识别模块用于: 每当检测到一段语音信息幵始吋, 则识别所 述语音信息的语音性别。
[0049] 可选地, 所述语音合成声纹包括男声声纹和女声声纹, 所述声纹选择模块包括
[0050] 第一选择单元, 用于当所述语音性别为男声吋, 选择所述男声声纹;
[0051] 第二选择单元, 用于当所述语音性别为女声吋, 选择所述女声声纹。
[0052] 本发明实施例还提出一种翻译设备, 所述翻译设备包括存储器、 处理器和至少 一个被存储在所述存储器中并被配置为由所述处理器执行的应用程序, 所述应 用程序被配置为用于执行前述语音翻译方法。
发明的有益效果
有益效果
[0053] 本发明实施例所提供的一种语音翻译方法, 通过识别出原始的语音信息的语音 性别, 再根据语音性别选择对应的语音合成声纹, 最后根据选择的语音合成声 纹对原始的语音信息进行翻译处理, 从而使得翻译处理后的语音信息的语音性 另 IJ与原始的语音信息的语音性别相一致, 实现了对语音性别的自适应。 当男性 说话吋翻译出来的语音是男声, 当女性说话吋翻译出来的语音是女声, 使得原 始语音与翻译语音协调一致, 大大增强了交流的真实感, 提升了用户体验。 对附图的简要说明
附图说明
[0054] 图 1是本发明的语音翻译方法一实施例的流程图;
[0055] 图 2是图 1中步骤 S 11的具体流程图;
[0056] 图 3是本发明的语音翻译装置一实施例的模块示意图;
[0057] 图 4是图 3中的性别识别模块的模块示意图;
[0058] 图 5是图 4中的获取单元的模块示意图;
[0059] 图 6是图 3中的声纹选择模块的模块示意图;
[0060] 图 7是图 3中的翻译处理模块的模块示意图。
[0061] 本发明目的的实现、 功能特点及优点将结合实施例, 参照附图做进一步说明。
实施该发明的最佳实施例 本发明的最佳实施方式
[0062] 应当理解, 此处所描述的具体实施例仅仅用以解释本发明, 并不用于限定本发 明。
[0063] 下面详细描述本发明的实施例, 所述实施例的示例在附图中示出, 其中自始至 终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。 下 面通过参考附图描述的实施例是示例性的, 仅用于解释本发明, 而不能解释为 对本发明的限制。
[0064] 本技术领域技术人员可以理解, 除非特意声明, 这里使用的单数形式"一"、 " 一个"、 "所述 "和"该"也可包括复数形式。 应该进一步理解的是, 本发明的说明 书中使用的措辞"包括"是指存在所述特征、 整数、 步骤、 操作、 元件和 /或组件 , 但是并不排除存在或添加一个或多个其他特征、 整数、 步骤、 操作、 元件、 组件和 /或它们的组。 应该理解, 当我们称元件被"连接"或"耦接"到另一元件吋 , 它可以直接连接或耦接到其他元件, 或者也可以存在中间元件。 此外, 这里 使用的"连接"或"耦接"可以包括无线连接或无线耦接。 这里使用的措辞 "和 /或"包 括一个或更多个相关联的列出项的全部或任一单元和全部组合。
[0065] 本技术领域技术人员可以理解, 除非另外定义, 这里使用的所有术语 (包括技 术术语和科学术语) , 具有与本发明所属领域中的普通技术人员的一般理解相 同的意义。 还应该理解的是, 诸如通用字典中定义的那些术语, 应该被理解为 具有与现有技术的上下文中的意义一致的意义, 并且除非像这里一样被特定定 义, 否则不会用理想化或过于正式的含义来解释。
[0066] 本技术领域技术人员可以理解, 这里所使用的 "终端"、 "终端设备"既包括无线 信号接收器的设备, 其仅具备无发射能力的无线信号接收器的设备, 又包括接 收和发射硬件的设备, 其具有能够在双向通信链路上, 执行双向通信的接收和 发射硬件的设备。 这种设备可以包括: 蜂窝或其他通信设备, 其具有单线路显 示器或多线路显示器或没有多线路显示器的蜂窝或其他通信设备; PCS (Persona 1 Communications Service, 个人通信系统) , 其可以组合语音、 数据处理、 传真 和 /或数据通信能力; PDA (Personal Digital Assistant, 个人数字助理) , 其可以 包括射频接收器、 寻呼机、 互联网 /内联网访问、 网络浏览器、 记事本、 日历和 / 或 GPS (Global Positioning System, 全球定位系统) 接收器; 常规膝上型和 /或掌 上型计算机或其他设备, 其具有和 /或包括射频接收器的常规膝上型和 /或掌上型 计算机或其他设备。 这里所使用的 "终端"、 "终端设备"可以是便携式、 可运输、 安装在交通工具 (航空、 海运和 /或陆地) 中的, 或者适合于和 /或配置为在本地 运行, 和 /或以分布形式, 运行在地球和 /或空间的任何其他位置运行。 这里所使 用的"终端"、 "终端设备"还可以是通信终端、 上网终端、 音乐 /视频播放终端, 例如可以是 PDA、 MID (Mobile Internet Device, 移动互联网设备) 和 /或具有音 乐 /视频播放功能的移动电话, 也可以是智能电视、 机顶盒等设备。
[0067] 本技术领域技术人员可以理解, 这里所使用的服务器, 其包括但不限于计算机 、 网络主机、 单个网络服务器、 多个网络服务器集或多个服务器构成的云。 在 此, 云由基于云计算 (Cloud Computing) 的大量计算机或网络服务器构成, 其 中, 云计算是分布式计算的一种, 由一群松散耦合的计算机集组成的一个超级 虚拟计算机。 本发明的实施例中, 服务器、 终端设备与 WNS服务器之间可通过 任何通信方式实现通信, 包括但不限于, 基于 3GPP、 LTE、 WIMAX的移动通信 、 基于 TCP/IP、 UDP协议的计算机网络通信以及基于蓝牙、 红外传输标准的近 距无线传输方式。
[0068] 本发明实施例的语音翻译方法和装置, 可以应用于翻译设备, 也可以应用于服 务器。 翻译设备可以是专门的翻译机, 也可以是手机、 平板等移动终端, 还可 以是个人电脑、 笔记本电脑等计算机终端。 参照图 1, 提出本发明的语音翻译方 法一实施例, 所述方法包括以下步骤:
[0069] Sl l、 识别原始的语音信息的语音性别。
[0070] 本发明实施例所述的原始的语音信息, 即待翻译的语音信息。 原始的语音信息 可以是当场采集的语音信息, 可以是存储于本地的语音信息, 也可以是从其它 设备获取的语音信息。
[0071] 以应用于翻译设备为例, 翻译设备可以通过麦克风采集用户发出的语音信息, 该语音信息即为原始的语音信息。
[0072] 以应用于服务器为例, 服务器接收翻译设备发送的语音信息, 该语音信息即为 原始的语音信息。 [0073] 在识别语音信息的语音性别吋, 可以以基音频率作为识别依据, 通过性别识别 算法来识别原始的语音信息的语音性别, 所述性别识别算法如 VQ (Vector Quantization, 矢量量化 ) 、 HMM (Hidden Markov
Model, 隐马尔可夫模型) 、 SVM (Support Vector Machines , 支持向量机) 等
[0074] 如图 2所示, 可以通过以下方式识别原始的语音信息的语音性别, 具体包括以 下步骤:
[0075] Sl l l、 获取原始的语音信息的基音的频率。
[0076] 具体的, 首先以预设的采样频率对原始的语音信息连续采样 M (M>2) 帧, 然 后对采集的语音帧进行基音频率特征提取, 最后根据提取的基音频率特征统计 出原始的语音信息的基音的频率。
[0077] 采样频率可以选择 8kHz, 当然也可以选择其它的频率。 M的取值范围优选 25≥
M≤35, 例如取 M=30, 即连续采样 30帧语音帧。 每一个语音帧的吋长优选为 20-3
0ms。 在统计基音的频率吋, 可以对采集的语音帧的基音频率求平均值, 将平均 值作为原始的语音信息的基音的频率。
[0078] S112、 比较基音的频率与门限值的大小, 判断基音的频率是否小于或等于门限 值。 当基音的频率小于或等于门限值吋, 进入步骤 S113 ; 当基音的频率大于门 限值吋, 进入步骤 S 114。
[0079] 男声的基音频率小于女声的基音频率, 男声的基音频率分布范围一般介于 0-20
0Hz之间, 女声的基音频率分布范围一般介于 200-500HZ之间, 因此门限值可以 设定为 180-220Hz, 如设定为 200Hz。
[0080] S113、 识别原始的语音信息的语音性别为男声。
[0081] S114、 识别原始的语音信息的语音性别为女声。
[0082] 本发明实施例所述的语音信息的语音性别包括男声和女声。 当基音的频率小于 或等于门限值吋, 贝 ij识别原始的语音信息的语音性别为男声。 当基音的频率大 于门限值吋, 则识别原始的语音信息的语音性别为女声。
[0083] 本发明实施例中, 每当检测到一段语音信息幵始吋, 则识别一次语音信息的语 音性别, 以为每一段语音信息分别匹配对应的语音合成声纹, 使得翻译处理后 的每一段语音信息的语音性别与原始的每一段语音信息的语音性别均相一致。
[0084] 在检测一段语音信息的幵始和结束吋, 可以通过两段语音的吋间间隔来确定, 例如: 当在预设吋长内没有检测到语音信息吋, 则确定一段语音结束, 当再次 检测到语音信息吋, 则确定下一段语音幵始。 在检测语音信息吋, 可以通过语 音活动检测 (VAD, Voice Activity Detection) 技术来检测声音信号中是否包括语 音信息。
[0085] 当应用于翻译设备吋, 也可以通过检测特定按键是否被触发来检测一段语音信 息的幵始和结束, 例如: 当特定按键首次被触发吋, 则一段语音信息幵始, 当 特定按键再次被触发吋, 则一段语音信息结束。
[0086] S12、 根据原始的语音信息的语音性别选择对应的语音合成声纹。
[0087] 本发明实施例中, 预置了两种语音合成声纹, 分别为男声声纹和女声声纹。 当 识别出原始的语音信息的语音性别为男声吋, 则选择男声声纹; 当识别出原始 的语音信息的语音性别为女声吋, 则选择女声声纹。
[0088] 进一步地, 男声声纹和女声声纹分别包括至少两个, 每一个的基音频率不同, 可以根据原始的语音信息的基音的频率选择对应的男声声纹或女声声纹。 从而 使得翻译处理后的语音信息与原始的语音信息的声纹更加吻合, 进一步增强了 交流的真实感。
[0089] S13、 根据选择的语音合成声纹对原始的语音信息进行翻译处理。
[0090] 本步骤 S13中, 根据选择的语音合成声纹对原始的语音信息进行翻译处理, 使 得翻译处理后的语音信息的语音性别与原始的语音信息的语音性别相一致, 增 强了交流的真实感, 提升了用户体验。
[0091] 语音信息的翻译处理, 主要包括语音识别、 文字翻译、 语音合成三个流程, 具 体的: 首先对原始的语音信息进行语音识别处理, 得到原始语言的第一字符串
; 对第一字符串进行文字翻译处理, 得到目标语言的第二字符串; 利用选择的 语音合成声纹对第二字符串进行语音合成处理, 得到目标语言的语音信息。
[0092] 以应用于翻译设备为例。 翻译设备可以在本地进行翻译处理, 即对原始的语音 信息依次执行语音识别、 文字翻译、 语音合成三个处理流程, 得到目标语言的 语音信息的码流。 [0093] 翻译设备也可以通过服务器进行翻译处理。 例如: 翻译设备首先将原始的语音 信息发送给语音识别服务器, 语音识别服务器对原始的语音信息进行语音识别 , 识别出第一字符串并返回给翻译设备; 翻译设备接收第一字符串, 并将第一 字符串发送给文字翻译服务器, 文字翻译服务器对第一字符串进行文字翻译, 翻译为目标语言的第二字符串并返回给翻译设备; 翻译设备接收第二字符串, 并将第二字符串和选择的语音合成声纹发送给语音合成服务器, 语音合成服务 器利用选择的语音合成声纹对第二字符串进行语音合成处理, 得到目标语言的 语音信息, 并将目标语言的语音信息以码流的形式返回给翻译设备, 翻译设备 接收目标语言的语音信息的码流, 获得翻译后的语音信息。
[0094] 当然, 在其它实施例中, 翻译设备也可以将原始的语音信息和选择的语音合成 声纹发送给一个服务器, 该服务器直接对原始的语音信息进行语音识别和文字 翻译处理, 并利用选择的语音合成声纹进行语音合成, 得到目标语言的语音信 息的码流。
[0095] 以应用于服务器为例。 服务器对原始的语音信息依次执行语音识别、 文字翻译 、 语音合成三个处理流程, 得到目标语言的语音信息。 并将目标语言的语音信 息以码流的形式发送给翻译设备。
[0096] 翻译设备获得翻译处理后的语音信息后, 则输出该语音信息, 例如, 驱动扬声 器输出该语音信息。 由于输出的语音信息的语音性别与原始的语音信息的语音 性别相一致, 因此用户听起来感觉更加真实, 提升了用户体验。
[0097] 本发明实施例的语音翻译方法, 通过识别出原始的语音信息的语音性别, 再根 据语音性别选择对应的语音合成声纹, 最后根据选择的语音合成声纹对原始的 语音信息进行翻译处理, 从而使得翻译处理后的语音信息的语音性别与原始的 语音信息的语音性别相一致, 实现了对语音性别的自适应。 当男性说话吋翻译 出来的语音是男声, 当女性说话吋翻译出来的语音是女声, 使得原始语音与翻 译语音协调一致, 大大增强了交流的真实感, 提升了用户体验。
[0098] 参照图 3, 提出本发明的语音翻译装置一实施例, 所述装置包括性别识别模块 1 0、 声纹选择模块 20和翻译处理模块 30, 其中: 性别识别模块 10, 用于识别原始 的语音信息的语音性别; 声纹选择模块 20, 用于根据原始的语音性别选择对应 的语音合成声纹; 翻译处理模块 30, 用于根据选择的语音合成声纹对原始的语 音信息进行翻译处理, 以使翻译处理后的语音信息的语音性别与原始的语音信 息的语音性别相一致。
[0099] 本发明实施例所述的原始的语音信息, 即待翻译的语音信息。 原始的语音信息 可以是当场采集的语音信息, 可以是存储于本地的语音信息, 也可以是从其它 设备获取的语音信息。
[0100] 以应用于翻译设备为例, 翻译设备可以通过麦克风采集用户发出的语音信息, 该语音信息即为原始的语音信息。
[0101] 以应用于服务器为例, 服务器接收翻译设备发送的语音信息, 该语音信息即为 原始的语音信息。
[0102] 在识别语音信息的语音性别吋, 性别识别模块 10可以以基音频率作为识别依据 , 通过性别识别算法来识别原始的语音信息的语音性别, 所述性别识别算法如 V Q、 HMM、 SVM等。
[0103] 可选地, 性别识别模块 10如图 4所示, 包括获取单元 11、 比较单元 12、 第一识 别单元 13和第二识别单元 14, 其中: 获取单元 11, 用于获取原始的语音信息的 基音的频率; 比较单元 12, 用于比较基音的频率与门限值的大小; 第一识别单 元 13, 用于当基音的频率小于或等于门限值吋, 确定原始的语音信息的语音性 别为男声; 第二识别单元 14, 用于当基音的频率大于门限值吋, 确定原始的语 音信息的语音性别为女声。
[0104] 如图 5所示, 获取单元 11包括采样子单元 111、 提取子单元 112和统计子单元 113 , 其中: 采样子单元 111, 用于以预设的采样频率对原始的语音信息连续采样 M ( M>2) 帧, ; 提取子单元 112, 用于对采集的语音帧进行基音频率特征提取; 统计子单元 113, 用于根据提取的基音频率特征统计出原始的语音信息的基音的 频率。
[0105] 采样频率可以选择 8kHz, 当然也可以选择其它的频率。 M的取值范围优选 25≥ M≤35, 例如取 M=30, 即连续采样 30帧语音帧。 每一个语音帧的吋长优选为 20-3 0ms。 在统计基音的频率吋, 统计子单元 113可以对采集的语音帧的基音频率求 平均值, 将平均值作为原始的语音信息的基音的频率。 [0106] 男声的基音频率小于女声的基音频率, 男声的基音频率分布范围一般介于 0-20 0Hz之间, 女声的基音频率分布范围一般介于 200-500HZ之间, 因此门限值可以 设定为 180-220Hz, 如设定为 200Hz。
[0107] 本发明实施例所述的语音信息的语音性别包括男声和女声。 当基音的频率小于 或等于门限值吋, 第一识别单元 13则识别原始的语音信息的语音性别为男声。 当基音的频率大于门限值吋, 第二识别单元 14则识别原始的语音信息的语音性 别为女声。
[0108] 本发明实施例中, 每当检测到一段语音信息幵始吋, 性别识别单元则识别一次 语音信息的语音性别, 以为每一段语音信息分别匹配对应的语音合成声纹, 使 得翻译处理后的每一段语音信息的语音性别与原始的每一段语音信息的语音性 别均相一致。
[0109] 性别识别单元在检测一段语音信息的幵始和结束吋, 可以通过两段语音的吋间 间隔来确定, 例如: 当在预设吋长内没有检测到语音信息吋, 则确定一段语音 结束, 当再次检测到语音信息吋, 则确定下一段语音幵始。 在检测语音信息吋 , 可以通过语音活动检测 (VAD, Voice Activity Detection) 技术来检测声音信号 中是否包括语音信息。
[0110] 当应用于翻译设备吋, 性别识别单元也可以通过检测特定按键是否被触发来检 测一段语音信息的幵始和结束, 例如: 当特定按键首次被触发吋, 则一段语音 信息幵始, 当特定按键再次被触发吋, 则一段语音信息结束。
[0111] 本发明实施例中, 预置了两种语音合成声纹, 分别为男声声纹和女声声纹。 声 纹选择模块 20如图 6所示, 包括第一选择单元 21和第二选择单元 22, 其中: 第一 选择单元 21, 用于当原始的语音信息的语音性别为男声吋, 则选择男声声纹; 第二选择单元 22, 用于当原始的语音信息的语音性别为女声吋, 则选择女声声 纹。
[0112] 进一步地, 男声声纹和女声声纹分别包括至少两个, 每一个的基音频率不同, 声纹选择模块 20可以根据原始的语音信息的基音的频率选择对应的男声声纹或 女声声纹。 从而使得翻译处理后的语音信息与原始的语音信息的声纹更加吻合 , 进一步增强了真实感。 [0113] 翻译处理模块 30根据选择的语音合成声纹对原始的语音信息进行翻译处理, 使 得翻译处理后的语音信息的语音性别与原始的语音信息的语音性别相一致, 增 强了真实感, 提升了用户体验。
[0114] 语音信息的翻译处理, 主要包括语音识别、 文字翻译、 语音合成三个流程。 如 图 7所示, 翻译处理模块 30包括第一处理单元 31、 第二处理单元 32和第三处理单 元 33: 第一处理单元 31, 用于对原始的语音信息进行语音识别处理, 得到原始 语言的第一字符串; 第二处理单元 32, 用于对第一字符串进行文字翻译处理, 得到目标语言的第二字符串; 第三处理单元 33, 用于利用选择的语音合成声纹 对第二字符串进行语音合成处理, 得到目标语言的语音信息。
[0115] 以应用于翻译设备为例。 翻译处理模块 30可以在翻译设备本地进行翻译处理, 即对原始的语音信息依次执行语音识别、 文字翻译、 语音合成三个处理流程, 得到目标语言的语音信息的码流。
[0116] 翻译处理模块 30也可以通过服务器进行翻译处理。 例如: 第一处理单元 31首先 将原始的语音信息发送给语音识别服务器, 语音识别服务器对原始的语音信息 进行语音识别, 识别出第一字符串并返回给翻译设备; 第二处理单元 32接收第 一字符串, 并将第一字符串发送给文字翻译服务器, 文字翻译服务器对第一字 符串进行文字翻译, 翻译为目标语言的第二字符串并返回给翻译设备; 第三处 理单元 33接收第二字符串, 并将第二字符串和选择的语音合成声纹发送给语音 合成服务器, 语音合成服务器利用选择的语音合成声纹对第二字符串进行语音 合成处理, 得到目标语言的语音信息, 并将目标语言的语音信息以码流的形式 返回给翻译设备, 第三处理单元 33接收目标语言的语音信息的码流, 获得翻译 后的语音信息。
[0117] 当然, 在其它实施例中, 翻译处理模块 30也可以将原始的语音信息和选择的语 音合成声纹发送给一个服务器, 该服务器直接对原始的语音信息进行语音识别 和文字翻译处理, 并利用选择的语音合成声纹进行语音合成, 得到目标语言的 语音信息的码流。
[0118] 以应用于服务器为例。 翻译处理模块 30通过第一处理单元 31、 第二处理单元 32 和第三处理单元 33对原始的语音信息依次执行语音识别、 文字翻译、 语音合成 三个处理流程, 得到目标语言的语音信息。 并将目标语言的语音信息以码流的 形式发送给翻译设备。
[0119] 翻译设备获得翻译处理后的语音信息后, 则输出该语音信息, 例如, 驱动扬声 器输出该语音信息。 由于输出的语音信息的语音性别与原始的语音信息的语音 性别相一致, 因此用户听起来感觉更加真实, 提升了用户体验。
[0120] 本发明实施例的语音翻译装置, 通过识别出原始的语音信息的语音性别, 再根 据语音性别选择对应的语音合成声纹, 最后根据选择的语音合成声纹对原始的 语音信息进行翻译处理, 从而使得翻译处理后的语音信息的语音性别与原始的 语音信息的语音性别相一致, 实现了对语音性别的自适应。 当男性说话吋翻译 出来的语音是男声, 当女性说话吋翻译出来的语音是女声, 使得原始语音与翻 译语音协调一致, 大大增强了交流的真实感, 提升了用户体验。
[0121] 本发明实施例的语音翻译方法和装置尤其适用于翻译机, 利用翻译机半双工数 据传输的交互特点, 在用户每说一句话吋, 则根据用户的语音信息识别出用户 的性别, 据此翻译出与用户的性别相一致的语音信息, 从而增强交流的真实性 , 提升用户体验。
[0122] 本发明同吋提出一种翻译设备, 所述翻译设备包括存储器、 处理器和至少一个 被存储在存储器中并被配置为由处理器执行的应用程序, 所述应用程序被配置 为用于执行语音翻译方法。 所述语音翻译方法包括以下步骤: 识别原始的语音 信息的语音性别; 根据原始的语音信息的语音性别选择对应的语音合成声纹; 根据选择的语音合成声纹对原始的语音信息进行翻译处理, 以使翻译处理后的 语音信息的语音性别与原始的语音信息的语音性别相一致。 本实施例中所描述 的语音翻译方法为本发明中上述实施例所涉及的语音翻译方法, 在此不再赘述
[0123] 本领域技术人员可以理解, 本发明包括涉及用于执行本申请中所述操作中的一 项或多项的设备。 这些设备可以为所需的目的而专门设计和制造, 或者也可以 包括通用计算机中的已知设备。 这些设备具有存储在其内的计算机程序, 这些 计算机程序选择性地激活或重构。 这样的计算机程序可以被存储在设备 (例如 , 计算机) 可读介质中或者存储在适于存储电子指令并分别耦联到总线的任何 类型的介质中, 所述计算机可读介质包括但不限于任何类型的盘 (包括软盘、 硬盘、 光盘、 CD-ROM、 和磁光盘) 、 ROM (Read-Only Memory , 只读存储器 ) 、 RAM (Random Access Memory , 随机存储器) 、 EPROM (Erasable Programmable Read-Only
Memory , 可擦写可编程只读存储器) 、 EEPROM (Electrically Erasable Programmable Read-Only Memory , 电可擦可编程只读存储器) 、 闪存、 磁性卡 片或光线卡片。 也就是, 可读介质包括由设备 (例如, 计算机) 以能够读的形 式存储或传输信息的任何介质。
[0124] 本技术领域技术人员可以理解, 可以用计算机程序指令来实现这些结构图和 / 或框图和 /或流图中的每个框以及这些结构图和 /或框图和 /或流图中的框的组合。 本技术领域技术人员可以理解, 可以将这些计算机程序指令提供给通用计算机 、 专业计算机或其他可编程数据处理方法的处理器来实现, 从而通过计算机或 其他可编程数据处理方法的处理器来执行本发明公幵的结构图和 /或框图和 /或流 图的框或多个框中指定的方案。
[0125] 本技术领域技术人员可以理解, 本发明中已经讨论过的各种操作、 方法、 流程 中的步骤、 措施、 方案可以被交替、 更改、 组合或刪除。 进一步地, 具有本发 明中已经讨论过的各种操作、 方法、 流程中的其他步骤、 措施、 方案也可以被 交替、 更改、 重排、 分解、 组合或刪除。 进一步地, 现有技术中的具有与本发 明中公幵的各种操作、 方法、 流程中的步骤、 措施、 方案也可以被交替、 更改 、 重排、 分解、 组合或刪除。
[0126] 以上所述仅为本发明的优选实施例, 并非因此限制本发明的专利范围, 凡是利 用本发明说明书及附图内容所作的等效结构或等效流程变换, 或直接或间接运 用在其他相关的技术领域, 均同理包括在本发明的专利保护范围内。

Claims

权利要求书
[权利要求 1] 一种语音翻译方法, 其特征在于, 包括以下步骤:
识别原始的语音信息的语音性别;
根据所述语音性别选择对应的语音合成声纹;
根据选择的语音合成声纹对所述原始的语音信息进行翻译处理, 以使 翻译处理后的语音信息的语音性别与原始的语音信息的语音性别相一 致。
[权利要求 2] 根据权利要求 1所述的语音翻译方法, 其特征在于, 所述识别原始的 语音信息的语音性别的步骤包括:
获取所述原始的语音信息的基音的频率;
比较所述基音的频率与门限值的大小;
当所述基音的频率小于或等于门限值吋, 识别所述原始的语音信息的 语音性别为男声;
当所述基音的频率大于门限值吋, 识别所述原始的语音信息的语音性 别为女声。
[权利要求 3] 根据权利要求 2所述的语音翻译方法, 其特征在于, 所述获取所述原 始的语音信息的基音的频率的步骤包括:
以预设的采样频率对所述原始的语音信息连续采样 M帧, M≥2;
对采集的语音帧进行基音频率特征提取;
根据提取的基音频率特征统计出所述原始的语音信息的基音的频率。
[权利要求 4] 根据权利要求 3所述的语音翻译方法, 其特征在于, 25≥M≤35。
[权利要求 5] 根据权利要求 3所述的语音翻译方法, 其特征在于, 所述语音帧的吋 长为 20-30ms。
[权利要求 6] 根据权利要求 3所述的语音翻译方法, 其特征在于, 所述采样频率为 8 kHz。
[权利要求 7] 根据权利要求 2所述的语音翻译方法, 其特征在于, 所述门限值为 180
-220Hz。
[权利要求 8] 根据权利要求 1-7任一项所述的语音翻译方法, 其特征在于, 所述根 据选择的语音合成声纹对所述原始的语音信息进行翻译处理的步骤包 括:
对所述原始的语音信息进行语音识别处理, 得到原始语言的第一字符 串;
对所述第一字符串进行文字翻译处理, 得到目标语言的第二字符串; 利用选择的语音合成声纹对所述第二字符串进行语音合成处理, 得到 目标语言的语音信息。
根据权利要求 1-7任一项所述的语音翻译方法, 其特征在于, 所述识 别原始的语音信息的语音性别的步骤包括:
每当检测到一段语音信息幵始吋, 则识别所述语音信息的语音性别。 根据权利要求 2-7任一项所述的语音翻译方法, 其特征在于, 所述语 音合成声纹包括男声声纹和女声声纹, 所述根据所述语音性别选择对 应的语音合成声纹的步骤包括:
当所述语音性别为男声吋, 选择所述男声声纹;
当所述语音性别为女声吋, 选择所述女声声纹。
一种语音翻译装置, 其特征在于, 包括:
性别识别模块, 用于识别原始的语音信息的语音性别;
声纹选择模块, 用于根据所述语音性别选择对应的语音合成声纹; 翻译处理模块, 用于根据选择的语音合成声纹对所述原始的语音信息 进行翻译处理, 以使翻译处理后的语音信息的语音性别与原始的语音 信息的语音性别相一致。
根据权利要求 11所述的语音翻译装置, 其特征在于, 所述性别识别模 块包括:
获取单元, 用于获取所述原始的语音信息的基音的频率;
比较单元, 用于比较所述基音的频率与门限值的大小;
第一识别单元, 用于当所述基音的频率小于或等于门限值吋, 确定所 述原始的语音信息的语音性别为男声;
第二识别单元, 用于当所述基音的频率大于门限值吋, 确定所述原始 的语音信息的语音性别为女声。
根据权利要求 12所述的语音翻译装置, 其特征在于, 所述获取单元包 括:
采样子单元, 用于以预设的采样频率对所述原始的语音信息连续采样
M帧, M≥2;
提取子单元, 用于对采集的语音帧进行基音频率特征提取; 统计子单元, 用于根据提取的基音频率特征统计出所述原始的语音信 息的基音的频率。
根据权利要求 13所述的语音翻译装置 , 其特征在于, 25≥M≤35。 根据权利要求 13所述的语音翻译装置 , 其特征在于, 所述语音帧的吋 长为 20-30ms。
根据权利要求 13所述的语音翻译装置 , 其特征在于, 所述采样频率为 8kHz。
根据权利要求 11所述的语音翻译装置, 其特征在于, 所述翻译处理模 块包括:
第一处理单元, 用于对所述原始的语音信息进行语音识别处理, 得到 原始语言的第一字符串;
第二处理单元, 用于对所述第一字符串进行文字翻译处理, 得到目标 语言的第二字符串;
第三处理单元, 用于利用选择的语音合成声纹对所述第二字符串进行 语音合成处理, 得到目标语言的语音信息。
根据权利要求 11所述的语音翻译装置, 其特征在于, 所述性别识别模 块用于: 每当检测到一段语音信息幵始吋, 则识别所述语音信息的语 音性别。
根据权利要求 12所述的语音翻译装置, 其特征在于, 所述语音合成声 纹包括男声声纹和女声声纹, 所述声纹选择模块包括:
第一选择单元, 用于当所述语音性别为男声吋, 选择所述男声声纹; 第二选择单元, 用于当所述语音性别为女声吋, 选择所述女声声纹。 [权利要求 20] —种翻译设备, 包括存储器、 处理器和至少一个被存储在所述存储器 中并被配置为由所述处理器执行的应用程序, 其特征在于, 所述应用 程序被配置为用于执行权利要求 1所述的语音翻译方法。
PCT/CN2017/111961 2017-10-17 2017-11-20 语音翻译方法、装置和翻译设备 WO2019075829A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710967364.0 2017-10-17
CN201710967364.0A CN107731232A (zh) 2017-10-17 2017-10-17 语音翻译方法和装置

Publications (1)

Publication Number Publication Date
WO2019075829A1 true WO2019075829A1 (zh) 2019-04-25

Family

ID=61211655

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/111961 WO2019075829A1 (zh) 2017-10-17 2017-11-20 语音翻译方法、装置和翻译设备

Country Status (2)

Country Link
CN (1) CN107731232A (zh)
WO (1) WO2019075829A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108447486B (zh) * 2018-02-28 2021-12-03 科大讯飞股份有限公司 一种语音翻译方法及装置
CN108831436A (zh) * 2018-06-12 2018-11-16 深圳市合言信息科技有限公司 一种模拟说话者情绪优化翻译后文本语音合成的方法
CN112201224A (zh) * 2020-10-09 2021-01-08 北京分音塔科技有限公司 用于即时通话同声翻译的方法、设备及系统
CN112614482A (zh) * 2020-12-16 2021-04-06 平安国际智慧城市科技股份有限公司 移动端外语翻译方法、系统及存储介质
CN112989847A (zh) * 2021-03-11 2021-06-18 读书郎教育科技有限公司 一种扫描笔的录音翻译系统及方法

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100068965A (ko) * 2008-12-15 2010-06-24 한국전자통신연구원 자동 통역 장치 및 그 방법
US20130144595A1 (en) * 2011-12-01 2013-06-06 Richard T. Lord Language translation based on speaker-related information
CN103236259A (zh) * 2013-03-22 2013-08-07 乐金电子研发中心(上海)有限公司 语音识别处理及反馈系统、语音回复方法
CN103365837A (zh) * 2012-03-29 2013-10-23 株式会社东芝 机器翻译装置、方法和计算机可读媒体
CN103559180A (zh) * 2013-10-12 2014-02-05 安波 交谈翻译机
CN106156009A (zh) * 2015-04-13 2016-11-23 中兴通讯股份有限公司 语音翻译方法及装置
CN106528547A (zh) * 2016-11-09 2017-03-22 王东宇 一种翻译机的翻译方法

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4492461B2 (ja) * 2005-06-24 2010-06-30 凸版印刷株式会社 カラオケシステム、装置およびプログラム
US7860705B2 (en) * 2006-09-01 2010-12-28 International Business Machines Corporation Methods and apparatus for context adaptation of speech-to-speech translation systems
CN101359473A (zh) * 2007-07-30 2009-02-04 国际商业机器公司 自动进行语音转换的方法和装置
CN101175272B (zh) * 2007-09-19 2010-12-08 中兴通讯股份有限公司 一种用声音读出文本短消息的方法
JP5328703B2 (ja) * 2010-03-23 2013-10-30 三菱電機株式会社 韻律パターン生成装置
CN103956163B (zh) * 2014-04-23 2017-01-11 成都零光量子科技有限公司 普通语音与加密语音的相互转换系统及方法
CN105208194A (zh) * 2015-08-17 2015-12-30 努比亚技术有限公司 语音播报装置及方法
CN105913854B (zh) * 2016-04-15 2020-10-23 腾讯科技(深圳)有限公司 语音信号级联处理方法和装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100068965A (ko) * 2008-12-15 2010-06-24 한국전자통신연구원 자동 통역 장치 및 그 방법
US20130144595A1 (en) * 2011-12-01 2013-06-06 Richard T. Lord Language translation based on speaker-related information
CN103365837A (zh) * 2012-03-29 2013-10-23 株式会社东芝 机器翻译装置、方法和计算机可读媒体
CN103236259A (zh) * 2013-03-22 2013-08-07 乐金电子研发中心(上海)有限公司 语音识别处理及反馈系统、语音回复方法
CN103559180A (zh) * 2013-10-12 2014-02-05 安波 交谈翻译机
CN106156009A (zh) * 2015-04-13 2016-11-23 中兴通讯股份有限公司 语音翻译方法及装置
CN106528547A (zh) * 2016-11-09 2017-03-22 王东宇 一种翻译机的翻译方法

Also Published As

Publication number Publication date
CN107731232A (zh) 2018-02-23

Similar Documents

Publication Publication Date Title
WO2019075829A1 (zh) 语音翻译方法、装置和翻译设备
US20200265197A1 (en) Language translation device and language translation method
WO2020222928A1 (en) Synchronization of audio signals from distributed devices
US9552815B2 (en) Speech understanding method and system
CN110049270A (zh) 多人会议语音转写方法、装置、系统、设备及存储介质
WO2020222925A1 (en) Customized output to optimize for user preference in a distributed system
JP6469252B2 (ja) アカウント追加方法、端末、サーバ、およびコンピュータ記憶媒体
JP6139598B2 (ja) オンライン音声認識を処理する音声認識クライアントシステム、音声認識サーバシステム及び音声認識方法
WO2016165590A1 (zh) 语音翻译方法及装置
WO2020222935A1 (en) Speaker attributed transcript generation
US8818797B2 (en) Dual-band speech encoding
CN107623614A (zh) 用于推送信息的方法和装置
WO2020222930A1 (en) Audio-visual diarization to identify meeting attendees
CN103514882B (zh) 一种语音识别方法及系统
WO2020222929A1 (en) Processing overlapping speech from distributed devices
WO2018214314A1 (zh) 同声翻译的实现方法和装置
CN106713111B (zh) 一种添加好友的处理方法、终端及服务器
WO2014173325A1 (zh) 喉音识别方法及装置
CN107749296A (zh) 语音翻译方法和装置
WO2019101099A1 (zh) 视频节目识别方法、设备、终端、系统和存储介质
TW200304638A (en) Network-accessible speaker-dependent voice models of multiple persons
WO2019169686A1 (zh) 语音翻译方法、装置和计算机设备
US20150325252A1 (en) Method and device for eliminating noise, and mobile terminal
EP3963575A1 (en) Distributed device meeting initiation
WO2019169685A1 (zh) 语音处理方法、装置和电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17929016

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17929016

Country of ref document: EP

Kind code of ref document: A1