WO2019071541A1 - Voice translation method, apparatus, and terminal device - Google Patents
Voice translation method, apparatus, and terminal device Download PDFInfo
- Publication number
- WO2019071541A1 WO2019071541A1 PCT/CN2017/105915 CN2017105915W WO2019071541A1 WO 2019071541 A1 WO2019071541 A1 WO 2019071541A1 CN 2017105915 W CN2017105915 W CN 2017105915W WO 2019071541 A1 WO2019071541 A1 WO 2019071541A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- original
- information
- voice information
- translation
- voiceprint
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 230000002194 synthesizing effect Effects 0.000 claims description 23
- 230000015572 biosynthetic process Effects 0.000 claims description 22
- 238000003786 synthesis reaction Methods 0.000 claims description 22
- 238000000605 extraction Methods 0.000 claims description 14
- 230000000694 effects Effects 0.000 abstract description 8
- 238000010586 diagram Methods 0.000 description 15
- 238000004891 communication Methods 0.000 description 14
- 239000000284 extract Substances 0.000 description 9
- 238000004590 computer program Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000003672 processing method Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M1/00—Substation equipment, e.g. for use by subscribers
- H04M1/72—Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
- H04M1/725—Cordless telephones
Definitions
- a main object of the present invention is to provide a speech translation method, apparatus and terminal device, which aim to improve the authenticity and vividness of translated speech and enhance the user experience.
- an embodiment of the present invention provides a voice translation method, where the method includes the following steps.
- the translation information is target voice information
- the step of synthesizing the translation information and the original voiceprint into final voice information includes:
- the step of synthesizing the original voiceprint into the target voice information of the voiceless voice, and generating the final voice information includes:
- the translation information is a target language string
- the step of synthesizing the translation information and the original voiceprint into final voice information includes:
- the step of performing translation processing on the original voice information to obtain translation information includes: sending the original voice information to a second server, so that the second server Translating the original voice information into a target language string;
- the step of performing translation processing on the original voice information to obtain translation information includes: [0026] performing voice recognition on the original voice information to generate an original language string;
- the method further includes: [0031] transmitting the final voice information outward.
- an extraction module configured to extract an original voiceprint from the original voice information
- a processing module configured to perform translation processing on the original voice information, to obtain translation information
- the translation information is target voice information
- the synthesizing module includes:
- a voiceprint culling unit configured to remove a preset voiceprint in the target voice information, to obtain a target utterance of the voiceless tone
- the voiceprint synthesis unit is configured to: perform signal addition on the original voiceprint and the voiceless target voice information to obtain final voice information.
- a first sending unit configured to send the original voice information to the first server, so that the first server translates the original voice information into target voice information
- the first receiving unit is configured to receive the target voice information returned by the first server.
- the translation information is a target language string
- the synthesizing module is configured to: perform speech synthesis on the target language string by using the original voiceprint to generate final speech information.
- a second receiving unit configured to receive the target language string returned by the second server.
- a voice recognition unit configured to perform voice recognition on the original voice information, and generate a original language string
- a character translation unit configured to translate the original language string into a target language string.
- the device further includes an output module, configured to output the final voice information.
- the device further includes a sending module, configured to send the final voice information outward.
- Embodiments of the present invention further provide a terminal device, where the terminal device includes a memory, a processor, and at least one application stored in the memory and configured to be executed by the processor, the application The program is configured to perform the aforementioned speech translation method.
- a speech translation method provided by an embodiment of the present invention, the original voiceprint is extracted from the original voice information, and the translated information and the original voiceprint are synthesized into the final voice information, so that the final voice information and the original voice information are obtained.
- the voiceprints are the same, it sounds like the other users have spoken the translated language, realizing the effect of the original sound translation, and promoting the human-machine dialogue as a direct dialogue between people, improving the vividness and authenticity of the translated voice, and improving The user experience.
- FIG. 1 is a flow chart of an embodiment of a speech translation method of the present invention
- FIG. 2 is a block diagram showing an embodiment of a speech translation apparatus of the present invention
- FIG. 3 is a block diagram of the processing module of FIG. 2;
- FIG. 4 is a block diagram of still another module of the processing module of FIG. 2;
- FIG. 6 is a block diagram of the synthesis module of FIG. 2;
- FIG. 7 is a block diagram of the voiceprint culling unit of FIG. 6.
- terminal and terminal device used herein include both a device of a wireless signal receiver, a device having only a wireless signal receiver without a transmitting capability, and a receiving and receiving device.
- Such a device may comprise: a cellular or other communication device having a single line display or a multi-line display or a cellular or other communication device without a multi-line display; PCS (Persona 1 Communications Service), which may combine voice, Data processing, fax and/or data communication capabilities; PDA (Personal Digital Assistant), which can include radio frequency receivers, pagers, Internet/Intranet access, web browsers, notepads, calendars and/or GPS ( Global Positioning System, receiver; conventional laptop and/or palmtop computer or other device having conventional laptop and/or palm type with and/or including a radio frequency receiver Computer or other device.
- PCS Personala 1 Communications Service
- PDA Personal Digital Assistant
- GPS Global Positioning System, receiver; conventional laptop and/or palmtop computer or other device having conventional laptop and/or palm type with and/or including a radio frequency receiver Computer or other device.
- terminal may be portable, transportable, installed in a vehicle (aviation, sea and/or land), or adapted and/or configured to operate locally, and/or Run in any other location on the Earth and/or space in a distributed fashion.
- the "terminal” and “terminal device” used herein may also be a communication terminal, an internet terminal, a music/video playback terminal, and may be, for example, a PDA, a MID (Mobile Internet Device), and/or have a music/video playback.
- Functional mobile phones can also be smart TVs, set-top boxes and other devices.
- the server used herein includes, but is not limited to, a computer, a network host, a single network server, a plurality of network server sets, or a cloud composed of a plurality of servers.
- the cloud consists of a large number of computers or network servers based on Cloud Computing, which is a kind of distributed computing, a super virtual computer composed of a group of loosely coupled computers.
- communication may be implemented by any communication means between the server, the terminal device and the WNS server, including but not limited to, mobile communication based on 3GPP, LTE, WIMAX, and computer network communication based on TCP/IP and UDP protocols. And short-range wireless transmission based on Bluetooth and infrared transmission standards.
- the speech translation method of the embodiment of the present invention can be applied to a translation device, a mobile terminal (such as a mobile phone, a tablet, etc.), a personal computer, and the like, and can also be applied to a server.
- a mobile terminal such as a mobile phone, a tablet, etc.
- a personal computer and the like
- the following is a detailed description of the application to the terminal device.
- an embodiment of a speech translation method according to the present invention includes the following steps: [0073] Sl l extracts an original voiceprint from the original voice information.
- the original voice information may be the voice information of the user that is collected by the terminal device through the microphone on the spot, or may be the voice information to be translated obtained from the outside (such as the peer device).
- the terminal device collects the original voice information, it preferably collects the original voice information through a microphone array composed of multiple microphones, and uses the beamforming and noise reduction processing methods of the microphone array to reduce the influence of environmental noise on the later processing and improve the voice quality.
- the terminal device After acquiring the original voice information, the terminal device immediately extracts the original voiceprint from the original voiceprint, and stores the original voiceprint.
- the terminal device can perform voiceprint extraction on the original voice information by using a wavelet transform algorithm in the prior art, and extract feature information of the original voiceprint in the domain and the frequency domain.
- the specific extraction method is the same as the prior art, and is not described here.
- the original voice information when applied to the server, the original voice information is from the terminal device, and the server receives the original voice information sent by the terminal device, and extracts the original voiceprint therefrom.
- S12 Perform translation processing on the original voice information to obtain translation information.
- the terminal device may perform translation processing on the original voice information locally, or may perform translation processing on the original voice information through the server.
- the translation information obtained by the terminal device may be the target voice information or the target language string.
- the terminal device sends the original voice information to the first server, so that the first server translates the original voice information into the target voice information.
- the first server After receiving the original voice information, the first server performs voice recognition on the original voice information, generates a original language string, and then translates the original language string into a target language string, and finally uses the preset voiceprint to perform the target language string. Speech synthesis, generating target speech information, and returning the target speech information to the terminal device.
- the terminal device receives the target voice information returned by the first server.
- the terminal device sends the original voice information to the second server, so that the second server translates the original voice information into a target language string.
- the second server After receiving the original voice information, the second server performs voice recognition on the original voice information, generates a original language string, and then translates the original language string into a target language string, and returns the target language string to the terminal device.
- the terminal device receives the target language string returned by the second server.
- the terminal device directly performs voice recognition on the original voice information, generates an original language string, and then translates the original language string into a target language string.
- the server when applied to a server, the server performs speech recognition on the original speech information, generates a raw language string, and then translates the original language string into a target language string.
- the terminal device when the translation information is the target voice information, the terminal device first rejects the preset voiceprint in the target voice information to obtain the target voice information of the voiceless pattern; and then synthesizes the original voiceprint into the target voice of the voiceless pattern. In the message, the final voice message is generated.
- the terminal device may first extract the preset voiceprint from the target voice information, for example, using the wavelet transform algorithm in the prior art to extract the voiceprint of the target voice information, and extract the preset. Characteristic information of the voice field and the frequency domain; then performing signal subtraction on the target voice information and the preset voiceprint, You can get the target voice information without voice. It can be understood by those skilled in the art that, besides this, the voiceprint culling can also be performed by other methods in the prior art, and the present invention will not be described again.
- the terminal device can perform signal addition on the original voiceprint and the voiceless target voice information to obtain the final voice information, so that the final voice information sounds like the user's original sound, and the original sound is realized. translation. It can be understood by those skilled in the art that, besides this, the voiceprint synthesis can also be performed by other means in the prior art, and the present invention will not be repeated here.
- the terminal device when the translation information is the target language string, the terminal device directly synthesizes the target language string by using the original voiceprint to generate the final voice information.
- the terminal device can perform speech synthesis using the existing speech synthesis technology, and will not be described here.
- the terminal device may directly output the final voice information, such as outputting the final voice information through a sounding device such as an earpiece or a speaker; or may send the final voice information to the outside device, for example, to the peer device.
- a sounding device such as an earpiece or a speaker
- the server when applied to a server, directly synthesizes the target language string using the original voiceprint to generate final voice information. And send the final voice message to the terminal device
- the translation machine collects the original voice information, and proposes the original voiceprint from the original voice information to be stored locally, and transmits the original voice information to the server.
- the server translates the original voice information into target voice information and returns it to the translator.
- the translation machine receives the target voice information returned by the server, rejects the preset voiceprint in the target voice information, synthesizes the original voiceprint into the target voice information of the voiceless voice, generates the final voice information, and outputs the final voice information. Therefore, two users using different languages can use the translator to conduct face-to-face conversation, and the translated final voice information output by the translator is the same as the voiceprint of the user, which is equivalent to the user speaking the translated language and realizing the original sound. The effect of translation.
- the mobile terminal collects the original voice information, and proposes the original voiceprint from the original voice information to be stored locally, and sends the original voice information to the server.
- the server translates the original voice information into the target voice information and returns it to the mobile terminal.
- the mobile terminal receives the target voice information returned by the server, removes the preset voiceprint in the target voice information, synthesizes the original voiceprint into the target voice information of the voiceless voice, generates the final voice information, and sends the final voice information to the opposite end.
- the user of the speech can use the mobile terminal to conduct a remote conversation, and the final voice information after translation is the same as the voiceprint of the user, which is equivalent to the user speaking the translated language and realizing the effect of the original sound translation.
- the server receives the original voice information sent by the terminal device, proposes the original voiceprint from the original voice information, performs voice recognition on the original voice information, generates a target language string, and performs voice synthesis on the target language string by using the original voiceprint.
- the final voice information is generated, and the final voice information is returned to the terminal device or the peer device of the terminal device (ie, the device that establishes a communication connection with the terminal device). Since the final speech information after translation is the same as the user's voiceprint, it is equivalent to the user's own spoken language and the effect of the original sound translation.
- the speech translation method of the embodiment of the present invention extracts the original voiceprint from the original voice information, and then synthesizes the translation information and the original voiceprint into the final voice information, so that the final voice information is the same as the voiceprint of the original voice information. It sounds like the other user has spoken the translated language, realized the effect of the original sound translation, and promoted the human-machine dialogue as a direct dialogue between people, which improved the vividness and authenticity of the translated voice and improved the user experience.
- the apparatus includes an extraction module 10, a processing module 20, and a synthesis module 30, wherein: an extraction module 10 is configured to extract originals from original voice information. a voiceprint; a processing module 20, configured to perform translation processing on the original voice information to obtain translation information; and a synthesis module 30, configured to synthesize the translation information and the original voiceprint into final voice information.
- the extraction module 10 may perform voiceprint extraction on the original voice information by using a wavelet transform algorithm in the prior art, and extract feature information of the original voiceprint in the domain and the frequency domain.
- the specific extraction method is the same as the prior art, and is not described here.
- the translation information obtained by the processing module 20 may be the target voice information, or may be the target language string.
- the processing module 20 includes a first sending unit 21 and a first receiving unit 22, where: the first sending unit 21 is configured to send original voice information to the first server, so that The first server translates the original voice information into the target voice information.
- the first receiving unit 22 is configured to receive the target voice information returned by the first server.
- the processing module 20 includes a second sending unit 23 and a second receiving unit 24,
- the second sending unit 23 is configured to send the original voice information to the second server, so that the second server translates the original voice information into the target language string;
- the second receiving unit 24 is configured to receive the second server return.
- the target language string is configured to send the original voice information to the second server, so that the second server translates the original voice information into the target language string;
- the processing module 20 includes a voice recognition unit 25 and a character translation unit 26, where: a voice recognition unit 25 is configured to perform voice recognition on the original voice information to generate an original language string; The character translation unit 26 is configured to translate the original language string into a target language string.
- the synthesizing module 30 synthesizes the translation information and the original voiceprint into a final speech.
- the synthesizing module 30 includes a voiceprint culling unit 31 and a voiceprint compositing unit 32, as shown in FIG. 6, wherein: the voiceprint culling unit 31 is configured to remove the target.
- the preset voiceprint in the voice information obtains the target voice information of the voiceless voice; the voiceprint synthesis unit 32 is configured to synthesize the original voiceprint into the target voice information of the voiceless voice to generate the final voice information.
- the voiceprint culling unit 31 includes a voiceprint extraction sub-unit 311 and a subtraction sub-unit 312, as shown in FIG. 7, wherein: a voiceprint extraction sub-unit 311 is used for the target voice information. Extracting the preset voiceprint, for example, using the wavelet transform algorithm in the prior art to perform voiceprint extraction on the target voice information, extracting feature information of the preset voiceprint in the ⁇ domain and the frequency domain; and a subtraction sub-unit 312 for Signal subtraction is performed on the target voice information and the preset voiceprint to obtain target voice information of the voiceless pattern.
- the voiceprint synthesizing unit 32 may perform signal addition on the original voiceprint and the voiceless target voice information to obtain final voice information, so that the final voice information sounds like the user's original sound. Realized the original sound translation. It can be understood by those skilled in the art that, besides this, the voiceprint synthesis can also be performed by other means in the prior art, and the present invention will not be described again.
- the synthesizing module 30 directly synthesizes the target language string by using the original voiceprint to generate final speech information.
- the synthesis module 30 can perform speech synthesis using the existing speech synthesis technology, and will not be described here.
- the apparatus may further include an output module for outputting final voice information.
- the output module outputs the final voice information through a sounding device such as an earpiece or a speaker.
- the apparatus further includes a sending module, configured to send the final voice information outward, such as to the terminal device.
- the voice translation device of the embodiment of the present invention can be applied to a terminal device such as a translation machine, a mobile terminal (such as a mobile phone, a tablet, etc.), a personal computer, or the like, and can also be applied to a server, which is not limited by the present invention.
- a terminal device such as a translation machine, a mobile terminal (such as a mobile phone, a tablet, etc.), a personal computer, or the like
- a server which is not limited by the present invention.
- the speech translation apparatus of the embodiment of the present invention extracts the original voiceprint from the original voice information, and then synthesizes the translation information and the original voiceprint into the final voice information, so that the final voice information is the same as the voiceprint of the original voice information. It sounds like the other user has spoken the translated language, realized the effect of the original sound translation, and promoted the human-machine dialogue as a direct dialogue between people, which improved the vividness and authenticity of the translated voice and improved the user experience.
- the present invention also provides a terminal device including a memory, a processor, and at least one application stored in the memory and configured to be executed by the processor, the application being configured to Used to perform a speech translation method.
- the speech translation method comprises the steps of: extracting an original voiceprint from the original voice information; performing translation processing on the original voice information to obtain translation information; and synthesizing the translation information and the original voiceprint into final voice information.
- the speech translation method described in this embodiment is the speech translation method involved in the above embodiment of the present invention, and details are not described herein again.
- the present invention includes apparatus related to performing one or more of the operations described herein.
- These devices may be specially designed and manufactured for the required purposes, or may also include known devices in a general purpose computer.
- These devices have computer programs stored therein that are selectively activated or reconfigured.
- Such computer programs may be stored in a device (eg, computer) readable medium or in any type of medium suitable for storing electronic instructions and respectively coupled to a bus, including but not limited to any Types of disks (including floppy disks, hard disks, CDs, CD-ROMs, and magneto-optical disks), ROM (Read-Only Memory), RAM (Random Access Memory), EPROM (Erasable Programmable Read-Only)
- a readable medium includes any medium that is stored or transmitted by a device (eg, a computer) in a form that is readable.
- a device eg, a computer
- each block of the block diagrams and/or block diagrams and/or flow diagrams can be implemented by computer program instructions, and/or in the block diagrams and/or block diagrams and/or flow diagrams. The combination of boxes.
- these computer program instructions can be implemented by a general purpose computer, a professional computer, or a processor of other programmable data processing methods, such that the processor is executed by a computer or other programmable data processing method.
- the block diagrams and/or block diagrams of the invention and/or the schemes specified in the blocks or blocks of the flow diagram are invented.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Machine Translation (AREA)
Abstract
Disclosed by the present invention are a voice translation method, apparatus, and a terminal device, said method comprising the following steps: extracting an original voiceprint from original voice information; performing translation processing on said original voice information to obtain translation information; combining said translation information and said original voiceprint into final voice information such that the final voice information is identical to the voiceprint of the original voice information, thus achieving the effect of translating the original audio and increasing the vividness and realness of a translation voice.
Description
语音翻译方法、 装置和终端设备 技术领域 Speech translation method, device and terminal device
[0001] 本发明涉及通信技术领域, 特别是涉及到一种语音翻译方法、 装置和终端设备 背景技术 [0001] The present invention relates to the field of communications technologies, and in particular, to a voice translation method, apparatus, and terminal device.
[0002] 翻译机可以将一种语言的语音信息翻译为另一种语言的语音信息, 因此使用不 同语言的人可以利用翻译机实现无障碍交流和沟通。 翻译机进行语音翻译的具 体流程为: 接收用户的原始语音信息, 将原始语音信息发送给服务器, 服务器 对原始语音信息进行语音识别、 字符翻译、 语音合成等一系列翻译处理后得到 目标语音信息并返回给翻译机, 翻译机输出目标语音信息。 [0002] A translator can translate voice information in one language into voice information in another language, so people using different languages can use the translator to achieve barrier-free communication and communication. The specific process of the speech translation by the translator is: receiving the original voice information of the user, and transmitting the original voice information to the server, and the server obtains the target voice information by performing a series of translation processing on the original voice information, such as voice recognition, character translation, and speech synthesis. Returned to the translator, the translator outputs the target voice information.
[0003] 服务器翻译后生成的目标语音信息的声纹是预先设定的, 因此所有的翻译语音 听起来都是同一个人的声音, 单调乏味, 让人感觉是在与机器人对话, 而不是 与真人对话, 缺乏真实感和人情味, 容易引起听觉疲劳, 用户体验不佳。 [0003] The voiceprint of the target voice information generated by the server translation is preset, so all the translated voices sound the same person's voice, monotonous, and make people feel that they are talking to the robot, not to the real person. Dialogue, lack of realism and human touch, can easily cause hearing fatigue, and the user experience is not good.
技术问题 technical problem
[0004] 本发明的主要目的为提供一种语音翻译方法、 装置和终端设备, 旨在提高翻译 语音的真实性和生动性, 提升用户体验。 问题的解决方案 [0004] A main object of the present invention is to provide a speech translation method, apparatus and terminal device, which aim to improve the authenticity and vividness of translated speech and enhance the user experience. Problem solution
技术解决方案 Technical solution
[0005] 为达以上目的, 本发明实施例提出一种语音翻译方法, 所述方法包括以下步骤 [0005] In order to achieve the above objective, an embodiment of the present invention provides a voice translation method, where the method includes the following steps.
[0006] 从原始语音信息中提取出原始声纹; Extracting the original voiceprint from the original voice information;
[0007] 对所述原始语音信息进行翻译处理, 获得翻译信息; [0007] performing translation processing on the original voice information to obtain translation information;
[0008] 将所述翻译信息和所述原始声纹合成为最终语音信息。 [0008] synthesizing the translation information and the original voiceprint into final voice information.
[0009] 可选地, 所述翻译信息为目标语音信息, 所述将所述翻译信息和所述原始声纹 合成为最终语音信息的步骤包括: [0009] Optionally, the translation information is target voice information, and the step of synthesizing the translation information and the original voiceprint into final voice information includes:
[0010] 剔除所述目标语音信息中的预设声纹, 得到无声纹的目标语音信息;
[0011] 将所述原始声纹合成到所述无声纹的目标语音信息中, 生成最终语音信息。 [0010] culling the preset voiceprint in the target voice information to obtain target voice information of the voiceless pattern; [0011] synthesizing the original voiceprint into the target voice information of the voiceless tone to generate final voice information.
[0012] 可选地, 所述剔除所述目标语音信息中的预设声纹的步骤包括: [0012] Optionally, the step of culling the preset voiceprint in the target voice information comprises:
[0013] 从所述目标语音信息中提取出预设声纹; [0013] extracting a preset voiceprint from the target voice information;
[0014] 对所述目标语音信息和所述预设声纹做信号减法运算, 得到无声纹的目标语音 f π息。 [0014] performing signal subtraction on the target voice information and the preset voiceprint to obtain a target voice f π information of the voiceless pattern.
[0015] 可选地, 所述将所述原始声纹合成到所述无声纹的目标语音信息中, 生成最终 语音信息的步骤包括: [0015] Optionally, the step of synthesizing the original voiceprint into the target voice information of the voiceless voice, and generating the final voice information includes:
[0016] 对所述原始声纹和所述无声纹的目标语音信息做信号加法运算, 得到最终语音 f π息。 [0016] performing signal addition on the original voiceprint and the target voice information of the voiceless voice to obtain a final voice f π.
[0017] 可选地, 所述对所述原始语音信息进行翻译处理, 获得翻译信息的步骤包括: [0018] 向第一服务器发送所述原始语音信息, 以使所述第一服务器将所述原始语音信 息翻译处理为目标语音信息; [0017] Optionally, the step of performing translation processing on the original voice information to obtain translation information includes: sending the original voice information to a first server, so that the first server The original voice information is translated into target voice information;
[0019] 接收所述第一服务器返回的所述目标语音信息。 [0019] receiving the target voice information returned by the first server.
[0020] 可选地, 所述翻译信息为目标语言字符串, 所述将所述翻译信息和所述原始声 纹合成为最终语音信息的步骤包括: [0020] Optionally, the translation information is a target language string, and the step of synthesizing the translation information and the original voiceprint into final voice information includes:
[0021] 利用所述原始声纹对所述目标语言字符串进行语音合成, 生成最终语音信息。 [0021] performing speech synthesis on the target language string using the original voiceprint to generate final voice information.
[0022] 可选地, 所述对所述原始语音信息进行翻译处理, 获得翻译信息的步骤包括: [0023] 将所述原始语音信息发送给第二服务器, 以使所述第二服务器将所述原始语音 信息翻译处理为目标语言字符串; [0022] Optionally, the step of performing translation processing on the original voice information to obtain translation information includes: sending the original voice information to a second server, so that the second server Translating the original voice information into a target language string;
[0024] 接收所述第二服务器返回的所述目标语言字符串。 [0024] receiving the target language string returned by the second server.
[0025] 可选地, 所述对所述原始语音信息进行翻译处理, 获得翻译信息的步骤包括: [0026] 对所述原始语音信息进行语音识别, 生成原始语言字符串; [0025] Optionally, the step of performing translation processing on the original voice information to obtain translation information includes: [0026] performing voice recognition on the original voice information to generate an original language string;
[0027] 将所述原始语言字符串翻译为目标语言字符串。 [0027] translating the original language string into a target language string.
[0028] 可选地, 所述将所述翻译信息和所述原始声纹合成为最终语音信息的步骤之后 还包括: [0028] Optionally, after the step of synthesizing the translation information and the original voiceprint into final voice information, the method further includes:
[0029] 输出所述最终语音信息。 [0029] outputting the final voice information.
[0030] 可选地, 所述将所述翻译信息和所述原始声纹合成为最终语音信息的步骤之后 还包括:
[0031] 向外发送所述最终语音信息。 [0030] Optionally, after the step of synthesizing the translation information and the original voiceprint into final voice information, the method further includes: [0031] transmitting the final voice information outward.
[0032] 本发明实施例同吋提出一种语音翻译装置, 所述装置包括: [0032] Embodiments of the present invention also provide a voice translation apparatus, where the apparatus includes:
[0033] 提取模块, 用于从原始语音信息中提取出原始声纹; [0033] an extraction module, configured to extract an original voiceprint from the original voice information;
[0034] 处理模块, 用于对所述原始语音信息进行翻译处理, 获得翻译信息; [0034] a processing module, configured to perform translation processing on the original voice information, to obtain translation information;
[0035] 合成模块, 用于将所述翻译信息和所述原始声纹合成为最终语音信息。 [0035] a synthesizing module, configured to synthesize the translation information and the original voiceprint into final voice information.
[0036] 可选地, 所述翻译信息为目标语音信息, 所述合成模块包括: [0036] Optionally, the translation information is target voice information, and the synthesizing module includes:
[0037] 声纹剔除单元, 用于剔除所述目标语音信息中的预设声纹, 得到无声纹的目标 语首 息; [0037] a voiceprint culling unit, configured to remove a preset voiceprint in the target voice information, to obtain a target utterance of the voiceless tone;
[0038] 声纹合成单元, 用于将所述原始声纹合成到所述无声纹的目标语音信息中, 生 成最终语音信息。 And a voiceprint synthesizing unit, configured to synthesize the original voiceprint into the target voice information of the voiceless tone to generate final voice information.
[0039] 可选地, 所述声纹剔除单元包括: [0039] Optionally, the voiceprint culling unit comprises:
[0040] 声纹提取子单元, 用于从所述目标语音信息中提取出预设声纹; [0040] a voiceprint extraction subunit, configured to extract a preset voiceprint from the target voice information;
[0041] 减法运算子单元, 用于对所述目标语音信息和所述预设声纹做信号减法运算, 得到无声纹的目标语音信息。 [0041] a subtraction subunit, configured to perform signal subtraction on the target voice information and the preset voiceprint to obtain target voice information of the voiceless pattern.
[0042] 可选地, 声纹合成单元用于: 对所述原始声纹和所述无声纹的目标语音信息做 信号加法运算, 得到最终语音信息。 [0042] Optionally, the voiceprint synthesis unit is configured to: perform signal addition on the original voiceprint and the voiceless target voice information to obtain final voice information.
[0043] 可选地, 所述处理模块包括: [0043] Optionally, the processing module includes:
[0044] 第一发送单元, 用于向第一服务器发送所述原始语音信息, 以使所述第一服务 器将所述原始语音信息翻译处理为目标语音信息; [0044] a first sending unit, configured to send the original voice information to the first server, so that the first server translates the original voice information into target voice information;
[0045] 第一接收单元, 用于接收所述第一服务器返回的所述目标语音信息。 [0045] The first receiving unit is configured to receive the target voice information returned by the first server.
[0046] 可选地, 所述翻译信息为目标语言字符串, 所述合成模块用于: 利用所述原始 声纹对所述目标语言字符串进行语音合成, 生成最终语音信息。 [0046] Optionally, the translation information is a target language string, and the synthesizing module is configured to: perform speech synthesis on the target language string by using the original voiceprint to generate final speech information.
[0047] 可选地, 所述处理模块包括: [0047] Optionally, the processing module includes:
[0048] 第二发送单元, 用于将所述原始语音信息发送给第二服务器, 以使所述第二服 务器将所述原始语音信息翻译处理为目标语言字符串; [0048] a second sending unit, configured to send the original voice information to the second server, so that the second server translates the original voice information into a target language string;
[0049] 第二接收单元, 用于接收所述第二服务器返回的所述目标语言字符串。 [0049] a second receiving unit, configured to receive the target language string returned by the second server.
[0050] 可选地, 所述处理模块包括: [0050] Optionally, the processing module includes:
[0051] 语音识别单元, 用于对所述原始语音信息进行语音识别, 生成原始语言字符串
[0052] 字符翻译单元, 用于将所述原始语言字符串翻译为目标语言字符串。 [0051] a voice recognition unit, configured to perform voice recognition on the original voice information, and generate a original language string [0052] a character translation unit, configured to translate the original language string into a target language string.
[0053] 可选地, 所述装置还包括输出模块, 其用于输出所述最终语音信息。 [0053] Optionally, the device further includes an output module, configured to output the final voice information.
[0054] 可选地, 所述装置还包括发送模块, 其用于向外发送所述最终语音信息。 [0054] Optionally, the device further includes a sending module, configured to send the final voice information outward.
[0055] 本发明实施例还提出一种终端设备, 所述终端设备包括存储器、 处理器和至少 一个被存储在所述存储器中并被配置为由所述处理器执行的应用程序, 所述应 用程序被配置为用于执行前述语音翻译方法。 [0055] Embodiments of the present invention further provide a terminal device, where the terminal device includes a memory, a processor, and at least one application stored in the memory and configured to be executed by the processor, the application The program is configured to perform the aforementioned speech translation method.
发明的有益效果 Advantageous effects of the invention
有益效果 Beneficial effect
[0056] 本发明实施例所提供的一种语音翻译方法, 通过从原始语音信息中提取出原始 声纹, 再将翻译信息和原始声纹合成为最终语音信息, 使得最终语音信息与原 始语音信息的声纹相同, 听起来好像对方用户自己说出了翻译后的语言, 实现 了原声翻译的效果, 将人机对话提升为人与人的直接对话, 提高了翻译语音的 生动性和真实性, 提升了用户体验。 [0056] A speech translation method provided by an embodiment of the present invention, the original voiceprint is extracted from the original voice information, and the translated information and the original voiceprint are synthesized into the final voice information, so that the final voice information and the original voice information are obtained. The voiceprints are the same, it sounds like the other users have spoken the translated language, realizing the effect of the original sound translation, and promoting the human-machine dialogue as a direct dialogue between people, improving the vividness and authenticity of the translated voice, and improving The user experience.
对附图的简要说明 Brief description of the drawing
附图说明 DRAWINGS
[0057] 图 1是本发明的语音翻译方法一实施例的流程图; 1 is a flow chart of an embodiment of a speech translation method of the present invention;
[0058] 图 2是本发明的语音翻译装置一实施例的模块示意图; 2 is a block diagram showing an embodiment of a speech translation apparatus of the present invention;
[0059] 图 3是图 2中的处理模块的模块示意图; 3 is a block diagram of the processing module of FIG. 2;
[0060] 图 4是图 2中的处理模块的又一模块示意图; 4 is a block diagram of still another module of the processing module of FIG. 2;
[0061] 图 5是图 2中的处理模块的又一模块示意图; [0061] FIG. 5 is another block diagram of the processing module of FIG. 2;
[0062] 图 6是图 2中的合成模块的模块示意图; 6 is a block diagram of the synthesis module of FIG. 2;
[0063] 图 7是图 6中的声纹剔除单元的模块示意图。 7 is a block diagram of the voiceprint culling unit of FIG. 6.
[0064] 本发明目的的实现、 功能特点及优点将结合实施例, 参照附图做进一步说明。 [0064] The implementation, functional features, and advantages of the present invention will be further described with reference to the accompanying drawings.
实施该发明的最佳实施例 BEST MODE FOR CARRYING OUT THE INVENTION
本发明的最佳实施方式 BEST MODE FOR CARRYING OUT THE INVENTION
[0065] 应当理解, 此处所描述的具体实施例仅仅用以解释本发明, 并不用于限定本发
明。 [0065] It should be understood that the specific embodiments described herein are merely illustrative of the present invention and are not intended to limit the present invention. Bright.
[0066] 下面详细描述本发明的实施例, 所述实施例的示例在附图中示出, 其中自始至 终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。 下 面通过参考附图描述的实施例是示例性的, 仅用于解释本发明, 而不能解释为 对本发明的限制。 The embodiments of the present invention are described in detail below, and the examples of the embodiments are illustrated in the drawings, wherein the same or similar reference numerals are used to refer to the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the accompanying drawings are intended to be illustrative of the invention and are not to be construed as limiting.
[0067] 本技术领域技术人员可以理解, 除非特意声明, 这里使用的单数形式"一"、 " 一个"、 "所述 "和"该"也可包括复数形式。 应该进一步理解的是, 本发明的说明 书中使用的措辞"包括"是指存在所述特征、 整数、 步骤、 操作、 元件和 /或组件 , 但是并不排除存在或添加一个或多个其他特征、 整数、 步骤、 操作、 元件、 组件和 /或它们的组。 应该理解, 当我们称元件被"连接"或"耦接"到另一元件吋 , 它可以直接连接或耦接到其他元件, 或者也可以存在中间元件。 此外, 这里 使用的"连接"或"耦接"可以包括无线连接或无线耦接。 这里使用的措辞 "和 /或"包 括一个或更多个相关联的列出项的全部或任一单元和全部组合。 [0067] The singular forms "a", "the", "the" It will be further understood that the phrase "comprising", used in the <RTI ID=0.0> </ RTI> <RTIgt; </ RTI> <RTIgt; </ RTI> <RTIgt; </ RTI> is intended to mean the presence of the features, integers, steps, operations, components and/or components, but does not exclude the presence or addition of one or more other features, Integers, steps, operations, components, components, and/or their groups. It will be understood that when we refer to an element being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element, or an intermediate element can be present. Further, "connected" or "coupled" as used herein may include either a wireless connection or a wireless coupling. The phrase "and/or" used herein includes all or any of the elements and all combinations of one or more of the associated listed.
[0068] 本技术领域技术人员可以理解, 除非另外定义, 这里使用的所有术语 (包括技 术术语和科学术语) , 具有与本发明所属领域中的普通技术人员的一般理解相 同的意义。 还应该理解的是, 诸如通用字典中定义的那些术语, 应该被理解为 具有与现有技术的上下文中的意义一致的意义, 并且除非像这里一样被特定定 义, 否则不会用理想化或过于正式的含义来解释。 [0068] Those skilled in the art will appreciate that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention belongs, unless otherwise defined. It should also be understood that terms such as those defined in a general dictionary should be understood to have meaning consistent with the meaning in the context of the prior art, and will not be idealized or excessive unless specifically defined as here. The formal meaning is explained.
[0069] 本技术领域技术人员可以理解, 这里所使用的 "终端"、 "终端设备"既包括无线 信号接收器的设备, 其仅具备无发射能力的无线信号接收器的设备, 又包括接 收和发射硬件的设备, 其具有能够在双向通信链路上, 执行双向通信的接收和 发射硬件的设备。 这种设备可以包括: 蜂窝或其他通信设备, 其具有单线路显 示器或多线路显示器或没有多线路显示器的蜂窝或其他通信设备; PCS (Persona 1 Communications Service, 个人通信系统) , 其可以组合语音、 数据处理、 传真 和 /或数据通信能力; PDA (Personal Digital Assistant, 个人数字助理) , 其可以 包括射频接收器、 寻呼机、 互联网 /内联网访问、 网络浏览器、 记事本、 日历和 / 或 GPS (Global Positioning System, 全球定位系统) 接收器; 常规膝上型和 /或掌 上型计算机或其他设备, 其具有和 /或包括射频接收器的常规膝上型和 /或掌上型
计算机或其他设备。 这里所使用的 "终端"、 "终端设备"可以是便携式、 可运输、 安装在交通工具 (航空、 海运和 /或陆地) 中的, 或者适合于和 /或配置为在本地 运行, 和 /或以分布形式, 运行在地球和 /或空间的任何其他位置运行。 这里所使 用的"终端"、 "终端设备"还可以是通信终端、 上网终端、 音乐 /视频播放终端, 例如可以是 PDA、 MID (Mobile Internet Device, 移动互联网设备) 和 /或具有音 乐 /视频播放功能的移动电话, 也可以是智能电视、 机顶盒等设备。 [0069] Those skilled in the art can understand that the "terminal" and "terminal device" used herein include both a device of a wireless signal receiver, a device having only a wireless signal receiver without a transmitting capability, and a receiving and receiving device. A device that transmits hardware having a receiving and transmitting hardware capable of performing two-way communication over a two-way communication link. Such a device may comprise: a cellular or other communication device having a single line display or a multi-line display or a cellular or other communication device without a multi-line display; PCS (Persona 1 Communications Service), which may combine voice, Data processing, fax and/or data communication capabilities; PDA (Personal Digital Assistant), which can include radio frequency receivers, pagers, Internet/Intranet access, web browsers, notepads, calendars and/or GPS ( Global Positioning System, receiver; conventional laptop and/or palmtop computer or other device having conventional laptop and/or palm type with and/or including a radio frequency receiver Computer or other device. As used herein, "terminal", "terminal device" may be portable, transportable, installed in a vehicle (aviation, sea and/or land), or adapted and/or configured to operate locally, and/or Run in any other location on the Earth and/or space in a distributed fashion. The "terminal" and "terminal device" used herein may also be a communication terminal, an internet terminal, a music/video playback terminal, and may be, for example, a PDA, a MID (Mobile Internet Device), and/or have a music/video playback. Functional mobile phones can also be smart TVs, set-top boxes and other devices.
[0070] 本技术领域技术人员可以理解, 这里所使用的服务器, 其包括但不限于计算机 、 网络主机、 单个网络服务器、 多个网络服务器集或多个服务器构成的云。 在 此, 云由基于云计算 (Cloud Computing) 的大量计算机或网络服务器构成, 其 中, 云计算是分布式计算的一种, 由一群松散耦合的计算机集组成的一个超级 虚拟计算机。 本发明的实施例中, 服务器、 终端设备与 WNS服务器之间可通过 任何通信方式实现通信, 包括但不限于, 基于 3GPP、 LTE、 WIMAX的移动通信 、 基于 TCP/IP、 UDP协议的计算机网络通信以及基于蓝牙、 红外传输标准的近 距无线传输方式。 [0070] Those skilled in the art can understand that the server used herein includes, but is not limited to, a computer, a network host, a single network server, a plurality of network server sets, or a cloud composed of a plurality of servers. Here, the cloud consists of a large number of computers or network servers based on Cloud Computing, which is a kind of distributed computing, a super virtual computer composed of a group of loosely coupled computers. In the embodiment of the present invention, communication may be implemented by any communication means between the server, the terminal device and the WNS server, including but not limited to, mobile communication based on 3GPP, LTE, WIMAX, and computer network communication based on TCP/IP and UDP protocols. And short-range wireless transmission based on Bluetooth and infrared transmission standards.
[0071] 本发明实施例的语音翻译方法, 可以应用于翻译机、 移动终端 (如手机、 平板 等) 、 个人电脑等终端设备, 也可以应用于服务器。 以下以应用于终端设备为 例进行详细说明。 The speech translation method of the embodiment of the present invention can be applied to a translation device, a mobile terminal (such as a mobile phone, a tablet, etc.), a personal computer, and the like, and can also be applied to a server. The following is a detailed description of the application to the terminal device.
[0072] 参照图 1, 提出本发明的语音翻译方法一实施例, 所述方法包括以下步骤: [0073] Sl l、 从原始语音信息中提取出原始声纹。 Referring to FIG. 1, an embodiment of a speech translation method according to the present invention is provided. The method includes the following steps: [0073] Sl l extracts an original voiceprint from the original voice information.
[0074] 本发明实施例中, 原始语音信息可以是终端设备通过麦克风当场采集的用户的 语音信息, 也可以是从外部 (如对端设备) 获取的待翻译的语音信息。 终端设 备采集原始语音信息吋, 优选通过由多个麦克风组成的麦克风阵列来采集原始 语音信息, 运用麦克风阵列的波束成型、 降噪等处理方式来降低环境噪声对后 期处理的影响, 提高语音质量。 In the embodiment of the present invention, the original voice information may be the voice information of the user that is collected by the terminal device through the microphone on the spot, or may be the voice information to be translated obtained from the outside (such as the peer device). After the terminal device collects the original voice information, it preferably collects the original voice information through a microphone array composed of multiple microphones, and uses the beamforming and noise reduction processing methods of the microphone array to reduce the influence of environmental noise on the later processing and improve the voice quality.
[0075] 终端设备获取原始语音信息后, 立即从中提取出原始声纹, 并将该原始声纹存 储起来。 终端设备可以采用现有技术中的小波变换算法对原始语音信息进行声 纹提取, 提取出原始声纹的吋域和频域的特征信息。 具体提取方式与现有技术 相同, 在此不赘述。
[0076] 在其它实施例中, 当应用于服务器吋, 原始语音信息则来自于终端设备, 服务 器接收终端设备发送的原始语音信息, 并从中提取出原始声纹。 [0075] After acquiring the original voice information, the terminal device immediately extracts the original voiceprint from the original voiceprint, and stores the original voiceprint. The terminal device can perform voiceprint extraction on the original voice information by using a wavelet transform algorithm in the prior art, and extract feature information of the original voiceprint in the domain and the frequency domain. The specific extraction method is the same as the prior art, and is not described here. [0076] In other embodiments, when applied to the server, the original voice information is from the terminal device, and the server receives the original voice information sent by the terminal device, and extracts the original voiceprint therefrom.
[0077] S12、 对原始语音信息进行翻译处理, 获得翻译信息。 [0077] S12: Perform translation processing on the original voice information to obtain translation information.
[0078] 终端设备可以在本地对原始语音信息进行翻译处理, 也可以通过服务器对原始 语音信息进行翻译处理。 终端设备获得的翻译信息, 可能是目标语音信息, 也 可能是目标语言字符串。 [0078] The terminal device may perform translation processing on the original voice information locally, or may perform translation processing on the original voice information through the server. The translation information obtained by the terminal device may be the target voice information or the target language string.
[0079] 可选地, 终端设备将原始语音信息发送给第一服务器, 以使第一服务器将原始 语音信息翻译处理为目标语音信息。 第一服务器接收到原始语音信息后, 先对 原始语音信息进行语音识别, 生成原始语言字符串, 接着将原始语言字符串翻 译为目标语言字符串, 最后利用预设声纹对目标语言字符串进行语音合成, 生 成目标语音信息, 并将目标语音信息返回给终端设备。 终端设备接收第一服务 器返回的目标语音信息。 [0079] Optionally, the terminal device sends the original voice information to the first server, so that the first server translates the original voice information into the target voice information. After receiving the original voice information, the first server performs voice recognition on the original voice information, generates a original language string, and then translates the original language string into a target language string, and finally uses the preset voiceprint to perform the target language string. Speech synthesis, generating target speech information, and returning the target speech information to the terminal device. The terminal device receives the target voice information returned by the first server.
[0080] 可选地, 终端设备将原始语音信息发送给第二服务器, 以使第二服务器将原始 语音信息翻译处理为目标语言字符串。 第二服务器接收到原始语音信息后, 先 对原始语音信息进行语音识别, 生成原始语言字符串, 然后将原始语言字符串 翻译为目标语言字符串, 并将目标语言字符串返回给终端设备。 终端设备接收 第二服务器返回的目标语言字符串。 [0080] Optionally, the terminal device sends the original voice information to the second server, so that the second server translates the original voice information into a target language string. After receiving the original voice information, the second server performs voice recognition on the original voice information, generates a original language string, and then translates the original language string into a target language string, and returns the target language string to the terminal device. The terminal device receives the target language string returned by the second server.
[0081] 可选地, 终端设备直接对原始语音信息进行语音识别, 生成原始语言字符串, 然后将原始语言字符串翻译为目标语言字符串。 [0081] Optionally, the terminal device directly performs voice recognition on the original voice information, generates an original language string, and then translates the original language string into a target language string.
[0082] 在其它实施例中, 当应用于服务器吋, 服务器对原始语音信息进行语音识别, 生成原始语言字符串, 然后将原始语言字符串翻译为目标语言字符串。 [0082] In other embodiments, when applied to a server, the server performs speech recognition on the original speech information, generates a raw language string, and then translates the original language string into a target language string.
[0083] S13、 将翻译信息和原始声纹合成为最终语音信息。 [0083] S13. Synthesize the translation information and the original voiceprint into final voice information.
[0084] 可选地, 当翻译信息为目标语音信息吋, 终端设备首先剔除目标语音信息中的 预设声纹, 得到无声纹的目标语音信息; 然后将原始声纹合成到无声纹的目标 语音信息中, 生成最终语音信息。 [0084] Optionally, when the translation information is the target voice information, the terminal device first rejects the preset voiceprint in the target voice information to obtain the target voice information of the voiceless pattern; and then synthesizes the original voiceprint into the target voice of the voiceless pattern. In the message, the final voice message is generated.
[0085] 在剔除预设声纹吋, 终端设备可以先从目标语音信息中提取出预设声纹, 如利 用现有技术中的小波变换算法对目标语音信息进行声纹提取, 提取出预设声纹 的吋域和频域的特征信息; 然后对目标语音信息和预设声纹做信号减法运算,
就能得到无声纹的目标语音信息。 本领域技术人员可以理解, 除此之外, 也可 以利用现有技术中的其它方式进行声纹剔除, 本发明对此不再一一列举赘述。 [0085] After the preset voiceprint is removed, the terminal device may first extract the preset voiceprint from the target voice information, for example, using the wavelet transform algorithm in the prior art to extract the voiceprint of the target voice information, and extract the preset. Characteristic information of the voice field and the frequency domain; then performing signal subtraction on the target voice information and the preset voiceprint, You can get the target voice information without voice. It can be understood by those skilled in the art that, besides this, the voiceprint culling can also be performed by other methods in the prior art, and the present invention will not be described again.
[0086] 在进行声纹合成吋, 终端设备可以对原始声纹和无声纹的目标语音信息做信号 加法运算, 得到最终语音信息, 从而使得最终语音信息听起来就像用户的原声 , 实现了原声翻译。 本领域技术人员可以理解, 除此之外, 也可以利用现有技 术中的其它方式进行声纹合成, 本发明对此不再一一列举赘述。 [0086] After the voiceprint synthesis, the terminal device can perform signal addition on the original voiceprint and the voiceless target voice information to obtain the final voice information, so that the final voice information sounds like the user's original sound, and the original sound is realized. translation. It can be understood by those skilled in the art that, besides this, the voiceprint synthesis can also be performed by other means in the prior art, and the present invention will not be repeated here.
[0087] 可选地, 当翻译信息为目标语言字符串吋, 终端设备则直接利用原始声纹对目 标语言字符串进行语音合成, 生成最终语音信息。 终端设备可以采用现有的语 音合成技术进行语音合成, 在此不赘述。 [0087] Optionally, when the translation information is the target language string, the terminal device directly synthesizes the target language string by using the original voiceprint to generate the final voice information. The terminal device can perform speech synthesis using the existing speech synthesis technology, and will not be described here.
[0088] 当生成最终语音信息后, 终端设备可以直接输出最终语音信息, 如通过听筒、 扬声器等发声装置输出最终语音信息; 也可以向外发送最终语音信息, 如发送 给对端设备。 [0088] After the final voice information is generated, the terminal device may directly output the final voice information, such as outputting the final voice information through a sounding device such as an earpiece or a speaker; or may send the final voice information to the outside device, for example, to the peer device.
[0089] 在其它实施例中, 当应用于服务器吋, 服务器则直接利用原始声纹对目标语言 字符串进行语音合成, 生成最终语音信息。 并将最终语音信息发送给终端设备 [0089] In other embodiments, when applied to a server, the server directly synthesizes the target language string using the original voiceprint to generate final voice information. And send the final voice message to the terminal device
[0090] 举例而言: [0090] For example:
[0091] 翻译机 (终端设备) 采集原始语音信息, 从原始语音信息中提出原始声纹存储 于本地, 并将原始语音信息发送给服务器。 服务器将原始语音信息翻译处理为 目标语音信息并返回给翻译机。 翻译机接收服务器返回的目标语音信息, 剔除 目标语音信息中的预设声纹, 将原始声纹合成到无声纹的目标语音信息中, 生 成最终语音信息, 并输出最终语音信息。 从而两个使用不同语言的用户就可以 利用翻译机进行面对面交谈, 并且翻译机输出的翻译后的最终语音信息与用户 的声纹相同, 相当于用户自己说出了翻译后的语言, 实现了原声翻译的效果。 [0091] The translation machine (terminal device) collects the original voice information, and proposes the original voiceprint from the original voice information to be stored locally, and transmits the original voice information to the server. The server translates the original voice information into target voice information and returns it to the translator. The translation machine receives the target voice information returned by the server, rejects the preset voiceprint in the target voice information, synthesizes the original voiceprint into the target voice information of the voiceless voice, generates the final voice information, and outputs the final voice information. Therefore, two users using different languages can use the translator to conduct face-to-face conversation, and the translated final voice information output by the translator is the same as the voiceprint of the user, which is equivalent to the user speaking the translated language and realizing the original sound. The effect of translation.
[0092] 移动终端 (终端设备) 采集原始语音信息, 从原始语音信息中提出原始声纹存 储于本地, 并将原始语音信息发送给服务器。 服务器将原始语音信息翻译处理 为目标语音信息并返回给移动终端。 移动终端接收服务器返回的目标语音信息 , 剔除目标语音信息中的预设声纹, 将原始声纹合成到无声纹的目标语音信息 中, 生成最终语音信息, 并将最终语音信息发送给对端。 从而两个使用不同语
言的用户就可以利用移动终端进行远程对话, 并且翻译后的最终语音信息与用 户的声纹相同, 相当于用户自己说出了翻译后的语言, 实现了原声翻译的效果 [0092] The mobile terminal (terminal device) collects the original voice information, and proposes the original voiceprint from the original voice information to be stored locally, and sends the original voice information to the server. The server translates the original voice information into the target voice information and returns it to the mobile terminal. The mobile terminal receives the target voice information returned by the server, removes the preset voiceprint in the target voice information, synthesizes the original voiceprint into the target voice information of the voiceless voice, generates the final voice information, and sends the final voice information to the opposite end. Thus two different languages The user of the speech can use the mobile terminal to conduct a remote conversation, and the final voice information after translation is the same as the voiceprint of the user, which is equivalent to the user speaking the translated language and realizing the effect of the original sound translation.
[0093] 服务器接收终端设备发送的原始语音信息, 从原始语音信息中提出原始声纹, 对原始语音信息进行语音识别, 生成目标语言字符串, 利用原始声纹对目标语 言字符串进行语音合成, 生成最终语音信息, 并将最终语音信息返回给终端设 备或该终端设备的对端设备 (即与该终端设备建立通讯连接的设备) 。 由于翻 译后的最终语音信息与用户的声纹相同, 相当于用户自己说出了翻译后的语言 , 实现了原声翻译的效果。 [0093] The server receives the original voice information sent by the terminal device, proposes the original voiceprint from the original voice information, performs voice recognition on the original voice information, generates a target language string, and performs voice synthesis on the target language string by using the original voiceprint. The final voice information is generated, and the final voice information is returned to the terminal device or the peer device of the terminal device (ie, the device that establishes a communication connection with the terminal device). Since the final speech information after translation is the same as the user's voiceprint, it is equivalent to the user's own spoken language and the effect of the original sound translation.
[0094] 本发明实施例的语音翻译方法, 通过从原始语音信息中提取出原始声纹, 再将 翻译信息和原始声纹合成为最终语音信息, 使得最终语音信息与原始语音信息 的声纹相同, 听起来好像对方用户自己说出了翻译后的语言, 实现了原声翻译 的效果, 将人机对话提升为人与人的直接对话, 提高了翻译语音的生动性和真 实性, 提升了用户体验。 [0094] The speech translation method of the embodiment of the present invention extracts the original voiceprint from the original voice information, and then synthesizes the translation information and the original voiceprint into the final voice information, so that the final voice information is the same as the voiceprint of the original voice information. It sounds like the other user has spoken the translated language, realized the effect of the original sound translation, and promoted the human-machine dialogue as a direct dialogue between people, which improved the vividness and authenticity of the translated voice and improved the user experience.
[0095] 参照图 2, 提出本发明的语音翻译装置一实施例, 所述装置包括提取模块 10、 处理模块 20和合成模块 30, 其中: 提取模块 10, 用于从原始语音信息中提取出 原始声纹; 处理模块 20, 用于对原始语音信息进行翻译处理, 获得翻译信息; 合成模块 30, 用于将翻译信息和原始声纹合成为最终语音信息。 [0095] Referring to FIG. 2, an embodiment of a speech translation apparatus of the present invention is provided. The apparatus includes an extraction module 10, a processing module 20, and a synthesis module 30, wherein: an extraction module 10 is configured to extract originals from original voice information. a voiceprint; a processing module 20, configured to perform translation processing on the original voice information to obtain translation information; and a synthesis module 30, configured to synthesize the translation information and the original voiceprint into final voice information.
[0096] 提取模块 10可以采用现有技术中的小波变换算法对原始语音信息进行声纹提取 , 提取出原始声纹的吋域和频域的特征信息。 具体提取方式与现有技术相同, 在此不赘述。 [0096] The extraction module 10 may perform voiceprint extraction on the original voice information by using a wavelet transform algorithm in the prior art, and extract feature information of the original voiceprint in the domain and the frequency domain. The specific extraction method is the same as the prior art, and is not described here.
[0097] 处理模块 20获得的翻译信息, 可能是目标语音信息, 也可能是目标语言字符串 [0097] The translation information obtained by the processing module 20 may be the target voice information, or may be the target language string.
[0098] 可选地, 如图 3所示, 处理模块 20包括第一发送单元 21和第一接收单元 22, 其 中: 第一发送单元 21, 用于向第一服务器发送原始语音信息, 以使第一服务器 将原始语音信息翻译处理为目标语音信息; 第一接收单元 22, 用于接收第一服 务器返回的目标语音信息。 [0098] Optionally, as shown in FIG. 3, the processing module 20 includes a first sending unit 21 and a first receiving unit 22, where: the first sending unit 21 is configured to send original voice information to the first server, so that The first server translates the original voice information into the target voice information. The first receiving unit 22 is configured to receive the target voice information returned by the first server.
[0099] 可选地, 如图 4所示, 处理模块 20包括第二发送单元 23和第二接收单元 24, 其
中: 第二发送单元 23, 用于将原始语音信息发送给第二服务器, 以使第二服务 器将原始语音信息翻译处理为目标语言字符串; 第二接收单元 24, 用于接收第 二服务器返回的目标语言字符串。 [0099] Optionally, as shown in FIG. 4, the processing module 20 includes a second sending unit 23 and a second receiving unit 24, The second sending unit 23 is configured to send the original voice information to the second server, so that the second server translates the original voice information into the target language string; the second receiving unit 24 is configured to receive the second server return. The target language string.
[0100] 可选地, 如图 5所示, 处理模块 20包括语音识别单元 25和字符翻译单元 26, 其 中: 语音识别单元 25, 用于对原始语音信息进行语音识别, 生成原始语言字符 串; 字符翻译单元 26, 用于将原始语言字符串翻译为目标语言字符串。 [0100] Optionally, as shown in FIG. 5, the processing module 20 includes a voice recognition unit 25 and a character translation unit 26, where: a voice recognition unit 25 is configured to perform voice recognition on the original voice information to generate an original language string; The character translation unit 26 is configured to translate the original language string into a target language string.
[0101] 处理模块 20获得翻译信息后, 合成模块 30则将翻译信息和原始声纹合成为最终 语首 息。 [0101] After the processing module 20 obtains the translation information, the synthesizing module 30 synthesizes the translation information and the original voiceprint into a final speech.
[0102] 可选地, 当翻译信息为目标语音信息吋, 合成模块 30如图 6所示, 包括声纹剔 除单元 31和声纹合成单元 32, 其中: 声纹剔除单元 31, 用于剔除目标语音信息 中的预设声纹, 得到无声纹的目标语音信息; 声纹合成单元 32, 用于将原始声 纹合成到无声纹的目标语音信息中, 生成最终语音信息。 [0102] Optionally, when the translation information is the target voice information, the synthesizing module 30 includes a voiceprint culling unit 31 and a voiceprint compositing unit 32, as shown in FIG. 6, wherein: the voiceprint culling unit 31 is configured to remove the target. The preset voiceprint in the voice information obtains the target voice information of the voiceless voice; the voiceprint synthesis unit 32 is configured to synthesize the original voiceprint into the target voice information of the voiceless voice to generate the final voice information.
[0103] 本发明实施例中, 声纹剔除单元 31如图 7所示, 包括声纹提取子单元 311和减法 运算子单元 312, 其中: 声纹提取子单元 311, 用于从目标语音信息中提取出预 设声纹, 如利用现有技术中的小波变换算法对目标语音信息进行声纹提取, 提 取出预设声纹的吋域和频域的特征信息; 减法运算子单元 312, 用于对目标语音 信息和预设声纹做信号减法运算, 得到无声纹的目标语音信息。 [0103] In the embodiment of the present invention, the voiceprint culling unit 31 includes a voiceprint extraction sub-unit 311 and a subtraction sub-unit 312, as shown in FIG. 7, wherein: a voiceprint extraction sub-unit 311 is used for the target voice information. Extracting the preset voiceprint, for example, using the wavelet transform algorithm in the prior art to perform voiceprint extraction on the target voice information, extracting feature information of the preset voiceprint in the 吋 domain and the frequency domain; and a subtraction sub-unit 312 for Signal subtraction is performed on the target voice information and the preset voiceprint to obtain target voice information of the voiceless pattern.
[0104] 本领域技术人员可以理解, 除此之外, 也可以利用现有技术中的其它方式进行 声纹剔除, 本发明对此不再一一列举赘述。 [0104] It can be understood by those skilled in the art that, besides this, voiceprint culling can also be performed by other methods in the prior art, and the present invention will not be described again.
[0105] 在进行声纹合成吋, 声纹合成单元 32可以对原始声纹和无声纹的目标语音信息 做信号加法运算, 得到最终语音信息, 从而使得最终语音信息听起来就像用户 的原声, 实现了原声翻译。 本领域技术人员可以理解, 除此之外, 也可以利用 现有技术中的其它方式进行声纹合成, 本发明对此不再一一列举赘述。 [0105] After performing voiceprint synthesis, the voiceprint synthesizing unit 32 may perform signal addition on the original voiceprint and the voiceless target voice information to obtain final voice information, so that the final voice information sounds like the user's original sound. Realized the original sound translation. It can be understood by those skilled in the art that, besides this, the voiceprint synthesis can also be performed by other means in the prior art, and the present invention will not be described again.
[0106] 可选地, 当翻译信息为目标语言字符串吋, 合成模块 30则直接利用原始声纹对 目标语言字符串进行语音合成, 生成最终语音信息。 合成模块 30可以采用现有 的语音合成技术进行语音合成, 在此不赘述。 [0106] Optionally, when the translation information is the target language string, the synthesizing module 30 directly synthesizes the target language string by using the original voiceprint to generate final speech information. The synthesis module 30 can perform speech synthesis using the existing speech synthesis technology, and will not be described here.
[0107] 进一步地, 该装置还可以包括输出模块, 其用于输出最终语音信息。 例如, 输 出模块通过听筒、 扬声器等发声装置输出最终语音信息。
[0108] 进一步地, 该装置还包括发送模块, 其用于向外发送最终语音信息, 如发送给 终端设备。 [0107] Further, the apparatus may further include an output module for outputting final voice information. For example, the output module outputs the final voice information through a sounding device such as an earpiece or a speaker. [0108] Further, the apparatus further includes a sending module, configured to send the final voice information outward, such as to the terminal device.
[0109] 本发明实施例的语音翻译装置, 可以应用于翻译机、 移动终端 (如手机、 平板 等) 、 个人电脑等终端设备, 也可以应用于服务器, 本发明对此不作限定。 The voice translation device of the embodiment of the present invention can be applied to a terminal device such as a translation machine, a mobile terminal (such as a mobile phone, a tablet, etc.), a personal computer, or the like, and can also be applied to a server, which is not limited by the present invention.
[0110] 本发明实施例的语音翻译装置, 通过从原始语音信息中提取出原始声纹, 再将 翻译信息和原始声纹合成为最终语音信息, 使得最终语音信息与原始语音信息 的声纹相同, 听起来好像对方用户自己说出了翻译后的语言, 实现了原声翻译 的效果, 将人机对话提升为人与人的直接对话, 提高了翻译语音的生动性和真 实性, 提升了用户体验。 [0110] The speech translation apparatus of the embodiment of the present invention extracts the original voiceprint from the original voice information, and then synthesizes the translation information and the original voiceprint into the final voice information, so that the final voice information is the same as the voiceprint of the original voice information. It sounds like the other user has spoken the translated language, realized the effect of the original sound translation, and promoted the human-machine dialogue as a direct dialogue between people, which improved the vividness and authenticity of the translated voice and improved the user experience.
[0111] 本发明同吋提出一种终端设备, 其包括存储器、 处理器和至少一个被存储在所 述存储器中并被配置为由所述处理器执行的应用程序, 所述应用程序被配置为 用于执行语音翻译方法。 所述语音翻译方法包括以下步骤: 从原始语音信息中 提取出原始声纹; 对原始语音信息进行翻译处理, 获得翻译信息; 将翻译信息 和原始声纹合成为最终语音信息。 本实施例中所描述的语音翻译方法为本发明 中上述实施例所涉及的语音翻译方法, 在此不再赘述。 [0111] The present invention also provides a terminal device including a memory, a processor, and at least one application stored in the memory and configured to be executed by the processor, the application being configured to Used to perform a speech translation method. The speech translation method comprises the steps of: extracting an original voiceprint from the original voice information; performing translation processing on the original voice information to obtain translation information; and synthesizing the translation information and the original voiceprint into final voice information. The speech translation method described in this embodiment is the speech translation method involved in the above embodiment of the present invention, and details are not described herein again.
[0112] 本领域技术人员可以理解, 本发明包括涉及用于执行本申请中所述操作中的一 项或多项的设备。 这些设备可以为所需的目的而专门设计和制造, 或者也可以 包括通用计算机中的已知设备。 这些设备具有存储在其内的计算机程序, 这些 计算机程序选择性地激活或重构。 这样的计算机程序可以被存储在设备 (例如 , 计算机) 可读介质中或者存储在适于存储电子指令并分别耦联到总线的任何 类型的介质中, 所述计算机可读介质包括但不限于任何类型的盘 (包括软盘、 硬盘、 光盘、 CD-ROM、 和磁光盘) 、 ROM (Read-Only Memory , 只读存储器 ) 、 RAM (Random Access Memory , 随机存储器) 、 EPROM (Erasable Programmable Read-Only Those skilled in the art will appreciate that the present invention includes apparatus related to performing one or more of the operations described herein. These devices may be specially designed and manufactured for the required purposes, or may also include known devices in a general purpose computer. These devices have computer programs stored therein that are selectively activated or reconfigured. Such computer programs may be stored in a device (eg, computer) readable medium or in any type of medium suitable for storing electronic instructions and respectively coupled to a bus, including but not limited to any Types of disks (including floppy disks, hard disks, CDs, CD-ROMs, and magneto-optical disks), ROM (Read-Only Memory), RAM (Random Access Memory), EPROM (Erasable Programmable Read-Only)
Memory , 可擦写可编程只读存储器) 、 EEPROM (Electrically Erasable Memory, rewritable programmable read only memory), EEPROM (Electrically Erasable
Programmable Read-Only Memory , 电可擦可编程只读存储器) 、 闪存、 磁性卡 片或光线卡片。 也就是, 可读介质包括由设备 (例如, 计算机) 以能够读的形 式存储或传输信息的任何介质。
[0113] 本技术领域技术人员可以理解, 可以用计算机程序指令来实现这些结构图和 / 或框图和 /或流图中的每个框以及这些结构图和 /或框图和 /或流图中的框的组合。 本技术领域技术人员可以理解, 可以将这些计算机程序指令提供给通用计算机 、 专业计算机或其他可编程数据处理方法的处理器来实现, 从而通过计算机或 其他可编程数据处理方法的处理器来执行本发明公幵的结构图和 /或框图和 /或流 图的框或多个框中指定的方案。 Programmable Read-Only Memory, Flash, Magnetic Card or Light Card. That is, a readable medium includes any medium that is stored or transmitted by a device (eg, a computer) in a form that is readable. [0113] Those skilled in the art will appreciate that each block of the block diagrams and/or block diagrams and/or flow diagrams can be implemented by computer program instructions, and/or in the block diagrams and/or block diagrams and/or flow diagrams. The combination of boxes. Those skilled in the art will appreciate that these computer program instructions can be implemented by a general purpose computer, a professional computer, or a processor of other programmable data processing methods, such that the processor is executed by a computer or other programmable data processing method. The block diagrams and/or block diagrams of the invention and/or the schemes specified in the blocks or blocks of the flow diagram are invented.
[0114] 本技术领域技术人员可以理解, 本发明中已经讨论过的各种操作、 方法、 流程 中的步骤、 措施、 方案可以被交替、 更改、 组合或刪除。 进一步地, 具有本发 明中已经讨论过的各种操作、 方法、 流程中的其他步骤、 措施、 方案也可以被 交替、 更改、 重排、 分解、 组合或刪除。 进一步地, 现有技术中的具有与本发 明中公幵的各种操作、 方法、 流程中的步骤、 措施、 方案也可以被交替、 更改 、 重排、 分解、 组合或刪除。 [0114] Those skilled in the art can understand that the various operations, methods, and steps, measures, and solutions in the present invention may be alternated, changed, combined, or deleted. Further, various operations, methods, and other steps, measures, and arrangements in the process of the present invention may be alternated, changed, rearranged, decomposed, combined, or deleted. Further, the steps, measures, and solutions in the various operations, methods, and processes disclosed in the prior art may be alternated, changed, rearranged, decomposed, combined, or deleted.
[0115] 以上所述仅为本发明的优选实施例, 并非因此限制本发明的专利范围, 凡是利 用本发明说明书及附图内容所作的等效结构或等效流程变换, 或直接或间接运 用在其他相关的技术领域, 均同理包括在本发明的专利保护范围内。
The above is only a preferred embodiment of the present invention, and is not intended to limit the scope of the invention, and the equivalent structure or equivalent process transformations made by the description of the invention and the drawings are used directly or indirectly. Other related technical fields are equally included in the scope of patent protection of the present invention.
Claims
权利要求书 Claim
一种语音翻译方法, 其特征在于, 包括以下步骤: A speech translation method, comprising the steps of:
从原始语音信息中提取出原始声纹; Extracting the original voiceprint from the original voice information;
对所述原始语音信息进行翻译处理, 获得翻译信息; Translating the original voice information to obtain translation information;
将所述翻译信息和所述原始声纹合成为最终语音信息。 The translation information and the original voiceprint are synthesized into final voice information.
根据权利要求 1所述的语音翻译方法, 其特征在于, 所述翻译信息为 目标语音信息, 所述将所述翻译信息和所述原始声纹合成为最终语音 信息的步骤包括: The speech translation method according to claim 1, wherein the translation information is target speech information, and the step of synthesizing the translation information and the original voiceprint into final speech information comprises:
剔除所述目标语音信息中的预设声纹, 得到无声纹的目标语音信息; 将所述原始声纹合成到所述无声纹的目标语音信息中, 生成最终语音 f π息。 The preset voiceprint in the target voice information is eliminated, and the target voice information of the voiceless pattern is obtained; the original voiceprint is synthesized into the target voice information of the voiceless tone to generate a final voice.
根据权利要求 2所述的语音翻译方法, 其特征在于, 所述剔除所述目 标语音信息中的预设声纹的步骤包括: The speech translation method according to claim 2, wherein the step of culling the preset voiceprint in the target voice information comprises:
从所述目标语音信息中提取出预设声纹; Extracting a preset voiceprint from the target voice information;
对所述目标语音信息和所述预设声纹做信号减法运算, 得到无声纹的 目标语音信息。 Performing signal subtraction on the target voice information and the preset voiceprint to obtain voiceless target voice information.
根据权利要求 2所述的语音翻译方法, 其特征在于, 所述将所述原始 声纹合成到所述无声纹的目标语音信息中, 生成最终语音信息的步骤 包括: The speech translation method according to claim 2, wherein the step of synthesizing the original voiceprint into the target voice information of the voiceless tone to generate final voice information comprises:
对所述原始声纹和所述无声纹的目标语音信息做信号加法运算, 得到 最终语音信息。 A signal addition operation is performed on the original voiceprint and the target voice information of the voiceless voice to obtain final voice information.
根据权利要求 1所述的语音翻译方法, 其特征在于, 所述对所述原始 语音信息进行翻译处理, 获得翻译信息的步骤包括: The speech translation method according to claim 1, wherein the step of performing translation processing on the original speech information to obtain translation information comprises:
向第一服务器发送所述原始语音信息, 以使所述第一服务器将所述原 始语音信息翻译处理为目标语音信息; Transmitting the original voice information to the first server, so that the first server translates the original voice information into target voice information;
接收所述第一服务器返回的所述目标语音信息。 Receiving the target voice information returned by the first server.
根据权利要求 1所述的语音翻译方法, 其特征在于, 所述翻译信息为 目标语言字符串, 所述将所述翻译信息和所述原始声纹合成为最终语
音信息的步骤包括: The speech translation method according to claim 1, wherein the translation information is a target language string, and the translation information and the original voiceprint are synthesized into a final language The steps of the audio information include:
利用所述原始声纹对所述目标语言字符串进行语音合成, 生成最终语 音信息。 The target speech string is synthesized by the original voiceprint to generate final speech information.
根据权利要求 6所述的语音翻译方法, 其特征在于, 所述对所述原始 语音信息进行翻译处理, 获得翻译信息的步骤包括: The speech translation method according to claim 6, wherein the step of performing translation processing on the original speech information to obtain translation information comprises:
将所述原始语音信息发送给第二服务器, 以使所述第二服务器将所述 原始语音信息翻译处理为目标语言字符串; Transmitting the original voice information to a second server, so that the second server translates the original voice information into a target language string;
接收所述第二服务器返回的所述目标语言字符串。 Receiving the target language string returned by the second server.
根据权利要求 6所述的语音翻译方法, 其特征在于, 所述对所述原始 语音信息进行翻译处理, 获得翻译信息的步骤包括: The speech translation method according to claim 6, wherein the step of performing translation processing on the original speech information to obtain translation information comprises:
对所述原始语音信息进行语音识别, 生成原始语言字符串; 将所述原始语言字符串翻译为目标语言字符串。 Performing speech recognition on the original speech information to generate an original language string; translating the original language string into a target language string.
根据权利要求 1所述的语音翻译方法, 其特征在于, 所述将所述翻译 信息和所述原始声纹合成为最终语音信息的步骤之后还包括: 输出所述最终语音信息。 The speech translation method according to claim 1, wherein the step of synthesizing the translation information and the original voiceprint into final speech information further comprises: outputting the final speech information.
根据权利要求 1所述的语音翻译方法, 其特征在于, 所述将所述翻译 信息和所述原始声纹合成为最终语音信息的步骤之后还包括: 向外发送所述最终语音信息。 The speech translation method according to claim 1, wherein the step of synthesizing the translation information and the original voiceprint into final speech information further comprises: transmitting the final speech information outward.
一种语音翻译装置, 其特征在于, 包括: A speech translation device, comprising:
提取模块, 用于从原始语音信息中提取出原始声纹; An extraction module, configured to extract an original voiceprint from the original voice information;
处理模块, 用于对所述原始语音信息进行翻译处理, 获得翻译信息; 合成模块, 用于将所述翻译信息和所述原始声纹合成为最终语音信息 根据权利要求 11所述的语音翻译装置, 其特征在于, 所述翻译信息为 目标语音信息, 所述合成模块包括: a processing module, configured to perform translation processing on the original voice information to obtain translation information; a synthesis module, configured to synthesize the translation information and the original voiceprint into final voice information, according to the voice translation device of claim 11. The translation information is the target voice information, and the synthesizing module includes:
声纹剔除单元, 用于剔除所述目标语音信息中的预设声纹, 得到无声 纹的目标语音信息; a voiceprint culling unit, configured to remove a preset voiceprint in the target voice information, to obtain target voice information of the voiceless pattern;
声纹合成单元, 用于将所述原始声纹合成到所述无声纹的目标语音信
息中, 生成最终语音信息。 a voiceprint synthesis unit, configured to synthesize the original voiceprint into the voiceless target voice letter In the message, the final voice message is generated.
根据权利要求 12所述的语音翻译装置, 其特征在于, 所述声纹剔除单 元包括: The speech translation apparatus according to claim 12, wherein said voiceprint culling unit comprises:
声纹提取子单元, 用于从所述目标语音信息中提取出预设声纹; 减法运算子单元, 用于对所述目标语音信息和所述预设声纹做信号减 法运算, 得到无声纹的目标语音信息。 a voiceprint extraction subunit, configured to extract a preset voiceprint from the target voice information; a subtraction subunit, configured to perform signal subtraction on the target voice information and the preset voiceprint to obtain a voiceless pattern Target voice information.
根据权利要求 12所述的语音翻译装置, 其特征在于, 声纹合成单元用 于: 对所述原始声纹和所述无声纹的目标语音信息做信号加法运算, 得到最终语音信息。 The speech translation apparatus according to claim 12, wherein the voiceprint synthesizing unit is configured to: perform signal addition on the original voiceprint and the target voice information of the voiceless tone to obtain final voice information.
根据权利要求 12所述的语音翻译装置, 其特征在于, 所述处理模块包 括: The speech translation apparatus according to claim 12, wherein the processing module comprises:
第一发送单元, 用于向第一服务器发送所述原始语音信息, 以使所述 第一服务器将所述原始语音信息翻译处理为目标语音信息; 第一接收单元, 用于接收所述第一服务器返回的所述目标语音信息。 根据权利要求 11所述的语音翻译装置, 其特征在于, 所述翻译信息为 目标语言字符串, 所述合成模块用于: 利用所述原始声纹对所述目标 语言字符串进行语音合成, 生成最终语音信息。 a first sending unit, configured to send the original voice information to the first server, so that the first server translates the original voice information into target voice information; and the first receiving unit is configured to receive the first The target voice information returned by the server. The speech translation apparatus according to claim 11, wherein the translation information is a target language string, and the synthesizing module is configured to: perform speech synthesis on the target language string by using the original voiceprint, and generate Final voice message.
根据权利要求 16所述的语音翻译装置, 其特征在于, 所述处理模块包 括: The speech translation apparatus according to claim 16, wherein the processing module comprises:
第二发送单元, 用于将所述原始语音信息发送给第二服务器, 以使所 述第二服务器将所述原始语音信息翻译处理为目标语言字符串; 第二接收单元, 用于接收所述第二服务器返回的所述目标语言字符串 根据权利要求 16所述的语音翻译装置, 其特征在于, 所述处理模块包 括: a second sending unit, configured to send the original voice information to the second server, so that the second server translates the original voice information into a target language string; and the second receiving unit is configured to receive the The target language string returned by the second server is the voice translation device according to claim 16, wherein the processing module comprises:
语音识别单元, 用于对所述原始语音信息进行语音识别, 生成原始语 言字符串; a voice recognition unit, configured to perform voice recognition on the original voice information, and generate an original language string;
字符翻译单元, 用于将所述原始语言字符串翻译为目标语言字符串。
[权利要求 19] 根据权利要求 11所述的语音翻译装置, 其特征在于, 所述装置还包括 输出模块, 其用于输出所述最终语音信息。 a character translation unit, configured to translate the original language string into a target language string. [Claim 19] The speech translation apparatus according to claim 11, wherein the apparatus further comprises an output module for outputting the final speech information.
[权利要求 20] —种终端设备, 包括存储器、 处理器和至少一个被存储在所述存储器 中并被配置为由所述处理器执行的应用程序, 其特征在于, 所述应用 程序被配置为用于执行权利要求 1所述的语音翻译方法。
[Claim 20] A terminal device comprising a memory, a processor, and at least one application stored in the memory and configured to be executed by the processor, wherein the application is configured to A speech translation method for performing the method of claim 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2017/105915 WO2019071541A1 (en) | 2017-10-12 | 2017-10-12 | Voice translation method, apparatus, and terminal device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2017/105915 WO2019071541A1 (en) | 2017-10-12 | 2017-10-12 | Voice translation method, apparatus, and terminal device |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019071541A1 true WO2019071541A1 (en) | 2019-04-18 |
Family
ID=66101102
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2017/105915 WO2019071541A1 (en) | 2017-10-12 | 2017-10-12 | Voice translation method, apparatus, and terminal device |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2019071541A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112201224A (en) * | 2020-10-09 | 2021-01-08 | 北京分音塔科技有限公司 | Method, equipment and system for simultaneous translation of instant call |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101727904A (en) * | 2008-10-31 | 2010-06-09 | 国际商业机器公司 | Voice translation method and device |
CN105786801A (en) * | 2014-12-22 | 2016-07-20 | 中兴通讯股份有限公司 | Speech translation method, communication method and related device |
US20170255616A1 (en) * | 2016-03-03 | 2017-09-07 | Electronics And Telecommunications Research Institute | Automatic interpretation system and method for generating synthetic sound having characteristics similar to those of original speaker's voice |
JP2017182394A (en) * | 2016-03-30 | 2017-10-05 | 株式会社リクルートライフスタイル | Voice translating device, voice translating method, and voice translating program |
-
2017
- 2017-10-12 WO PCT/CN2017/105915 patent/WO2019071541A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101727904A (en) * | 2008-10-31 | 2010-06-09 | 国际商业机器公司 | Voice translation method and device |
CN105786801A (en) * | 2014-12-22 | 2016-07-20 | 中兴通讯股份有限公司 | Speech translation method, communication method and related device |
US20170255616A1 (en) * | 2016-03-03 | 2017-09-07 | Electronics And Telecommunications Research Institute | Automatic interpretation system and method for generating synthetic sound having characteristics similar to those of original speaker's voice |
JP2017182394A (en) * | 2016-03-30 | 2017-10-05 | 株式会社リクルートライフスタイル | Voice translating device, voice translating method, and voice translating program |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112201224A (en) * | 2020-10-09 | 2021-01-08 | 北京分音塔科技有限公司 | Method, equipment and system for simultaneous translation of instant call |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8606249B1 (en) | Methods and systems for enhancing audio quality during teleconferencing | |
JP2022137201A (en) | Synthesis of speech from text in voice of target speaker using neural networks | |
CN110049270A (en) | Multi-person conference speech transcription method, apparatus, system, equipment and storage medium | |
US20160048508A1 (en) | Universal language translator | |
WO2010000161A1 (en) | Voice conversation method and apparatus based on instant communication system | |
WO2016165590A1 (en) | Speech translation method and device | |
CN107749296A (en) | Voice translation method and device | |
WO2019075829A1 (en) | Voice translation method and apparatus, and translation device | |
CN110149805A (en) | Double-directional speech translation system, double-directional speech interpretation method and program | |
WO2016101571A1 (en) | Voice translation method, communication method and related device | |
EP4254979A1 (en) | Active noise reduction method, device and system | |
WO2019000515A1 (en) | Voice call method and device | |
WO2018209851A1 (en) | Translation method and translation system | |
WO2019169686A1 (en) | Voice translation method and apparatus, and computer device | |
CN113436609A (en) | Voice conversion model and training method thereof, voice conversion method and system | |
TW200304638A (en) | Network-accessible speaker-dependent voice models of multiple persons | |
Ding et al. | Ultraspeech: Speech enhancement by interaction between ultrasound and speech | |
CN110600045A (en) | Sound conversion method and related product | |
US20220157316A1 (en) | Real-time voice converter | |
US8768406B2 (en) | Background sound removal for privacy and personalization use | |
WO2019071541A1 (en) | Voice translation method, apparatus, and terminal device | |
TW202305783A (en) | Personalized voice conversion system which includes a cloud server and a smart device and is capable of improving conversion effect without extra storage space and operation | |
WO2019000619A1 (en) | Translation method, translation device and translation system | |
JP2022105982A (en) | Automatic interpretation method based on speaker separation, user terminal providing automatic interpretation service based on speaker separation, and automatic interpretation service providing system based on speaker separation | |
JP2002101203A (en) | Speech processing system, speech processing method and storage medium storing the method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17928311 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 17928311 Country of ref document: EP Kind code of ref document: A1 |