WO2020057102A1 - 语音翻译方法及翻译装置 - Google Patents

语音翻译方法及翻译装置 Download PDF

Info

Publication number
WO2020057102A1
WO2020057102A1 PCT/CN2019/081036 CN2019081036W WO2020057102A1 WO 2020057102 A1 WO2020057102 A1 WO 2020057102A1 CN 2019081036 W CN2019081036 W CN 2019081036W WO 2020057102 A1 WO2020057102 A1 WO 2020057102A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
voice
translation
language
processor
Prior art date
Application number
PCT/CN2019/081036
Other languages
English (en)
French (fr)
Inventor
张岩
熊涛
Original Assignee
深圳市合言信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市合言信息科技有限公司 filed Critical 深圳市合言信息科技有限公司
Priority to CN201980001336.0A priority Critical patent/CN110914828B/zh
Priority to JP2019563584A priority patent/JP2021503094A/ja
Priority to US16/470,560 priority patent/US20210343270A1/en
Publication of WO2020057102A1 publication Critical patent/WO2020057102A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/55Rule-based translation
    • G06F40/56Natural language generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques

Definitions

  • the present application relates to the field of data processing technology, and in particular, to a speech translation method and a translation device.
  • Simultaneous interpretation also referred to as “simultaneous interpretation”, also known as “simultaneous interpretation” and “simultaneous interpretation” refers to a method of interpreting content by an interpreter without interrupting the speaker's speech. Simultaneous interpreters provide instant translation through dedicated equipment. This method is suitable for large-scale seminars and international conferences, and is usually rotated by two to three interpreters. At the same time, simultaneous interpretation mainly relies on translators to listen, then translate and pronounce. With the development of AI (Artificial Intelligence) technology, AI simultaneous interpretation will gradually replace manual translation. Although there are some conference translators on the market, when translating, you need a translation device. The cost is high, and the speaker usually needs to hold the button to start speaking, and then the online translation customer service translates what the speaker said to others. The operation is very tedious and requires more manual participation.
  • AI Artificial Intelligence
  • the embodiments of the present application provide a voice translation method and a translation device, which can be used to reduce translation costs and simplify translation operations.
  • An embodiment of the present application provides a voice translation method, which is applied to a translation device.
  • the translation device includes a processor, and a sound collection device and a sound playback device electrically connected to the processor.
  • the method includes:
  • sounds in the environment are collected by the sound collection device, and whether the user starts to speak according to the collected sounds is detected by the processor;
  • the target voice is played by the sound playback device, and after the playback ends, the step of detecting whether the user starts to speak based on the collected sound by the processor is returned until the translation task ends.
  • An embodiment of the present application further provides a translation device, including:
  • An endpoint detection module configured to collect sounds in the environment through the sound collection device when a translation task is triggered, and detect whether the user starts to speak based on the collected sounds;
  • a recognition module configured to enter a voice recognition state when the user is detected to start speaking, extract a user voice from the collected voice, and determine a source language used by the user according to the extracted user voice, and To determine a target language associated with the source language;
  • a tail point detection module configured to detect whether the user stops speaking for more than a preset delay duration, and when it is detected that the user stops speaking for more than the preset delay duration, exit the voice recognition state;
  • a translation and speech synthesis module configured to convert a user's speech extracted in the speech recognition state into a target speech of the target language
  • a playback module configured to play the target voice through the sound playback device, and after the playback ends, trigger the endpoint detection module to perform the step of detecting whether the user starts to speak based on the collected sound.
  • An aspect of the embodiments of the present application further provides a translation device, which includes a sound collection device, a sound playback device, a memory, a processor, and a computer program stored in the memory and executable on the processor. ; Wherein the sound collection device, the sound playback device, and the memory are electrically connected to the processor; when the processor runs the computer program, the following steps are performed:
  • the sound in the environment is collected by the sound collection device, and whether the user starts to speak based on the collected sound is detected; when the user is detected to start speaking, enter a voice recognition state, and the collected sound is Extract the user's voice from the user, determine the source language used by the user based on the extracted user's voice, and determine the target language associated with the source language according to a preset language pair; when it is detected that the user stops speaking for more than a preset delay When the duration, exit the voice recognition state, convert the user voice extracted in the voice recognition state into the target voice of the target language; play the target voice through the sound playback device, and end the After the playback, the method returns to the step of detecting whether the user starts to speak based on the collected sound, until the translation task ends.
  • the translation task during the execution of the translation task, it automatically monitors whether the user starts and ends speaking and translates what the user said into the target language to play it out.
  • multiple people share a translation device for simultaneous sound Interpretation, thereby reducing translation costs.
  • it truly realizes that translation equipment automatically perceives user conversations and translates broadcasts, thereby simplifying translation operations.
  • FIG. 1 is a schematic flowchart of a speech translation method provided by an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a speech translation method provided by another embodiment of the present application.
  • FIG. 3 is a diagram illustrating a practical application example of the speech translation method provided by the embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a translation apparatus according to an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a translation apparatus according to another embodiment of the present application.
  • FIG. 6 is a schematic diagram of a hardware structure of a translation apparatus according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a hardware structure of a translation apparatus according to another embodiment of the present application.
  • FIG. 1 is a schematic flowchart of a speech translation method provided by an embodiment of the present application.
  • the speech translation method is applied to a translation device.
  • the translation device includes a processor, and a sound collection device and a sound playback device electrically connected to the processor.
  • the sound collection device may be, for example, a microphone or a pickup, and the sound playback device may be, for example, a speaker.
  • the speech translation method includes:
  • the translation task may be, for example, but not limited to, triggered automatically after the translation device is started, or triggered when a user clicks a preset button for triggering a translation task, or when a first preset of the user is detected Triggered when speaking.
  • the button may be a hardware button or a virtual button.
  • the first preset voice may be set according to a user-defined operation.
  • the first preset voice may be a text or other preset voice including a semantic of “start translation”.
  • the sound in the environment is collected in real time by a sound acquisition device, and the collected sound is analyzed in real time by the processor to determine whether a human voice is included in the sound.
  • the sound collection is stopped and the standby state is entered to reduce power consumption.
  • the translation device stores an association relationship between at least two languages included in a preset language pair.
  • This language pair can be used to determine the source and target languages.
  • the user When the user is detected to start speaking, it enters a voice recognition state, and the user's voice is extracted from the collected voice through the processor, and the extracted user's voice is subjected to voice recognition to determine the source language used by the user. According to the above-mentioned association relationship, other languages associated with the source language in the voice pair are determined as the target language.
  • a language setting interactive interface is provided for the user.
  • the processor is used in the translation device. , Configure at least two languages pointed to by the language assignment operation as the language pair used to determine the source and target languages.
  • the processor analyzes in real time whether the human voice included in the collected sound disappears. If the sound disappears, a timer is started to start counting, and when the preset delay period elapses, the sound does not reappear, confirming that the user is stopped. Speak and exit voice recognition. Then, through the processor, all user voices extracted in the state of voice recognition are converted into target voices in the target language.
  • step S105 Play the target voice through the sound playback device, and return to step S102 after the playback ends, until the translation task ends.
  • the target voice is played by the sound playback device, and after the playback of the target voice is ended, the process returns to step S102: the processor detects whether the user has started to speak according to the collected voice, so as to translate the words spoken by another speaker, and so on. Until the end of the translation task.
  • the translation task may be, for example, but not limited to, ending when the operation of the user clicking a preset button for ending the translation task is detected, or being triggered when a second preset voice of the user is detected.
  • the button may be a hardware button or a virtual button.
  • the second preset voice may be set according to a user-defined operation.
  • the second preset voice may be text or other sounds that include the semantics of “end translation”.
  • the sound collection can be paused during the playback of the target voice to avoid misjudgment of the user's voice and reduce power consumption.
  • the translation task during the execution of the translation task, it automatically monitors whether the user starts and ends speaking and translates what the user said into the target language to play it out.
  • multiple people share a translation device for simultaneous voice Interpretation, thereby reducing translation costs.
  • it truly realizes that translation equipment automatically perceives user conversations and translates broadcasts, thereby simplifying translation operations.
  • FIG. 2 is a schematic flowchart of a speech translation method provided by another embodiment of the present application.
  • the speech translation method is applied to a translation device.
  • the translation device includes a processor, and a sound collection device and a sound playback device electrically connected to the processor.
  • the sound collection device may be, for example, a microphone or a pickup, and the sound playback device may be, for example, a speaker.
  • the speech translation method includes:
  • the translation task may be, for example, but not limited to, triggered automatically after the translation device is started, or triggered when a user clicks a preset button for triggering a translation task, or when a first preset of the user is detected Triggered when speaking.
  • the button may be a hardware button or a virtual button.
  • the first preset voice may be set according to a user-defined operation.
  • the first preset voice may be text or other sounds including a semantic of “start translation”.
  • the sound in the environment is collected in real time by a sound acquisition device, and the collected sound is analyzed in real time by the processor to determine whether a human voice is included in the sound. If the human voice is included, it is confirmed that the user starts to speak.
  • the processor in order to ensure translation quality, periodically detects whether the noise in the environment is greater than a preset noise based on the collected sound, and if the noise is greater than the preset noise, a prompt message is output.
  • the prompt information is used to inform the user that the translation environment is not good.
  • the prompt information may be output in a voice and / or text manner.
  • noise detection may be performed only before entering the speech recognition state.
  • the sound in the environment is collected in real time by a sound acquisition device, and whether the collected sound includes a person in real time is analyzed by a processor. Whether the volume of the voice of the human voice included is greater than a preset decibel, and if the volume of the human voice and the voice of the contained human voice is greater than a preset decibel, it is confirmed that the user starts to speak.
  • the translation device further includes a memory electrically connected to the processor.
  • the memory stores an association relationship between at least two languages included in a preset language pair. This language pair can be used to determine the source and target languages.
  • This language pair can be used to determine the source and target languages.
  • the user When the user is detected to start speaking, it enters a voice recognition state, and the user's voice is extracted from the collected voice through the processor, and the extracted user's voice is subjected to voice recognition to determine the source language used by the user.
  • voice recognition determine the source language used by the user.
  • other languages associated with the source language in the voice pair are determined as the target language. For example: assuming the language pair is English and Chinese, and the source language is Chinese, then the target language is English.
  • the user language needs to be converted into Chinese voice; assuming the language pair is English-Chinese-Russian, and the source language is English, then determine the target The language is Chinese and Russian, that is, the user's voice needs to be converted into Chinese voice and Russian voice at this time.
  • a language setting interactive interface is provided for the user.
  • the processor is used in the translation device. , Configure at least two languages pointed to by the language assignment operation as the language pair used to determine the source and target languages.
  • the memory also stores identification information of each language in the language pair, and the identification information may be generated by the processor for each language in the language pair when setting the language pair.
  • the foregoing step of determining the source language used by the user based on the extracted user voice specifically includes: extracting the voiceprint feature of the user in the user voice by the processor, and determining whether the language identification information corresponding to the voiceprint feature is stored in the memory; if stored in the memory If the identification information is available, the language corresponding to the identification information is determined as the source language; if the identification information is not stored in the memory, the user's pronunciation characteristics in the user's voice are extracted, the source language is determined according to the pronunciation characteristics, and the user's voice is determined The correspondence between the texture features and the identification information of the source language is stored in the memory for use in language recognition at the next translation.
  • the user's pronunciation characteristics can be matched with the pronunciation characteristics of each language in the language pair, and the language with the highest matching degree can be determined as the source language.
  • the above-mentioned pronunciation feature matching may be performed locally in the translation device, or may be implemented through a server.
  • the language of the first text is the source language.
  • the translation apparatus further includes a display screen electrically connected to the processor.
  • the processor analyzes in real time whether the human voice included in the collected sound disappears. If the sound disappears, a timer is started to start counting, and when the preset delay period elapses, the sound does not reappear, confirming that the user is stopped. Speak and exit voice recognition. Then, the processor translates the first text in the source language corresponding to the user speech extracted in the state of speech recognition into the second text in the target language, and displays the second text on the display screen. At the same time, the TTS (Text to Speech) speech synthesis system is used to convert the second text into the target speech in the target language.
  • TTS Text to Speech
  • the voice recognition state when it is detected that the user stops speaking for more than a preset delay period, before exiting the voice recognition state, the voice recognition state is exited in response to a triggered translation instruction.
  • the translation device further includes a motion sensor electrically connected to the processor.
  • a motion sensor electrically connected to the processor. In the voice recognition state, when the motion sensor detects that the translation device has a greater amplitude than the preset amplitude, or When a translation device is collided, a translation instruction is triggered.
  • the preset delay time can improve the flexibility of the user to stop speaking and make the timing of the translation more suitable for the user's needs.
  • the step of adjusting the preset delay duration according to the time difference between the time when the user stops speaking and the time when the translation instruction is triggered specifically includes: determining whether a stop is stored in the memory The preset delay duration corresponding to the voiceprint feature of the speaking user; if the corresponding preset delay duration is stored in the memory, the user is adjusted according to the time difference between the time when the user stops speaking and the time when the translation instruction is triggered The preset delay time corresponding to the voiceprint feature of the voicemail; if the corresponding preset delay time is not stored in the memory, that is, only the default delay time for triggering the exit of the speech recognition state is configured, the time difference is set to the user's The preset delay time corresponding to the voiceprint feature.
  • adjusting the preset delay time according to the time difference includes setting the value of the time difference to a value of the preset delay time, or taking an average of the time difference and the preset delay time as a new preset delay time value.
  • step S207 Play the target voice through the sound playback device, and return to step S202 after the playback ends, until the translation task ends.
  • the target voice is played by the sound playback device, and after the playback of the target voice is ended, the process returns to step S202: the processor detects whether the user has started to speak based on the collected voice, so as to translate the words spoken by another speaker, and so on. Until the end of the translation task.
  • the translation task may be, for example, but not limited to, ending when the operation of the user clicking a preset button for ending the translation task is detected, or being triggered when a second preset voice of the user is detected.
  • the button may be a hardware button or a virtual button.
  • the second preset voice may be set according to a user-defined operation.
  • the second preset voice may be text or other sounds that include the semantics of “end translation”.
  • the sound collection can be paused during the playback of the target voice to avoid misjudgment of the user's voice and reduce power consumption.
  • all the first text and the second text obtained during the execution of the translation task may be stored in the memory as a conversation record, so as to facilitate subsequent query by the user.
  • the processor automatically clears the conversation records that exceed the storage period periodically or after each power-on to improve the utilization of storage space.
  • the above translation device automatically detects that user A starts to speak through the endpoint detection module
  • the language judgment module detects that the user A speaks the A language, and at this time, the first text corresponding to the currently recognized voice A will be displayed on the display screen of the translation device;
  • the translation device automatically judges that the user has finished speaking through the end point detection module
  • the translation device will enter the translation stage, and the first text in the A language is converted into the second text in the B language through the translation module;
  • the translation device obtains the translated text in the B language, the corresponding target speech is generated by the TTS speech synthesis module and automatically broadcasted.
  • the translation device automatically detects that user B starts to speak again through the endpoint detection module, so based on user B, the above steps 3-7 are performed to translate the speech of user B's language B into the target speech of language A and broadcast it automatically, so Back and forth until the conversation between users A and B ends.
  • the translation device will complete a series of processes such as listening, identifying, ending, translating, and broadcasting.
  • the user's voiceprint features can be collected in advance during the first use, and the collected voiceprint features can be bound to the language used by the user. .
  • the second use quickly confirm the language used by the user directly based on the user's voiceprint characteristics.
  • the translation device provides the user with an interface for binding the voiceprint feature with the corresponding language.
  • the user's target voice is collected through the sound collection device, and The target voice is subjected to speech recognition to obtain the voiceprint characteristics of the user and the language used by the user, and the identified voiceprint characteristics of the user and the used language are bound in a translation device.
  • the language bound to the voiceprint feature may also be the language pointed to by the binding instruction.
  • the user's voice is extracted from the collected voice through the processor, and the source language used by the user is determined according to the extracted user's voice, which specifically includes: when detected When the user begins to speak, he enters a voice recognition state, and the user's voice is extracted from the collected voice through the processor, and the voiceprint recognition of the extracted user voice is performed to obtain the voiceprint feature of the user and the voiceprint feature binding. Language, and use that language as the source language for that user.
  • user A and user B respectively bind their voiceprint features with the language used in the translation device through the interface provided by the translation device.
  • the user A and the user B sequentially press the language setting button of the translation device to trigger a binding instruction, and according to the prompt information output by the translation device, a voice is recorded in the translation device.
  • the prompt information can be output by voice or text.
  • the voice setting button may be a physical button or a virtual button.
  • the translation device performs speech recognition on the recorded voices of the user A and the user B, obtains the voiceprint feature of the user A and the corresponding language A, and associates the obtained voiceprint feature of the user A and the corresponding language A,
  • the associated information is stored in the memory to bind the voiceprint feature of user A and its corresponding language A in the translation device.
  • the voiceprint feature of user B and its corresponding language are obtained, and the obtained voiceprint feature of user B and its corresponding language B are associated, and the associated information is stored in the memory to be stored in the translation device. Binding user B's voiceprint feature and its corresponding language B.
  • voiceprint recognition is used to confirm the language used by user A based on the above-mentioned related information, and language recognition is no longer required at this time.
  • voiceprint recognition requires less computation and occupies less system resources, so it can increase recognition speed and translation speed.
  • the translation task during the execution of the translation task, it automatically monitors whether the user starts and ends speaking and translates what the user said into the target language to play it out.
  • multiple people share a translation device for simultaneous voice Interpretation, thereby reducing translation costs.
  • it truly realizes that translation equipment automatically perceives user conversations and translates broadcasts, thereby simplifying translation operations.
  • FIG. 4 is a schematic structural diagram of a translation apparatus according to an embodiment of the present application.
  • the translation device can be used to implement the speech translation method shown in FIG. 1.
  • the translation device includes: an endpoint detection module 401, an identification module 402, an end point detection module 403, a translation and speech synthesis module 404, and a playback module 405.
  • the endpoint detection module 401 is configured to collect sounds in the environment through a sound collection device when a translation task is triggered, and detect whether the user starts to speak according to the collected sounds.
  • the recognition module 402 is configured to, when detecting that the user starts to speak, enter a voice recognition state, extract a user voice from the collected voice, and determine a source language used by the user based on the extracted user voice, and according to a preset language pair, Determine the target language associated with the source language.
  • the end point detection module 403 is configured to detect whether the user stops speaking for more than a preset delay duration, and when it is detected that the user stops speaking for more than the preset delay duration, exit the voice recognition state.
  • the translation and speech synthesis module 404 is configured to convert the user speech extracted in the speech recognition state into the target speech of the target language.
  • the playback module 405 is configured to play the target voice through a sound playback device and trigger the endpoint detection module to perform the step of detecting whether the user starts to speak based on the collected sound after the playback ends.
  • the translation apparatus further includes:
  • the noise estimation module 501 is configured to detect whether the noise in the environment is greater than a preset noise according to the collected sound, and if the noise is greater than the preset noise, output a prompt message, which is used to prompt the user that the translation environment is not good.
  • the translation device further includes:
  • a configuration module 502 is configured to, in response to the user's language designation operation, configure at least two languages pointed to by the language designation operation as the language pair.
  • the recognition module 402 is further configured to convert the extracted user voice into a corresponding first text.
  • the translation device further includes:
  • the display module 503 is configured to display the first text on the display screen.
  • the translation and speech synthesis module 404 is further configured to translate the first text into a second text in the target language, and convert the second text into the target speech through a speech synthesis system.
  • the display module 503 is further configured to display the second text on the display screen.
  • the translation device further includes:
  • the processing module 504 is configured to exit the speech recognition state in response to a triggered translation instruction.
  • the configuration module 502 is further configured to adjust the preset delay time according to a time difference between a time when the user stops detecting the speech and a time when the translation instruction is triggered.
  • processing module 504 is further configured to trigger the translation instruction when the translation device detects that the motion amplitude of the translation device is greater than a preset amplitude in a speech recognition state, or when the translation device is collided.
  • the identification module 402 is further configured to extract the voiceprint characteristics of the user in the user's voice, and determine whether the language identification information corresponding to the voiceprint characteristics is stored in the memory. If the memory stores the identification information, then The language corresponding to the identification information is determined as the source language. If the identification information is not stored in the memory, the pronunciation characteristics of the user in the user's voice are extracted, the source language is determined according to the pronunciation characteristics, and the user ’s The correspondence between the voiceprint characteristics and the identification information of the source language is stored in the memory.
  • the configuration module 502 is further configured to determine whether a preset delay time corresponding to the voiceprint feature of the user who stopped speaking is stored in the memory; if the corresponding preset delay time is stored in the memory, the detection is performed based on the detection. Adjust the corresponding preset delay time between the time when the user stops speaking and the time when the translation instruction is triggered; if the corresponding preset delay time is not stored in the memory, set the time difference to the The corresponding preset delay time.
  • processing module 504 is further configured to store all the first text and the second text obtained during the execution of the translation task in a memory as a conversation record, so as to facilitate subsequent query by the user.
  • the processing module 504 is further configured to automatically clean up the conversation records exceeding the storage period periodically or after each booting, so as to improve the utilization of the storage space.
  • the recognition module 402 is further configured to collect a target voice of the user through a sound acquisition device in response to a binding instruction triggered by the user, and perform voice recognition on the target voice to obtain the voiceprint characteristics of the user and the user's Language.
  • the configuration module 502 is further configured to bind the identified voiceprint characteristics of the user and the language used in the translation device.
  • the recognition module 402 is further configured to, when detecting that the user starts to speak, enter a voice recognition state, extract a user voice from the collected voice, and perform voiceprint recognition on the extracted user voice to obtain the voiceprint characteristics of the user and the voiceprint feature.
  • the language bound to the voiceprint feature and uses that language as the source language for the user.
  • the translation task during the execution of the translation task, it automatically monitors whether the user starts and ends speaking and translates what the user said into the target language to play it out.
  • multiple people share a translation device for simultaneous voice Interpretation, thereby reducing translation costs.
  • it truly realizes that translation equipment automatically perceives user conversations and translates broadcasts, thereby simplifying translation operations.
  • FIG. 6 is a schematic diagram of a hardware structure of a translation apparatus according to an embodiment of the present application.
  • the translation device described in this embodiment includes a sound collection device 601, a sound playback device 602, a memory 603, a processor 604, and a computer program stored in the memory 603 and executable on the processor 604.
  • the sound collection device 601, the sound playback device 602, and the memory are electrically connected to the processor 604.
  • the memory 603 may be a high-speed random access memory (RAM, Random Access Memory) memory, or may be a non-volatile memory (non-volatile memory), such as a magnetic disk memory.
  • the memory 501 is configured to store a set of executable program code.
  • sounds in the environment are collected by the sound collection device 601, and whether the user starts speaking is detected based on the collected sounds.
  • enter the voice recognition state extract the user's voice from the collected voice, and determine the source language used by the user based on the extracted user's voice, and determine the association with the source language according to the preset language pair Target language.
  • the voice recognition state is exited, and the user voice extracted in the voice recognition state is converted into the target voice of the target language.
  • the target voice is played by the sound playback device 602, and after the playback ends, the step of detecting whether the user starts to speak based on the collected sound is returned until the translation task ends.
  • the translation apparatus further includes:
  • the input device 701 may be a camera, a touch panel, a physical button, or the like.
  • the output device 702 may be a display screen.
  • the motion sensor 703 may specifically be a gravity sensor, a gyroscope, an acceleration sensor, or the like.
  • the translation device further includes a signal transceiving device for receiving and sending a wireless network signal.
  • the translation task during the execution of the translation task, it automatically monitors whether the user starts and ends speaking and translates what the user said into the target language to play it out.
  • multiple people share a translation device for simultaneous voice Interpretation, thereby reducing translation costs.
  • it truly realizes that translation equipment automatically perceives user conversations and translates broadcasts, thereby simplifying translation operations.
  • the disclosed apparatus and method may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the modules is only a logical function division.
  • multiple modules or components may be combined or Can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or modules, which may be electrical, mechanical or other forms.
  • the modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on multiple network modules. Some or all of the modules may be selected according to actual needs to achieve the objective of the solution of this embodiment.
  • each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist separately physically, or two or more modules may be integrated into one module.
  • the above integrated modules may be implemented in the form of hardware or software functional modules.
  • the integrated module When the integrated module is implemented in the form of a software functional module and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially a part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, which is stored in a readable storage
  • the medium includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application.
  • the foregoing readable storage medium includes: various media that can store program codes, such as a U disk, a mobile hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

一种语音翻译方法及翻译装置,其中该方法包括:当翻译任务触发时,通过声音采集装置对环境中的声音进行采集,并根据采集的声音检测用户是否开始说话;当检测到用户开始说话时,进入语音识别状态,从采集的声音中提取用户语音,并根据提取的用户语音判断用户使用的源语言,根据预设的语言对,确定与源语言关联的目标语言;当检测到用户停止说话超过预设延迟时长时,退出语音识别状态,将语音识别状态下提取的用户语音转换为目标语言的目标语音;通过声音播放装置对目标语音进行播放,并在结束播放后返回通过处理器根据采集的声音检测用户是否开始说话的步骤,直至翻译任务结束。上述语音翻译方法及翻译装置可降低翻译成本,简化翻译操作。

Description

语音翻译方法及翻译装置 技术领域
本申请涉及数据处理技术领域,尤其涉及一种语音翻译方法及翻译装置。
背景技术
同声传译,简称“同传”,又称“同声翻译”、“同步口译”,是指译员在不打断讲话者讲话的情况下,不间断地将内容口译给听众的一种翻译方式,同声传译员通过专用的设备提供即时翻译,这种方式适用于大型的研讨会和国际会议,通常由两名到三名译员轮换进行。目前同声传译主要依赖翻译人员倾听然后翻译并且发音,随着AI(Artificial Intelligence,人工智能)技术的发展,AI同声传译将会逐渐取代人工翻译。市面上虽然也有一些会议翻译机,但在翻译时,需要人手一台翻译设备,成本较高,并且说话人通常需要按住按钮开始说话,然后在线翻译客服将说话人说的话分别翻译给其他人,操作非常繁琐,需要较多的人工参与。
发明内容
本申请实施例提供一种语音翻译方法及翻译装置,可用于降低翻译成本,简化翻译操作。
本申请实施例一方面提供了一种语音翻译方法,应用于翻译装置,所述翻译装置包括处理器以及与所述处理器电性连接的声音采集装置和声音播放装置,所述方法包括:
当翻译任务触发时,通过所述声音采集装置对环境中的声音进行采集,并通过所述处理器根据采集的声音检测用户是否开始说话;
当检测到所述用户开始说话时,进入语音识别状态,通过所述处理器从采集的声音中提取用户语音,并根据提取的用户语音判断所述用户使用的源语言, 根据预设的语言对,确定与所述源语言关联的目标语言;
当检测到所述用户停止说话超过预设延迟时长时,退出所述语音识别状态,通过所述处理器,将所述语音识别状态下提取的用户语音转换为所述目标语言的目标语音;
通过所述声音播放装置对所述目标语音进行播放,并在结束所述播放后返回所述通过所述处理器根据采集的声音检测用户是否开始说话的步骤,直至所述翻译任务结束。
本申请实施例一方面还提供了一种翻译装置,包括:
端点检测模块,用于当翻译任务触发时,通过所述声音采集装置对环境中的声音进行采集,并根据采集的声音检测用户是否开始说话;
识别模块,用于当检测到所述用户开始说话时,进入语音识别状态,从采集的声音中提取用户语音,并根据提取的用户语音判断所述用户使用的源语言,根据预设的语言对,确定与所述源语言关联的目标语言;
尾点检测模块,用于检测所述用户是否停止说话超过预设延迟时长,当检测到所述用户停止说话超过所述预设延迟时长时,退出所述语音识别状态;
翻译及语音合成模块,用于将所述语音识别状态下提取的用户语音转换为所述目标语言的目标语音;
播放模块,用于通过所述声音播放装置对所述目标语音进行播放,并在结束所述播放后触发所述端点检测模块执行所述根据采集的声音检测用户是否开始说话的步骤。
本申请实施例一方面还提供了一种翻译装置,所述装置包括:声音采集装置、声音播放装置、存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序;其中,所述声音采集装置、所述声音播放装置和所述存储器电性相连于所述处理器;所述处理器运行所述计算机程序时,执行以下步骤:
当翻译任务触发时,通过所述声音采集装置对环境中的声音进行采集,并根据采集的声音检测用户是否开始说话;当检测到所述用户开始说话时,进入语音识别状态,从采集的声音中提取用户语音,并根据提取的用户语音判断所 述用户使用的源语言,根据预设的语言对,确定与所述源语言关联的目标语言;当检测到所述用户停止说话超过预设延迟时长时,退出所述语音识别状态,将所述语音识别状态下提取的用户语音转换为所述目标语言的目标语音;通过所述声音播放装置对所述目标语音进行播放,并在结束所述播放后返回所述根据采集的声音检测用户是否开始说话的步骤,直至所述翻译任务结束。
上述各实施例,通过在翻译任务执行期间,自动循环侦听用户是否开始和结束说话,并将用户所说的话翻译成目标语言播放出来,一方面实现了多人共用一台翻译设备进行同声传译,从而减低了翻译成本,另一方面,真正实现了翻译设备对用户交谈内容的自动感知并翻译播报,从而简化了翻译操作。
附图说明
图1为本申请一实施例提供的语音翻译方法的实现流程示意图;
图2为本申请另一实施例提供的语音翻译方法的实现流程示意图;
图3为本申请实施例提供的语音翻译方法的一实际应用例的演示图;
图4为本申请一实施例提供的翻译装置的结构示意图;
图5为本申请另一实施例提供的翻译装置的结构示意图;
图6为本申请一实施例提供的翻译装置的硬件结构示意图;
图7为本申请另一实施例提供的翻译装置的硬件结构示意图。
具体实施方式
为使得本申请的发明目的、特征、优点能够更加的明显和易懂,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而非全部实施例。基于本申请中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
请参阅图1,为本申请一实施例提供的语音翻译方法的实现流程示意图。该语音翻译方法应用于翻译装置,该翻译装置包括处理器以及与该处理器电性连接的声音采集装置和声音播放装置。其中,声音采集装置例如可以是麦克风或 拾音器,声音播放装置例如可以是扬声器。如图1所示,该语音翻译方法包括:
S101、当翻译任务触发时,通过声音采集装置对环境中的声音进行采集;
S102、通过处理器根据采集的声音检测用户是否开始说话;
翻译任务例如可以但不限于:在翻译装置启动后自动触发,或者,当检测到用户点击预设的用于触发翻译任务的按钮的操作时被触发,或者,当检测到用户的第一预设语音时被触发。其中,该按钮可以是硬件按钮也可以是虚拟按钮。该第一预设语音可以根据用户的自定义操作设置,例如可以是:包含“开始翻译”语义的文字或其他预设声音。
当翻译任务被触发时,通过声音采集装置实时采集环境中的声音,并通过处理器实时分析采集到的声音中是否包含人的声音,若包含人的声音,则确认检测到用户开始说话。
可选的,若超过预设检测时长,采集到的声音中仍然不包含人的声音,则停止声音采集,进入待机状态,以降低耗电量。
S103、当检测到用户开始说话时,进入语音识别状态,通过处理器从采集的声音中提取用户语音,并根据提取的用户语音判断用户使用的源语言,根据预设的语言对,确定与源语言关联的目标语言;
翻译装置中存储有预设的语言对中包含的至少两种语言之间的关联关系。该语言对可以用于确定源语言和目标语言。当检测到用户开始说话时,进入语音识别状态,通过处理器从采集的声音中提取用户语音,并对提取的用户语音进行语音识别,以判断用户使用的源语言。根据上述关联关系,将语音对中与源语言关联的其他语言确定为目标语言。
可选的,于本申请其他一实施方式中,为用户提供语言设置交互界面,在检测到用户开始说话之前,响应于用户在语言设置交互界面执行的语言指定操作,通过处理器在翻译装置中,将语言指定操作指向的至少两种语言配置为用于确定源语言和目标语言的语言对。
S104、当检测到用户停止说话超过预设延迟时长时,退出语音识别状态,通过处理器,将语音识别状态下提取的用户语音转换为目标语言的目标语音;
通过处理器实时分析采集到的声音中包含的人的声音是否消失,若该声音 消失,则启动计时器开始计时,并当经过预设延迟时长,该声音没有再次出现时,确认检测到用户停止说话,退出语音识别状态。然后通过处理器,将语音识别状态下提取的所有用户语音转换为目标语言的目标语音。
S105、通过声音播放装置对目标语音进行播放,并在结束播放后返回步骤S102,直至翻译任务结束。
通过声音播放装置对目标语音进行播放,并在结束目标语音的播放后,返回步骤S102:通过处理器根据采集的声音检测用户是否开始说话,以对另一说话人所说的话进行翻译,如此往复,直至翻译任务结束。
其中,翻译任务例如可以但不限于:当检测到用户点击预设的用于结束翻译任务的按钮的操作时结束,或者,当检测到用户的第二预设语音时被触发。其中,该按钮可以是硬件按钮也可以是虚拟按钮。该第二预设语音可以根据用户的自定义操作设置,例如可以是:包含“结束翻译”语义的文字或其他声音。
可选的,在目标语音播放期间可暂停声音采集,以避免用户语音误判,同时降低耗电量。
本实施例中,通过在翻译任务执行期间,自动循环侦听用户是否开始和结束说话,并将用户所说的话翻译成目标语言播放出来,一方面实现了多人共用一台翻译设备进行同声传译,从而减低了翻译成本,另一方面,真正实现了翻译设备对用户交谈内容的自动感知并翻译播报,从而简化了翻译操作。
请参阅图2,为本申请另一实施例提供的语音翻译方法的实现流程示意图。该语音翻译方法应用于翻译装置,该翻译装置包括处理器以及与该处理器电性连接的声音采集装置和声音播放装置。其中,声音采集装置例如可以是麦克风或拾音器,声音播放装置例如可以是扬声器。如图2所示,该语音翻译方法包括:
S201、当翻译任务触发时,通过声音采集装置对环境中的声音进行采集;
S202、通过处理器根据采集的声音检测用户是否开始说话;
翻译任务例如可以但不限于:在翻译装置启动后自动触发,或者,当检测到用户点击预设的用于触发翻译任务的按钮的操作时被触发,或者,当检测到用户的第一预设语音时被触发。其中,该按钮可以是硬件按钮也可以是虚拟按钮。该第一预设语音可以根据用户的自定义操作设置,例如可以是:包含“开始 翻译”语义的文字或其他声音。
当翻译任务被触发时,通过声音采集装置实时采集环境中的声音,并通过处理器实时分析采集到的声音中是否包含人的声音,若包含人的声音,则确认检测到用户开始说话。
可选的,于本申请其他一实施方式中,为保证翻译质量,定期通过处理器根据采集的声音,检测环境中的噪声是否大于预设噪声,若大于预设噪声,则输出提示信息。该提示信息用于提示用户翻译环境不佳。其中,该提示信息可以通过语音和/或文字的方式输出。可选的,噪声检测可只在进入语音识别状态之前进行。
可选的,于本申请其他一实施方式中,为避免翻译错误,当翻译任务被触发时,通过声音采集装置实时采集环境中的声音,并通过处理器实时分析采集到的声音中是否包含人的声音且包含的人的声音的音量是否大于预设分贝,若包含人的声音且包含的人的声音的音量大于预设分贝,则确认检测到用户开始说话。
S203、当检测到用户开始说话时,进入语音识别状态,通过处理器从采集的声音中提取用户语音,并根据提取的用户语音判断用户使用的源语言,根据预设的语言对,确定与源语言关联的目标语言;
翻译装置还包括与处理器电性相连的存储器。该存储器中存储有预设的语言对中包含的至少两种语言之间的关联关系。该语言对可以用于确定源语言和目标语言。当检测到用户开始说话时,进入语音识别状态,通过处理器从采集的声音中提取用户语音,并对提取的用户语音进行语音识别,以判断用户使用的源语言。根据上述关联关系,将语音对中与源语言关联的其他语言确定为目标语言。例如:假设语言对为英语和汉语,源语言为汉语,则目标语言为英语,此时需要将用户语言转换为汉语语音;假设语言对为英语-汉语-俄语,源语言为英语,则确定目标语言为汉语和俄语,即此时需要将用户语音分别转换为汉语语音和俄语语音。
可选的,于本申请其他一实施方式中,为用户提供语言设置交互界面,在检测到用户开始说话之前,响应于用户在语言设置交互界面执行的语言指定操 作,通过处理器在翻译装置中,将语言指定操作指向的至少两种语言配置为用于确定源语言和目标语言的语言对。
可选的,于本申请其他一实施方式中,存储器中还存储有语言对中的各语言的标识信息,该标识信息可由处理器在设置语言对时,为该语言对中的各语言生成。上述根据提取的用户语音判断用户使用的源语言的步骤,具体包括:通过处理器提取用户语音中用户的声纹特征,判断存储器中是否存储有声纹特征对应的语言的标识信息;若存储器中存储有该标识信息,则将该标识信息对应的语言确定为源语言;若存储器中未存储有该标识信息,则提取用户语音中用户的发音特征,根据发音特征确定源语言,并将用户的声纹特征和源语言的标识信息的对应关系存储在存储器中,以在下一次翻译时用于语言识别。
具体的,可将用户的发音特征与语言对中的各语言的发音特征进行匹配,并将匹配度最高的语言确定为源语言。上述发音特征匹配可在翻译装置本地进行,也可通过服务器实现。
像这样,由于发音特征比对需要占用更多的系统资源,通过自动记录用户的声纹特征和源语言的标识信息的对应关系,并利用用户的声纹特征和上述对应关系确定源语言,可提高语言识别的效率。
S204、将提取的用户语音转换为对应的第一文字,并将第一文字展示在显示屏上;
其中,第一文字的语言为源语言。
S205、当检测到用户停止说话超过预设延迟时长时,退出语音识别状态,通过处理器,将第一文字翻译为目标语言的第二文字,并将第二文字展示在显示屏上;
S206、通过语音合成系统,将第二文字转换为目标语音;
具体的,翻译装置还包括与处理器电性相连的显示屏。通过处理器实时分析采集到的声音中包含的人的声音是否消失,若该声音消失,则启动计时器开始计时,并当经过预设延迟时长,该声音没有再次出现时,确认检测到用户停止说话,退出语音识别状态。然后通过处理器,将语音识别状态下提取的用户语音对应的源语言的第一文字翻译为目标语言的第二文字,并将该第二文字展 示在显示屏上。同时,利用TTS(Text To Speech,从文本到语音)语音合成系统将该第二文字转换为目标语言的目标语音。
可选的,于本申请其他一实施方式中,在当检测到用户停止说话超过预设延迟时长时,退出语音识别状态之前,响应于触发的翻译指令时,退出语音识别状态。根据检测到用户停止说话的时间与翻译指令触发的时间之间的时间差,调整预设延迟时长,例如:可将该时间差的值设置为预设延迟时长的值。
可选的,于本申请其他一实施方式中,翻译装置还包括与处理器电性相连的运动传感器,在语音识别状态下,当通过运动传感器检测到翻译装置的运动幅度大于预设幅度,或者,翻译装置被碰撞时,触发翻译指令。
由于预设延迟时长的初始值是默认值,而每个说话人的耐心不同,因此允许用户通过传递翻译装置或者碰撞翻译装置的方式,主动触发翻译指令,并根据翻译指令触发的时间,动态调整预设延迟时长,可以提高用户停止说话判断的灵活性,使得翻译的时机更符合用户的需求。
可选的,于本申请其他一实施方式中,根据检测到用户停止说话的时间与翻译指令触发的时间之间的时间差,调整预设延迟时长的步骤,具体包括:判断存储器中是否存储有停止说话的用户的声纹特征对应的预设延迟时长;若存储器中存储有对应的预设延迟时长,则根据检测到该用户停止说话的时间与翻译指令触发的时间之间的时间差,调整该用户的声纹特征对应的预设延迟时长;若存储器中未存储有对应的预设延迟时长,即,只配置了用于触发退出语音识别状态的默认延迟时长,则将该时间差设置为该用户的声纹特征对应的预设延迟时长。通过上述步骤,可为不同的讲话人设置不同的预设延迟时长,从而提高翻译装置的智能化程度。
可选的,根据时间差调整预设延迟时长,包括将时间差的值设置为预设延迟时长的值,或者,取时间差与预设延迟时长的平均值,作为新的预设延迟时长的值。
S207、通过声音播放装置对目标语音进行播放,并在结束播放后返回步骤S202,直至翻译任务结束。
通过声音播放装置对目标语音进行播放,并在结束目标语音的播放后,返 回步骤S202:通过处理器根据采集的声音检测用户是否开始说话,以对另一说话人所说的话进行翻译,如此往复,直至翻译任务结束。
其中,翻译任务例如可以但不限于:当检测到用户点击预设的用于结束翻译任务的按钮的操作时结束,或者,当检测到用户的第二预设语音时被触发。其中,该按钮可以是硬件按钮也可以是虚拟按钮。该第二预设语音可以根据用户的自定义操作设置,例如可以是:包含“结束翻译”语义的文字或其他声音。
可选的,在目标语音播放期间可暂停声音采集,以避免用户语音误判,同时降低耗电量。
可选的,于本申请其他一实施方式中,可将翻译任务执行期间得到的所有第一文字和第二文字作为谈话记录存储在存储器中,以方便用户后续查询。同时,处理器定期或者在每次开机后自动清理超过存储期限的谈话记录,以提高存储空间的利用率。
为进一步说明本实施例提供的语音翻译方法,结合图3,举例来说,假设用户A和用户B是不同国家的人,用户A使用A语言,用户B使用B语言,可通过以下步骤完成翻译:
1、用户A说话生成语音A;
2、上述翻译装置通过端点检测模块自动检测到用户A开始说话;
3、通过语音识别模块与语种判断模块,一边识别用户A说的话,一边判断用户A使用的语言(即,语种);
4、语种判断模块检测到用户A说的是A语言,此时在翻译装置的显示屏上会展示当前识别的语音A对应的第一文字;
5、当用户A停止讲话时,该翻译装置通过尾点检测模块自动判断用户已经讲完话;
6、此时该翻译装置会进入翻译阶段,通过翻译模块将A语言的第一文字转换成B语言的第二文字;
7、该翻译装置得到B语言的翻译文字后,通过TTS语音合成模块生成对应的目标语音,并自动播报出来。
此后,翻译装置通过端点检测模块再次自动检测用户B开始说话,于是基 于用户B,执行上述步骤3-7,将用户B的B语言的语音翻译为A语言的目标语音,并自动播报出来,如此往复,直至用户A与B的谈话结束。
整个翻译过程中,用户A对于翻译装置不需要再做额外操作,翻译装置会自己完成倾听、识别、结束、翻译、播报等一系列过程。
可选的,于本申请其他一实施方式中,为了提高语言识别的速度,可在首次使用时预先采集用户的声纹特征,并将采集的声纹特征与该用户使用的语言绑定在一起。在第二次使用时,直接根据用户的声纹特征快速确认该用户使用的语言。
具体的,翻译装置为用户提供用于绑定声纹特征与对应语言的接口,在触发翻译任务之前,响应于用户通过该接口触发的绑定指令,通过声音采集装置采集用户的目标语音,对该目标语音进行语音识别,得到该用户的声纹特征及该用户所使用的语言,并将识别出的该用户的声纹特征与使用的语言绑定在翻译装置中。或者,与声纹特征绑定的语言也可为该绑定指令指向的语言。
则,步骤当检测到该用户开始说话时,进入语音识别状态,通过该处理器从采集的声音中提取用户语音,并根据提取的用户语音判断该用户使用的源语言,具体包括:当检测到该用户开始说话时,进入语音识别状态,通过该处理器从采集的声音中提取用户语音,并对提取的用户语音进行声纹识别,得到该用户的声纹特征及该声纹特征绑定的语言,并将该语言作为该用户使用的源语言。
举例来说,假设用户A使用A语言,用户B使用B语言,在进行翻译前,用户A和用户B分别通过翻译装置提供的接口将自己的声纹特征与所使用的语言绑定在翻译装置中。例如,用户A和用户B依次通过按压翻译装置的语言设置按钮触发绑定指令,根据翻译装置输出的提示信息,在该翻译装置中录入一段语音。其中,该提示信息可以通过语音或者文字的方式输出。该语音设置按钮可以是物理按钮或者虚拟按钮。
该翻译装置对录入的用户A和用户B的语音进行语音识别,得到用户A的声纹特征及其对应的语言A,并将得到的用户A的声纹特征及其对应的语言A进行关联,并将关联信息存储在存储器中,以在该翻译装置中绑定用户A的声纹特 征及其对应的语言A。同理,得到用户B的声纹特征及其对应的语言,并将得到的用户B的声纹特征及其对应的语言B进行关联,并将关联信息存储在存储器中,以在该翻译装置中绑定用户B的声纹特征及其对应的语言B。
在翻译任务被触发后,当检测到用户A开始说话时,通过声纹识别,并根据上述关联信息,可确认用户A使用的语言,此时不再需要进行语种识别。相较于语种识别,声纹识别的运算量更低,占用的系统资源更少,因此可以提高识别速度,进而提高翻译速度。
本实施例中,通过在翻译任务执行期间,自动循环侦听用户是否开始和结束说话,并将用户所说的话翻译成目标语言播放出来,一方面实现了多人共用一台翻译设备进行同声传译,从而减低了翻译成本,另一方面,真正实现了翻译设备对用户交谈内容的自动感知并翻译播报,从而简化了翻译操作。
请参阅图4,图4为本申请一实施例提供的翻译装置的结构示意图。该翻译装置可用于实现图1所示的语音翻译方法。该翻译装置包括:端点检测模块401、识别模块402、尾点检测模块403、翻译及语音合成模块404以及播放模块405。
端点检测模块401,用于当翻译任务触发时,通过声音采集装置对环境中的声音进行采集,并根据采集的声音检测用户是否开始说话。
识别模块402,用于当检测到该用户开始说话时,进入语音识别状态,从采集的声音中提取用户语音,并根据提取的用户语音判断该用户使用的源语言,根据预设的语言对,确定与该源语言关联的目标语言。
尾点检测模块403,用于检测该用户是否停止说话超过预设延迟时长,当检测到该用户停止说话超过该预设延迟时长时,退出该语音识别状态。
翻译及语音合成模块404,用于将该语音识别状态下提取的用户语音转换为该目标语言的目标语音。
播放模块405,用于通过声音播放装置对该目标语音进行播放,并在结束该播放后触发该端点检测模块执行该根据采集的声音检测用户是否开始说话的步骤。
进一步的,如图5所示,于本申请其他一实施例中,该翻译装置还包括:
噪声估计模块501,用于根据采集的该声音,检测该环境中的噪声是否大 于预设噪声,若大于预设噪声,则输出提示信息,该提示信息用于提示该用户翻译环境不佳。
进一步的,该翻译装置还包括:
配置模块502,用于响应于该用户的语言指定操作,将该语言指定操作指向的至少两种语言配置为该语言对。
进一步的,识别模块402,还用于将提取的用户语音转换为对应的第一文字。
进一步的,该翻译装置还包括:
展示模块503,用于将该第一文字展示在该显示屏上。
进一步的,翻译及语音合成模块404,还用于将该第一文字翻译为该目标语言的第二文字,以及通过语音合成系统,将该第二文字转换为该目标语音。
展示模块503,还用于将该第二文字展示在该显示屏上。
进一步的,该翻译装置还包括:
处理模块504,用于响应于触发的翻译指令时,退出该语音识别状态。
配置模块502,还用于根据检测到该用户停止说话的时间与该翻译指令触发的时间之间的时间差,调整该预设延迟时长。
进一步的,处理模块504,还用于在语音识别状态下,当通过运动传感器检测到该翻译装置的运动幅度大于预设幅度,或者,该翻译装置被碰撞时,触发该翻译指令。
进一步的,识别模块402,还用于提取该用户语音中该用户的声纹特征,判断存储器中是否存储有该声纹特征对应的语言的标识信息,若该存储器中存储有该标识信息,则将该标识信息对应的语言确定为该源语言,若该存储器中未存储有该标识信息,则提取该用户语音中该用户的发音特征,根据该发音特征确定该源语言,并将该用户的声纹特征和该源语言的标识信息的对应关系存储在该存储器中。
进一步的,配置模块502,还用于判断该存储器中是否存储有停止说话的该用户的声纹特征对应的预设延迟时长;若该存储器中存储有该对应的预设延迟时长,则根据检测到该用户停止说话的时间与该翻译指令触发的时间之间的 时间差,调整该对应的预设延迟时长;若该存储器中未存储有该对应的预设延迟时长,则将该时间差设置为该对应的预设延迟时长。
进一步的,处理模块504,还用于将翻译任务执行期间得到的所有第一文字和第二文字作为谈话记录存储在存储器中,以方便用户后续查询。
处理模块504,还用于定期或者在每次开机后自动清理超过存储期限的谈话记录,以提高存储空间的利用率。
进一步的,识别模块402,还用于响应于用户触发的绑定指令,通过声音采集装置采集用户的目标语音,对该目标语音进行语音识别,得到该用户的声纹特征及该用户所使用的语言。
配置模块502,还用于将识别出的该用户的声纹特征与使用的语言绑定在翻译装置中。
识别模块402,还用于当检测到该用户开始说话时,进入语音识别状态,从采集的声音中提取用户语音,并对提取的用户语音进行声纹识别,得到该用户的声纹特征以及该声纹特征绑定的语言,并将该语言作为该用户使用的源语言。
上述各模块实现各自功能的具体过程可参考图1至图3所示实施例中的相关内容,此处不再赘述。
本实施例中,通过在翻译任务执行期间,自动循环侦听用户是否开始和结束说话,并将用户所说的话翻译成目标语言播放出来,一方面实现了多人共用一台翻译设备进行同声传译,从而减低了翻译成本,另一方面,真正实现了翻译设备对用户交谈内容的自动感知并翻译播报,从而简化了翻译操作。
请参阅图6,图6为本申请一实施例提供的翻译装置的硬件结构示意图。
本实施例中所描述的翻译装置,包括:声音采集装置601、声音播放装置602、存储器603、处理器604及存储在存储器603上并可在处理器604上运行的计算机程序。
其中,声音采集装置601、声音播放装置602和该存储器电性相连于处理器604。存储器603可以是高速随机存取记忆体(RAM,Random Access Memory)存储器,也可为非不稳定的存储器(non-volatile memory),例如磁盘存储器。 存储器501用于存储一组可执行程序代码。
处理器604运行该计算机程序时,执行以下步骤:
当翻译任务触发时,通过声音采集装置601对环境中的声音进行采集,并根据采集的声音检测用户是否开始说话。当检测到该用户开始说话时,进入语音识别状态,从采集的声音中提取用户语音,并根据提取的用户语音判断该用户使用的源语言,根据预设的语言对,确定与该源语言关联的目标语言。当检测到该用户停止说话超过预设延迟时长时,退出该语音识别状态,将该语音识别状态下提取的用户语音转换为该目标语言的目标语音。通过声音播放装置602对该目标语音进行播放,并在结束该播放后返回该根据采集的声音检测用户是否开始说话的步骤,直至该翻译任务结束。
进一步的,如图7所示,如本实施例其他一实施方式中,该翻译装置还包括:
与处理器604电性相连的至少一个输入设备701、至少一个输出设备702以及至少一个运动传感器703。其中,输入设备701具体可为摄像头、触控面板、物理按键等等。输出设备702具体可为显示屏。运动传感器703具体可为重力传感器、陀螺仪、加速度传感器等等。
进一步的,该翻译装置还包括信号收发装置,用于接收和发送无线网络信号。
上述各元器件实现各自功能的具体过程可参考图1至图3所示实施例的相关内容,此处不再赘述。
本实施例中,通过在翻译任务执行期间,自动循环侦听用户是否开始和结束说话,并将用户所说的话翻译成目标语言播放出来,一方面实现了多人共用一台翻译设备进行同声传译,从而减低了翻译成本,另一方面,真正实现了翻译设备对用户交谈内容的自动感知并翻译播报,从而简化了翻译操作。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合 或通信连接可以是通过一些接口,装置或模块的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。
所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个可读存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的可读存储介质包括:U盘、移动硬盘、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
需要说明的是,对于前述的各方法实施例,为了简便描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定都是本申请所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。
以上为对本申请所提供的语音翻译方法及翻译装置的描述,对于本领域的技术人员,依据本申请实施例的思想,在具体实施方式及应用范围上均会有改变之处,综上,本说明书内容不应理解为对本申请的限制。

Claims (10)

  1. 一种语音翻译方法,应用于翻译装置,所述翻译装置包括处理器以及与所述处理器电性连接的声音采集装置和声音播放装置,其特征在于,所述方法包括:
    当翻译任务触发时,通过所述声音采集装置对环境中的声音进行采集,并通过所述处理器根据采集的声音检测用户是否开始说话;
    当检测到所述用户开始说话时,进入语音识别状态,通过所述处理器从采集的声音中提取用户语音,并根据提取的用户语音判断所述用户使用的源语言,根据预设的语言对,确定与所述源语言关联的目标语言;
    当检测到所述用户停止说话超过预设延迟时长时,退出所述语音识别状态,通过所述处理器,将所述语音识别状态下提取的用户语音转换为所述目标语言的目标语音;
    通过所述声音播放装置对所述目标语音进行播放,并在结束所述播放后返回所述通过所述处理器根据采集的声音检测用户是否开始说话的步骤,直至所述翻译任务结束。
  2. 如权利要求1所述的方法,其特征在于,所述当检测到所述用户开始说话时,进入语音识别状态之前还包括:
    通过所述处理器根据采集的所述声音,检测所述环境中的噪声是否大于预设噪声,若大于预设噪声,则输出提示信息,所述提示信息用于提示所述用户翻译环境不佳。
  3. 如权利要求1所述的方法,其特征在于,所述方法还包括:
    响应于所述用户的语言指定操作,通过所述处理器将所述语言指定操作指向的至少两种语言配置为所述语言对。
  4. 如权利要求1所述的方法,其特征在于,所述翻译装置还包括与所述处理器电性相连的显示屏,所述当检测到所述用户开始说话时,进入语音识别状态,通过所述处理器从采集的声音中提取用户语音之后,还包括:
    将提取的所述用户语音转换为对应的第一文字,并将所述第一文字展示在 所述显示屏上;
    所述当检测到所述用户停止说话超过预设延迟时长时,退出所述语音识别状态,通过所述处理器,将所述语音识别状态下提取的用户语音转换为所述目标语言的目标语音,具体包括:
    当检测到所述用户停止说话超过预设延迟时长时,退出所述语音识别状态,通过所述处理器,将所述第一文字翻译为所述目标语言的第二文字,并将所述第二文字展示在所述显示屏上。
    通过语音合成系统,将所述第二文字转换为所述目标语音。
  5. 如权利要求1所述的方法,其特征在于,所述当检测到所述用户停止说话超过预设延迟时长时,退出所述语音识别状态之前,还包括:
    响应于触发的翻译指令时,退出所述语音识别状态;
    根据检测到所述用户停止说话的时间与所述翻译指令触发的时间之间的时间差,调整所述预设延迟时长。
  6. 如权利要求5所述的方法,其特征在于,所述翻译装置还包括与所述处理器电性相连的运动传感器,所述方法还包括:
    在语音识别状态下,当通过所述运动传感器检测到所述翻译装置的运动幅度大于预设幅度,或者,所述翻译装置被碰撞时,触发所述翻译指令。
  7. 如权利要求5所述的方法,其特征在于,所述翻译装置还包括与所述处理器电性相连的存储器,所述根据提取的用户语音判断所述用户使用的源语言,具体包括:
    通过所述处理器提取所述用户语音中所述用户的声纹特征,判断所述存储器中是否存储有所述声纹特征对应的语言的标识信息;
    若所述存储器中存储有所述标识信息,则将所述标识信息对应的语言确定为所述源语言;
    若所述存储器中未存储有所述标识信息,则提取所述用户语音中所述用户的发音特征,根据所述发音特征确定所述源语言,并将所述用户的声纹特征和所述源语言的标识信息的对应关系存储在所述存储器中。
  8. 如权利要求7所述的方法,其特征在于,所述根据检测到所述用户停止 说话的时间与所述翻译指令触发的时间之间的时间差,调整所述预设延迟时长,具体包括:
    判断所述存储器中是否存储有停止说话的所述用户的声纹特征对应的预设延迟时长;
    若所述存储器中存储有所述对应的预设延迟时长,则根据检测到所述用户停止说话的时间与所述翻译指令触发的时间之间的时间差,调整所述对应的预设延迟时长;
    若所述存储器中未存储有所述对应的预设延迟时长,则将所述时间差设置为所述对应的预设延迟时长。
  9. 一种翻译装置,其特征在于,所述装置包括:
    端点检测模块,用于当翻译任务触发时,通过所述声音采集装置对环境中的声音进行采集,并根据采集的声音检测用户是否开始说话;
    识别模块,用于当检测到所述用户开始说话时,进入语音识别状态,从采集的声音中提取用户语音,并根据提取的用户语音判断所述用户使用的源语言,根据预设的语言对,确定与所述源语言关联的目标语言;
    尾点检测模块,用于检测所述用户是否停止说话超过预设延迟时长,当检测到所述用户停止说话超过所述预设延迟时长时,退出所述语音识别状态;
    翻译及语音合成模块,用于将所述语音识别状态下提取的用户语音转换为所述目标语言的目标语音;
    播放模块,用于通过所述声音播放装置对所述目标语音进行播放,并在结束所述播放后触发所述端点检测模块执行所述根据采集的声音检测用户是否开始说话的步骤。
  10. 一种翻译装置,其特征在于,所述装置包括:声音采集装置、声音播放装置、存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序;
    其中,所述声音采集装置、所述声音播放装置和所述存储器电性相连于所述处理器;
    所述处理器运行所述计算机程序时,执行以下步骤:
    当翻译任务触发时,通过所述声音采集装置对环境中的声音进行采集,并根据采集的声音检测用户是否开始说话;
    当检测到所述用户开始说话时,进入语音识别状态,从采集的声音中提取用户语音,并根据提取的用户语音判断所述用户使用的源语言,根据预设的语言对,确定与所述源语言关联的目标语言;
    当检测到所述用户停止说话超过预设延迟时长时,退出所述语音识别状态,将所述语音识别状态下提取的用户语音转换为所述目标语言的目标语音;
    通过所述声音播放装置对所述目标语音进行播放,并在结束所述播放后返回所述根据采集的声音检测用户是否开始说话的步骤,直至所述翻译任务结束。
PCT/CN2019/081036 2018-09-19 2019-04-02 语音翻译方法及翻译装置 WO2020057102A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201980001336.0A CN110914828B (zh) 2018-09-19 2019-04-02 语音翻译方法及翻译装置
JP2019563584A JP2021503094A (ja) 2018-09-19 2019-04-02 音声翻訳方法及び翻訳装置
US16/470,560 US20210343270A1 (en) 2018-09-19 2019-04-02 Speech translation method and translation apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811094286.9 2018-09-19
CN201811094286.9A CN109344411A (zh) 2018-09-19 2018-09-19 一种自动侦听式同声传译的翻译方法

Publications (1)

Publication Number Publication Date
WO2020057102A1 true WO2020057102A1 (zh) 2020-03-26

Family

ID=65305959

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/081036 WO2020057102A1 (zh) 2018-09-19 2019-04-02 语音翻译方法及翻译装置

Country Status (4)

Country Link
US (1) US20210343270A1 (zh)
JP (1) JP2021503094A (zh)
CN (1) CN109344411A (zh)
WO (1) WO2020057102A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680522A (zh) * 2020-05-29 2020-09-18 刘于平 基于电子终端实现翻译控制的方法及其系统、电子设备

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344411A (zh) * 2018-09-19 2019-02-15 深圳市合言信息科技有限公司 一种自动侦听式同声传译的翻译方法
CN112435690B (zh) * 2019-08-08 2024-06-04 百度在线网络技术(北京)有限公司 双工蓝牙翻译处理方法、装置、计算机设备和存储介质
CN111142822A (zh) * 2019-12-27 2020-05-12 深圳小佳科技有限公司 一种同声传译会议方法及系统
JP2022030754A (ja) * 2020-08-07 2022-02-18 株式会社東芝 入力支援システム、入力支援方法およびプログラム
CN112309370A (zh) * 2020-11-02 2021-02-02 北京分音塔科技有限公司 语音翻译方法、装置及设备、翻译机
CN113766510A (zh) * 2021-09-28 2021-12-07 安徽华米信息科技有限公司 设备绑定方法、装置、设备、系统及存储介质
CN115312029B (zh) * 2022-10-12 2023-01-31 之江实验室 一种基于语音深度表征映射的语音翻译方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1932807A (zh) * 2005-09-15 2007-03-21 株式会社东芝 用于翻译语音和进行翻译结果的语音合成的装置和方法
CN101154221A (zh) * 2006-09-28 2008-04-02 株式会社东芝 执行输入语音翻译处理的装置
CN108307659A (zh) * 2016-11-11 2018-07-20 松下知识产权经营株式会社 翻译装置的控制方法、翻译装置以及程序
CN109344411A (zh) * 2018-09-19 2019-02-15 深圳市合言信息科技有限公司 一种自动侦听式同声传译的翻译方法

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007272260A (ja) * 2004-06-23 2007-10-18 Matsushita Electric Ind Co Ltd 自動翻訳装置
JP2007322523A (ja) * 2006-05-30 2007-12-13 Toshiba Corp 音声翻訳装置及びその方法
JP2008077601A (ja) * 2006-09-25 2008-04-03 Toshiba Corp 機械翻訳装置、機械翻訳方法および機械翻訳プログラム
WO2013163293A1 (en) * 2012-04-25 2013-10-31 Kopin Corporation Instant translation system
CN103617801B (zh) * 2013-12-18 2017-09-29 联想(北京)有限公司 语音检测方法、装置及电子设备
JP2015118710A (ja) * 2015-01-09 2015-06-25 株式会社東芝 対話装置、方法及びプログラム
CN104780263A (zh) * 2015-03-10 2015-07-15 广东小天才科技有限公司 一种语音断点延长判断的方法及装置
CN107305541B (zh) * 2016-04-20 2021-05-04 科大讯飞股份有限公司 语音识别文本分段方法及装置
JP6916664B2 (ja) * 2016-09-28 2021-08-11 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America 音声認識方法、携帯端末、および、プログラム
CN106486125A (zh) * 2016-09-29 2017-03-08 安徽声讯信息技术有限公司 一种基于语音识别技术的同声传译系统
CN107910004A (zh) * 2017-11-10 2018-04-13 科大讯飞股份有限公司 语音翻译处理方法及装置
CN108009159A (zh) * 2017-11-30 2018-05-08 上海与德科技有限公司 一种同声传译方法和移动终端
CN108257616A (zh) * 2017-12-05 2018-07-06 苏州车萝卜汽车电子科技有限公司 人机对话的检测方法以及装置
CN207851812U (zh) * 2017-12-28 2018-09-11 中译语通科技(青岛)有限公司 新型同传翻译装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1932807A (zh) * 2005-09-15 2007-03-21 株式会社东芝 用于翻译语音和进行翻译结果的语音合成的装置和方法
CN101154221A (zh) * 2006-09-28 2008-04-02 株式会社东芝 执行输入语音翻译处理的装置
CN108307659A (zh) * 2016-11-11 2018-07-20 松下知识产权经营株式会社 翻译装置的控制方法、翻译装置以及程序
CN109344411A (zh) * 2018-09-19 2019-02-15 深圳市合言信息科技有限公司 一种自动侦听式同声传译的翻译方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680522A (zh) * 2020-05-29 2020-09-18 刘于平 基于电子终端实现翻译控制的方法及其系统、电子设备
CN111680522B (zh) * 2020-05-29 2024-04-23 刘于平 基于电子终端实现翻译控制的方法及其系统、电子设备

Also Published As

Publication number Publication date
CN109344411A (zh) 2019-02-15
US20210343270A1 (en) 2021-11-04
JP2021503094A (ja) 2021-02-04

Similar Documents

Publication Publication Date Title
WO2020057102A1 (zh) 语音翻译方法及翻译装置
CN110914828B (zh) 语音翻译方法及翻译装置
WO2019237806A1 (zh) 语音识别及翻译方法以及翻译装置
CN110049270B (zh) 多人会议语音转写方法、装置、系统、设备及存储介质
CN110517689B (zh) 一种语音数据处理方法、装置及存储介质
CN109147784B (zh) 语音交互方法、设备以及存储介质
JP6139598B2 (ja) オンライン音声認識を処理する音声認識クライアントシステム、音声認識サーバシステム及び音声認識方法
US11164571B2 (en) Content recognizing method and apparatus, device, and computer storage medium
CN110853615B (zh) 一种数据处理方法、装置及存储介质
WO2016187910A1 (zh) 一种语音文字的转换方法及设备、存储介质
JPWO2020222925A5 (zh)
CN111883168A (zh) 一种语音处理方法及装置
JP2000207170A (ja) 情報処理装置および情報処理方法
CN117253478A (zh) 一种语音交互方法和相关装置
JP7400364B2 (ja) 音声認識システム及び情報処理方法
CN114064943A (zh) 会议管理方法、装置、存储介质及电子设备
JP7417272B2 (ja) 端末装置、サーバ装置、配信方法、学習器取得方法、およびプログラム
CN111540357A (zh) 语音处理方法、装置、终端、服务器及存储介质
WO2019150708A1 (ja) 情報処理装置、情報処理システム、および情報処理方法、並びにプログラム
CN110197663A (zh) 一种控制方法、装置及电子设备
CN112435690B (zh) 双工蓝牙翻译处理方法、装置、计算机设备和存储介质
KR102181583B1 (ko) 음성인식 교감형 로봇, 교감형 로봇 음성인식 시스템 및 그 방법
CN113066513B (zh) 语音数据处理方法、装置、电子设备及存储介质
WO2024032111A1 (zh) 在线会议的数据处理方法、装置、设备、介质及产品
WO2024125032A1 (zh) 一种语音控制方法及终端设备

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2019563584

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 12.05.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19862886

Country of ref document: EP

Kind code of ref document: A1