CN111862940A - Earphone-based translation method, device, system, equipment and storage medium - Google Patents

Earphone-based translation method, device, system, equipment and storage medium Download PDF

Info

Publication number
CN111862940A
CN111862940A CN202010682940.9A CN202010682940A CN111862940A CN 111862940 A CN111862940 A CN 111862940A CN 202010682940 A CN202010682940 A CN 202010682940A CN 111862940 A CN111862940 A CN 111862940A
Authority
CN
China
Prior art keywords
source
text
target
voice
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010682940.9A
Other languages
Chinese (zh)
Inventor
张铭阳
熊雷
黄荣毕
胡文波
蒋峰
张志达
许云飞
李宏亮
张国旺
邢仁泰
蒋习旺
王婷
席晓宁
高聪
高栋
常镶石
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Shanghai Xiaodu Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202010682940.9A priority Critical patent/CN111862940A/en
Publication of CN111862940A publication Critical patent/CN111862940A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/80Services using short range communication, e.g. near-field communication [NFC], radio-frequency identification [RFID] or low energy communication
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W76/00Connection management
    • H04W76/10Connection setup
    • H04W76/14Direct-mode setup

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a translation method, a translation device, a translation system, translation equipment and a translation medium based on an earphone, and relates to the technical field of voice interaction and voice. The specific implementation scheme is as follows: establishing a communication connection with the earphone end; acquiring a source text to be translated; translating the source text to obtain a target text; performing voice synthesis on the target text to generate target voice; and displaying the target text on a display interface of a client, and transmitting the target voice to the earphone end for playing. The embodiment of the application can realize the technical effect of performing voice translation in real time and conveniently.

Description

Earphone-based translation method, device, system, equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a system, a device, and a storage medium for translating based on an earphone.
Background
In human life and work, language expression is an important communication mode and information expression mode. Translation is required when languages of different languages are involved.
With the development of artificial intelligence and other technologies, various application software is developed in succession to provide a language translation function for a user. However, the actual operation of the existing language translation function is limited, and for a scene needing to obtain translation information conveniently in real time, a scheme to be optimized still remains.
Disclosure of Invention
The disclosure provides a translation method, a translation device, translation equipment and a storage medium based on earphone implementation.
According to an aspect of the present disclosure, there is provided a headset-based translation method, performed by a client, the method including:
establishing a communication connection with the earphone end;
acquiring a source text to be translated;
translating the source text to obtain a target text;
performing voice synthesis on the target text to generate target voice;
and displaying the target text on a display interface of a client, and transmitting the target voice to the earphone end for playing.
According to another aspect of the present disclosure, there is provided a headset-based translation apparatus configured in a client, the apparatus including:
the communication connection establishing module is used for establishing communication connection with the earphone end;
the source text acquisition module is used for acquiring a source text to be translated;
The target text acquisition module is used for translating the source text to acquire a target text;
the voice synthesis module is used for carrying out voice synthesis on the target text to generate target voice;
and the target voice processing module is used for displaying the target text on a display interface of a client and transmitting the target voice to the earphone end for playing.
According to another aspect of the present disclosure, a translation system implemented based on an earphone is provided, where the system includes an earphone end, a terminal, and a cloud server; the terminal is in communication connection with the earphone end and the cloud server;
the earphone end is used for collecting source voice;
the terminal is used for acquiring source voice collected by the earphone end and transmitting the source voice to the cloud server;
the cloud server is used for performing at least one processing operation of voice recognition, language translation and voice synthesis on the source voice and feeding back a processing result to the terminal.
According to another aspect of the present disclosure, there is provided an electronic device including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform a headset-based implementation of the translation method as described in any of the embodiments of the present application.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a headset-based implementation translation method according to any one of the embodiments of the present disclosure.
According to the technology of the application, the technical effect of performing voice translation in real time and conveniently is achieved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
FIG. 1 is a flow diagram of a headset-based implementation of a translation method according to an embodiment of the present application;
FIG. 2A is a flow chart of a method for earphone-based translation according to an embodiment of the present application;
FIG. 2B is a schematic diagram of an abnormal connection notification interface according to an embodiment of the present application;
FIG. 2C is a schematic view of a configuration interface according to an embodiment of the present application;
FIG. 2D is a schematic diagram of a blocking prompt information interface according to an embodiment of the application;
FIG. 2E is a schematic view of a display interface according to an embodiment of the present application;
FIG. 2F is a schematic illustration of a translation mode interface according to an embodiment of the present application;
FIG. 3 is a flow diagram of a method for earphone-based translation according to an embodiment of the present application;
FIG. 4 is a flow diagram of a method for earphone-based translation according to an embodiment of the present application;
FIG. 5A is a schematic diagram of a headset-based implementation of a translation system according to an embodiment of the present application;
FIG. 5B is a schematic diagram of a headset-based implementation of a translation system according to an embodiment of the present application;
FIG. 5C is a schematic diagram of a headset-based implementation of a translation system according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a translation apparatus implemented based on a headset according to an embodiment of the present application;
fig. 7 is a block diagram of an electronic device based on a headset-implemented translation method according to an embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a flowchart of a translation method implemented based on an earphone according to an embodiment of the present application, which may be applied to a case where a user listens to a translation result of a source text through an earphone. The method of the embodiment may be performed by a translation apparatus implemented based on a headset, the apparatus may be configured in a client, may be implemented by software and/or hardware, and may be integrated on any electronic device with computing capability. The client comprises system software or application software and the like installed in a terminal, and the terminal comprises but is not limited to a smart phone, a smart watch, a tablet computer, a notebook computer or any equipment configured with a smart operating system.
As shown in fig. 1, the earphone-based translation method disclosed in this embodiment may include:
s101, establishing communication connection with an earphone end.
In one embodiment, the terminal to which the client belongs and the earphone end establish communication connection in a preset mode, so that the client installed in the terminal and the earphone end establish communication connection. The preset communication connection establishing mode comprises at least one of the following modes: 1) the data lines are connected in a communication mode; 2) connecting through Bluetooth communication; 3) the communication connection is realized through WiFi (Wireless Fidelity) technology; 4) a Communication connection by NFC (Near Field Communication) technology; 5) the communication connection is realized by an RFID (Radio Frequency Identification) technology.
Optionally, the terminal to which the client belongs and the headset end are in communication connection through A2DP (Advanced Audio distribution profile, bluetooth Audio transmission model agreement).
By establishing communication connection with the earphone end, a foundation is laid for transmitting the translated target voice to the earphone end for playing subsequently.
And S102, acquiring a source text to be translated.
Wherein the source text represents the textual information to be translated.
In an embodiment, the client may identify and determine a source text corresponding to source speech by obtaining the source speech to be translated, or may directly obtain the source text.
Optionally, the obtaining of the source text to be translated includes at least one of:
1) and acquiring source speech acquired by an earphone end, and identifying and determining the source text according to the source speech.
In one embodiment, a user dictates a source speech to be translated, an earphone end worn by the user collects the source speech and transmits the source speech to a terminal with established communication connection, and a client in the terminal correspondingly obtains the source speech and performs speech recognition on the source speech to determine a source text corresponding to the source speech.
2) And acquiring source speech input by a user through a microphone of a terminal to which the client belongs, and identifying and determining the source text according to the source speech.
In one embodiment, after the user starts the sound receiving function of the microphone of the terminal to which the client belongs, the user inputs source speech to the terminal, where the source speech may be the source speech to be translated dictated by the user himself or the source speech to be translated from other sound sources, such as a piece of audio file. And the client correspondingly acquires the source speech acquired by the terminal and performs speech recognition on the source speech to determine a source text corresponding to the source speech.
Optionally, the "determining the source text according to the source speech recognition" in 1) and 2) above includes two implementation manners, i.e. a and B:
A. and carrying out local speech recognition on the source speech to determine the source text.
In one embodiment, the client performs speech recognition on the acquired source speech according to a preset speech recognition algorithm to determine text information, i.e., a source text, corresponding to the source speech.
The speech recognition algorithm includes, but is not limited to, a hidden markov model algorithm based on a parametric model, a vector quantization algorithm based on a non-parametric model, or a deep learning neural network algorithm.
B. And transmitting the source voice to a voice recognition cloud server based on a long voice interface so as to request voice recognition and receive the source text fed back by the voice recognition cloud server.
The long voice interface is essentially a voice SDK (Software Development Kit) for establishing data transmission between the client and the voice recognition cloud server.
In one implementation mode, a client generates a voice recognition request based on source voice, sends the voice recognition request to a voice recognition cloud server through a preset long voice interface, and the voice recognition cloud server responds to the voice recognition request, acquires the source voice, calls a voice recognition module to perform voice recognition on the source voice, and determines a source text corresponding to the source voice. And finally, the voice recognition cloud server feeds the obtained source text back to the client through the long voice interface.
Determining a source text by performing local speech recognition on a source speech; or transmit source speech to speech recognition cloud end server based on long speech interface to speech recognition cloud end server to the request carries out speech recognition, and receive speech recognition cloud end server feedback the source text has realized confirming the source text to source speech recognition with two kinds of modes, preferentially transmits source speech to speech recognition cloud end server when network conditions is good and carries out speech recognition, carries out local speech recognition with source speech when network conditions is relatively poor, has improved the reliability of carrying out speech recognition to source speech, has increased the redundancy.
3) And acquiring a source text input by a user through an input interface of the client.
In one embodiment, a user invokes a text input function through an input interface of a client, manually inputs a source text to be translated, and the client correspondingly acquires the source text input by the user.
By acquiring the source speech acquired by the earphone end and identifying and determining the source text according to the source speech, or acquiring the source speech input by a user through a microphone of a terminal to which the client belongs and identifying and determining the source text according to the source speech, or acquiring the source text input by the user through an input interface of the client, the source text can be acquired through at least three ways, the adaptability of the translation method based on the earphone in the embodiment to different scenes is improved, and the user experience is improved.
By obtaining the source text to be translated, a foundation is laid for subsequently translating the source text and obtaining the target text.
S103, translating the source text to obtain a target text.
The language of the target text and the language of the source text can be set in the client according to requirements, for example, the language of the source text can be 'Chinese', and the language of the target text can be 'English'; accordingly, the language of the source text may be "English" and the language of the target text may be "Chinese".
In one embodiment, according to a target language set by a user, a preset translation processing algorithm is adopted to translate a source text to obtain a target text in a target language form.
Optionally, translating the source text to obtain the target text includes two implementation manners, i.e. a and B:
A. and performing local language translation on the source text to acquire a target text.
In one embodiment, the client translates the source text by using a locally preset translation processing algorithm to obtain a target text in a target language form. The translation processing algorithm includes, but is not limited to, a language machine translation algorithm of semi-supervised learning, a translation processing algorithm of multi-strategy analysis, a translation template extraction algorithm, and the like.
B. And transmitting the source text to a translation cloud server based on a long-chain interface to request language translation, and receiving the target text fed back by the translation cloud server.
The long-chain interface is essentially a voice SDK and is used for establishing data transmission between the client and the translation cloud server.
In one implementation, a client generates a language translation request based on a source text and a target language set by a user, sends the language translation request to a translation cloud server through a preset long-link interface, and the translation cloud server responds to the language translation request, acquires the source text and calls a language translation module to perform language translation on the source text to obtain the target text in a target language form. And finally, the translation cloud server feeds back the obtained target text to the client through the long-chain connection interface.
Performing local language translation on a source text to obtain a target text; or the source text is transmitted to the translation cloud server based on the long-chain interface to request language translation, and the target text fed back by the translation cloud server is received, so that the source text is translated in two ways to obtain the target text, the source text is preferentially transmitted to the translation cloud server to be translated in languages when the network condition is good, and the source text is translated in local languages when the network condition is poor, so that the reliability of language translation of the source text is improved, and the redundancy is increased.
The target text is obtained by translating the source text, so that the language translation effect is realized, and a foundation is laid for generating the target voice by subsequently performing voice synthesis.
And S104, carrying out voice synthesis on the target text to generate target voice.
In one embodiment, the target Text is Speech synthesized using TTS (Text to Speech) technology to generate the target Speech.
Optionally, performing speech synthesis on the target text to generate the target speech includes two implementation manners, i.e. a:
A. and performing local voice synthesis on the target text to generate target voice.
In one embodiment, the client locally performs speech synthesis on the target text by using a TTS technology, where the TTS technology optionally includes a linear predictive coding technology, a pitch synchronous superposition technology, and a speech synthesis method based on a vocal tract model.
B. And transmitting the target text to a voice synthesis cloud server based on a long-chain interface to request voice synthesis, and receiving the target voice fed back by the voice synthesis cloud server.
In one implementation, the client generates a voice synthesis request based on a target text, sends the voice synthesis request to a voice synthesis cloud server through a preset long-link interface, and the voice synthesis cloud server responds to the voice synthesis request, acquires the target text and calls a voice synthesis module to perform voice synthesis on the target text to generate a target voice corresponding to the target text. And finally, the voice synthesis cloud server feeds the generated target voice back to the client through the long-link interface.
Generating a target voice by performing local voice synthesis on the target text; or the target text is transmitted to the voice synthesis cloud server based on the long-chain interface to request voice synthesis, and the target voice fed back by the voice synthesis cloud server is received, so that the target voice is obtained by performing voice synthesis on the target text in two modes, the target text is preferentially transmitted to the voice synthesis cloud server to perform voice synthesis when the network condition is good, and the target text is locally subjected to voice synthesis when the network condition is poor, so that the reliability of performing voice synthesis on the target text is improved, and the redundancy is increased.
The target text is subjected to voice synthesis, so that the effect of generating the target voice corresponding to the target text is realized, and a foundation is laid for subsequently transmitting the target voice to an earphone end for a user to listen.
And S105, displaying the target text on a display interface of a client, and transmitting the target voice to the earphone end for playing.
In one embodiment, the client displays the target text on a preset display interface, and a user can view the translated target text through the display interface of the client. Meanwhile, the client side can also transmit the target voice corresponding to the target text to the earphone end which is established with communication connection, the earphone end controls each generating unit of the earphone end to convert the target voice into mechanical vibration in an electric signal mode, the effect of playing the target voice is further achieved, and a user can listen to the target voice corresponding to the target text by wearing the earphone end.
The target text is displayed on the display interface of the client, and the target voice is transmitted to the earphone end to be played, so that the effect of feeding back the translated target text and the target voice corresponding to the target text for the user is achieved.
According to the technical scheme of the embodiment, the source text to be translated is translated by establishing communication connection with the earphone end, the target text obtained by translation is subjected to voice synthesis to generate the target voice, the target voice is finally displayed on the display interface of the client side and is transmitted to the earphone end to be played, a user can check the translated target text in the client side and can also listen to the target voice corresponding to the target text through the earphone end, and therefore the technical effect of performing voice translation conveniently and in real time is achieved.
On the basis of the foregoing embodiment, the step S105 of "transmitting the target voice to the earphone end for playing" includes:
and respectively transmitting the target voices corresponding to the source texts input by the two users to the two earphone ends for playing respectively.
In one embodiment, a client translates a source text input by a user, acquires a target text corresponding to the source text, performs speech synthesis on the target text, and transmits the generated target speech to two earphone terminals respectively for playing.
Illustratively, a user A and a user B are respectively provided with an earphone end, the user A inputs a source text of 'Hello' in a client, a target voice corresponding to a target text after the source text of 'Hello' is translated is 'Hello', and the client transmits the target voice 'Hello' to the earphone end worn by the user B; correspondingly, the user B inputs the source text as 'Hello' in the user side, the target voice corresponding to the target text after the source text 'Hello' is translated is 'Hello', and the client side transmits the target voice 'Hello' to the earphone side worn by the user A.
Target voices corresponding to source texts input by two users are respectively transmitted to two earphone ends and are respectively played, so that the technical effect of translation communication between the two users is achieved.
Fig. 2A is a flowchart of a translation method implemented based on an earphone according to an embodiment of the present application, which is further optimized and expanded based on the foregoing technical solution, and can be combined with the foregoing optional embodiments. As shown in fig. 2A, the method may include:
s201, establishing communication connection with an earphone end.
Optionally, the client automatically detects whether the current communication connection with the earphone end is smooth, and if the current communication connection with the earphone end is interrupted, abnormal connection prompt information is displayed on a display interface of the client to prompt a user that the current communication connection between the client and the earphone end is interrupted. As shown in fig. 2B, fig. 2B is a schematic diagram of an abnormal connection prompt information interface disclosed in an embodiment of the present application.
S202, acquiring translation configuration information input by a user through a configuration interface of a client; wherein the translation configuration information comprises: and the source language and the target language corresponding to each earphone end.
In one embodiment, if the user uses the translation function for the first time, or does not use the translation function within a week, the client automatically pops up the configuration interface, and the user inputs the translation configuration information according to the requirement, including but not limited to the source language, the target voice speed, the target voice volume, and the like corresponding to each of the ear terminals.
The source language represents the language used by the user, and the target language represents the language the user wants to translate the source language into, for example, if the user uses Chinese and wants to translate into English, the source language is Chinese and the target language is English.
As shown in fig. 2C, fig. 2C is a schematic diagram of a configuration interface disclosed in the embodiment of the present application, wherein 20 represents a first configuration scheme, a source language at a left earphone end is "english", and a target language is "chinese"; correspondingly, the source language at the right earphone end is 'Chinese' and the target language is 'English'. 21, a configuration scheme II, wherein the source language at the left earphone end is 'Chinese', and the target language is 'English'; correspondingly, the source language at the right earphone end is English, and the target language is Chinese.
S203, acquiring a test sound instruction input by a user, transmitting preset test sound audio to each earphone end according to the source language and the target language configured for each earphone end in the current translation configuration information, and playing according to the sequence of the source language and the target language.
In one embodiment, if the user does not use the translation function for the first time and uses the translation function within a week, the client automatically pops up a test sound guide interface. The user inputs a test sound instruction on the test sound guide page, optionally clicking a test sound button, correspondingly transmitting preset test sound audio to each earphone end according to the source language and the target language configured for each earphone end in the current translation configuration information by the client, and playing according to the sequence of the source language and the target language.
For example, assuming that the source language at the left earphone end is "English", the target language is "chinese", the source language at the right earphone end is "chinese", and the target language is "English" in the current translation configuration information, when the user clicks the "test sound" button, the preset test sound audio is respectively transmitted to the left earphone end and the right earphone end for playing, for example, "This leftearrude is for English-speaking user" is played at the left earphone end, and "the right earphone end is played by the chinese user".
For example, assuming that the source language at the left earphone end is "chinese", the target language is "English", the source language at the right earphone end is "English", and the target language is "chinese" in the current translation configuration information, when the user clicks the "test sound" button, the preset test sound audio is respectively transmitted to the left earphone end and the right earphone end for playing, for example, "left earphone please wear by the chinese user" is played at the left earphone end, and "This right ear for English-speaking user" is played at the right earphone end.
Optionally, after the user inputs the translation configuration information, the client clicks "save use", and then transmits the preset test tone audio to each earphone according to the source language and the target language configured for each earphone in the translation configuration information input by the user this time, and plays the test tone audio according to the sequence of the source language and the target language.
Optionally, after the user inputs the translation configuration information, the "return" button is clicked without clicking the "save use" button, and the client automatically pops up the blocking prompt information, for example, "please save the current translation configuration information" or "does not save the current translation configuration information yet", or the like.
As shown in fig. 2D, fig. 2D is a schematic diagram of a blocking prompt information interface disclosed according to an embodiment of the present application, where after the user clicks "no save", the client holds and returns the translation configuration information input by the previous user, and when the user clicks "i know", the blocking prompt information disappears and stops at the configuration interface.
S204, respectively acquiring a first source speech and a second source speech which are respectively acquired from the two earphone ends, and respectively carrying out speech recognition on the first source speech and the second source speech to acquire a first source text and a second source text.
In one embodiment, the client obtains a first collected source speech from the left earphone end and a second collected source speech from the right earphone end, wherein the source language of the first source speech is the target language of the second source speech, and the target language of the first source speech is the source language of the second source speech.
S205, translating the first source speech and the second source speech to obtain a first target text and a second target text.
S206, performing voice synthesis on the first target text and the second target text to generate first target voice and second target voice.
And S207, two target texts and two corresponding source texts obtained by translation are displayed on a display interface of the client, and the two target voices are transmitted to the earphone end which does not collect the corresponding source voices in a cross mode to be played.
For example, assuming that the source speech collected by the left earphone end is "Hello, I hungry", the target speech is "Hello, I'm standing", the source speech collected by the right earphone end is "What do you want to eat", and the target speech is "What you want to eat", the client transmits the target speech "Hello, I'm standing" to the right earphone end, and transmits the target speech "What you want to eat" to the left earphone end.
Optionally, "displaying the target text on the display interface of the client includes:
synchronously displaying the source text and the target text on a display interface of a client based on a set on-screen display rule;
wherein the on-screen display rules include at least one of the following A, B, C and D:
A. the source text and the target text are displayed in different regions of the display interface.
Optionally, if the source text is displayed in the upper area of the display interface, the target text is displayed in the lower area of the display interface. For example, the upper area of the display interface is a white area, the source text is displayed in a black font, the lower area of the display interface is a blue area, and the target text is displayed in a white font. Correspondingly, if the source text is displayed in the lower area of the display interface, the target text is displayed in the upper area of the display interface.
As shown in fig. 2E, fig. 2E is a schematic diagram of a display interface disclosed in an embodiment of the present application, where 22 denotes an area where source text is displayed, and 23 denotes an area where target text is displayed.
B. And the source text and the target text are displayed in the same area of the display interface in an associated mode.
Optionally, the source text and the target text are displayed in association in the same dialog box in the same region of the display interface.
C. And the source text corresponding to the currently played target voice and the target text are displayed in a distinguishing way.
Optionally, the source text and the target text corresponding to the currently played target speech are displayed in a bold and enlarged manner in a differentiated manner.
D. And performing differential display on the source text and the target text input by different users.
Optionally, identifiers are added in front of the source text and the target text input by different users to achieve the effect of distinguishing the display, for example, a triangle identifier is added in front of the source text and the target text input by the user a, and a circle identifier is added in front of the source text and the target text input by the user B. For another example, a circular identifier is added to the front of the source text and the target text input by the user a, and no identifier is added to the front of the source text and the target text input by the user B.
According to the technical scheme of the embodiment, the source text and the target text are synchronously displayed on the display interface of the client based on the set on-screen display rule, so that a user can view the source text and the target text on the display interface of the client in real time; the method comprises the steps that first source speech and second source speech which are acquired from two earphone ends are acquired respectively, the first source speech and the second source speech are subjected to speech recognition respectively to acquire a first source text and a second source text, and finally two target texts and two corresponding source texts are obtained through translation and displayed on a display interface of a client side; the translation configuration information input by the user through the configuration interface of the client is acquired, so that the requirements of the user on source languages and target languages of different earphone ends are met, and the adaptability of a scene is improved; by acquiring the test tone instruction input by the user, according to the source language and the target language configured for each earphone end in the current translation configuration information, the preset test tone audio is transmitted to each earphone end, and the test tone audio is played according to the sequence of the source language and the target language, so that the correctness of playing the languages of each earphone end is ensured, and the reliability of the translation result is improved.
On the basis of the above embodiment, the method further includes:
and displaying at least two translation modes on the interface of the client, and determining the current translation mode according to the selection of the user on the translation modes.
In one embodiment, a user can select one of at least two translation modes displayed on the interface of the client as a current translation mode according to requirements.
As shown in fig. 2F, fig. 2F is a schematic diagram of a translation mode interface disclosed in an embodiment of the present application, where 24 denotes a "meeting place real-time listening and translation mode", 25 denotes a "headphone and mobile phone interactive translation mode", and 26 denotes a "double split ear and chat mode".
At least two translation modes are displayed on an interface of a client, and the current translation mode is determined according to the selection of a user on the translation mode, so that the effect of providing multiple selectable translation modes for the user is realized, and multiple translation requirements of the user are met.
On the basis of the above embodiment, the method further includes:
and saving the source text and the target text as a history.
In one embodiment, the user clicks a 'save' button on the client display interface, and the client responds to the save operation of the user to save the source text and the target text in the display interface as a history.
In another embodiment, the user clicks an "empty" button on the client display interface, and the client responds to the empty operation of the user to empty the source text and the target text in the display interface and store the empty source text and the target text as a history.
The source text and the target text are stored as a history record, so that the user can view the history record subsequently.
On the basis of the above embodiment, the method further includes:
when an abnormal event is detected, suspending the acquisition operation of the source text and suspending the operation of transmitting the target voice to the earphone end for playing;
wherein the exception event includes at least one of A, B and C:
A. the environmental noise of the source speech collected by the earphone end is larger than the noise threshold value.
In an embodiment, the client detects the environmental noise of the source speech acquired by the earphone according to a preset period, optionally, the preset period is 5 seconds. If the environmental noise of the source speech collected by the earphone end is detected to be larger than the noise threshold value, which indicates that the quality of the source speech is not good, the client suspends the operation of obtaining the source text and the operation of transmitting the target speech to the earphone end for playing, and generates prompt information at the client, for example, "the current environmental noise is large, the translation may be inaccurate, please note! ", for example," transformation library be extracted from to noise ", etc.
B. And the user implements a return operation or a clearing operation on the display interface of the client.
In one embodiment, when the user clicks a "return" button or clicks a "clear" button on the display interface of the client, the client suspends the operation of acquiring the source text and the operation of suspending the transmission of the target voice to the earphone end for playing, and generates a prompt message, for example, "some information is not translated yet and may be lost after returning," some information is not translated yet and may be lost after clearing, "for example.
C. And the communication connection between the earphone end and the terminal to which the client belongs is disconnected, and/or the communication connection between the two earphone ends is disconnected.
In one embodiment, when the client detects that the communication connection between the earphone end and the terminal to which the client belongs is disconnected and/or the communication connection between the two earphone ends is disconnected, the acquisition operation of the source text is suspended, the operation of transmitting the target voice to the earphone end for playing is suspended, and prompt information is generated, such as "the earphone is disconnected, the translation is stopped, and if the use needs to be continued, the manual connection is requested, and further, such as" the device connection fails, the retry is requested ", and the like.
By suspending the acquisition operation of the source text and suspending the operation of transmitting the target voice to the earphone end for playing when the abnormal event is detected, the reliability and the accuracy of the translation result are ensured.
Fig. 3 is a flowchart of a translation method implemented based on a headset according to an embodiment of the present application, which is further optimized and expanded based on the above technical solution, and can be combined with the above optional embodiments. As shown in fig. 3, the method may include:
s301, establishing communication connection with the earphone end.
S302, acquiring a first source speech input by a user and identifying and determining a first source text according to the first source speech.
Optionally, the user a wears an earphone end, assuming that the source language of the user a is "chinese", after the user a clicks a "speak chinese" button in the client, a microphone at the earphone end is turned on, the user a speaks toward the earphone end, the earphone end collects first source speech input by the user a, and "please speak chinese" and "please speak to the earphone" are displayed on an interface of the client.
S303, acquiring second source speech input by a user collected by a client, and identifying and determining a second source text according to the second source speech
Optionally, on the basis of the content described in S302, assuming that the source language of the user B is "English", after the user a finishes speaking, the user B clicks a "Speak English" button in the client, the microphone of the terminal to which the client belongs is turned on, the user B starts speaking, the client acquires a second source speech input by the user B, and the client interface displays "Speak" and the like.
S304, translating the first source text and the second source text to obtain a first target text and a second target text.
S305, performing voice synthesis on the first target text and the second target text to generate target voice.
S306, displaying the first target text and the second target text on a display interface of a client, and transmitting the target voice corresponding to the second source voice to the earphone end for playing.
Optionally, on the basis of the contents described in S302 and S303, the client displays the first target text and the second target text on a display interface of the client, and transmits the target speech corresponding to the second source speech of the user B to an earphone end worn by the user a for playing, so that the user a can listen to the target speech.
According to the technical scheme of the embodiment, the earphone end is used for acquiring the first source speech input by the user, the first source text is determined according to the first source speech recognition, the client end is used for acquiring the second source speech input by the user, the second source text is determined according to the second source speech recognition, and finally the target speech corresponding to the second source speech is transmitted to the earphone end for playing.
Fig. 4 is a flowchart of a translation method implemented based on a headset according to an embodiment of the present application, which is further optimized and expanded based on the above technical solution, and can be combined with the above optional embodiments. As shown in fig. 4, the method may include:
s401, establishing communication connection with the earphone end.
S402, obtaining source speech input by a user through a microphone of a terminal to which the client belongs, and identifying and determining the source text according to the source speech.
In one embodiment, the user clicks a "start listening translation" button in the client, the client automatically checks the communication connection with the earphone, and if the communication connection is interrupted, an interruption prompt is displayed in the client. If the communication is normal, a microphone sound receiving function of a terminal to which the client belongs is started, the current source language and the current target language are displayed on the upper side of the client interface, and prompt information is displayed, for example, "the user is translating English into Chinese and please approach a mobile phone microphone to an English sound source". The user can select a new source language and a new target language according to the requirement, the current source language and the current target language are updated and displayed on the upper side of the corresponding client interface, and the display prompt information is updated and displayed. The sound source may be a speech spoken by the user or any sound source comprising speech information, such as an audio file and a video file with sound, etc.
And S403, translating the source text to obtain a target text.
S404, performing voice synthesis on the target text to generate target voice.
S405, displaying the target text on a display interface of a client, and transmitting the target voice to the earphone end for playing.
Optionally, when the target text is displayed on the display interface of the client, the user may select the target voice of the target text to be played in the client according to the requirement, or may select the silent mode, that is, only the target text is displayed.
Optionally, in the process that the client displays the target text on the display interface of the client and transmits the target voice to the earphone end for playing, the user may touch and click a "pause" button in the client interface at any time, stop the reception of the microphone after clicking, continue to display the target text that is not yet displayed on the display interface, and continue to transmit the target voice that is not transmitted to the earphone end. The pause button is changed into a start listening translation button after being clicked by touch control, namely, the user can click the start listening translation button again to start the sound receiving function of the microphone again.
Optionally, after the user clicks the "pause" button, the target texts are all displayed on the display interface, and the target voices are all transmitted to the earphone terminal, at this time, the user may reselect the source language and the target language, after reselecting, after starting the listening and translating function again, the previous target texts are cleared and stored in the translation history record, and the display interface continues to display new target texts.
According to the technical scheme of the embodiment, the source speech input by the user through the microphone of the terminal to which the client belongs is obtained, the source text is identified and determined according to the source speech, the target text is finally displayed on the display interface of the client, and the target speech is transmitted to the earphone end to be played, so that the user can translate the sound source to be translated, such as songs or movie lines and the like, at any time and any place, and the flexibility and convenience of speech translation are greatly improved.
Fig. 5A is a schematic diagram of a translation system implemented based on a headset according to an embodiment of the present application, which is further optimized and expanded based on the above technical solution, and can be combined with the above optional embodiments. As shown in fig. 5A, the system may include:
earphone end 50, terminal 51 and cloud server 52, wherein:
the earphone end 50 is used for collecting source speech;
the terminal 51 is configured to obtain source audio collected by the earphone end 50, and transmit the source audio to the cloud server 52;
the cloud server 52 is configured to perform at least one processing operation of speech recognition, language translation, and speech synthesis on the source speech, and feed back a processing result to the terminal 51.
Optionally, fig. 5B is a schematic diagram of a translation system implemented based on an earphone according to an embodiment of the present disclosure, and includes an earphone end 50, a terminal 51, a long voice interface 52, a voice recognition cloud server 53, a translation cloud server 54, and a voice synthesis cloud server 55, where:
the earphone end 50 is used for collecting source speech;
the terminal 51 is configured to obtain source speech acquired by the earphone end 50, and transmit the source speech to the speech recognition cloud server 53 through the long speech interface 52;
the voice recognition cloud server 53 is configured to perform voice recognition on the source voice to obtain a source text, and transmit the source text to the translation cloud server 54;
the translation cloud server 54 is configured to perform language translation on the source text to obtain a target text corresponding to the source text, and transmit the target text to the speech synthesis cloud server 55;
the voice synthesis cloud server 55 is configured to perform voice synthesis on the target text to obtain a target voice, and feed the target voice back to the terminal 51 through the long voice interface 52.
Optionally, fig. 5C is a schematic diagram of a translation system implemented based on an earphone according to an embodiment of the present application, and includes an earphone end 50, a terminal 51, a long voice interface 52, a voice recognition cloud server 53, a translation cloud server 54, a voice synthesis cloud server 55, and a long connection interface 56, where:
The earphone end 50 is used for collecting source speech;
the terminal 51 is configured to obtain source speech acquired by the earphone end 50, and transmit the source speech to the speech recognition cloud server 53 through the long speech interface 52;
the voice recognition cloud server 53 is configured to perform voice recognition on the source voice to obtain a source text, and feed back the source text to the terminal 51 through the long voice interface 52;
the terminal 51 is configured to transmit the source text to the translation cloud server 54 through a long link interface 56;
the translation cloud server 54 is configured to perform language translation on the source text to obtain a target text, and feed back the target text to the terminal 51 through the long link interface 56;
the terminal 51 is configured to transmit the target text to the voice synthesis cloud server 55 through a long link interface 56;
the voice synthesis cloud server 55 is configured to perform voice synthesis on the target text to generate target voice, and transmit the target voice to the translation cloud server 54 through the long-link interface 56.
Fig. 6 is a schematic structural diagram of a translation apparatus implemented based on an earphone according to an embodiment of the present application, which may be applied to a case where a user listens to a translation result of a source text through an earphone. The device of the embodiment is configured in the client, can be implemented by software and/or hardware, and can be integrated on any electronic device with computing capability.
As shown in fig. 6, the earphone-based translation apparatus 60 disclosed in this embodiment may include a communication connection establishing module 61, a source text acquiring module 62, a target text acquiring module 63, a speech synthesizing module 64, and a target speech processing module 73, where:
a communication connection establishing module 61 for establishing a communication connection with the earphone end;
a source text obtaining module 62, configured to obtain a source text to be translated;
a target text obtaining module 63, configured to translate the source text to obtain a target text;
a speech synthesis module 64, configured to perform speech synthesis on the target text to generate a target speech;
and the target voice processing module 65 is configured to display the target text on a display interface of the client, and transmit the target voice to the earphone end for playing.
Optionally, the source text obtaining module 62 is specifically configured to at least one of:
obtaining source speech acquired by an earphone end, and identifying and determining the source text according to the source speech;
acquiring source speech input by a user through a microphone of a terminal to which the client belongs, and identifying and determining the source text according to the source speech;
and acquiring a source text input by a user through an input interface of the client.
Optionally, the source text obtaining module 62 is further specifically configured to:
performing local speech recognition on the source speech to determine the source text; or
And transmitting the source voice to a voice recognition cloud server based on a long voice interface so as to request voice recognition and receive the source text fed back by the voice recognition cloud server.
Optionally, the target text obtaining module 63 is specifically configured to:
performing local language translation on the source text to obtain a target text; or
And transmitting the source text to a translation cloud server based on a long-chain interface to request language translation, and receiving the target text fed back by the translation cloud server.
Optionally, the speech synthesis module 64 is specifically configured to:
performing local voice synthesis on the target text to generate target voice; or
And transmitting the target text to a voice synthesis cloud server based on a long-chain interface to request voice synthesis, and receiving the target voice fed back by the voice synthesis cloud server.
Optionally, the target speech processing module 65 is specifically configured to:
synchronously displaying the source text and the target text on a display interface of a client based on a set on-screen display rule;
Wherein the on-screen display rules include at least one of:
the source text and the target text are displayed in different areas of the display interface;
the source text and the target text are displayed in the same area of the display interface in an associated mode;
the source text corresponding to the current playing target voice and the target text are displayed in a distinguishing way;
and performing differential display on the source text and the target text input by different users.
Optionally, the target speech processing module 65 is further specifically configured to:
and respectively transmitting the target voices corresponding to the source texts input by the two users to the two earphone ends for playing respectively.
Optionally, the apparatus further includes a saving module, specifically configured to:
and saving the source text and the target text as a history.
Optionally, the apparatus further includes an anomaly detection module, specifically configured to:
when an abnormal event is detected, suspending the acquisition operation of the source text and suspending the operation of transmitting the target voice to the earphone end for playing;
wherein the exception event comprises at least one of:
the environmental noise of the source speech collected by the earphone end is greater than a noise threshold value;
The user implements a return operation or an emptying operation on the display interface of the client;
and the communication connection between the earphone end and the terminal to which the client belongs is disconnected, and/or the communication connection between the two earphone ends is disconnected.
Optionally, the communication connection establishing module 61 is specifically configured to:
respectively acquiring a first source speech and a second source speech which are acquired from two earphone ends, and respectively carrying out speech recognition on the first source speech and the second source speech to acquire a first source text and a second source text;
correspondingly, the target speech processing module 65 is further specifically configured to:
and displaying the two target texts and the two corresponding source texts obtained by translation on a display interface of a client, and cross-transmitting the two target voices to an earphone end which does not collect the corresponding source voices for playing.
Optionally, the apparatus further includes a configuration information obtaining module, specifically configured to:
acquiring translation configuration information input by a user through a configuration interface of a client; wherein the translation configuration information comprises: and the source language and the target language corresponding to each earphone end.
Optionally, the apparatus further includes a test instruction obtaining module, specifically configured to:
Acquiring a test sound instruction input by a user, transmitting preset test sound audio to each earphone end according to a source language and a target language configured for each earphone end in current translation configuration information, and playing according to the sequence of the source language and the target language.
Optionally, the source text obtaining module 62 is further specifically configured to:
acquiring a first source speech input by a user and identifying and determining a first source text according to the first source speech;
acquiring second source speech input by a user acquired by a client, and identifying and determining a second source text according to the second source speech;
correspondingly, the target speech processing module 65 is further specifically configured to:
and transmitting the target voice corresponding to the second source voice to the earphone end for playing.
The earphone-based translation device 60 disclosed in the embodiment of the present application can execute the earphone-based translation method disclosed in the embodiment of the present application, and has functional modules and beneficial effects corresponding to the execution method. Reference may be made to the description of any method embodiment of the present application for details not explicitly described in this embodiment.
According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.
Fig. 7 is a block diagram of an electronic device based on a translation method implemented by a headset according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 7, the electronic apparatus includes: one or more processors 701, a memory 702, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 7, one processor 701 is taken as an example.
The memory 702 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the headset-based implementation of the translation method provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the headset-based implementation of the translation method provided herein.
The memory 702 serves as a non-transitory computer-readable storage medium, and may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the headset-based translation method in the embodiment of the present application (for example, the communication connection establishing module 61, the source text acquiring module 62, the target text acquiring module 63, the speech synthesizing module 64, and the target speech processing module 65 shown in fig. 6). The processor 701 executes various functional applications of the server and data processing, i.e., implementing the headset-based implementation of the translation method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 702.
The memory 702 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device based on the headset-implemented translation method, and the like. Further, the memory 702 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 702 may optionally include memory located remotely from the processor 701, which may be connected over a network to an electronic device based on the headset-implemented translation method. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device based on the earphone-implemented translation method may further include: an input device 703 and an output device 704. The processor 701, the memory 702, the input device 703 and the output device 704 may be connected by a bus or other means, and fig. 7 illustrates an example of a connection by a bus.
The input device 703 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus based on the headset-implemented translation method, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input devices. The output devices 704 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.
According to the technical scheme of the embodiment of the application, the source text to be translated is translated by establishing the communication connection with the earphone end, the target text obtained by translation is subjected to voice synthesis to generate the target voice, the target voice is finally displayed on the display interface of the client end and transmitted to the earphone end to be played, a user can check the translated target text in the client end and can listen to the target voice corresponding to the target text through the earphone end, and therefore the technical effect of performing voice translation conveniently and timely is achieved.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (31)

1. A headset-based implementation of a translation method, performed by a client, the method comprising:
establishing a communication connection with the earphone end;
acquiring a source text to be translated;
translating the source text to obtain a target text;
performing voice synthesis on the target text to generate target voice;
and displaying the target text on a display interface of a client, and transmitting the target voice to the earphone end for playing.
2. The method of claim 1, wherein obtaining the source text to be translated comprises at least one of:
obtaining source speech acquired by an earphone end, and identifying and determining the source text according to the source speech;
acquiring source speech input by a user through a microphone of a terminal to which the client belongs, and identifying and determining the source text according to the source speech;
And acquiring a source text input by a user through an input interface of the client.
3. The method of claim 2, wherein determining the source text from the source speech recognition comprises:
performing local speech recognition on the source speech to determine the source text; or
And transmitting the source voice to a voice recognition cloud server based on a long voice interface so as to request voice recognition and receive the source text fed back by the voice recognition cloud server.
4. The method of claim 1, wherein translating the source text to obtain target text comprises:
performing local language translation on the source text to obtain a target text; or
And transmitting the source text to a translation cloud server based on a long-chain interface to request language translation, and receiving the target text fed back by the translation cloud server.
5. The method of claim 1, wherein speech synthesizing the target text to produce target speech comprises:
performing local voice synthesis on the target text to generate target voice; or
And transmitting the target text to a voice synthesis cloud server based on a long-chain interface to request voice synthesis, and receiving the target voice fed back by the voice synthesis cloud server.
6. The method of claim 1, wherein displaying the target text on a display interface of a client comprises:
synchronously displaying the source text and the target text on a display interface of a client based on a set on-screen display rule;
wherein the on-screen display rules include at least one of:
the source text and the target text are displayed in different areas of the display interface;
the source text and the target text are displayed in the same area of the display interface in an associated mode;
the source text corresponding to the current playing target voice and the target text are displayed in a distinguishing way;
and performing differential display on the source text and the target text input by different users.
7. The method of claim 1, wherein transmitting the target voice to the headset end for playing comprises:
and respectively transmitting the target voices corresponding to the source texts input by the two users to the two earphone ends for playing respectively.
8. The method of claim 1, wherein the method further comprises:
and displaying at least two translation modes on the interface of the client, and determining the current translation mode according to the selection of the user on the translation modes.
9. The method of claim 1, wherein the method further comprises:
and saving the source text and the target text as a history.
10. The method of claim 1, wherein the method further comprises:
when an abnormal event is detected, suspending the acquisition operation of the source text and suspending the operation of transmitting the target voice to the earphone end for playing;
wherein the exception event comprises at least one of:
the environmental noise of the source speech collected by the earphone end is greater than a noise threshold value;
the user implements a return operation or an emptying operation on the display interface of the client;
and the communication connection between the earphone end and the terminal to which the client belongs is disconnected, and/or the communication connection between the two earphone ends is disconnected.
11. The method of claim 1, wherein:
the obtaining of the source text to be translated includes: respectively acquiring a first source speech and a second source speech which are acquired from two earphone ends, and respectively carrying out speech recognition on the first source speech and the second source speech to acquire a first source text and a second source text;
correspondingly, the displaying the target text on a display interface of a client, and the transmitting the target voice to the earphone terminal for playing includes:
And displaying the two target texts and the two corresponding source texts obtained by translation on a display interface of a client, and cross-transmitting the two target voices to an earphone end which does not collect the corresponding source voices for playing.
12. The method of claim 11, wherein prior to obtaining the source text to be translated, further comprising:
acquiring translation configuration information input by a user through a configuration interface of a client; wherein the translation configuration information comprises: and the source language and the target language corresponding to each earphone end.
13. The method of claim 12, wherein prior to obtaining the source text to be translated, further comprising:
acquiring a test sound instruction input by a user, transmitting preset test sound audio to each earphone end according to a source language and a target language configured for each earphone end in current translation configuration information, and playing according to the sequence of the source language and the target language.
14. The method of claim 1, wherein obtaining the source text to be translated comprises:
acquiring a first source speech input by a user and identifying and determining a first source text according to the first source speech;
acquiring second source speech input by a user acquired by a client, and identifying and determining a second source text according to the second source speech;
Correspondingly, transmitting the target voice to the earphone end for playing comprises:
and transmitting the target voice corresponding to the second source voice to the earphone end for playing.
15. A headset-based translation apparatus configured in a client, the apparatus comprising:
the communication connection establishing module is used for establishing communication connection with the earphone end;
the source text acquisition module is used for acquiring a source text to be translated;
the target text acquisition module is used for translating the source text to acquire a target text;
the voice synthesis module is used for carrying out voice synthesis on the target text to generate target voice;
and the target voice processing module is used for displaying the target text on a display interface of a client and transmitting the target voice to the earphone end for playing.
16. The apparatus of claim 15, wherein the source text acquisition module is specifically configured to at least one of:
obtaining source speech acquired by an earphone end, and identifying and determining the source text according to the source speech;
acquiring source speech input by a user through a microphone of a terminal to which the client belongs, and identifying and determining the source text according to the source speech;
And acquiring a source text input by a user through an input interface of the client.
17. The apparatus of claim 16, wherein the source text acquisition module is further specifically configured to:
performing local speech recognition on the source speech to determine the source text; or
And transmitting the source voice to a voice recognition cloud server based on a long voice interface so as to request voice recognition and receive the source text fed back by the voice recognition cloud server.
18. The apparatus of claim 15, wherein the target text acquisition module is specifically configured to:
performing local language translation on the source text to obtain a target text; or
And transmitting the source text to a translation cloud server based on a long-chain interface to request language translation, and receiving the target text fed back by the translation cloud server.
19. The apparatus of claim 15, wherein the speech synthesis module is specifically configured to:
performing local voice synthesis on the target text to generate target voice; or
And transmitting the target text to a voice synthesis cloud server based on a long-chain interface to request voice synthesis, and receiving the target voice fed back by the voice synthesis cloud server.
20. The apparatus according to claim 15, wherein the target speech processing module is specifically configured to:
synchronously displaying the source text and the target text on a display interface of a client based on a set on-screen display rule;
wherein the on-screen display rules include at least one of:
the source text and the target text are displayed in different areas of the display interface;
the source text and the target text are displayed in the same area of the display interface in an associated mode;
the source text corresponding to the current playing target voice and the target text are displayed in a distinguishing way;
and performing differential display on the source text and the target text input by different users.
21. The apparatus of claim 15, wherein the target speech processing module is further specifically configured to:
and respectively transmitting the target voices corresponding to the source texts input by the two users to the two earphone ends for playing respectively.
22. The apparatus according to claim 15, wherein the apparatus further comprises a translation mode selection module, specifically configured to:
and displaying at least two translation modes on the interface of the client, and determining the current translation mode according to the selection of the user on the translation modes.
23. The apparatus according to claim 15, wherein the apparatus further comprises a saving module, specifically configured to:
and saving the source text and the target text as a history.
24. The apparatus according to claim 15, wherein the apparatus further comprises an anomaly detection module, specifically configured to:
when an abnormal event is detected, suspending the acquisition operation of the source text and suspending the operation of transmitting the target voice to the earphone end for playing;
wherein the exception event comprises at least one of:
the environmental noise of the source speech collected by the earphone end is greater than a noise threshold value;
the user implements a return operation or an emptying operation on the display interface of the client;
and the communication connection between the earphone end and the terminal to which the client belongs is disconnected, and/or the communication connection between the two earphone ends is disconnected.
25. The apparatus of claim 15, wherein:
the communication connection establishing module is specifically configured to:
respectively acquiring a first source speech and a second source speech which are acquired from two earphone ends, and respectively carrying out speech recognition on the first source speech and the second source speech to acquire a first source text and a second source text;
Correspondingly, the target speech processing module is specifically further configured to:
and displaying the two target texts and the two corresponding source texts obtained by translation on a display interface of a client, and cross-transmitting the two target voices to an earphone end which does not collect the corresponding source voices for playing.
26. The apparatus according to claim 25, wherein the apparatus further comprises a configuration information obtaining module, specifically configured to:
acquiring translation configuration information input by a user through a configuration interface of a client; wherein the translation configuration information comprises: and the source language and the target language corresponding to each earphone end.
27. The apparatus according to claim 26, wherein the apparatus further includes a test tone instruction obtaining module, which is specifically configured to:
acquiring a test sound instruction input by a user, transmitting preset test sound audio to each earphone end according to a source language and a target language configured for each earphone end in current translation configuration information, and playing according to the sequence of the source language and the target language.
28. The apparatus of claim 15, wherein the source text acquisition module is further specifically configured to:
acquiring a first source speech input by a user and identifying and determining a first source text according to the first source speech;
Acquiring second source speech input by a user acquired by a client, and identifying and determining a second source text according to the second source speech;
correspondingly, the target speech processing module is specifically further configured to:
and transmitting the target voice corresponding to the second source voice to the earphone end for playing.
29. A translation system realized based on an earphone comprises an earphone end, a terminal and a cloud server; the terminal is in communication connection with the earphone end and the cloud server;
the earphone end is used for collecting source voice;
the terminal is used for acquiring source voice collected by the earphone end and transmitting the source voice to the cloud server;
the cloud server is used for performing at least one processing operation of voice recognition, language translation and voice synthesis on the source voice and feeding back a processing result to the terminal.
30. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the headset-based implementation translation method of any one of claims 1-14.
31. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the headset-implementation-based translation method of any one of claims 1-14.
CN202010682940.9A 2020-07-15 2020-07-15 Earphone-based translation method, device, system, equipment and storage medium Pending CN111862940A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010682940.9A CN111862940A (en) 2020-07-15 2020-07-15 Earphone-based translation method, device, system, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010682940.9A CN111862940A (en) 2020-07-15 2020-07-15 Earphone-based translation method, device, system, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111862940A true CN111862940A (en) 2020-10-30

Family

ID=72984036

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010682940.9A Pending CN111862940A (en) 2020-07-15 2020-07-15 Earphone-based translation method, device, system, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111862940A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112685115A (en) * 2020-12-29 2021-04-20 平安普惠企业管理有限公司 International cue language generating method, system, computer equipment and storage medium
CN112820272A (en) * 2021-01-27 2021-05-18 上海淇玥信息技术有限公司 Instant multi-language translation method and device and electronic equipment
CN113286217A (en) * 2021-04-23 2021-08-20 北京搜狗智能科技有限公司 Call voice translation method and device and earphone equipment
CN113345440A (en) * 2021-06-08 2021-09-03 北京有竹居网络技术有限公司 Signal processing method, device and equipment and Augmented Reality (AR) system
CN113362818A (en) * 2021-05-08 2021-09-07 山西三友和智慧信息技术股份有限公司 Voice interaction guidance system and method based on artificial intelligence
CN113377276A (en) * 2021-05-19 2021-09-10 深圳云译科技有限公司 System, method and device for quick recording and translation, electronic equipment and storage medium
CN113641439A (en) * 2021-08-16 2021-11-12 百度在线网络技术(北京)有限公司 Text recognition and display method, device, electronic equipment and medium
CN113709558A (en) * 2021-10-09 2021-11-26 立讯电子科技(昆山)有限公司 Multimedia processing method and multimedia interaction system
CN114095906A (en) * 2021-10-14 2022-02-25 华为技术有限公司 Short-distance communication method and system

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104978980A (en) * 2015-07-03 2015-10-14 上海斐讯数据通信技术有限公司 Method for controlling sound playing and sound playing system
CN105893019A (en) * 2015-12-15 2016-08-24 乐视移动智能信息技术(北京)有限公司 User manual calling-out method and apparatus for smartphone
CN105979421A (en) * 2016-06-24 2016-09-28 陈灿伟 Bluetooth headset based on simultaneous interpretation and simultaneous interpretation system using the same
CN106060643A (en) * 2016-06-28 2016-10-26 乐视控股(北京)有限公司 Method and device for playing multimedia file and earphones
CN107168959A (en) * 2017-05-17 2017-09-15 深圳市沃特沃德股份有限公司 Interpretation method and translation system
CN206639220U (en) * 2017-01-05 2017-11-14 陈伯妤 A kind of portable simultaneous interpretation equipment
CN107832309A (en) * 2017-10-18 2018-03-23 广东小天才科技有限公司 A kind of method, apparatus of language translation, wearable device and storage medium
CN108090051A (en) * 2017-12-20 2018-05-29 深圳市沃特沃德股份有限公司 The interpretation method and translator of continuous long voice document
CN108509428A (en) * 2018-02-26 2018-09-07 深圳市百泰实业股份有限公司 Earphone interpretation method and system
CN109376363A (en) * 2018-09-04 2019-02-22 出门问问信息科技有限公司 A kind of real-time voice interpretation method and device based on earphone
CN209103286U (en) * 2018-02-26 2019-07-12 深圳市百泰实业股份有限公司 Earphone translation system
CN110457716A (en) * 2019-07-22 2019-11-15 维沃移动通信有限公司 A kind of speech output method and mobile terminal

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104978980A (en) * 2015-07-03 2015-10-14 上海斐讯数据通信技术有限公司 Method for controlling sound playing and sound playing system
CN105893019A (en) * 2015-12-15 2016-08-24 乐视移动智能信息技术(北京)有限公司 User manual calling-out method and apparatus for smartphone
CN105979421A (en) * 2016-06-24 2016-09-28 陈灿伟 Bluetooth headset based on simultaneous interpretation and simultaneous interpretation system using the same
CN106060643A (en) * 2016-06-28 2016-10-26 乐视控股(北京)有限公司 Method and device for playing multimedia file and earphones
CN206639220U (en) * 2017-01-05 2017-11-14 陈伯妤 A kind of portable simultaneous interpretation equipment
CN107168959A (en) * 2017-05-17 2017-09-15 深圳市沃特沃德股份有限公司 Interpretation method and translation system
CN107832309A (en) * 2017-10-18 2018-03-23 广东小天才科技有限公司 A kind of method, apparatus of language translation, wearable device and storage medium
CN108090051A (en) * 2017-12-20 2018-05-29 深圳市沃特沃德股份有限公司 The interpretation method and translator of continuous long voice document
CN108509428A (en) * 2018-02-26 2018-09-07 深圳市百泰实业股份有限公司 Earphone interpretation method and system
CN209103286U (en) * 2018-02-26 2019-07-12 深圳市百泰实业股份有限公司 Earphone translation system
CN109376363A (en) * 2018-09-04 2019-02-22 出门问问信息科技有限公司 A kind of real-time voice interpretation method and device based on earphone
CN110457716A (en) * 2019-07-22 2019-11-15 维沃移动通信有限公司 A kind of speech output method and mobile terminal

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112685115A (en) * 2020-12-29 2021-04-20 平安普惠企业管理有限公司 International cue language generating method, system, computer equipment and storage medium
CN112820272A (en) * 2021-01-27 2021-05-18 上海淇玥信息技术有限公司 Instant multi-language translation method and device and electronic equipment
CN113286217A (en) * 2021-04-23 2021-08-20 北京搜狗智能科技有限公司 Call voice translation method and device and earphone equipment
CN113362818A (en) * 2021-05-08 2021-09-07 山西三友和智慧信息技术股份有限公司 Voice interaction guidance system and method based on artificial intelligence
CN113377276A (en) * 2021-05-19 2021-09-10 深圳云译科技有限公司 System, method and device for quick recording and translation, electronic equipment and storage medium
CN113345440A (en) * 2021-06-08 2021-09-03 北京有竹居网络技术有限公司 Signal processing method, device and equipment and Augmented Reality (AR) system
CN113641439A (en) * 2021-08-16 2021-11-12 百度在线网络技术(北京)有限公司 Text recognition and display method, device, electronic equipment and medium
CN113641439B (en) * 2021-08-16 2023-08-29 百度在线网络技术(北京)有限公司 Text recognition and display method, device, electronic equipment and medium
CN113709558A (en) * 2021-10-09 2021-11-26 立讯电子科技(昆山)有限公司 Multimedia processing method and multimedia interaction system
CN114095906A (en) * 2021-10-14 2022-02-25 华为技术有限公司 Short-distance communication method and system

Similar Documents

Publication Publication Date Title
CN111862940A (en) Earphone-based translation method, device, system, equipment and storage medium
KR102108500B1 (en) Supporting Method And System For communication Service, and Electronic Device supporting the same
CN111192591B (en) Awakening method and device of intelligent equipment, intelligent sound box and storage medium
CN112533041A (en) Video playing method and device, electronic equipment and readable storage medium
CN110675873B (en) Data processing method, device and equipment of intelligent equipment and storage medium
CN104598443B (en) Language service providing method, apparatus and system
CN110501918B (en) Intelligent household appliance control method and device, electronic equipment and storage medium
CN110580904A (en) Method and device for controlling small program through voice, electronic equipment and storage medium
CN112382279B (en) Voice recognition method and device, electronic equipment and storage medium
CN111755002B (en) Speech recognition device, electronic apparatus, and speech recognition method
CN112434139A (en) Information interaction method and device, electronic equipment and storage medium
KR20220011083A (en) Information processing method, device, electronic equipment and storage medium in user dialogue
KR20210033873A (en) Speech recognition control method, apparatus, electronic device and readable storage medium
CN109195016B (en) Intelligent terminal equipment-oriented voice interaction method and terminal system for video barrage and intelligent terminal equipment
CN113611316A (en) Man-machine interaction method, device, equipment and storage medium
CN110600039B (en) Method and device for determining speaker attribute, electronic equipment and readable storage medium
CN112382294A (en) Voice recognition method and device, electronic equipment and storage medium
CN113810814B (en) Earphone mode switching control method and device, electronic equipment and storage medium
JP7331044B2 (en) Information processing method, device, system, electronic device, storage medium and computer program
CN113160782B (en) Audio processing method and device, electronic equipment and readable storage medium
CN112037794A (en) Voice interaction method, device, equipment and storage medium
CN113556649A (en) Broadcasting control method and device of intelligent sound box
CN111986682A (en) Voice interaction method, device, equipment and storage medium
US20230344669A1 (en) Remote control method and apparatus, electronic device and medium
JP2015036826A (en) Communication processor, communication processing method and communication processing program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20210517

Address after: 100085 Baidu Building, 10 Shangdi Tenth Street, Haidian District, Beijing

Applicant after: BEIJING BAIDU NETCOM SCIENCE AND TECHNOLOGY Co.,Ltd.

Applicant after: Shanghai Xiaodu Technology Co.,Ltd.

Address before: 100085 Baidu Building, 10 Shangdi Tenth Street, Haidian District, Beijing

Applicant before: BEIJING BAIDU NETCOM SCIENCE AND TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right