CN114239613B - Real-time voice translation method, device, equipment and storage medium - Google Patents

Real-time voice translation method, device, equipment and storage medium Download PDF

Info

Publication number
CN114239613B
CN114239613B CN202210164989.4A CN202210164989A CN114239613B CN 114239613 B CN114239613 B CN 114239613B CN 202210164989 A CN202210164989 A CN 202210164989A CN 114239613 B CN114239613 B CN 114239613B
Authority
CN
China
Prior art keywords
language text
time
target language
real
moment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210164989.4A
Other languages
Chinese (zh)
Other versions
CN114239613A (en
Inventor
葛正晗
罗维
黄忠强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Damo Institute Hangzhou Technology Co Ltd
Original Assignee
Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Damo Institute Hangzhou Technology Co Ltd filed Critical Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority to CN202210164989.4A priority Critical patent/CN114239613B/en
Publication of CN114239613A publication Critical patent/CN114239613A/en
Application granted granted Critical
Publication of CN114239613B publication Critical patent/CN114239613B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Abstract

The disclosure relates to a real-time speech translation method, device, equipment and storage medium. According to the method and the device, the first prefix part in the target language text at the historical moment is obtained, and the original language voice obtained at the current moment is subjected to voice recognition to obtain the original language text at the current moment. Further, according to the first prefix portion, the original language text at the current time is translated into the target language text at the current time, and the first prefix portion is kept unchanged in the target language text at the current time. That is to say, the translation corresponding to the original language text at the current moment is output according to the prefix with the certain length of the original language text at the current moment and the prefix with the certain length of the translation at the historical moment, so that the prefix with the certain length of the translation at the current moment can be ensured to be consistent with the prefix with the certain length of the translation at the historical moment, and the problem that the translation output in the real-time speech translation is unstable can be effectively solved.

Description

Real-time voice translation method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of information technology, and in particular, to a real-time speech translation method, apparatus, device, and storage medium.
Background
In real-time speech translation, the original language speech needs to be recognized as the original language text in real time, and then the original language text needs to be translated into the target language text in real time.
However, the inventor of the present application finds that, as time goes on, the content broadcasted by the original language voice is increased, and the content of the recognized original language text is increased, so that the target language text translated at different times is changed greatly, and user experience is affected.
Disclosure of Invention
In order to solve the above technical problems or at least partially solve the above technical problems, the present disclosure provides a real-time speech translation method, apparatus, device and storage medium, so as to ensure that a prefix with a certain length of a translation at a current time is consistent with a prefix with a certain length of a translation at a historical time, thereby effectively alleviating a problem of instability of a translation already output in real-time speech translation.
In a first aspect, an embodiment of the present disclosure provides a real-time speech translation method, where the method includes:
acquiring real-time voice;
performing real-time voice recognition on the real-time voice to obtain a real-time original language text;
and translating the real-time original language text into a real-time target language text according to a first prefix part in the target language text translated at the historical moment, wherein the first prefix part is kept unchanged in the real-time target language text.
In a second aspect, an embodiment of the present disclosure provides a real-time subtitle generating method, where the method includes:
acquiring real-time voice;
performing real-time voice recognition on the real-time voice to obtain a real-time original language text;
translating the real-time original language text into a real-time target language text according to a first prefix part in the target language text translated at a historical moment, wherein the first prefix part is kept unchanged in the real-time target language text;
and generating a real-time caption according to the real-time original language text and the real-time target language text, wherein the real-time caption comprises at least one of the real-time original language text and the real-time target language text.
In a third aspect, an embodiment of the present disclosure provides a real-time translation method, including:
acquiring a first prefix part in a target language text at a historical moment;
performing voice recognition on the original language voice acquired at the current moment to obtain an original language text at the current moment;
and translating the original language text at the current moment into the target language text at the current moment according to the first prefix part, wherein the first prefix part is kept unchanged in the target language text at the current moment.
In a fourth aspect, an embodiment of the present disclosure provides a real-time translation apparatus, including:
the acquisition module is used for acquiring a first prefix part in a target language text at a historical moment;
the voice recognition module is used for carrying out voice recognition on the original language voice acquired at the current moment to obtain an original language text at the current moment;
and the translation module is used for translating the original language text at the current moment into the target language text at the current moment according to the first prefix part, and the first prefix part is kept unchanged in the target language text at the current moment.
In a fifth aspect, an embodiment of the present disclosure provides an electronic device, including:
a memory;
a processor; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of the first, second or third aspect.
In a sixth aspect, the disclosed embodiments provide a computer readable storage medium having stored thereon a computer program for execution by a processor to implement the method of the first, second or third aspect.
According to the real-time speech translation method, the real-time speech translation device, the real-time speech translation equipment and the storage medium, the first prefix part in the target language text at the historical moment is obtained, and speech recognition is performed on the original language speech obtained at the current moment, so that the original language text at the current moment is obtained. Further, according to the first prefix portion, the original language text at the current time is translated into the target language text at the current time, and the first prefix portion is kept unchanged in the target language text at the current time. That is to say, the translation corresponding to the original language text at the current moment is output according to the prefix with the certain length of the original language text at the current moment and the prefix with the certain length of the translation at the historical moment, so that the prefix with the certain length of the translation at the current moment can be ensured to be consistent with the prefix with the certain length of the translation at the historical moment, and the problem that the translation output in the real-time speech translation is unstable can be effectively solved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
FIG. 1 is a flowchart of a real-time translation method provided by an embodiment of the present disclosure;
FIG. 2 is a flowchart of a real-time translation method provided by another embodiment of the present disclosure;
FIG. 3 is a flowchart of a real-time speech translation method provided by an embodiment of the present disclosure;
fig. 4 is a flowchart of a real-time subtitle generating method according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a real-time translation apparatus provided in an embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of a real-time speech translation apparatus according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of a real-time subtitle generating apparatus according to an embodiment of the present disclosure;
fig. 8 is a schematic structural diagram of an embodiment of an electronic device provided in the embodiment of the present disclosure.
Detailed Description
In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.
In real-time Speech Translation, Speech Translation refers to Artificial Intelligence (AI) technology for translating contents in Speech into another language, and real-time Speech Translation usually requires integration of Speech Recognition (ASR) technology and Machine Translation (MT) technology. The ASR technology can recognize the original language voice as the original language text in real time. The MT technology can translate the original language text into the target language text in real time.
However, as time goes on, the content broadcasted by the original language voice is increased, and the content of the original language text obtained by recognition is increased, so that the target language text obtained by translation at different moments is changed greatly, and user experience is affected.
The general idea of implementing real-time speech translation is that in the same sentence, each time a new ASR result is output, the new ASR result is re-translated, where the new ASR result refers to the latest original language text output by ASR. For example, the correspondence between the new ASR results and the translation results is shown in table 1 below.
TABLE 1
New ASR results Translation results
In one year In a year
Before one year A year ago
Clouds one year ago In the cloud a year ago
In a cloud-dwelling congress a year ago, At the Yunqi Conference a year ago,
in a cloud-inhabited congress before one year, I At the Yunqi Conference a year ago, I
In the cloud-dwelling congress of one year ago, we announced At the Yunqi Conference a year ago, we announced
On a cloud-dwelling congress one year ago, we announced the dammo institute We announced the DAMO Academy at the Yunqi Conference a year ago
As can be seen from table 1, although the original language text, i.e., the upper text of the original text, may be fixed, for example, "one year" is the upper text relative to "one year ago". "one year ago" is above, as opposed to "a cloud one year ago". However, the translation results, i.e., the word order and words of the translated text, are changing. In some scenarios, the context of the original text may also change. For example, the ASR result at time t1 is "cloud; previous year", and the ASR result at time t2 is "at a cloud guild previous year", i.e., when ASR recognizes "guild," cloud; is modified to "cloud", thereby causing a change in the context of "cloud guild previous year" that is more aggravated by the change. If the user side displays the translation result in real time, the user experience will be seriously affected because the text and the word order of the translation result are continuously changed.
To solve this problem, embodiments of the present disclosure provide a real-time translation method, which is described below with reference to specific embodiments.
Fig. 1 is a flowchart of a real-time translation method provided in an embodiment of the present disclosure. The method can be executed by a real-time translation apparatus, which can be implemented in software and/or hardware, and the apparatus can be configured in an electronic device, such as a server or a terminal, where the terminal specifically includes a mobile phone, a computer, or a tablet computer. As shown in fig. 1, the method comprises the following specific steps:
s101, acquiring a first prefix part in a target language text at historical time.
For example, the source language is Chinese and the target language is English. the ASR result at time t1 is "all have no problem" and the corresponding translation result is "the re is no proplem". Here, the time t1 is regarded as the history time, "the term is no layout" is the target language text at the history time. Here, the first n characters in "the is no proplem" are used as the first prefix portion in the target language text, where the value size of n is not particularly limited. For example, "the same is no proplem" may be taken as a whole as the first prefix portion.
S102, performing voice recognition on the original language voice acquired at the current moment to obtain an original language text at the current moment.
For example, the current time is denoted as time t2, and speech recognition is performed on the original language speech acquired at the current time to obtain the original language text at the current time, for example, the ASR result at time t2 is "all are no problem, i.e., our babies".
S103, according to the first prefix part, translating the original language text at the current moment into the target language text at the current moment, wherein the first prefix part is kept unchanged in the target language text at the current moment.
At this time, the ASR result at the time t2 may be translated into a text in the target language at the time t2 according to the first prefix portion, i.e., "the re is no proplem," for example, the text in the target language at the time t2 is "the re is no proplem, outer fans. That is, the first prefix portion, i.e., "the is no proplem," remains unchanged in the target language text at time t2, so that the first n characters of the target language text at time t2 and the first n characters of the target language text at time t1 are kept consistent, and the target language text at time t2 is prevented from being greatly changed compared with the target language text at time t 1.
In this embodiment, a specific translation process may be implemented by a translation model, where the input of the translation model includes not only the ASR result to be translated, but also the first prefix portion, and the output of the translation model is the target language text. For example, when "all have no problem" and our babies "are translated at time t2, the input of the translation model includes" all have no problem "and our babies" and the first prefix part is "the re is no proplem", the output of the translation model is "all have no problem" and our babies "and the translation result is" the re is no proplem, outer fans ", and the first prefix part is kept unchanged in the translation result.
The embodiment of the disclosure introduces an Interactive Machine Translation (IMT) concept in a real-time speech Translation scene, and the IMT calling mode is as follows: inputs to the translation model include: prefix, original text (e.g., text in the original language), original language, target language. The output of the translation model includes: a translation, and a prefix of the translation is consistent with a prefix input to the translation model. The original translation calling mode is as follows: inputs to the translation model include: original text, original language, target language. The output of the translation model includes: and (6) translating. Therefore, compared with the original translation calling mode, the IMT calling mode not only enriches the input of the translation model, but also can ensure the stability of the translated text. Therefore, the embodiment can keep the translation model unchanged, and only the decoding part in the translation model is required to be improved, for example, incremental decoding is performed under the condition that the prefix of the translated text is ensured to be unchanged.
According to the embodiment of the disclosure, the first prefix part in the target language text at the historical moment is obtained, and the original language voice obtained at the current moment is subjected to voice recognition to obtain the original language text at the current moment. Further, according to the first prefix portion, the original language text at the current time is translated into the target language text at the current time, and the first prefix portion is kept unchanged in the target language text at the current time. That is to say, the translation corresponding to the original language text at the current moment is output according to the prefix with the certain length of the original language text at the current moment and the prefix with the certain length of the translation at the historical moment, so that the prefix with the certain length of the translation at the current moment can be ensured to be consistent with the prefix with the certain length of the translation at the historical moment, and the problem that the translation output in the real-time speech translation is unstable can be effectively solved.
Fig. 2 is a flowchart of a real-time translation method according to another embodiment of the disclosure. In this embodiment, the method specifically includes the following steps:
s201, when the length of the target language text at the historical time is larger than a first threshold value, acquiring a first prefix part in the target language text at the historical time.
In this embodiment, a first threshold and a second threshold are involved, where the first threshold may be denoted as b, that is, block size, and the first threshold is a threshold related to the length of the translation, when the length of the translation does not reach the first threshold, the prefix portion of the translation is not determined, otherwise, the prefix portion of the translation is determined in combination with the second threshold, and the length of the prefix portion is calculated according to the length of the translation and the second threshold. Where the second threshold may be denoted as k, k representing the length of the incremental portion. For example, b =10, k = 5.
For example, the original language text at the time t1 is "all have no problem" and the translation model may translate "all have no problem" into "the re is no proplem", that is, the target language text at the time t1 is "the re is no proplem". Since the length of the text in the target language at the time t1 is 4 (i.e., the total number of characters), and 4 is less than b =10, the prefix portion of the uncertain translation, i.e., the prefix of the translation input to the translation model at the next time t2, remains empty. That is, at time t2, the input to the translation model includes only the text in the original language at time t 2.
For example, the original language text at the time t2 is "all have no problem" and our babies "and the translation model may translate" all have no problem "and our babies" into "the re is no proplem, outer fans", that is, the target language text at the time t2 is "the re is no proplem, outer fans". Since the length of the target language text at the time t2 is 6, 6 is less than b =10, the prefix portion of the uncertain translation, i.e., the prefix of the translation input to the translation model at the next time t3, remains empty. That is, at time t3, the input to the translation model includes only the text in the original language at time t 3.
For example, the original language text at time t3 is "all have no problem," so that our babies have any problem, "and the translation model may translate" all have no problem, "so that our babies have any problem," into "the re is no proplem, outer fans, and if is any proplem," i.e., the target language text at time t3 is "the re is no proplem, outer fans, or if is any proplem. Since the length of the target language text at the time t3 is 11, and 11 is greater than b =10, it is determined that the prefix portion of the translation is the first 11-5=6 characters of the target language text at the time t3, that is, the prefix of the translation input to the translation model at the next time t4 is "the is no documents, outer fans". That is, at time t4, the input of the translation model includes not only the original language text at time t4, but also the translation prefix "the is no project, outer names". The characters in this embodiment may be understood as words.
For example, time t1, time t2, and time t3 are history times, respectively. the length of the target language text at the time t1 is smaller than a first threshold, the length of the target language text at the time t2 is smaller than the first threshold, and the length of the target language text at the time t3 is larger than the first threshold. The first 6 characters, namely "the is no layout, outer names" are obtained from the target language text at the time t3 as the first prefix part.
Optionally, the first character of the first prefix portion is the same as the first character of the target language text at the historical time, and the length of the first prefix portion is determined by the length of the target language text at the historical time and a second threshold.
For example, the first character of the first prefix portion is the same as the first character of the target language text at time t3, i.e., the first prefix portion is counted from the first character of the target language text at time t 3. In addition, the length of the first prefix portion is determined by the length of the text in the target language at time t3 and a second threshold value.
Optionally, the length of the first prefix portion is a difference between the length of the target language text at the historical time and a second threshold.
For example, the length of the text of the target language at time t3 is 11, the second threshold value is 5, and the length of the first prefix portion is the difference between 11 and 5.
S202, performing voice recognition on the original language voice acquired at the current moment to obtain an original language text at the current moment.
For example, when the current time is time t4, the speech recognition is performed on the original language speech acquired at the current time to acquire the original language text at the current time, and the original language text at time t4 is "all have no problem", so that our babies can tell me directly if there is any problem ".
S203, according to the first prefix part, translating the original language text at the current moment into the target language text at the current moment, wherein the first prefix part is kept unchanged in the target language text at the current moment.
For example, the original language text at the time t4 is "all have no problem" and our babies can tell me all have problems directly, and the translation model can tell me that "all have problems" according to the translated text prefixes "this is no layout, outer fans", and "all have no problem" and our babies can tell me that There are problems directly, "i.e. the target language text at the time t4 is" the re is no layout, outer fans, if is no layout, you can document retrieval ", and the translated text" the prefix is no layout, outer fans, if is no layout, you can document retrieval ", and the target language text at the time t 3832 remains unchanged in the target language text at the time t 4. It can be understood that, since the first 6 characters of the target language text at the time t3 are determined as the translation prefix, that is, "the re is no proplem," the outer fans, "and the if this is any proplem" in "the re is no proplem, the outer fans" are determined as the translation prefix, and since "the if this is any proplem" is not determined as the translation prefix, the "if this is any proplem" may remain unchanged in the target language text at the time t4, or may change in the target language text at the time t 4.
And S204, if the length of the target language text at the current moment is greater than a first threshold value, taking a character with a preset length in front of the target language text at the current moment as a second prefix part.
Optionally, the preset length is a difference between the length of the target language text at the current time and a second threshold.
Similarly, since the length of the target language text at the time t4 is 16, and 16 is greater than b =10, it is determined that the prefix portion of the translation is updated to the first 16-5=11 characters of the target language text at the time t4, that is, the prefix of the translation input to the translation model at the next time t5 is "the is no protocol, outer fas, if There is no protocol". Here, "the term is no layout, outer fas, if There is any layout" can be written as the second prefix part. That is, at time t5, the input of the translation model not only includes the original language text at time t5, but also includes the translation prefix "the is no layout, outer fas, if There is an is layout".
And S205, according to the second prefix part, translating the original language text at the next moment into the target language text at the next moment, wherein the second prefix part is kept unchanged in the target language text at the next moment.
For example, the original language text at time t5 is "all unproblematic" and so the speakers of our babies with any question can tell me directly to hit under our comment area ", the translation model can tell me directly according to the updated translation prefix" the re is no project, outer fans, if present is project ", and so the speakers of our babies with any question can tell me directly to hit under our comment area" to translate into "the re is no project, outer fans, if present is project, u book direct, and type in the lower part of our comment area", that is, the target language text at time t5 is "the re is no project, the heat is project, heat fan, you can tell how, and type it in the lower part of the outer comment section ", and the updated translation prefix" the re is no layout, outer fas, if this is a layout "remains unchanged in the target language text at time t 5. Similarly, since the length of the target language text at the time t5 is 27, and 27 is greater than b =10, it is determined that the prefix portion of the translation is updated to the first 27-5=22 characters of the target language text at the time t5, that is, the prefix of the translation input to the translation model at the next time t6 is "the re is no protocol, outer names, if other is any protocol, you are all direct, and type in the lower". Since the original language text at the time t5 is already a complete sentence, that is, the end of the sentence is encountered, the processing of the sentence can be finished by translating the original language text at the time t 5.
For example, from time t1 to time t5, the correspondence between the text in the original language, the prefix of the translated version, and the text in the target language is shown in table 2 below:
TABLE 2
Time of day Text in the original language Translation prefix Target language text Remarks for note
time t1 All have no problems There is no problem The whole length of the translation is 4 and less than 10, so that the time t2 is input into the translation Translation prefix of translation model is kept to be null
time t2 All have no problem o, our babies There is no problem, our fans The whole length of the translation is 6 and less than 10, so that the time t3 is input into the translation Translation prefixes of translation models are kept null
time t3 All have no problem, and our babies have All right of all questions There is no problem, our fans, if there is any problem The whole length of the translation is 11 and is more than 10, so the time t4 is input into the translation The translation prefix of the translation model is the first 11-5=6 characters of the translation
time t4 All have no problem, and our babies have All questions can be directly sent Tell me There is no problem, our fans There is no problem, our fans, if there is any problem, you can tell me directly The whole length of the translation is 16 and is more than 10, so the time t5 is input into the translation The translation prefix of the translation model is the first 16-5=11 characters of the translation
time t5 All have no problem, and our babies have All questions can be directly sent Telling me to hit under our comment area There is no problem, our fans, if there is any problem There is no problem, our fans, if there is any problem, you can tell me directly, and type it in the lower part of our comment section The whole length of the translation is 27 and is more than 10, so the time t6 is input into the translation The translation prefix of the translation model is the first 27-5=22 characters of the translation. However, since the end of the sentence has been encountered, the sentence ends up And (4) processing.
Optionally, the original language texts at different times belong to the same sentence.
For example, from the time t1 to the time t5, the original language texts at each time belong to the same sentence. That is, at different times, the original language texts belonging to the same sentence can be processed by the method described in the embodiment of the present disclosure.
In the embodiment, the translation prefix with a certain length is determined from the target language text at the previous moment, so that when the original language text at the current moment is translated into the target language text at the current moment, the translation prefix with a certain length can be kept unchanged in the target language text at the current moment, and thus, the stability of the translation is ensured.
Fig. 3 is a flowchart of a real-time speech translation method according to an embodiment of the present disclosure. The method can be executed by a real-time speech translation apparatus, which can be implemented in software and/or hardware, and the apparatus can be configured in an electronic device, such as a server or a terminal, where the terminal specifically includes a mobile phone, a computer, or a tablet computer. As shown in fig. 3, the method comprises the following specific steps:
s301, acquiring real-time voice.
For example, the terminal may collect voice in real time. For example, the speech collected by the terminal in real time is speech in chinese language.
And S302, performing real-time voice recognition on the real-time voice to obtain a real-time original language text.
Specifically, the terminal can perform real-time speech recognition on speech acquired in real time by using an ASR technology, so as to obtain a real-time chinese text, where the chinese text can be recorded as an original language text.
And S303, translating the real-time original language text into a real-time target language text according to a first prefix part in the target language text translated at the historical moment, wherein the first prefix part is kept unchanged in the real-time target language text.
It is assumed that at a certain historical time, for example, at the previous historical time, the recognized chinese text is "all have no problem" and the corresponding translation result is "the is no protem". Here, english can be used as the target language, and therefore, "the term is no publishing" is the target language text translated at the historical time. Further, the first n characters in "the is no proplem" are used as the first prefix part in the target language text, wherein the value size of n is not particularly limited. For example, "the same is no proplem" may be taken as a whole as the first prefix portion. For example, the Chinese text recognized at the present moment is "all unproblematic, our babies". At this time, the chinese text at the current time may be translated into the english text at the current time according to the first prefix portion, "the term is no recipe", for example, the english text at the current time is "the term is no recipe, outer names". The first prefix part is kept unchanged in the English text at the current moment, so that the first n characters of the English text at the current moment are kept consistent with the first n characters of the English text at the historical moment, and the English text at the current moment is prevented from being greatly changed compared with the English text at the historical moment. Therefore, the problem of instability of the output translated text in real-time speech translation is effectively solved.
Fig. 4 is a flowchart of a real-time subtitle generating method according to an embodiment of the present disclosure. The method may be executed by a real-time subtitle generating apparatus, which may be implemented in software and/or hardware, and may be configured in an electronic device, such as a server or a terminal, where the terminal specifically includes a mobile phone, a computer, or a tablet computer. As shown in fig. 4, the method comprises the following specific steps:
s401, acquiring real-time voice.
S402, performing real-time voice recognition on the real-time voice to obtain a real-time original language text.
And S403, translating the real-time original language text into a real-time target language text according to a first prefix part in the target language text translated at the historical moment, wherein the first prefix part is kept unchanged in the real-time target language text.
Specifically, the implementation process and the specific principle of S401 to S403 are consistent with the implementation process and the specific principle of S301 to S303 in the above embodiment, and are not described herein again.
S404, generating a real-time caption according to the real-time original language text and the real-time target language text, wherein the real-time caption comprises at least one of the real-time original language text and the real-time target language text.
For example, at the previous history time, the original language text is "all unproblematic", and the target language text into which the original language text is translated is "the is no proplem". At the last history time, the terminal may generate a caption including at least one of "all are no problem" and "the is no proplem".
At the present moment, the original language text is "all have no problem, i.e. our babies", and the target language text into which the original language text is translated is "the is no problem, outer names". At this time, the terminal may generate a caption including at least one of "all are unproblematic, our babies" and "the same is no proplem, outer fans".
It can be understood that, because the target language text at the current moment is not greatly changed compared with the target language text at the historical moment, when the subtitle displayed in real time by the terminal includes the target language text, the target language text translated in real time can be effectively prevented from being greatly changed, and thus, the user experience can be improved.
Fig. 5 is a schematic structural diagram of a real-time translation apparatus according to an embodiment of the present disclosure. The real-time translation apparatus provided in the embodiment of the present disclosure may execute the processing procedure provided in the embodiment of the real-time translation method, as shown in fig. 5, the real-time translation apparatus 50 includes:
an obtaining module 51, configured to obtain a first prefix portion in a target language text at a historical time;
the speech recognition module 52 is configured to perform speech recognition on the original language speech acquired at the current time to obtain an original language text at the current time;
a translation module 53, configured to translate the original language text at the current time into the target language text at the current time according to the first prefix portion, where the first prefix portion remains unchanged in the target language text at the current time.
Optionally, the obtaining module 51 is specifically configured to: and when the length of the target language text at the historical moment is larger than a first threshold value, acquiring a first prefix part in the target language text at the historical moment.
Optionally, the first character of the first prefix portion is the same as the first character of the target language text at the historical time, and the length of the first prefix portion is determined by the length of the target language text at the historical time and a second threshold.
Optionally, the length of the first prefix portion is a difference between the length of the target language text at the historical time and a second threshold.
Optionally, the translation module 53 is further configured to: after the original language text at the current moment is translated into the target language text at the current moment according to the first prefix part, if the length of the target language text at the current moment is greater than a first threshold value, taking a character with a preset length in front of the target language text at the current moment as a second prefix part; and translating the original language text at the next moment into the target language text at the next moment according to the second prefix part, wherein the second prefix part is kept unchanged in the target language text at the next moment.
Optionally, the preset length is a difference between the length of the target language text at the current time and a second threshold.
Optionally, the original language texts at different times belong to the same sentence.
The real-time translation apparatus in the embodiment shown in fig. 5 can be used to execute the real-time translation method described above, and the implementation principle and the technical effect are similar, which are not described herein again.
Fig. 6 is a schematic structural diagram of a real-time speech translation apparatus according to an embodiment of the present disclosure. The real-time speech translation apparatus provided in the embodiment of the present disclosure may execute the processing procedure provided in the embodiment of the real-time speech translation method, as shown in fig. 6, the real-time speech translation apparatus 60 includes:
an obtaining module 61, configured to obtain real-time speech;
a real-time speech recognition module 62, configured to perform real-time speech recognition on the real-time speech to obtain a real-time original language text;
a translation module 63, configured to translate the real-time original language text into a real-time target language text according to a first prefix portion in the target language text translated at a historical time, where the first prefix portion is kept unchanged in the real-time target language text.
The real-time speech translation apparatus of the embodiment shown in fig. 6 can be used for executing the real-time speech translation method as described above, and the implementation principle and the technical effect are similar, and are not described herein again.
Fig. 7 is a schematic structural diagram of a real-time subtitle generating apparatus according to an embodiment of the present disclosure. The real-time subtitle generating apparatus provided by the embodiment of the present disclosure may execute the processing procedure provided by the real-time subtitle generating method embodiment, as shown in fig. 7, the real-time subtitle generating apparatus 70 includes:
an obtaining module 71, configured to obtain real-time speech;
a real-time speech recognition module 72, configured to perform real-time speech recognition on the real-time speech to obtain a real-time original language text;
a translation module 73, configured to translate the real-time original language text into a real-time target language text according to a first prefix portion in a target language text translated at a historical time, where the first prefix portion is kept unchanged in the real-time target language text;
a generating module 74, configured to generate a real-time subtitle according to the real-time original language text and the real-time target language text, where the real-time subtitle includes at least one of the real-time original language text and the real-time target language text.
The real-time caption generating device in the embodiment shown in fig. 7 may be configured to execute the real-time caption generating method described above, and the implementation principle and the technical effect are similar, which are not described herein again.
The internal functions and structures of the real-time translation apparatus, the real-time speech translation apparatus, and the real-time caption generating apparatus, which can be implemented as one kind of electronic device, are described above. Fig. 8 is a schematic structural diagram of an embodiment of an electronic device provided in the embodiment of the present disclosure. As shown in fig. 8, the electronic device includes a memory 81 and a processor 82.
The memory 81 is used to store programs. In addition to the above-described programs, the memory 81 may also be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and so forth.
The memory 81 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The processor 82 is coupled to the memory 81 and executes the program stored in the memory 81 for:
acquiring a first prefix part in a target language text at a historical moment;
performing voice recognition on the original language voice acquired at the current moment to obtain an original language text at the current moment;
and translating the original language text at the current moment into the target language text at the current moment according to the first prefix part, wherein the first prefix part is kept unchanged in the target language text at the current moment.
Further, as shown in fig. 8, the electronic device may further include: communication components 83, power components 84, audio components 85, a display 86, and the like. Only some of the components are schematically shown in fig. 8, and the electronic device is not meant to include only the components shown in fig. 8.
The communication component 83 is configured to facilitate wired or wireless communication between the electronic device and other devices. The electronic device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 83 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 83 further includes a Near Field Communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
A power supply component 84 provides power to the various components of the electronic device. The power components 84 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for an electronic device.
The audio component 85 is configured to output and/or input audio signals. For example, the audio component 85 includes a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 81 or transmitted via the communication component 83. In some embodiments, audio assembly 85 also includes a speaker for outputting audio signals.
The display 86 includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.
In addition, the embodiment of the present disclosure also provides a computer readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the real-time translation method described in the above embodiment.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (9)

1. A real-time speech translation method, wherein the method comprises:
acquiring real-time voice;
performing real-time voice recognition on the real-time voice to obtain a real-time original language text;
when the length of a target language text at a historical moment is larger than a first threshold value, acquiring a first prefix part in the target language text at the historical moment, and translating the real-time original language text into a real-time target language text according to the first prefix part in the target language text translated at the historical moment, wherein the first prefix part is kept unchanged in the real-time target language text;
the length of the first prefix part is the difference between the length of the target language text at the historical time and a second threshold, and the first character of the first prefix part is the same as the first character of the target language text at the historical time.
2. A real-time subtitle generating method, wherein the method comprises:
acquiring real-time voice;
performing real-time voice recognition on the real-time voice to obtain a real-time original language text;
when the length of a target language text at a historical moment is larger than a first threshold value, acquiring a first prefix part in the target language text at the historical moment, and translating the real-time original language text into a real-time target language text according to the first prefix part in the target language text translated at the historical moment, wherein the first prefix part is kept unchanged in the real-time target language text;
generating a real-time caption according to the real-time original language text and the real-time target language text, wherein the real-time caption comprises at least one of the real-time original language text and the real-time target language text;
the length of the first prefix part is the difference between the length of the target language text at the historical time and a second threshold, and the first character of the first prefix part is the same as the first character of the target language text at the historical time.
3. A real-time translation method, wherein the method comprises:
acquiring a first prefix part in a target language text at a historical moment;
performing voice recognition on the original language voice acquired at the current moment to obtain an original language text at the current moment;
according to the first prefix part, translating the original language text at the current moment into a target language text at the current moment, wherein the first prefix part is kept unchanged in the target language text at the current moment;
the length of the first prefix part is the difference value between the length of the target language text at the historical moment and a second threshold value, and the first character of the first prefix part is the same as the first character of the target language text at the historical moment;
acquiring a first prefix part in a target language text at a historical moment, wherein the first prefix part comprises the following steps:
and when the length of the target language text at the historical moment is larger than a first threshold value, acquiring a first prefix part in the target language text at the historical moment.
4. The method of claim 3, wherein after translating the original language text at the current time to the target language text at the current time according to the first prefix portion, the method further comprises:
if the length of the target language text at the current moment is larger than a first threshold value, taking a character with a preset length in front of the target language text at the current moment as a second prefix part;
and translating the original language text at the next moment into the target language text at the next moment according to the second prefix part, wherein the second prefix part is kept unchanged in the target language text at the next moment.
5. The method according to claim 4, wherein the preset length is a difference value between a length of the target language text at the current time and a second threshold value.
6. The method of claim 3, wherein the native language texts at different times belong to the same sentence.
7. A real-time translation apparatus, comprising:
the acquisition module is used for acquiring a first prefix part in a target language text at a historical moment;
the voice recognition module is used for carrying out voice recognition on the original language voice acquired at the current moment to obtain an original language text at the current moment;
a translation module, configured to translate the original language text at the current time into the target language text at the current time according to the first prefix portion, where the first prefix portion remains unchanged in the target language text at the current time;
the length of the first prefix part is the difference value between the length of the target language text at the historical moment and a second threshold value, and the first character of the first prefix part is the same as the first character of the target language text at the historical moment;
the acquisition module is specifically configured to: and when the length of the target language text at the historical moment is larger than a first threshold value, acquiring a first prefix part in the target language text at the historical moment.
8. An electronic device, comprising:
a memory;
a processor; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any one of claims 1-6.
9. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the method of any one of claims 1-6.
CN202210164989.4A 2022-02-23 2022-02-23 Real-time voice translation method, device, equipment and storage medium Active CN114239613B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210164989.4A CN114239613B (en) 2022-02-23 2022-02-23 Real-time voice translation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210164989.4A CN114239613B (en) 2022-02-23 2022-02-23 Real-time voice translation method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN114239613A CN114239613A (en) 2022-03-25
CN114239613B true CN114239613B (en) 2022-08-02

Family

ID=80747887

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210164989.4A Active CN114239613B (en) 2022-02-23 2022-02-23 Real-time voice translation method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114239613B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059313A (en) * 2019-04-03 2019-07-26 百度在线网络技术(北京)有限公司 Translation processing method and device
CN113362810A (en) * 2021-05-28 2021-09-07 平安科技(深圳)有限公司 Training method, device and equipment of voice processing model and storage medium

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102195685B (en) * 2011-05-20 2013-12-11 惠州Tcl移动通信有限公司 Processing system and method for translating displayed text information
WO2013083132A1 (en) * 2011-12-05 2013-06-13 Copenhagen Business School Translation method and computer programme for assisting the same
JP2013206253A (en) * 2012-03-29 2013-10-07 Toshiba Corp Machine translation device, method and program
CN106156009A (en) * 2015-04-13 2016-11-23 中兴通讯股份有限公司 Voice translation method and device
JP2015201215A (en) * 2015-05-25 2015-11-12 株式会社東芝 Machine translation device, method, and program
JP2017167659A (en) * 2016-03-14 2017-09-21 株式会社東芝 Machine translation device, method, and program
US10346548B1 (en) * 2016-09-26 2019-07-09 Lilt, Inc. Apparatus and method for prefix-constrained decoding in a neural machine translation system
CN106776534B (en) * 2016-11-11 2020-02-11 北京工商大学 Incremental learning method of word vector model
CN107632980B (en) * 2017-08-03 2020-10-27 北京搜狗科技发展有限公司 Voice translation method and device for voice translation
CN108304388B (en) * 2017-09-12 2020-07-07 腾讯科技(深圳)有限公司 Machine translation method and device
CN110175335B (en) * 2019-05-08 2023-05-09 北京百度网讯科技有限公司 Translation model training method and device
CN110162800B (en) * 2019-05-08 2021-02-05 北京百度网讯科技有限公司 Translation model training method and device
CN110765787A (en) * 2019-10-21 2020-02-07 深圳传音控股股份有限公司 Information interaction real-time translation method, medium and terminal
CN112507729A (en) * 2020-12-15 2021-03-16 康键信息技术(深圳)有限公司 Method and device for translating text in page, computer equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059313A (en) * 2019-04-03 2019-07-26 百度在线网络技术(北京)有限公司 Translation processing method and device
CN113362810A (en) * 2021-05-28 2021-09-07 平安科技(深圳)有限公司 Training method, device and equipment of voice processing model and storage medium

Also Published As

Publication number Publication date
CN114239613A (en) 2022-03-25

Similar Documents

Publication Publication Date Title
CN106569800B (en) Front-end interface generation method and device
KR101756042B1 (en) Method and device for input processing
CN107423106B (en) Method and apparatus for supporting multi-frame syntax
US9262399B2 (en) Electronic device, character conversion method, and storage medium
CN106547547B (en) data acquisition method and device
WO2018024116A1 (en) Card-based information displaying method and apparatus, and information displaying service processing method and apparatus
US9471567B2 (en) Automatic language recognition
CN111857903A (en) Display page processing method, device, equipment and storage medium
JP2018504865A (en) Information processing method and apparatus, program, and recording medium
CN105468606B (en) Webpage saving method and device
CN111381819A (en) List creation method and device, electronic equipment and computer-readable storage medium
CN112233669A (en) Speech content prompting method and system
CN114239613B (en) Real-time voice translation method, device, equipment and storage medium
CN107402756B (en) Method, device and terminal for drawing page
CN111324214A (en) Statement error correction method and device
CN110781689B (en) Information processing method, device and storage medium
CN114462410A (en) Entity identification method, device, terminal and storage medium
CN114298227A (en) Text duplicate removal method, device, equipment and medium
CN108469913B (en) Method, apparatus and storage medium for modifying input information
CN108241438B (en) Input method, input device and input device
CN113051235A (en) Document loading method and device, terminal and storage medium
CN112182449A (en) Page loading method and device, electronic equipment and storage medium
CN106354749B (en) Information display method and device
CN111460836B (en) Data processing method and device for data processing
CN109408623B (en) Information processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant