WO2019223437A1

WO2019223437A1 - Speech translation method and apparatus

Info

Publication number: WO2019223437A1
Application number: PCT/CN2019/082040
Authority: WO
Inventors: 占萌萌; 刘俊华
Original assignee: 科大讯飞股份有限公司
Priority date: 2018-05-23
Filing date: 2019-04-10
Publication date: 2019-11-28
Also published as: CN108710616A

Abstract

Disclosed in the present application are a speech translation method and apparatus, the method comprising: translating original speech data of a user to obtain a first translation text, the language of the first translation text being different from the language of the original speech data; then, by means of interaction with the user, determining whether a translation result of using the first translation text as original speech data is correct. Hence, by means of determining whether the translation result of using the first translation text as the original speech data is correct, the present application may process the first translation text on the basis of a determination result, and may thus improve the accuracy of the translation result.

Description

Method and device for speech translation

This application claims priority from a Chinese patent application filed with the Chinese Patent Office on May 23, 2018, with application number 201810503163.X, and with the application name "A Voice Translation Method and Device", the entire contents of which are incorporated herein by reference Applying.

Technical field

The present application relates to the field of artificial intelligence technology, and in particular, to a method and device for speech translation.

Background technique

Speech translation refers to the process of automatically translating the speech data of the source language into the speech data of the target language, where the source language and the target language belong to different languages. In the existing speech translation technology, the speech data of the source language is directly translated and a translation result is obtained, but the translation result may not be accurate.

For example, the voice data in the source language is Chinese voice data "Does luggage have to go through security?", And the voice data in the target language is English voice data "Does Lee through security?". Actually, "Mr. Li has passed the security check?" It can be seen that the Chinese voice data before translation "Does luggage have to pass security check?" Is different from the actual meaning of the translated English voice data, "Mr. Li passed the security check?" , That is, the translation result is inaccurate.

Summary of the Invention

The main purpose of the embodiments of the present application is to provide a speech translation method and device, which can improve the accuracy of speech translation results.

An embodiment of the present application provides a voice translation method, including:

Translating the user's source speech data to obtain a first translated text, wherein the language of the first translated text is different from the language of the source speech data;

By interacting with the user, it is determined whether the translation result of the first translated text as the source speech data is correct.

Optionally, after determining whether the translation result of the first translated text as the source speech data is correct, the method further includes:

If it is determined that the translation result of the first translated text as the source speech data is incorrect, the first translated text is corrected, and the corrected text is used as the translation result of the source speech data.

Optionally, before interacting with the user, the method further includes:

Determine whether the translation quality of the first translated text is greater than a preset quality threshold, wherein the translation quality of the first translated text is used to characterize the correctness of the translation result of the first translated text as the source speech data;

If not, perform the step of interacting with the user.

Optionally, the determining whether the translation quality of the first translated text is greater than a preset quality threshold includes:

Translating the first translated text to obtain a second translated text, wherein the language of the second translated text is the same as the language of the source speech data;

Determining whether the translation quality of the first translated text is greater than a preset quality threshold according to the second translated text.

Optionally, determining whether the translation quality of the first translated text is greater than a preset quality threshold based on the second translated text includes:

Determining whether the translation quality of the first translated text is greater than a preset quality threshold according to the recognized text of the source speech data and the second translated text.

Optionally, determining whether the first translation text is the correct translation result of the source voice data by interacting with the user includes:

Interact with the user by using the second translated text to determine whether the translation result of the first translated text as the source speech data is correct.

Optionally, the using the second translated text to interact with the user to determine whether the translation result of the first translated text as the source voice data is correct includes:

Outputting a first query voice to the user, wherein the first query voice is used to query whether the source voice data is similar to the semantics of the second translated text;

If a positive answer to the first query voice is received by the user, the first translation text is correct as a translation result of the source voice data;

If a negative answer is received from the user to the first query voice, the first translation text is incorrect as a translation result of the source voice data.

Optionally, the modifying the first translated text includes:

The first translation text is corrected by using a text matching method.

Optionally, the correcting the first translated text by using a text matching method includes:

Match the recognized text of the source speech data with text data in a database, wherein the database stores at least one set of sentence pairs, the sentence pairs including a first sample text and the first sample text A correctly translated second sample text, the language of the first sample text is the same as the language of the source speech data, and the language of the second sample text is the same as the language of the first translated text;

Obtaining the first sample text most similar to the recognition text of the source speech data through the matching operation;

Correct the first translated text based on the most similar first sample text.

Optionally, the modifying the first translated text based on the most similar first sample text includes:

Interacting with the user by using the most similar first sample text to achieve correction of the first translated text.

Optionally, interacting with the user by using the most similar first sample text to achieve the correction of the first translated text includes:

Outputting a second query voice to the user, wherein the second query voice is used to query whether the source voice data is semantically similar to the most similar first sample text;

If a positive answer is received from the user to the second query voice, a second sample text is obtained from the sentence pair to which the most similar first sample text belongs, as a success of the first translated text Corrected text.

Optionally, the method further includes:

If a negative answer to the second query voice is received by the user, a prompt voice is output, wherein the prompt voice is used to prompt the user to repeat the source voice data or to replace the source voice data. .

An embodiment of the present application further provides a voice translation device, including:

A voice translation unit, configured to translate a user's source voice data to obtain a first translated text, wherein a language of the first translated text is different from a language of the source voice data;

A user interaction unit is configured to determine whether the translation result of the first translated text as the source voice data is correct by interacting with the user.

Optionally, the device further includes:

A text correction unit, configured to correct the first translated text if it is determined that the translation result of the first translated text as the source speech data is incorrect, and use the corrected text as the source speech Data translation results.

Optionally, the device further includes:

A quality determining unit, configured to determine whether the translation quality of the first translated text is greater than a preset quality threshold, wherein the translation quality of the first translated text is used to characterize the first translated text as the source speech data The correctness of the translation result; if not, triggering the user interaction unit to determine whether the translation result of the first translated text as the source speech data is correct by interacting with the user.

Optionally, the quality judgment unit includes:

A reverse translation subunit, configured to translate the first translated text to obtain a second translated text, wherein the language of the second translated text is the same as the language of the source speech data;

A quality judging subunit, configured to determine whether the translation quality of the first translated text is greater than a preset quality threshold according to the second translated text.

Optionally, the quality judging subunit is specifically configured to determine whether the translation quality of the first translated text is greater than a preset quality threshold based on the recognized text of the source speech data and the second translated text.

Optionally, the user interaction unit is specifically configured to use the second translated text to interact with the user to determine whether the translation result of the first translated text as the source voice data is correct.

Optionally, the user interaction unit includes:

A first query subunit, configured to output a first query voice to the user, wherein the first query voice is used to query whether the source voice data is semantically similar to the second translated text;

A result determining subunit, configured to: if a positive answer to the first query voice is received by the user, the first translated text is correct as a translation result of the source voice data; if the user is received For a negative answer to the first query voice, the first translation text is incorrect as a translation result of the source voice data.

Optionally, the text correction unit is specifically configured to correct the first translated text in a text matching manner.

Optionally, the text correction unit includes:

Text matching sub-unit, configured to match the recognized text of the source speech data with text data in a database, wherein the database stores at least one sentence pair, the sentence pair includes a first sample text and A second sample text after the first sample text is correctly translated, the language of the first sample text is the same as the language of the source speech data, and the language of the second sample text is the same as the first sample text The language of the translated text is the same;

A text obtaining subunit, configured to obtain, through the matching operation, a first sample text that is most similar to the recognized text of the source speech data;

A text correction subunit is configured to correct the first translated text according to the most similar first sample text.

Optionally, the text correction subunit is specifically configured to interact with the user by using the most similar first sample text to implement correction on the first translated text.

Optionally, the text correction subunit includes:

A second query subunit, configured to output a second query voice to the user, wherein the second query voice is used to query whether the source voice data is semantically similar to the most similar first sample text;

The modification completion subunit is configured to obtain a second sample text from the sentence pair to which the most similar first sample text belongs if the user's positive answer to the second query voice is received, as The text after the first translation is successfully revised.

Optionally, the text correction subunit further includes:

A voice prompting subunit, configured to output a prompting voice if a negative answer to the second query voice is received by the user, wherein the prompting voice is used to prompt the user to repeat the source voice data or replace The source speech data.

An embodiment of the present application further provides a voice translation device, including: a processor, a memory, and a system bus;

The processor and the memory are connected through the system bus;

The memory is configured to store one or more programs, where the one or more programs include instructions, and the instructions, when executed by the processor, cause the processor to execute any one of the implementation methods of the speech translation method described above. .

An embodiment of the present application further provides a computer-readable storage medium, including instructions, which, when running on a computer, cause the computer to execute any one of the above-mentioned voice translation methods.

An embodiment of the present application further provides a computer program product, which, when the computer program product runs on a terminal device, causes the terminal device to execute any one of the above-mentioned voice translation methods.

A voice translation method and device provided in the embodiments of the present application are used to translate a user's source voice data to obtain a first translated text. The language of the first translated text is different from the language of the source voice data. Interact to determine whether the translation result of the first translated text as the source speech data is correct. It can be seen that by judging whether the translation result of the first translated text as the source speech data is correct, the first translated text can be processed based on the judgment result, thereby improving the accuracy of the translation result.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to more clearly explain the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings in the following description are Some embodiments of the present application, for those of ordinary skill in the art, can obtain other drawings according to these drawings without paying creative labor.

FIG. 1 is a schematic flowchart of a speech translation method according to an embodiment of the present application;

2 is a schematic flowchart of a method for determining translation quality according to an embodiment of the present application;

3 is a schematic flowchart of a method for determining whether a translation result is credible according to an embodiment of the present application;

4 is a schematic flowchart of a method for correcting a translated text according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a speech translation apparatus according to an embodiment of the present application; FIG.

FIG. 6 is a schematic diagram of a hardware structure of a speech translation apparatus according to an embodiment of the present application.

Detailed ways

Speech translation refers to the process of automatically translating the speech data of the source language (that is, the speech data before translation) into the speech data of the target language (that is, the translated speech data). Generally, speech translation technology involves speech recognition, machine translation, and The three main components of speech synthesis. Among them, speech recognition refers to the recognition of speech data in the source language through speech recognition technology to generate the source language text; machine translation refers to the translation of source language text into the target language text through machine translation technology; speech synthesis refers to the use of speech synthesis technology The target language text is synthesized into speech data of the target language.

As the application of speech translation technology becomes more and more widespread, people have higher and higher requirements for the accuracy of translation results. A method of speech translation is to realize speech translation through a round of man-machine conversation, that is, to realize speech translation through one input and one output. The input is the voice data of the source language and the output is the voice data of the target language. The voice data of the source language to be translated is input into the voice translation device, and the voice translation device then automatically translates the voice data of the source language into the voice data of the target language through steps such as speech recognition, machine translation and speech synthesis, and feedbacks To the user, however, in the process, the results of speech recognition and machine translation may have deviations, resulting in inaccurate speech data in the target language that is finally output, that is, the user can only passively accept the speech translation device The translation result is a one-time translation. When the translation result is wrong, the speech translation device cannot correct the incorrect translation result in time, thereby reducing the accuracy of the translation result.

For this reason, the embodiment of the present application provides a speech translation method, which adds a correction function for the translation result, that is, the accuracy of the one-time translation result can be evaluated. When the evaluation result indicates that the translation result is less accurate At this time, the translation result can be modified. Specifically, the translation result can be corrected by interacting with the user according to the interaction result, thereby improving the accuracy of the translation result.

It should be noted that the speech translation method provided in the embodiments of the present application does not limit its application scenarios. For example, the method can be used in scenarios where a user needs to translate, such as traveling abroad, entering and exiting security.

In order to make the objectives, technical solutions, and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments These are part of the embodiments of the present application, but not all the embodiments. Based on the embodiments in the present application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

First embodiment

Referring to FIG. 1, a schematic flowchart of a speech translation method according to this embodiment is provided. The speech translation method includes the following steps:

S101: Translate the user's source speech data to obtain a first translated text, wherein the language of the first translated text is different from the language of the source speech data.

This embodiment refers to the speech data before translation (that is, the speech to be translated) as the source speech data. Moreover, this embodiment does not limit the language type of the source speech data. For example, the source speech data may be Chinese speech or English speech, etc. .

In this embodiment, the translated text data is referred to as a first translated text. Moreover, this embodiment does not limit the language type of the first translated text, as long as the first translated text and the source speech data belong to different language types, for example, The source speech data is Chinese speech, and the first translated text is English text. For another example, the source speech data is English speech, and the first translated text is Chinese text.

In this embodiment, the source speech data can be speech-recognized by speech recognition technology to obtain the recognition text A1 of the source speech data, and then the machine-translated recognition text A1 is machine-translated by the machine translation technology to obtain the first translated text B1. It should be noted that the speech recognition technology in this embodiment may be any existing or future speech recognition technology. Similarly, the machine translation technology in this embodiment may also be existing or future Any kind of machine translation technology.

For example, when entering and exiting the security check, the user wants to have a dialogue with the security checker through a voice translation device. Assume that the source voice data that the user said is "Does the baggage have to pass security check?" The text A1 is "Does Lee have to go through security?", And then the recognition text A1 is translated (Chinese to English). The first translated text B1 is "Does Lee has passed through security?". It can be seen that when speech recognition is performed on the source speech data, a recognition error occurs in its recognition text A1.

S102: Determine whether the translation result of the first translated text as the source voice data is correct by interacting with the user.

In this embodiment, the voice translation device may interact with the user. Specifically, the voice translation device or the text interaction method may be used, and according to the interaction result, it is determined whether the translation result of the first translated text as the source voice data is correct. If it is determined that the translation result of the first translated text as the source speech data is correct, the first translated text B1 may be used as the translation result of the source speech data.

At this time, the first translation text B1 can be further speech synthesized to obtain target speech data, and the target speech data is directly fed back to the user, thereby ending the current round of translation. Of course, after the first translation text B1 is used as the text translation result of the source speech data, other processing may also be performed on it, and this embodiment does not limit the subsequent processing manner.

It should be noted that if it is determined that the translation result of the first translated text as the source speech data is incorrect, the first translated text B1 may be corrected through the subsequent fourth embodiment, or the user is requested to repeat the source speech data, Or put it another way that is semantically similar to the source speech data in order to start a new round of translation interactions.

In summary, a voice translation method provided in this embodiment translates a user's source voice data to obtain a first translated text, and a language of the first translated text is different from a language of the source voice data. Interact to determine whether the translation result of the first translated text as the source speech data is correct. It can be seen that by judging whether the translation result of the first translated text as the source speech data is correct, the first translated text can be processed based on the judgment result, thereby improving the accuracy of the translation result.

Second embodiment

In this embodiment, before the judgment step S102 in the first embodiment, that is, before judging whether the translation result of the first translated text as the source speech data is correct through human-computer interaction, the machine (that is, (Voice translation device) determines whether the translation result of the first translated text as the source voice data is correct.

Therefore, before the determining step S102 in the first embodiment, the method may further include: determining whether the translation quality of the first translated text is greater than a preset quality threshold, wherein the translation quality of the first translated text is used to represent the The correctness of the first translated text as the translation result of the source speech data is described; if not, the determination step S102 in the first embodiment is performed.

In this embodiment, the translation quality of the first translated text B1 can be evaluated. If the translation quality is not higher than a preset quality threshold, which is referred to herein as a preset quality threshold, the first translated text B1 is considered as the source speech. The translation result of the data is unreliable, that is, the translation result of the first translation text B1 as the source speech data is incorrect. At this time, step S102 can be continued to further correct the correctness of the first translation text B1 as the translation result. Judge.

Conversely, if the translation quality of the first translated text B1 is higher than a preset quality threshold, the translation result of the first translated text B1 as the source speech data is considered to be credible, that is, the translation result of the first translated text B1 as the source speech data is Correctly, at this time, the first translated text B1 can be used as the translation result of the source speech data. Further, the first translated text B1 can be synthesized by speech to obtain the target speech data, and the target speech data is directly fed back to the user. Thus ending this round of translation. Of course, after the first translation text B1 is used as the text translation result of the source speech data, other processing may also be performed on it, and this embodiment does not limit the subsequent processing manner.

In the following, a specific implementation manner of the above-mentioned translation quality judgment step (that is, "determining whether the translation quality of the first translated text is greater than a preset quality threshold") will be described.

Referring to FIG. 2, a flowchart of a method for determining translation quality according to this embodiment is provided. The method for determining translation quality includes the following steps:

S201: Translate the first translated text to obtain a second translated text, wherein the language of the second translated text is the same as the language of the source speech data.

In this embodiment, the first translated text B1 may be reversely translated to obtain a second translated text A2. The language of the first translated text B1 is the language of the translated language, such as English; the language of the second translated text A2 is the language of the pre-translation language, such as Chinese.

For example, following the above example, assuming that the first translated text B1 is "Does, Lee has to go through security?", And the second translated text A2 obtained after reverse translation is "Did Mr. Li pass the security check?".

S202: Determine whether the translation quality of the first translated text is greater than a preset quality threshold according to the second translated text.

In this embodiment, the translation quality of the first translated text B1 may be determined based on the second translated text A2. In an implementation manner, step S202 may specifically include: judging whether the translation quality of the first translated text is greater than a preset quality threshold according to the recognized text of the source speech data and the second translated text.

In a specific implementation of this step 202, a BLEU (bilingual evaluation understudy) algorithm may be specifically used to determine whether the translation quality of the first translated text is greater than a preset quality threshold.

Specifically, the BLEU algorithm is an evaluation algorithm for machine translation results, which is used to evaluate the translation quality of one natural language into another natural language. The specific algorithm is as follows:

First of all, in order to comprehensively consider the translation effect of the first translated text B1, it is necessary to statistically identify from a number of perspectives, such as 1-word-based unit (1-gram) to multiple-word-based unit (n-gram). The number of basic units that can be matched between the text A1 and the second translated text A2. In the statistical process, the position of each basic unit in the text is not considered. Then, according to the number of matched basic units, the matching accuracy of the second translated text A2 under each order of basic units is calculated.

The following formula can be used to calculate the matching accuracy rate percison of the second translated text A2 under each order basic unit i-gram (i = 1, 2 ... n):

Corret is the number of same-level basic units in the second translation text A2 that correctly matches the recognition text A1, and output_length is the total number of same-level basic units in the second translation text A2.

For example, following the above example, assuming that the recognition text A1 is "Does Lee have to go through security?" And the second translated text A2 is "Mr. Lee passed the security?" The calculation result of the matching accuracy rate percison is shown in Table 1 below.

Table 1

Zh	正确匹配的基础单元Correctly matched base unit	匹配准确率percisonMatching accuracy percison
1-gram1-gram	李、过、安、检、吗、？Li, ever, security, check ,?	6/10＝0.66/10 = 0.6
2-gram2-gram	过安、安检、吗？Go through security, security check?	3/9＝0.333/9 = 0.33
3-gram3-gram	过安检Security check	1/8＝0.1251/8 = 0.125
4-gram4-gram	(无)(no)	0/7＝00/7 = 0

Then, we also need to consider punishing the redundant words in the second translation text A2. Therefore, a length penalty factor is introduced to solve this problem. The principle is that the longer the second translation text A2, the more penalties will be deducted. The formula for calculating the length penalty factor C is as follows:

C = min (1, L1 / L2) (2)

Among them, L1 is the length of the recognition text A1, and L2 is the length of the second translation text A2.

In formula (2), if the recognized text A1 and the second translated text A2 are Chinese text, the text length can be calculated in word units. For example, when the recognition text A1 is "Does Lee have to go through security?", Its length is 9; when the second translated text A2 is "Mr. Lee has passed security?", Its length is 10.

Finally, after the matching accuracy rate corret and the length penalty factor C are calculated according to the above formulas (1) and (2) respectively, the BLEU score of the second translated text A2 can be calculated. Specifically, you can select the BLEU score corresponding to a certain basic unit, such as selecting the BLEU score corresponding to 4-igram. The calculation formula is as follows:

bleu _4-gram = C * f (4-gram) (3)

Among them, bleu _4-gram is the BLEU score of the second translated text A2, C is the length penalty factor, and f is a processing function for the matching accuracy rate corresponding to 1-gram, 2-gram, 3-gram, and 4-gram.

For example, when the recognition text A1 is "Does Lee have to go through security?" And the second translated text A2 is "Mr. Lee passed the security?", The accuracy of each match calculated by formula (1) (see Table 1) (Described above) and the length penalty factor calculated by formula (2) is substituted into formula (3), and the BLEU score of the second translated text A2 can be calculated to be 20.56.

In this embodiment, a translation scoring threshold may be set in advance. Here, the translation scoring threshold is used as a preset quality threshold. For example, the threshold is set to 50. Since the calculated score 20.56 is smaller than the threshold 50, it may be determined A translation text B1 as the translation result of the source speech data is not credible. For example, the above-mentioned first translation text B1 "Does Lee Lee Has To Go Through Security?" It is determined that the translation result of the first translated text B1 as the source speech data is credible.

In summary, the method for determining translation quality provided in this embodiment can reversely translate the first translated text to obtain the second translated text, and use the BLEU algorithm to compare the translated text based on the source speech data and the second translated text. The second translated text is scored, so that the translation quality of the first translated text can be judged according to the scoring result, thereby achieving the problem of evaluating the translation quality.

Third embodiment

In this embodiment, if it is determined through the second embodiment that the translation quality of the first translated text is not greater than a preset quality threshold, that is, it is determined that the translation result of the first translated text B1 as the source speech data is unreliable, due to speech translation The judgment result of the device may not be accurate. Therefore, the voice translation device may interact with the user through step S102 in the first embodiment, and determine whether the first translated text B1 is the correct translation of the source voice data based on the user's interactive feedback. result.

In an implementation manner of this embodiment, step S102 in the first embodiment may specifically include: using the second translated text to interact with the user to determine the first translated text as the source voice data Is the translation result correct? In this embodiment, the second translated text A2 may be used as the content for interaction with the user, and the judgment may be made according to the user feedback result.

Specifically, this determination step can be implemented in the following manner.

As shown in FIG. 3, a schematic flowchart of a method for determining whether a translation result is credible provided in this embodiment may include the following steps:

S301: Output a first query voice to the user, where the first query voice is used to query whether the source voice data is semantically similar to the second translated text.

In this embodiment, the second translated text A2 can be synthesized with speech to interact with the user. The purpose of the interaction is to ask whether the sentence the user wants to translate is the second translated text A2 (that is, the source speech data and the second translated text). Whether the semantics of A2 are similar), for the convenience of description and differentiation, this embodiment calls the voice of the inquiring user as the first inquiring voice, and the first inquiring voice may specifically be "Do you want to translate the second translated text A2?" ? ".

For example, assuming that the first translated text B1 is "Does Lee, through security?", The second translated text A2 is obtained by reverse-translating it through the second embodiment S201. "Did Mr. Li pass the security inspection?", When The BLEU algorithm is used to score the second translated text A2. For example, a score of 20.56 is obtained. Since it is 50 points lower than the preset quality threshold, the second translated text A2 after reverse translation is synthesized into the first query voice. For example, "What do you want to translate is" Mr. Li passed the security check? " "".

At this time, the voice translation device feeds back the first inquiry voice to the user, and waits for a response from the user.

S302: If a positive answer to the first query voice is received by the user, the first translation text is correct as a translation result of the source voice data.

The user can give a positive answer to the first query voice by using voice or keys, for example, the user can input a voice "yes" to the voice translation device, or press the "OK" or "OK" key on the voice translation device, etc. . In this case, the speech translation device considers that the translation result of the first translation text B1 as the source speech data is credible, that is, it considers that the translation result of the first translation text B1 as the source speech data is correct. Step S103 uses the first translated text B1 as the translation result of the source speech data.

S303: If a negative answer to the first query voice is received by the user, the translation result of the first translated text as the source voice data is incorrect.

The user can make a negative answer to the first query voice by using voice or keys, for example, the user can input a voice "No" to the voice translation device, or press the "NO" key on the voice translation device. In this case, the speech translation device considers that the translation result of the first translation text B1 as the source speech data is unreliable, that is, it considers that the translation result of the first translation text B1 as the source speech data is wrong.

In summary, a method for judging whether a translation result is credible provided in this embodiment may output a first query voice to a user, and the first query voice is used to query whether the source voice data is similar to the semantics of the second translated text; When a positive answer is received, the translation result of the first translated text as the source speech data is believed to be credible; on the other hand, if a negative answer is received, the translation result of the first translated text as the source speech data is considered unreliable. It can be seen that through human-computer interaction with the user, it is possible to confirm whether the first translated text is correct, thereby ensuring the accuracy of the translation result.

Fourth embodiment

In this embodiment, when the first embodiment determines that the translation result of the first translated text as the source voice data is incorrect through step S102, the first translated text may be further modified, and the modified The resulting text is used as the translation result of the source speech data.

When the correction is successful, the corrected text data can be used as the text translation result of the source speech data. At this time, the corrected text data can be further speech synthesized to obtain the target speech data, and the target speech data is directly fed back to the user. , Thus ending this round of translation. Of course, after the corrected text data is used as the text translation result of the source speech data, other processing may also be performed on it, and this embodiment does not limit the subsequent processing manner.

It can be seen that the present embodiment adds a correction function to the translation result, that is, the translation quality of the first translation text can be evaluated, and when the evaluation result indicates that the translation quality of the first translation text as a translation result is low, the translation can be The results are revised to improve the accuracy of the translation results.

It should be noted that, based on any of the foregoing embodiments, the present application may modify the first translated text B1 according to the correction method provided by this embodiment.

In an implementation manner of this embodiment, a text matching manner may be specifically used to modify the first translated text B1. Next, a specific implementation manner of this correction step will be described.

Referring to FIG. 4, a flowchart of a method for correcting a translated text according to this embodiment is provided. The method for correcting a translated text includes the following steps:

S401: Perform a matching operation on the recognized text of the source voice data and the text data in a database.

In this embodiment, a database may be constructed in advance, where the database stores at least one set of sentence pairs, where the sentence pairs include a first sample text and a first translated text after the first sample text is correctly translated. Two sample texts, the language of the first sample text is the same as the language of the source speech data (the language of the language before translation), and the language of the second sample text is the same as the language of the first translation text (after translation) Language).

Specifically, a large number of first sample texts and second sample texts after correct translation of the first sample texts can be collected in advance, and the first sample text and the second sample text corresponding to each other are formed into sentence pairs, and These sentence pairs construct a database, which may be a local database of the speech translation device, or a cloud server-side database that communicates with the speech translation device.

In this embodiment, the database can be constructed according to specific application requirements, that is, the database can store only sentence pairs related to specific application scenarios. For example, if a user needs to use a voice translation device during immigration security, then, Sentence pairs commonly used in immigration and security checks are stored in this database in advance; of course, the database can also store multiple sentence pairs related to application scenarios. In actual applications, the application scenarios can be automatically determined based on the user's source voice data, and then Select a set of sentence pairs for the corresponding application scenario.

It should be noted that this embodiment does not limit the number of sentence pairs in a certain application scenario, such as about 1 to 40,000 sentence pairs, but in order to achieve the correction effect, it is necessary to cover commonly used or Common sentence pairs.

Taking the immigration security check scenario as an example, the storage format of a sentence pair in the database is as follows:

{"cn": "Do my luggage have to go through security?", "update_time": "20171018T173941",

"en": "Must the luggage checked security?", "create_time": "20171018T173941", "id": "00000001"}

Where: cn: Chinese sentence;

en: the corresponding English sentence;

update_time: indicates the database upload time;

create_time: indicates the time to make a statement pair;

id: the unique identifier of the data pair in the database.

In this embodiment, the recognition text A1 of the source speech data is matched with the text data in the database. For example, the Doc2Vec algorithm can be used for matching. Doc2Vec is also called paragraph2vec or sentence embeddings, which is an unsupervised algorithm.

S402: A first sample text that is most similar to the recognized text of the source speech data is obtained through the matching operation.

By matching the recognition text A1 with the text data in the database, a first sample text in the database that is most similar to the recognition text A1 is obtained, which is simply referred to as sample text A3. During the matching, the recognition text A1 can be vectorized to obtain the sentence vector of the recognition text A1. Then, for each first sample text in the same language as the recognition text A1 in the database, the recognition text A1 is calculated separately. The distance between the sentence vector and the sentence vector of each first sample text, the first sample text closest to the distance is selected as the sample text A3 most similar to the recognition text A1.

For example, when using the Doc2Vec algorithm for matching, suppose the recognition text A1 is "Does Lee have to go through security?", And match it with the database. If it is determined that the first sample text with id "00000001" "Luggage must go through security The shortest distance between the sentence vector of "?" And "Do Lee have to go through security?" The first sample text "Is luggage checked through?" With id "00000001" is used as the identification text. A1 is the most similar sample text A3.

S403: Correct the first translated text according to the most similar first sample text.

In this embodiment, when the first sample text that is most similar to the recognition text A1 of the source speech data, that is, the sample text A3 is obtained, the first translated text B1 may be modified by using the sample text A3.

In an implementation manner of this embodiment, the second sample text in the sentence pair to which the sample text A3 belongs can be directly used as the text after the first translation text is successfully modified.

In another implementation manner of this embodiment, step S403 may specifically use the most similar first sample text to interact with the user to implement correction of the first translated text. In this implementation manner, the most similar first sample text, that is, sample text A3, may be used as the content for interaction with the user, and the first translated text may be modified according to the user feedback result.

The specific implementation of step S403 may include the following steps A-B:

Step A: output a second query voice to the user, wherein the second query voice is used to query whether the source voice data is semantically similar to the most similar first sample text.

The sample text A3 matched from the database can be used to interact with the user after synthesizing the speech. The purpose of the interaction is to ask whether the sentence the user wants to translate is the sample text A3 (that is, whether the semantics of the source speech data and the sample text A3 are similar). For the convenience of description and differentiation, this embodiment refers to the voice of the inquiring user as the second inquiring voice. The second inquiring voice may specifically be "Do you want to translate the sample text A3?".

For example, assuming the sample text A3 is "Does luggage have to go through security?", The second query voice may be "Do you want to translate" Does luggage have to go through security? " "".

At this time, the voice translation device feeds back the second query voice to the user, and waits for a response from the user.

Step B: if a positive answer to the second query voice is received by the user, obtain a second sample text from the sentence pair to which the most similar first sample text belongs, as a translation to the first Text The text after successful correction.

The user can give a positive answer to the second query voice by using voice or keys, for example, the user can input a voice "yes" to the voice translation device, or press the "OK" or "OK" key on the voice translation device, etc. . In this case, the second sample text can be obtained from the sentence pair to which the sample text A3 belongs by querying the database, which is referred to as sample text B3 here, and the sample text B3 is used as a successful modification of the first translated text text.

For example, the user hears the second questioning voice of the voice translation device, "What do you want to translate is" Does luggage have to go through security? " "", If the answer is "yes", at this time, the speech translation device considers that the user wants to translate the sample text A3: "Does the luggage have to go through security check?", And matches the sentence with the corresponding sample text B3 "Must", "checked", "security?" As the text after the first translation text B1 was successfully corrected, the correction was successful.

Further, the user may also make a negative answer to the second query voice. Therefore, this embodiment may further include:

Step C: if a negative answer to the second query voice is received by the user, a prompt voice is output, wherein the prompt voice is used to prompt the user to repeat the source voice data or replace the source voice Data claims.

The user can make a negative answer to the second query voice by voice or by pressing a button, for example, the user can input a voice "No" to the voice translation device, or press the "NO" key on the voice translation device. In this case, it is considered that the correction has failed. At this time, the speech translation device may request the user to repeat the source speech data in a voice manner, or to change the term similar to the source speech data in semantics in order to start a new round of Translation interaction.

In summary, a translation text correction method provided in this embodiment performs a matching operation on the recognized text of the source speech data and the text data in the database to obtain the sentence most similar to the recognized text, and then according to the most similar sentence, Correct the first translated text. It can be seen that in this embodiment, sentence pairs in each translation direction and application scenario can be accumulated in advance and stored in the database. The matching algorithm can be used to find the sentence most similar to the recognized text in the database, and the translated text of the sentence is used as Corrected text, thus achieving text correction.

Fifth Embodiment

This embodiment will introduce a speech translation device. For related content, refer to the foregoing method embodiment. It should be noted that the voice translation device may be the above-mentioned voice translation device, or may be a part of the above-mentioned voice translation device.

Referring to FIG. 5, a composition diagram of a speech translation apparatus according to this embodiment is provided. The apparatus 500 includes:

A voice translation unit 501, configured to translate a user's source voice data to obtain a first translated text, wherein a language of the first translated text is different from a language of the source voice data;

The user interaction unit 502 is configured to determine whether the translation result of the first translated text as the source voice data is correct by interacting with the user.

In an implementation manner of this embodiment, the apparatus 500 may further include:

A quality determining unit, configured to determine whether the translation quality of the first translated text is greater than a preset quality threshold, wherein the translation quality of the first translated text is used to characterize the first translated text as the source speech data The correctness of the translation result; if not, triggering the user interaction unit 502 to determine whether the translation result of the first translated text as the source speech data is correct by interacting with the user.

In an implementation manner of this embodiment, the quality determination unit includes:

In an implementation manner of this embodiment, the quality judging subunit is specifically configured to determine whether the translation quality of the first translated text is greater than or equal to that of the first translated text based on the recognized text of the source voice data and the second translated text. Preset quality threshold.

In an implementation manner of this embodiment, the user interaction unit 502 may be specifically configured to use the second translated text to interact with the user to determine the first translated text as the source speech data. Whether the translation result is correct.

In an implementation manner of this embodiment, the user interaction unit 502 may include:

In an implementation manner of this embodiment, the text correction unit may be specifically configured to correct the first translated text in a text matching manner.

In an implementation manner of this embodiment, the text correction unit may include:

In an implementation manner of this embodiment, the text correction subunit may be specifically configured to interact with the user by using the most similar first sample text to implement correction on the first translated text.

In an implementation manner of this embodiment, the text correction subunit may include:

In an implementation manner of this embodiment, the text correction subunit may further include:

Sixth embodiment

This embodiment will introduce another speech translation device. For related content, refer to the foregoing method embodiment.

Referring to FIG. 6, a schematic diagram of a hardware structure of a speech translation apparatus according to this embodiment. The speech interaction apparatus 600 includes a memory 601 and a receiver 602, and processes connected to the memory 601 and the receiver 602 respectively A processor 603, the memory 601 is configured to store a set of program instructions, and the processor 603 is configured to call the program instructions stored in the memory 601 to perform the following operations:

In an implementation manner of this embodiment, the processor 603 is further configured to call a program instruction stored in the memory 601 to perform the following operations:

If not, perform the step of interacting with the user.

If a negative answer to the first query voice is received by the user, the first translation text is incorrect as a translation result of the source voice data.

The first translation text is corrected by using a text matching method.

Correct the first translated text based on the most similar first sample text.

In some embodiments, the processor 603 may be a central processing unit (CPU), the memory 601 may be an internal memory of random access memory (RAM) type, and the receiver 602 may include a common physical interface, and the physical interface may be an Ethernet interface or an Asynchronous Transfer Mode (ATM) interface. The processor 603, the receiver 602, and the memory 601 may be integrated into one or more independent circuits or hardware, such as: Application Specific Integrated Circuit (ASIC).

Further, this embodiment also provides a computer-readable storage medium, which includes instructions that, when run on a computer, cause the computer to perform any one of the above-mentioned voice translation methods.

Further, this embodiment also provides a computer program product, which, when the computer program product runs on a terminal device, causes the terminal device to execute any one of the above-mentioned voice translation methods.

It can be known from the description of the foregoing embodiments that those skilled in the art can clearly understand that all or part of the steps in the method of the above embodiment can be implemented by means of software plus a necessary universal hardware platform. Based on such an understanding, the technical solution of the present application, in essence, or a part that contributes to the existing technology, can be embodied in the form of a software product, which can be stored in a storage medium, such as ROM / RAM, magnetic disk , Optical discs, etc., including a number of instructions for causing a computer device (which may be a personal computer, a server, or a network communication device such as a media gateway, etc.) to execute the various embodiments or certain parts of the embodiments described in this application. method.

It should be noted that each embodiment in this specification is described in a progressive manner. Each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments may refer to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part may refer to the description of the method.

It should also be noted that in this article, relational terms such as first and second are used only to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities or operations There is any such actual relationship or order among them. Moreover, the terms "including", "comprising", or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article, or device that includes a series of elements includes not only those elements but also those that are not explicitly listed Or other elements inherent to such a process, method, article, or device. Without more restrictions, the elements defined by the sentence "including a ..." do not exclude the existence of other identical elements in the process, method, article, or equipment that includes the elements.

The above description of the disclosed embodiments enables those skilled in the art to implement or use the present application. Various modifications to these embodiments will be apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the application. Therefore, this application will not be limited to the embodiments shown herein, but should conform to the widest scope consistent with the principles and novel features disclosed herein.

Claims

A speech translation method, comprising:

Translating the user's source speech data to obtain a first translated text, wherein the language of the first translated text is different from the language of the source speech data;

By interacting with the user, it is determined whether the translation result of the first translated text as the source speech data is correct.
The method according to claim 1, wherein after determining whether the translation result of the first translated text as the source speech data is correct, further comprising:

If it is determined that the translation result of the first translated text as the source speech data is incorrect, the first translated text is corrected, and the corrected text is used as the translation result of the source speech data.
The method according to claim 1, before the interacting with the user, further comprising:

Determine whether the translation quality of the first translated text is greater than a preset quality threshold, wherein the translation quality of the first translated text is used to characterize the correctness of the translation result of the first translated text as the source speech data;

If not, perform the step of interacting with the user.
The method according to claim 3, wherein the determining whether the translation quality of the first translated text is greater than a preset quality threshold comprises:

Translating the first translated text to obtain a second translated text, wherein the language of the second translated text is the same as the language of the source speech data;

Determining whether the translation quality of the first translated text is greater than a preset quality threshold according to the second translated text.
The method according to claim 4, wherein determining whether the translation quality of the first translated text is greater than a preset quality threshold based on the second translated text, comprises:

Determining whether the translation quality of the first translated text is greater than a preset quality threshold according to the recognized text of the source speech data and the second translated text.
The method according to claim 4, wherein the determining whether the first translation text is the correct translation result of the source speech data by interacting with the user comprises:

Interact with the user by using the second translated text to determine whether the translation result of the first translated text as the source speech data is correct.
The method according to claim 6, wherein the using the second translated text to interact with the user to determine whether the translation result of the first translated text as the source speech data is correct comprises:

Outputting a first query voice to the user, wherein the first query voice is used to query whether the source voice data is similar to the semantics of the second translated text;

If a positive answer to the first query voice is received by the user, the first translation text is correct as a translation result of the source voice data;

If a negative answer to the first query voice is received by the user, the first translation text is incorrect as a translation result of the source voice data.
The method according to any one of claims 2 to 7, wherein the modifying the first translated text comprises:

The first translation text is corrected by using a text matching method.
The method according to claim 8, wherein the modifying the first translated text in a text matching manner comprises:

Match the recognized text of the source speech data with text data in a database, wherein the database stores at least one set of sentence pairs, the sentence pairs including a first sample text and the first sample text A correctly translated second sample text, the language of the first sample text is the same as the language of the source speech data, and the language of the second sample text is the same as the language of the first translated text;

Obtaining the first sample text most similar to the recognition text of the source speech data through the matching operation;

Correct the first translated text based on the most similar first sample text.
The method according to claim 9, wherein the modifying the first translated text based on the most similar first sample text comprises:

Interacting with the user by using the most similar first sample text to achieve correction of the first translated text.
The method according to claim 10, wherein the interacting with the user by using the most similar first sample text to achieve the correction of the first translated text comprises:

Outputting a second query voice to the user, wherein the second query voice is used to query whether the source voice data is semantically similar to the most similar first sample text;

If a positive answer is received from the user to the second query voice, a second sample text is obtained from the sentence pair to which the most similar first sample text belongs, as a success of the first translated text Corrected text.
The method according to claim 11, further comprising:

If a negative answer to the second query voice is received by the user, a prompt voice is output, wherein the prompt voice is used to prompt the user to repeat the source voice data or to replace the source voice data. .
A speech translation device, comprising:

A voice translation unit, configured to translate a user's source voice data to obtain a first translated text, wherein a language of the first translated text is different from a language of the source voice data;

A user interaction unit is configured to determine whether the translation result of the first translated text as the source voice data is correct by interacting with the user.
The apparatus according to claim 13, further comprising:

A text correction unit, configured to correct the first translated text if it is determined that the translation result of the first translated text as the source speech data is incorrect, and use the corrected text as the source speech Data translation results.
The apparatus according to claim 13, further comprising:

A quality determining unit, configured to determine whether the translation quality of the first translated text is greater than a preset quality threshold, wherein the translation quality of the first translated text is used to characterize the first translated text as the source speech data The correctness of the translation result; if not, triggering the user interaction unit to determine whether the translation result of the first translated text as the source speech data is correct by interacting with the user.
The device according to claim 15, wherein the quality judgment unit comprises:

A reverse translation subunit, configured to translate the first translated text to obtain a second translated text, wherein the language of the second translated text is the same as the language of the source speech data;

A quality judging subunit, configured to determine whether the translation quality of the first translated text is greater than a preset quality threshold according to the second translated text.
The device according to claim 16, wherein the user interaction unit is specifically configured to use the second translated text to interact with the user to determine the first translated text as the source speech data. Whether the translation result is correct.
The apparatus according to claim 17, wherein the user interaction unit comprises:

A first query subunit, configured to output a first query voice to the user, wherein the first query voice is used to query whether the source voice data is semantically similar to the second translated text;

A result determining subunit, configured to: if a positive answer to the first query voice is received by the user, the first translated text is correct as a translation result of the source voice data; if the user is received For a negative answer to the first query voice, the first translation text is incorrect as a translation result of the source voice data.
The device according to any one of claims 14 to 18, wherein the text correction unit is specifically configured to correct the first translated text in a text matching manner.
The apparatus according to claim 19, wherein the text correction unit comprises:

Text matching sub-unit, configured to match the recognized text of the source speech data with text data in a database, wherein the database stores at least one sentence pair, the sentence pair includes a first sample text and A second sample text after the first sample text is correctly translated, the language of the first sample text is the same as the language of the source speech data, and the language of the second sample text is the same as the first sample text The language of the translated text is the same;

A text obtaining subunit, configured to obtain, through the matching operation, a first sample text that is most similar to the recognized text of the source speech data;

A text correction subunit is configured to correct the first translated text according to the most similar first sample text.
The device according to claim 20, wherein the text correction subunit is specifically configured to interact with the user by using the most similar first sample text to implement correction on the first translated text.
The apparatus according to claim 21, wherein the text correction subunit comprises:

A second query subunit, configured to output a second query voice to the user, wherein the second query voice is used to query whether the source voice data is semantically similar to the most similar first sample text;

The modification completion subunit is configured to obtain a second sample text from the sentence pair to which the most similar first sample text belongs if the user's positive answer to the second query voice is received, as The text after the first translation is successfully revised.
A speech translation device, comprising: a processor, a memory, and a system bus;

The processor and the memory are connected through the system bus;

The memory is configured to store one or more programs, and the one or more programs include instructions that, when executed by the processor, cause the processor to execute the method according to any one of claims 1-12 Methods.
A computer-readable storage medium includes instructions that, when run on a computer, cause the computer to perform the method according to any one of claims 1-12.
A computer program product, wherein when the computer program product is run on a terminal device, the terminal device executes the method according to any one of claims 1-12.