CN113378586B

CN113378586B - Speech translation method, translation model training method, device, medium, and apparatus

Info

Publication number: CN113378586B
Application number: CN202110801927.5A
Authority: CN
Inventors: 刘志成; 童剑; 王明轩; 李磊
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-07-15
Filing date: 2021-07-15
Publication date: 2023-03-28
Anticipated expiration: 2041-07-15
Also published as: CN113378586A

Abstract

The disclosure relates to a speech translation method, a translation model training method, a device, a medium and equipment. The voice translation method comprises the following steps: acquiring target voice data to be translated; carrying out voice recognition on the target voice data to obtain a target voice recognition text; and inputting the target speech recognition text into a pre-trained translation model to obtain a target translation text, and training the translation model according to the punctuation disturbance text to obtain the target translation text. Therefore, the robustness of the translation model to the punctuation can be improved, punctuation errors of the target translation text can be effectively avoided, the accuracy of voice translation is improved, a user can quickly and accurately understand the target translation text, and the user can conveniently communicate. In addition, the target speech recognition text is directly input into the translation model, so that punctuation errors of the target translation text can be effectively avoided, punctuation correction of the target speech recognition text before input is not needed, the time consumption of speech translation is short, the real-time performance of speech translation is good, and the method is suitable for simultaneous interpretation scenes.

Description

Speech translation method, translation model training method, device, medium, and apparatus

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a speech translation method, a translation model training method, an apparatus, a medium, and a device.

Background

At present, with the development of artificial intelligence, natural language processing and other technologies, the speech translation technology is widely applied to scenes such as simultaneous interpretation, foreign language teaching and the like. For example, in the simultaneous interpretation scenario, the speech interpretation technology can convert the speaker's speech into text of other languages synchronously, which facilitates human interaction. However, the speech translation method in the related art usually performs speech recognition on the speech of the speaker, and then performs machine translation on the speech recognition text to obtain the text of the corresponding language, but the accuracy of the speech translation is not high, for example, punctuation errors and other problems are easily generated, which affect the understanding of the user on the translated text and bring inconvenience to the user.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a speech translation method, including: acquiring target voice data to be translated; performing voice recognition on the target voice data to obtain a target voice recognition text; inputting the target voice recognition text into a pre-trained translation model to obtain a target translation text, wherein the translation model is obtained by training according to a punctuation perturbation text, the punctuation perturbation text is obtained by performing punctuation perturbation processing on a historical voice recognition text, and the punctuation perturbation processing comprises at least one of punctuation deletion, punctuation modification and punctuation addition.

In a second aspect, the present disclosure provides a translation model training method, including: acquiring a plurality of historical voice recognition texts and a historical translation text corresponding to each historical voice recognition text; performing punctuation disturbance processing on part or all of the historical speech recognition texts in the plurality of historical speech recognition texts to obtain punctuation disturbance texts, wherein the punctuation disturbance processing comprises at least one of punctuation deletion, punctuation modification and punctuation addition; and performing model training by taking the plurality of historical recognition texts, the punctuation perturbation texts and sample texts in a universal training sample as the input of a model and taking the sample translation texts corresponding to the sample texts in each of the historical translation texts and the universal training sample as the target output of the model to obtain the translation model.

In a third aspect, the present disclosure provides a speech translation apparatus, including: the first acquisition module is used for acquiring target voice data to be translated; the recognition module is used for carrying out voice recognition on the target voice data acquired by the first acquisition module to obtain a target voice recognition text; and the translation module is used for inputting the target voice recognition text obtained by the recognition module into a pre-trained translation model to obtain a target translation text, wherein the translation model is obtained by training according to a punctuation disturbance text, the punctuation disturbance text is obtained by performing punctuation disturbance processing on a historical voice recognition text, and the punctuation disturbance processing comprises at least one of punctuation deletion, punctuation modification and punctuation addition.

In a fourth aspect, the present disclosure provides a translation model training apparatus, comprising: the second acquisition module is used for acquiring a plurality of historical voice recognition texts and historical translation texts corresponding to each historical voice recognition text; a perturbation processing module, configured to perform punctuation perturbation processing on part or all of the historical speech recognition texts acquired by the second acquisition module to obtain punctuation perturbation texts, where the punctuation perturbation processing includes at least one of deleting punctuation marks, modifying punctuation marks, and adding punctuation marks; and the training module is used for performing model training by taking the plurality of historical recognition texts acquired by the second acquisition module, the punctuation perturbation text acquired by the perturbation processing module and a sample text in a universal training sample as input of a model and taking each of the historical translation texts and the universal training sample acquired by the second acquisition module and a sample translation text corresponding to the sample text as a target of the model to acquire the translation model.

In a fifth aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method provided by the first or second aspect of the present disclosure.

In a sixth aspect, the present disclosure provides an electronic device comprising: a storage device having a computer program stored thereon; processing means for executing the computer program in the storage means to implement the steps of the method provided by the first or second aspect of the present disclosure.

In the technical scheme, firstly, target voice data to be translated is subjected to voice recognition to obtain a target voice recognition text; and then, inputting the target voice recognition text into a pre-trained translation model to obtain a target translation text, wherein the translation model is obtained by training according to a punctuation disturbance text, and the punctuation disturbance text is obtained by performing punctuation disturbance processing on the historical voice recognition text. Therefore, the robustness of the translation model to the punctuation can be improved, punctuation errors of the target translation text can be effectively avoided, the accuracy of voice translation is improved, a user can quickly and accurately understand the target translation text, and the user can conveniently communicate. In addition, the target speech recognition text is directly input into the translation model, so that punctuation errors of the target translation text can be effectively avoided, punctuation correction of the target speech recognition text before input is not needed, the time consumption of speech translation is short, the real-time performance of speech translation is good, and the method is suitable for simultaneous interpretation scenes.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flow diagram illustrating a method of speech translation according to an exemplary embodiment.

FIG. 2 is a flow diagram illustrating a translation model training method in accordance with an exemplary embodiment.

FIG. 3 is a flowchart illustrating a translation model training method according to another exemplary embodiment.

FIG. 4 is a flowchart illustrating a translation model training method according to another exemplary embodiment.

Fig. 5 is a block diagram illustrating a speech translation apparatus according to an example embodiment.

FIG. 6 is a block diagram illustrating a translation model training apparatus in accordance with an exemplary embodiment.

FIG. 7 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and the embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein is intended to be open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

FIG. 1 is a flow diagram illustrating a method of speech translation according to an exemplary embodiment. As shown in fig. 1, the method includes S101 to S103.

In S101, target speech data to be translated is acquired.

In the present disclosure, the target speech data to be translated may be speech data of any language, such as chinese, english, french, and the like.

In S102, the target voice data is subjected to voice recognition to obtain a target voice recognition text.

In the present disclosure, the target speech data may be subjected to speech recognition by a speech recognition model to obtain a target speech recognition text. The Speech Recognition model may be set according to actual conditions, for example, the Speech Recognition model may be an Automatic Speech Recognition (ASR) model.

In S103, the target speech recognition text is input into a pre-trained translation model to obtain a target translation text, and the translation model is obtained by training according to the punctuation disturbance text.

In the present disclosure, the punctuation perturbation text is obtained by performing punctuation perturbation processing on the historical speech recognition text (i.e., the training sample), where the punctuation perturbation processing includes at least one of deleting punctuation, modifying punctuation, and adding punctuation. Deleting punctuations refers to randomly deleting punctuations with a first preset proportion (for example, 5%) from the historical speech recognition text; the step of modifying the punctuation refers to randomly replacing punctuation points with a second preset proportion (for example, 6%) in the historical speech recognition text with other arbitrary punctuation points; adding punctuation refers to randomly adding punctuation originally in the historical speech recognition text at any position in the historical speech recognition text.

It should be noted that the language of the target translation text may be any other language different from the language of the target voice data to be translated, and this disclosure is not limited in particular.

The following describes the specific training method of the translation model. Specifically, the translation model can be trained through S201 to S203 shown in fig. 2.

In S201, a plurality of history voice recognition texts, history translation texts corresponding to each history voice recognition text are acquired.

In S202, punctuation disturbance processing is performed on part or all of the plurality of historical speech recognition texts to obtain punctuation disturbance texts.

Exemplarily, a punctuation modification operation is performed on 30% of the plurality of historical speech recognition texts; punctuation adding operation is carried out on 10% of the historical voice recognition texts in the plurality of historical voice recognition texts; and performing punctuation deletion operation on 20% of the historical speech recognition texts in the plurality of historical speech recognition texts.

In S203, model training is performed by using the plurality of history recognition texts, the punctuation perturbation texts, and the sample texts in the general training samples as inputs of the model, and using the sample translation texts corresponding to the sample texts in each of the history translation texts and the general training samples as targets of the model, so as to obtain a translation model.

In the disclosure, the universal training sample can be a universal corpus sample, and the universal training sample can be understood as a sample universal in various industry fields, so that the universal training sample has universality and meets the requirements of the general industry fields. The universal training sample comprises a sample text and a sample translation text corresponding to the sample text.

Because relatively few historical recognition texts can be acquired, the requirement of the translation model on the number of training samples cannot be met only by the multiple historical recognition texts, the punctuation disturbance texts and each historical translation text, and therefore when the model is trained, the universal training sample, the multiple historical recognition texts, the punctuation disturbance texts and each historical translation text are jointly used as the training samples.

A detailed description will be given below of a specific embodiment of acquiring the history translation text corresponding to each history voice recognition text in S201.

Specifically, the historical translation text may be obtained in a number of ways. In one embodiment, for each historical speech recognition text, the historical speech recognition text is directly translated (may be manually or mechanically translated) to obtain a historical translation text corresponding to the historical speech recognition text.

In another embodiment, for each historical speech recognition text, performing spoken language smoothing on the historical speech recognition text; and translating the historical speech recognition text obtained after the smooth spoken language processing to obtain a historical translation text corresponding to the historical speech recognition text.

In the present disclosure, the spoken word smoothing process includes at least one of de-emphasis, filtering stop words, and correction. Where stop words refer to words with no actual meaning, such as, for example, mood words (e.g., kah, o, or the like), connection words (e.g., do, and the like), frequently used word of mouth (e.g., o, and the like).

Stop words may appear in the historical speech recognition text, the stop words have little influence on semantic expression and do not need to be translated, and therefore the stop words in the historical speech recognition text can be filtered.

Since some words may be repeated or a phrase may be repeatedly expressed when the user speaks, repeated content (for example, "important information of common interest and attention of everyone" is important for all people ", and" important information of common interest "is repeated content) may also occur in the historical speech recognition text, and at this time, the historical speech recognition text may be subjected to deduplication processing, so that the problem of repeated translation of the translation model may be avoided, and the translation quality of the translation model may be improved.

In addition, false recognition may occur in the historical speech recognition text, for example, false recognition such as homophone, harmonic character, and missed recognition may occur, and at this time, the historical speech recognition text may be corrected to eliminate the false recognition in the historical speech recognition text, so as to avoid the problem that the quality of subsequent translation is not high due to inaccurate early-stage speech recognition results, ensure the translation quality of the translation model, and improve the accuracy of speech translation.

Preferably, the spoken language smoothing processing comprises duplication removal, stop word filtering and correction, so that the spoken language smoothing processing is performed on the historical speech recognition text, the problem that the subsequent translation quality is not high due to inaccurate early-stage speech recognition results can be avoided, noise data such as repeated content and stop words can be removed, and the translation quality of the translation model is improved.

In the translation model training stage, compared with a mode of directly outputting the translation result of the historical speech recognition text as the target of the model, the mode of outputting the translation result of the historical speech recognition text subjected to the spoken language smoothing processing as the target of the model can reduce the dependence of the translation model on the spoken language smoothing, so that the translation model has better robustness, the accuracy of speech translation is improved, and the method is more suitable for the simultaneous interpretation scene.

In another embodiment, for each historical speech recognition text, obtaining a labeled text corresponding to the historical speech data, where the historical speech data is the speech data corresponding to the historical speech recognition text; carrying out oral smooth processing on the marked text; and translating the marked text obtained after the smooth spoken language processing to obtain a historical translation text corresponding to the historical speech recognition text.

In the translation model training stage, compared with a mode of directly outputting the translation result of the historical speech recognition text as the target of the model, the mode of outputting the translation result of the marked text subjected to the spoken language smoothness processing as the target of the model can reduce the dependence of the translation model on the spoken language smoothness, so that the translation model has better robustness, the accuracy of speech translation is improved, and the method is more suitable for simultaneous interpretation scenes.

FIG. 3 is a flowchart illustrating a translation model training method according to another exemplary embodiment. As shown in fig. 3, before the step S203, the training method further includes a step S204.

In S204, it is determined whether there is noise data in each of the historical speech recognition texts.

The noise data may include error punctuation, repeated content, stop words, etc., among others.

At this time, in S203, the translation model may be obtained by performing model training by using, as input of the model, the plurality of historical recognition texts, the identifier for characterizing whether or not there is noise data in each historical recognition text, the punctuation perturbation text, and the sample text in the general training sample, and by using, as a target of the model, the sample translation text corresponding to the sample text in each historical translation text and the general training sample. In this way, when the model is trained, the identifier for representing whether noise data exists in each historical recognition text is added, so that the translation model can effectively distinguish the training sample containing the noise data from the training sample not containing the noise data, and therefore, the training sample not containing the noise data (wherein the universal training sample is the training sample not containing the noise data, namely, the universal training sample is especially fully utilized), the sample attenuation in the fine adjustment process can be reduced, the training efficiency of the translation model can be improved, and meanwhile, the translation quality of the translation model can be improved.

A detailed description will be given below of a specific embodiment of determining whether or not there is noise data in each of the historical speech recognition texts in S204 described above.

In particular, the presence or absence of noisy data in the historical speech recognition text may be determined in a number of ways. In one embodiment, for each historical speech recognition text, performing spoken smoothing on the historical speech recognition text; calculating a first similarity between a historical speech recognition text obtained after the smooth spoken language processing and the historical speech recognition text; and if the first similarity is larger than a first preset similarity threshold, determining that noise data exists in the historical speech recognition text.

In the present disclosure, the first similarity between the history speech recognition text and the history speech recognition text obtained after the spoken language smoothing processing can be measured by the euclidean distance and the cosine distance. The higher the first similarity between the historical speech recognition text obtained after the smooth spoken language processing and the historical speech recognition text is, the lower the possibility that noise data exists in the historical speech recognition text is; the lower the first similarity between the historical speech recognition text obtained after the smooth spoken language processing and the historical speech recognition text is, the higher the possibility that noise data exists in the historical speech recognition text is; therefore, if the first similarity between the historical speech recognition text obtained after the smooth spoken language processing and the historical speech recognition text is greater than the first preset similarity threshold, it is determined that the noise data exists in the historical speech recognition text.

In another embodiment, for each historical speech recognition text, obtaining a labeled text corresponding to the historical speech data, wherein the historical speech data is the speech data corresponding to the historical speech recognition text; calculating a second similarity between the marked text and the historical speech recognition text; and if the second similarity is larger than a second preset similarity threshold, determining that the noise data exists in the historical speech recognition text.

In the present disclosure, the second similarity between the labeled text and the historical speech recognition text can be measured by Euclidean distance and cosine distance. The higher the second similarity between the annotation text and the historical speech recognition text is, the smaller the possibility that the noise data exists in the historical speech recognition text is, and the lower the second similarity between the annotation text and the historical speech recognition text is, the larger the possibility that the noise data exists in the historical speech recognition text is; therefore, if the second similarity between the annotation text and the historical speech recognition text is greater than the second preset similarity threshold, it is determined that the noise data exists in the historical speech recognition text.

In addition, in order to further improve the translation quality of the translation model, a training sample of the translation model may be screened, and specifically, as shown in fig. 4, before the step S202, the training method further includes a step S205.

In S205, the history translation text with poor translation quality is removed from all the history translation texts, and the history speech recognition text corresponding to the history translation text with poor translation quality is removed from the plurality of history speech recognition texts.

At this time, in the step S202, punctuation disturbance processing may be performed on part or all of the historical speech recognition texts in the plurality of historical speech recognition texts obtained after the elimination, so as to obtain punctuation disturbance texts; in step S203, model training may be performed by using the plurality of history recognition texts, the punctuation perturbation texts, and the sample texts in the general training samples, which are obtained after the elimination, as inputs of the model, and using the sample translation texts, which are in the history translation texts and the general training samples and correspond to the sample texts, which are obtained after the elimination, as targets of the model, so as to obtain the translation model.

The present disclosure further provides a translation model training method, as shown in fig. 2, the training method includes S201 to S203.

Optionally, the obtaining of the historical translation text corresponding to each of the historical speech recognition texts includes: for each historical voice recognition text, carrying out spoken language smoothing processing on the historical voice recognition text; translating the historical speech recognition text obtained after the smooth spoken language processing to obtain a historical translation text corresponding to the historical speech recognition text; or alternatively

Acquiring a label text corresponding to the historical voice data aiming at each historical voice recognition text, wherein the historical voice data is the voice data corresponding to the historical voice recognition text; carrying out oral smooth processing on the marked text; and translating the marked text obtained after the smooth spoken language processing to obtain a historical translation text corresponding to the historical speech recognition text.

Optionally, the spoken word smoothing processing includes at least one of de-emphasis, filtering stop words, and correction.

Optionally, before the step of performing model training by using the plurality of historical recognition texts, the punctuation perturbation texts, and a sample text in a general training sample as an input of a model, and using a sample translation text corresponding to the sample text in each of the historical translation texts and the general training sample as a target output of the model, so as to obtain the translation model, the training method of the translation model further includes: determining whether noise data is present in each of the historical speech recognition texts; the obtaining of the translation model by performing model training in a manner that the plurality of historical recognition texts, the punctuation perturbation texts and sample texts in a universal training sample are used as inputs of a model, and the sample translation texts corresponding to the sample texts in each of the historical translation texts and the universal training sample are used as targets of the model to be output comprises: and performing model training by taking the plurality of historical recognition texts, the identifier for representing whether noise data exists in each historical recognition text, the punctuation disturbance text and a sample text in a universal training sample as the input of a model, and taking the sample translation text corresponding to the sample text in each historical translation text and the universal training sample as the target output of the model to obtain the translation model.

Optionally, the determining whether noise data exists in each of the historical speech recognition texts comprises: aiming at each historical voice recognition text, carrying out spoken language smoothing processing on the historical voice recognition text; calculating a first similarity between a historical speech recognition text obtained after the smooth spoken language processing and the historical speech recognition text; if the first similarity is larger than a first preset similarity threshold, determining that noise data exists in the historical speech recognition text; or alternatively

Acquiring a label text corresponding to the historical voice data aiming at each historical voice recognition text, wherein the historical voice data is the voice data corresponding to the historical voice recognition text; calculating a second similarity between the labeled text and the historical speech recognition text; and if the second similarity is larger than a second preset similarity threshold, determining that noise data exists in the historical speech recognition text.

Optionally, before the step of performing punctuation perturbation processing on part or all of the plurality of historical speech recognition texts to obtain punctuation perturbation texts, the training mode of the translation model further includes: removing historical translation texts with poor translation quality from all the historical translation texts, and removing historical speech recognition texts corresponding to the historical translation texts with poor translation quality from the plurality of historical speech recognition texts; the punctuation disturbance processing is performed on part or all of the historical speech recognition texts in the plurality of historical speech recognition texts to obtain punctuation disturbance texts, and the punctuation disturbance processing comprises the following steps: performing punctuation disturbance processing on part or all of the historical speech recognition texts in the plurality of historical speech recognition texts obtained after the elimination to obtain punctuation disturbance texts; the obtaining of the translation model by performing model training in a manner that the plurality of historical recognition texts, the punctuation perturbation texts and sample texts in a universal training sample are used as inputs of a model, and the sample translation texts corresponding to the sample texts in each of the historical translation texts and the universal training sample are used as targets of the model to be output comprises: and performing model training by taking the plurality of history recognition texts, the punctuation perturbation texts and the sample texts in the universal training samples which are obtained after the elimination as the input of the model, and taking the history translation texts and the sample translation texts which are in the universal training samples and correspond to the sample texts in the elimination as the target output of the model, so as to obtain the translation model.

With regard to the model training method in the above embodiment, the specific manner in which each step performs the operation has been described in detail in the embodiment related to the speech translation method, and will not be elaborated here.

Based on the same inventive concept, the disclosure also provides a voice translation device. As shown in fig. 5, the speech translation apparatus 500 may include: a first obtaining module 501, configured to obtain target speech data to be translated; the recognition module 502 is configured to perform voice recognition on the target voice data acquired by the first acquisition module 501 to obtain a target voice recognition text; the translation module 503 is configured to input the target speech recognition text obtained by the recognition module 502 into a pre-trained translation model to obtain a target translation text, where the translation model is obtained by training according to a punctuation perturbation text, the punctuation perturbation text is obtained by performing punctuation perturbation processing on a historical speech recognition text, and the punctuation perturbation processing includes at least one of punctuation deletion, punctuation modification, and punctuation addition.

In the technical scheme, firstly, voice recognition is carried out on target voice data to be translated to obtain a target voice recognition text; and then, inputting the target voice recognition text into a pre-trained translation model to obtain a target translation text, wherein the translation model is obtained by training according to a punctuation disturbance text, and the punctuation disturbance text is obtained by performing punctuation disturbance processing on the historical voice recognition text. Therefore, the robustness of the translation model to the punctuation can be improved, punctuation errors of the target translation text can be effectively avoided, the accuracy of voice translation is improved, a user can quickly and accurately understand the target translation text, and the user can conveniently communicate. In addition, the target speech recognition text is directly input into the translation model, so that punctuation errors of the target translation text can be effectively avoided, punctuation correction of the target speech recognition text before input is not needed, the time consumption of speech translation is short, the real-time performance of speech translation is good, and the method is suitable for simultaneous interpretation scenes.

Optionally, the translation model is obtained by training through a translation model training apparatus, as shown in fig. 6, the translation model training apparatus 600 includes: a second obtaining module 601, configured to obtain a plurality of historical speech recognition texts and a historical translation text corresponding to each historical speech recognition text; a perturbation processing module 602, configured to perform the punctuation perturbation processing on part or all of the historical speech recognition texts in the multiple historical speech recognition texts to obtain punctuation perturbation texts; a translation module 603, configured to perform model training by using the plurality of history recognition texts acquired by the second acquisition module 601, the punctuation perturbation text acquired by the perturbation processing module 602, and a sample text in a general training sample as inputs of a model, and using a sample translation text corresponding to the sample text in each of the history translation texts and the general training sample acquired by the second acquisition module 601 as a target of the model, so as to obtain the translation model.

It should be noted that the translation model training apparatus 600 may be provided independently of the speech translation apparatus 500, or may be integrated into the speech translation apparatus 500, and is not particularly limited in this disclosure.

Optionally, the second obtaining module 601 includes: the first processing submodule is used for carrying out spoken language smoothing processing on each historical voice recognition text; the first translation submodule is used for translating the historical speech recognition text obtained after the smooth processing of the spoken language to obtain a historical translation text corresponding to the historical speech recognition text; or

The first obtaining submodule is used for obtaining a labeled text corresponding to the historical voice data aiming at each historical voice recognition text, wherein the historical voice data is the voice data corresponding to the historical voice recognition text; the second processing submodule is used for carrying out oral smooth processing on the marked text; and the second translation sub-module is used for translating the marked text obtained after the smooth spoken language processing to obtain a historical translation text corresponding to the historical speech recognition text.

Optionally, the translation model training apparatus 600 further includes: a determining module, configured to perform model training in a manner that a sample text in the plurality of historical recognition texts, the punctuation perturbation text, and a general training sample is used as an input of a model, and a sample translation text corresponding to the sample text in each of the historical translation texts and the general training sample is used as a target of the model, so as to obtain the translation model, before the training module 603 determines whether noise data exists in each of the historical speech recognition texts; the training module 603 is configured to perform model training by using the plurality of historical recognition texts, the identifier for characterizing whether noise data exists in each historical recognition text, the punctuation perturbation text, and a sample text in a general training sample as inputs of a model, and using a sample translation text corresponding to the sample text in each historical translation text and the general training sample as a target of the model, so as to obtain the translation model.

Optionally, the determining module includes: the third processing submodule is used for carrying out oral smooth processing on the historical speech recognition text aiming at each historical speech recognition text; the first calculation submodule is used for calculating a first similarity between a historical speech recognition text obtained after the smooth processing of the spoken language and the historical speech recognition text; the first determining submodule is used for determining that noise data exists in the historical speech recognition text if the first similarity is larger than a first preset similarity threshold; or alternatively

The second obtaining submodule is used for obtaining a labeled text corresponding to the historical voice data aiming at each historical voice recognition text, wherein the historical voice data is the voice data corresponding to the historical voice recognition text; the second calculation submodule is used for calculating a second similarity between the labeled text and the historical speech recognition text; and the second determining submodule is used for determining that the noise data exists in the historical speech recognition text if the second similarity is larger than a second preset similarity threshold.

Optionally, the translation model training apparatus 600 further includes: a removing module, configured to remove, before the perturbation processing module 602 performs the punctuation perturbation processing on some or all of the historical speech recognition texts in the multiple historical speech recognition texts to obtain punctuation perturbation texts, historical translation texts with poor translation quality from all the historical translation texts, and remove, from the multiple historical speech recognition texts, historical speech recognition texts corresponding to the historical translation texts with poor translation quality; the perturbation processing module 602 is configured to perform the punctuation perturbation processing on part or all of the historical speech recognition texts in the multiple historical speech recognition texts obtained after the elimination to obtain punctuation perturbation texts; the training module 603 is configured to perform model training by using the multiple history recognition texts, the punctuation perturbation texts, and sample texts in the general training samples, which are obtained after the rejection, as inputs of the model, and using the history translation texts and the sample translation texts, which are in the general training samples and correspond to the sample texts, which are obtained after the rejection, as targets of the model, so as to obtain the translation model.

The present disclosure also provides a translation model training apparatus, as shown in fig. 6, the translation model training apparatus 600 includes: a second obtaining module 601, configured to obtain a plurality of historical speech recognition texts and a historical translation text corresponding to each historical speech recognition text; a perturbation processing module 602, configured to perform punctuation perturbation processing on part or all of the historical speech recognition texts acquired by the second acquisition module 601 to obtain punctuation perturbation texts, where the punctuation perturbation processing includes at least one of deleting punctuation marks, modifying punctuation marks, and adding punctuation marks; a training module 603, configured to perform model training by using the multiple history recognition texts acquired by the second acquisition module 601, the punctuation perturbation text acquired by the perturbation processing module 602, and a sample text in a general training sample as inputs of a model, and using each of the history translation texts acquired by the second acquisition module 601 and a sample translation text in the general training sample, which corresponds to the sample text, as a target of the model, so as to obtain the translation model.

Optionally, the second obtaining module 601 includes: the first processing submodule is used for carrying out spoken language smoothing processing on the historical speech recognition text aiming at each historical speech recognition text; the first translation submodule is used for translating the historical speech recognition text obtained after the smooth processing of the spoken language to obtain a historical translation text corresponding to the historical speech recognition text; or

Optionally, the translation model training apparatus 600 further includes: a determining module, configured to perform model training in a manner that a sample translation text corresponding to the sample text in each of the historical translation texts and the general training samples is output as a target of a model by using, as inputs of the model, sample texts in the plurality of historical recognition texts, the punctuation perturbation texts, and the general training samples in the training module 603, so as to obtain the translation model, and determine whether noise data exists in each of the historical speech recognition texts; the training module 603 is configured to perform model training by using the multiple historical recognition texts, the identifier for characterizing whether noise data exists in each historical recognition text, the punctuation perturbation text, and a sample text in a general training sample as inputs of a model, and using a sample translation text corresponding to the sample text in each historical translation text and the general training sample as a target of the model, so as to obtain the translation model.

Optionally, the determining module includes: the third processing submodule is used for carrying out spoken language smoothing processing on the historical voice recognition text aiming at each historical voice recognition text; the first calculation submodule is used for calculating a first similarity between a historical speech recognition text obtained after the smooth processing of the spoken language and the historical speech recognition text; the first determining submodule is used for determining that noise data exists in the historical speech recognition text if the first similarity is larger than a first preset similarity threshold; or

Referring now to fig. 7, a schematic diagram of an electronic device (e.g., a terminal device or server) 700 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 7, electronic device 700 may include a processing means (e.g., central processing unit, graphics processor, etc.) 701 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 702 or a program loaded from storage 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The processing device 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Generally, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates an electronic device 700 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication means 709, or may be installed from the storage means 708, or may be installed from the ROM 702. The computer program, when executed by the processing device 701, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring target voice data to be translated; performing voice recognition on the target voice data to obtain a target voice recognition text; inputting the target voice recognition text into a pre-trained translation model to obtain a target translation text, wherein the translation model is obtained by training according to a punctuation perturbation text, the punctuation perturbation text is obtained by performing punctuation perturbation processing on a historical voice recognition text, and the punctuation perturbation processing comprises at least one of punctuation deletion, punctuation modification and punctuation addition.

Alternatively, the computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a plurality of historical voice recognition texts and a historical translation text corresponding to each historical voice recognition text; performing punctuation disturbance processing on part or all of the historical speech recognition texts in the plurality of historical speech recognition texts to obtain punctuation disturbance texts, wherein the punctuation disturbance processing comprises at least one of punctuation deletion, punctuation modification and punctuation addition; and performing model training by taking the plurality of historical recognition texts, the punctuation perturbation texts and sample texts in a universal training sample as the input of a model and taking the sample translation texts corresponding to the sample texts in each of the historical translation texts and the universal training sample as the target output of the model to obtain the translation model.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of a module does not in some cases constitute a limitation of the module itself, and for example, the first obtaining module may also be described as a "module that obtains target speech data to be translated".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems on a chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Example 1 provides a speech translation method according to one or more embodiments of the present disclosure, including: acquiring target voice data to be translated; performing voice recognition on the target voice data to obtain a target voice recognition text; inputting the target voice recognition text into a pre-trained translation model to obtain a target translation text, wherein the translation model is obtained by training according to a punctuation perturbation text, the punctuation perturbation text is obtained by performing punctuation perturbation processing on a historical voice recognition text, and the punctuation perturbation processing comprises at least one of punctuation deletion, punctuation modification and punctuation addition.

Example 2 provides the method of example 1, the translation model being trained in the following manner, in accordance with one or more embodiments of the present disclosure: acquiring a plurality of historical voice recognition texts and a historical translation text corresponding to each historical voice recognition text; performing punctuation disturbance processing on part or all of the historical speech recognition texts in the plurality of historical speech recognition texts to obtain punctuation disturbance texts; and performing model training by taking the plurality of historical recognition texts, the punctuation perturbation texts and sample texts in a universal training sample as the input of a model and taking the sample translation texts corresponding to the sample texts in each of the historical translation texts and the universal training sample as the target output of the model to obtain the translation model.

Example 3 provides the method of example 2, the obtaining historical translation text corresponding to each of the historical speech recognition texts, including: for each historical voice recognition text, carrying out spoken language smoothing processing on the historical voice recognition text; translating the historical speech recognition text obtained after the smooth spoken language processing to obtain a historical translation text corresponding to the historical speech recognition text; or acquiring a label text corresponding to the historical voice data aiming at each historical voice recognition text, wherein the historical voice data is the voice data corresponding to the historical voice recognition text; carrying out smooth spoken language processing on the labeled text; and translating the marked text obtained after the smooth spoken language processing to obtain a historical translation text corresponding to the historical speech recognition text.

Example 4 provides the method of example 3, the spoken word smoothing process including at least one of deduplication, filtering stop words, and correcting, in accordance with one or more embodiments of the present disclosure.

Example 5 provides the method of example 2, where before the step of performing model training by using the plurality of historical recognition texts, the punctuation perturbation text, and a sample text in a common training sample as inputs of a model and using a sample translation text corresponding to the sample text in each of the historical translation texts and the common training sample as a target of the model to obtain the translation model, the training mode of the translation model further includes: determining whether noise data is present in each of the historical speech recognition texts; the obtaining of the translation model by performing model training in a manner that the plurality of historical recognition texts, the punctuation perturbation texts and sample texts in a universal training sample are used as inputs of a model, and the sample translation texts corresponding to the sample texts in each of the historical translation texts and the universal training sample are used as targets of the model to be output comprises: and performing model training by taking the plurality of historical recognition texts, the identifier for representing whether noise data exists in each historical recognition text, the punctuation disturbance text and a sample text in a universal training sample as the input of a model, and taking the sample translation text corresponding to the sample text in each historical translation text and the universal training sample as the target output of the model to obtain the translation model.

Example 6 provides the method of example 5, wherein determining whether noise data is present in each of the historical speech recognition texts includes: for each historical voice recognition text, carrying out spoken language smoothing processing on the historical voice recognition text; calculating a first similarity between a historical speech recognition text obtained after the smooth spoken language processing and the historical speech recognition text; if the first similarity is larger than a first preset similarity threshold, determining that noise data exists in the historical speech recognition text; or acquiring a label text corresponding to the historical voice data aiming at each historical voice recognition text, wherein the historical voice data is the voice data corresponding to the historical voice recognition text; calculating a second similarity between the labeled text and the historical speech recognition text; and if the second similarity is larger than a second preset similarity threshold, determining that noise data exists in the historical speech recognition text.

In accordance with one or more embodiments of the present disclosure, example 7 provides the method of example 2, before the step of performing the punctuation perturbation processing on part or all of the historical speech recognition texts in the plurality of historical speech recognition texts to obtain punctuation perturbation texts, the training mode of the translation model further includes: removing historical translation texts with poor translation quality from all the historical translation texts, and removing historical speech recognition texts corresponding to the historical translation texts with poor translation quality from the plurality of historical speech recognition texts; the punctuation disturbance processing is performed on part or all of the historical speech recognition texts in the plurality of historical speech recognition texts to obtain punctuation disturbance texts, and the punctuation disturbance processing comprises the following steps: performing punctuation disturbance processing on part or all of the historical voice recognition texts in the plurality of historical voice recognition texts obtained after the elimination to obtain punctuation disturbance texts; the model training is performed by taking the multiple history recognition texts, the punctuation perturbation texts and sample texts in a universal training sample as the input of a model, and taking the sample translation texts corresponding to the sample texts in each history translation text and the universal training sample as the target output of the model, so as to obtain the translation model, and the method comprises the following steps: and performing model training by taking the plurality of history recognition texts, the punctuation perturbation texts and the sample texts in the universal training samples which are obtained after the elimination as the input of the model, and taking the history translation texts and the sample translation texts which are in the universal training samples and correspond to the sample texts in the elimination as the target output of the model, so as to obtain the translation model.

Example 8 provides a translation model training method, according to one or more embodiments of the present disclosure, including: acquiring a plurality of historical voice recognition texts and a historical translation text corresponding to each historical voice recognition text; performing punctuation disturbance processing on part or all of the historical speech recognition texts in the plurality of historical speech recognition texts to obtain punctuation disturbance texts, wherein the punctuation disturbance processing comprises at least one of punctuation deletion, punctuation modification and punctuation addition; and performing model training by taking the plurality of historical recognition texts, the punctuation perturbation texts and sample texts in a universal training sample as the input of a model and taking the sample translation texts corresponding to the sample texts in each of the historical translation texts and the universal training sample as the target output of the model to obtain the translation model.

Example 9 provides the method of example 8, wherein obtaining the historical translation text corresponding to each of the historical speech recognition texts includes performing spoken language smoothing on each of the historical speech recognition texts for each of the historical speech recognition texts; translating the historical speech recognition text obtained after the smooth spoken language processing to obtain a historical translation text corresponding to the historical speech recognition text; or acquiring a label text corresponding to the historical voice data aiming at each historical voice recognition text, wherein the historical voice data is the voice data corresponding to the historical voice recognition text; carrying out oral smooth processing on the marked text; and translating the marked text obtained after the smooth spoken language processing to obtain a historical translation text corresponding to the historical speech recognition text.

Example 10 provides the method of example 9, the spoken word smoothing process including at least one of deduplication, filtering stop words, and correcting, according to one or more embodiments of the present disclosure.

Example 11 provides the method of example 8, before the step of performing model training by using the plurality of historical recognition texts, the punctuation perturbation text, and a sample text in a general training sample as inputs of a model, and using a sample translation text corresponding to the sample text in each of the historical translation texts and the general training sample as a target of the model to obtain the translation model, the method further including: determining whether noise data is present in each of the historical speech recognition texts; the obtaining of the translation model by performing model training in a manner that the plurality of historical recognition texts, the punctuation perturbation texts and sample texts in a universal training sample are used as inputs of a model, and the sample translation texts corresponding to the sample texts in each of the historical translation texts and the universal training sample are used as targets of the model to be output comprises: and performing model training by taking the plurality of historical recognition texts, the identifier for representing whether noise data exists in each historical recognition text, the punctuation disturbance text and a sample text in a universal training sample as the input of a model, and taking the sample translation text corresponding to the sample text in each historical translation text and the universal training sample as the target output of the model to obtain the translation model.

Example 12 provides the method of example 11, the determining whether noise data is present in each of the historical speech recognition texts, comprising: for each historical voice recognition text, carrying out spoken language smoothing processing on the historical voice recognition text; calculating a first similarity between a historical speech recognition text obtained after the smooth spoken language processing and the historical speech recognition text; if the first similarity is larger than a first preset similarity threshold, determining that noise data exist in the historical speech recognition text; or acquiring a label text corresponding to the historical voice data aiming at each historical voice recognition text, wherein the historical voice data is the voice data corresponding to the historical voice recognition text; calculating a second similarity between the labeled text and the historical speech recognition text; and if the second similarity is larger than a second preset similarity threshold, determining that noise data exist in the historical speech recognition text.

Example 13 provides the method of example 8, where before the step of performing punctuation perturbation processing on part or all of the plurality of historical speech recognition texts to obtain punctuation-perturbed texts, the method further includes: removing historical translation texts with poor translation quality from all the historical translation texts, and removing historical speech recognition texts corresponding to the historical translation texts with poor translation quality from the plurality of historical speech recognition texts; the punctuation disturbance processing is performed on part or all of the historical speech recognition texts in the plurality of historical speech recognition texts to obtain punctuation disturbance texts, and the punctuation disturbance processing comprises the following steps: performing punctuation disturbance processing on part or all of the historical speech recognition texts in the plurality of historical speech recognition texts obtained after the elimination to obtain punctuation disturbance texts; the obtaining of the translation model by performing model training in a manner that the plurality of historical recognition texts, the punctuation perturbation texts and sample texts in a universal training sample are used as inputs of a model, and the sample translation texts corresponding to the sample texts in each of the historical translation texts and the universal training sample are used as targets of the model to be output comprises: and performing model training by taking the plurality of history recognition texts, the punctuation perturbation texts and the sample texts in the universal training samples which are obtained after the elimination as the input of the model, and taking the history translation texts and the sample translation texts which are in the universal training samples and correspond to the sample texts in the elimination as the target output of the model, so as to obtain the translation model.

Example 14 provides, in accordance with one or more embodiments of the present disclosure, a speech translation apparatus comprising: the first acquisition module is used for acquiring target voice data to be translated; the recognition module is used for carrying out voice recognition on the target voice data acquired by the first acquisition module to obtain a target voice recognition text; and the translation module is used for inputting the target voice recognition text obtained by the recognition module into a pre-trained translation model to obtain a target translation text, wherein the translation model is obtained by training according to a punctuation disturbance text, the punctuation disturbance text is obtained by performing punctuation disturbance processing on a historical voice recognition text, and the punctuation disturbance processing comprises at least one of punctuation deletion, punctuation modification and punctuation addition.

Example 15 provides, in accordance with one or more embodiments of the present disclosure, a translation model training apparatus, comprising: the second acquisition module is used for acquiring a plurality of historical voice recognition texts and historical translation texts corresponding to each historical voice recognition text; a disturbance processing module, configured to perform punctuation disturbance processing on part or all of the historical speech recognition texts acquired by the second acquisition module to obtain punctuation disturbance texts, where the punctuation disturbance processing includes at least one of punctuation deletion, punctuation modification, and punctuation addition; and the training module is used for performing model training by taking the plurality of historical recognition texts acquired by the second acquisition module, the punctuation perturbation text acquired by the perturbation processing module and a sample text in a universal training sample as input of a model and taking each of the historical translation texts and the universal training sample acquired by the second acquisition module and a sample translation text corresponding to the sample text as a target of the model to acquire the translation model.

Example 16 provides a computer readable medium having stored thereon a computer program that, when executed by a processing apparatus, performs the steps of the method of any of examples 1-13, in accordance with one or more embodiments of the present disclosure.

Example 17 provides, in accordance with one or more embodiments of the present disclosure, an electronic device, comprising: a storage device having a computer program stored thereon; processing means for executing the computer program in the storage means to carry out the steps of the method of any of examples 1-13.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and the technical features disclosed in the present disclosure (but not limited to) having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims

1. A method of speech translation, comprising:

acquiring target voice data to be translated;

performing voice recognition on the target voice data to obtain a target voice recognition text;

inputting the target voice recognition text into a pre-trained translation model to obtain a target translation text, wherein the translation model is obtained by training according to a punctuation disturbance text, the punctuation disturbance text is obtained by performing punctuation disturbance processing on a historical voice recognition text, and the punctuation disturbance processing comprises at least one of punctuation deletion, punctuation modification and punctuation addition;

wherein the translation model is obtained by training in the following way:

acquiring a plurality of historical voice recognition texts and a historical translation text corresponding to each historical voice recognition text;

performing punctuation disturbance processing on part or all of the historical speech recognition texts in the plurality of historical speech recognition texts to obtain punctuation disturbance texts;

determining whether noise data is present in each of the historical speech recognition texts;

and performing model training by taking the plurality of historical speech recognition texts, the identifier for representing whether noise data exists in each historical speech recognition text, the punctuation perturbation text and a sample text in a universal training sample as the input of a model, and taking a sample translation text corresponding to the sample text in each historical translation text and the universal training sample as the target output of the model to obtain the translation model.

2. The method of claim 1, wherein obtaining historical translation text corresponding to each of the historical speech recognition texts comprises:

for each historical voice recognition text, carrying out spoken language smoothing processing on the historical voice recognition text; translating the historical speech recognition text obtained after the smooth spoken language processing to obtain a historical translation text corresponding to the historical speech recognition text; or

Acquiring a label text corresponding to historical voice data aiming at each historical voice recognition text, wherein the historical voice data is the voice data corresponding to the historical voice recognition text; carrying out oral smooth processing on the marked text; and translating the marked text obtained after the smooth spoken language processing to obtain a historical translation text corresponding to the historical speech recognition text.

3. The method of claim 2, wherein the spoken language smoothing process comprises at least one of de-emphasis, filtering stop words, and correction.

4. The method of claim 1, wherein the determining whether noise data is present in each of the historical speech recognized texts comprises:

for each historical voice recognition text, carrying out spoken language smoothing processing on the historical voice recognition text; calculating a first similarity between a historical speech recognition text obtained after the smooth spoken language processing and the historical speech recognition text; if the first similarity is larger than a first preset similarity threshold, determining that noise data exists in the historical speech recognition text; or

Acquiring a label text corresponding to historical voice data aiming at each historical voice recognition text, wherein the historical voice data is the voice data corresponding to the historical voice recognition text; calculating a second similarity between the labeled text and the historical speech recognition text; and if the second similarity is larger than a second preset similarity threshold, determining that noise data exists in the historical speech recognition text.

5. The method according to claim 1, wherein before the step of performing the punctuation perturbation processing on part or all of the historical speech recognition texts in the plurality of historical speech recognition texts to obtain punctuation perturbation texts, the training mode of the translation model further comprises:

removing historical translation texts with poor translation quality from all the historical translation texts, and removing historical speech recognition texts corresponding to the historical translation texts with poor translation quality from the plurality of historical speech recognition texts;

the punctuation disturbance processing is performed on part or all of the historical speech recognition texts in the plurality of historical speech recognition texts to obtain punctuation disturbance texts, and the punctuation disturbance processing comprises:

performing punctuation disturbance processing on part or all of the historical voice recognition texts in the plurality of rejected historical voice recognition texts to obtain punctuation disturbance texts;

the obtaining of the translation model by performing model training in a manner that the plurality of historical speech recognition texts, the punctuation perturbation texts and sample texts in a universal training sample are used as inputs of a model, and the sample translation texts corresponding to the sample texts in each of the historical translation texts and the universal training sample are used as targets of the model output, includes:

and performing model training by taking the plurality of historical speech recognition texts, the punctuation perturbation texts and the sample texts in the universal training samples which are obtained after the elimination as the input of the model, and taking the historical translation texts and the sample translation texts which are in the universal training samples and correspond to the sample texts in the elimination as the target output of the model, so as to obtain the translation model.

6. A translation model training method is characterized by comprising the following steps:

performing punctuation perturbation processing on part or all of the historical speech recognition texts in the plurality of historical speech recognition texts to obtain punctuation perturbation texts, wherein the punctuation perturbation processing comprises at least one of punctuation deletion, punctuation modification and punctuation addition;

7. The method of claim 6, wherein obtaining historical translation text corresponding to each of the historical speech recognition texts comprises:

8. The method of claim 7, wherein the spoken word smoothing process comprises at least one of de-emphasis, filtering stop words, and correction.

9. The method of claim 6, wherein the determining whether noise data is present in each of the historical speech recognized texts comprises:

10. The method according to claim 6, wherein before the step of performing punctuation perturbation processing on part or all of the plurality of historical speech recognition texts to obtain punctuation perturbation texts, the method further comprises:

the punctuation disturbance processing is performed on part or all of the historical speech recognition texts in the plurality of historical speech recognition texts to obtain punctuation disturbance texts, and the punctuation disturbance processing comprises the following steps:

performing punctuation disturbance processing on part or all of the historical speech recognition texts in the plurality of historical speech recognition texts which are obtained after elimination to obtain punctuation disturbance texts;

11. A speech translation apparatus, comprising:

the first acquisition module is used for acquiring target voice data to be translated;

the recognition module is used for carrying out voice recognition on the target voice data acquired by the first acquisition module to obtain a target voice recognition text;

the translation module is used for inputting the target voice recognition text obtained by the recognition module into a pre-trained translation model to obtain a target translation text, wherein the translation model is obtained by training according to a punctuation disturbance text, the punctuation disturbance text is obtained by performing punctuation disturbance processing on a historical voice recognition text, and the punctuation disturbance processing comprises at least one of punctuation deletion, punctuation modification and punctuation addition;

the translation model is obtained through training of a translation model training device, and the translation model training device comprises:

the second acquisition module is used for acquiring a plurality of historical voice recognition texts and a historical translation text corresponding to each historical voice recognition text;

the disturbance processing module is used for carrying out punctuation disturbance processing on part or all of the historical voice recognition texts in the plurality of historical voice recognition texts to obtain punctuation disturbance texts;

a determining module for determining whether noise data exists in each of the historical speech recognition texts;

and the translation module is used for performing model training by taking the plurality of historical speech recognition texts acquired by the second acquisition module, the identifier used for representing whether noise data exists in each historical speech recognition text, the punctuation perturbation text acquired by the perturbation processing module and a sample text in a universal training sample as input of a model, and taking each historical translation text acquired by the second acquisition module and a sample translation text corresponding to the sample text in the universal training sample as target output of the model to acquire the translation model.

12. A translation model training apparatus, comprising:

the second acquisition module is used for acquiring a plurality of historical voice recognition texts and historical translation texts corresponding to each historical voice recognition text;

a disturbance processing module, configured to perform punctuation disturbance processing on part or all of the historical speech recognition texts acquired by the second acquisition module to obtain punctuation disturbance texts, where the punctuation disturbance processing includes at least one of punctuation deletion, punctuation modification, and punctuation addition;

and the training module is used for performing model training by taking the plurality of historical speech recognition texts acquired by the second acquisition module, the identifier used for representing whether noise data exists in each historical speech recognition text, the punctuation perturbation text acquired by the perturbation processing module and a sample text in a universal training sample as input of a model, and taking each historical translation text acquired by the second acquisition module and a sample translation text corresponding to the sample text in the universal training sample as target output of the model to acquire the translation model.

13. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1-10.

14. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1 to 10.