WO2020087655A1

WO2020087655A1 - Translation method, apparatus and device, and readable storage medium

Info

Publication number: WO2020087655A1
Application number: PCT/CN2018/119329
Authority: WO
Inventors: 孔常青; 高建清; 刘俊华; 胡国平
Original assignee: 科大讯飞股份有限公司
Priority date: 2018-10-30
Filing date: 2018-12-05
Publication date: 2020-05-07
Also published as: CN109408833A

Abstract

Disclosed are a translation method, apparatus and device and a readable storage medium. The method comprises: obtaining a source language text to be translated; and performing sentence segmentation on the source language text further according to the current translation scene, so that the obtained source language text after sentence segmentation better conforms to the current translation scene. Obviously, compared with the existing translation method, the present application adds the sentence segmentation optimization process to the obtained source language text, namely, the sentence segmentation mode of the source language text is more optimized by considering the situation that the sentence segmentation is performed on the source language text again in the current translation scene, and on this basis, the source language text after the sentence segmentation is translated, so that the quality of the obtained target language text is higher.

Description

Translation method, device, equipment and readable storage medium

Technical field

This application requires the priority of the domestic application submitted to the China Patent Office on October 30, 2018, with the application number 201811276866.X and the invention titled "A translation method, device, equipment and readable storage medium", all of its content Incorporated by reference in this application.

Background technique

The process of text translation is the process of translating the source language text to be translated into the target speech text. For the source language text to be translated, the sentence breaking method is not standardized, and is affected by the source of the source language text. For example, for the source language text obtained through speech recognition, it mainly depends on the pause information of the speech to break the sentence, which is often used by speakers influences.

In the prior art, when machine translation is performed on source language text that is not optimized based on such sentence segmentation methods, the quality of machine translation is greatly affected.

Summary of the invention

In view of this, the present application provides a translation method, device, equipment, and readable storage medium, which are used to solve the problem that the existing source language text to be translated is not optimized for sentence segmentation, resulting in low quality of machine translation.

In order to achieve the above purpose, the proposed scheme is as follows:

A translation method, including:

Obtain the source language text to be translated;

Segment the source language text according to the current translation scenario to obtain the source language text after sentence segmentation;

Translate the source language text after the sentence segmentation to obtain the target language text.

Preferably, the sentence segmentation of the source language text according to the translation scenario to obtain the source language text after sentence segmentation includes:

Input the source language text into a preset text segmentation model to obtain the source language text after the segmentation output by the text segmentation model;

Wherein, the text segmentation model is obtained by training the source language training text as the training data, and using the sentence segmentation result of the source language training text that matches the current translation scene as the training label.

Preferably, the process of determining the text segmentation model includes:

Get the source language training text;

Determining a sentence segmentation result of the source language training text that matches the current translation scenario, as a target sentence segmentation result;

Use the source language training text as training data and the target sentence segmentation result as a training label to train a text sentence segmentation model.

Preferably, the determination of the sentence segmentation result of the source language training text that matches the current translation scenario, as the target sentence segmentation result, includes:

Acquiring the translated target language training text of the source language training text in the current translation scenario;

With reference to the set sentence changing method, the sentence breaking method of the source language training text is changed to obtain the changed source language training text, and the candidate source language training is composed of the changed source language training text and the source language training text text;

Using a preset machine translation model to translate each candidate source language training text to obtain a machine translation result of each candidate source language training text;

Determining the machine translation result of each of the candidate source language training texts and the similarity with the target language training text, and using the candidate source language training text with the highest similarity as the target sentence segmentation result.

Preferably, the sentence breaking method of the reference language is changed with reference to the sentence breaking method of the source language training text to obtain the changed source language training text, including:

Determine the non-terminating punctuation included in the source language training text;

Each non-terminating punctuation included in the source language training text is replaced with terminating punctuation to obtain the modified source language training text.

Preferably, the use of a preset machine translation model to translate each candidate source language training text to obtain a machine translation result of each candidate source language training text includes:

Divide each candidate source language training text into clauses according to the termination punctuation it contains to obtain a divided clause sequence;

Use a preset machine translation model to separately translate each clause in the clause sequence of the candidate source language training text to obtain a machine translation result for each clause;

According to the order of the clauses in the clause sequence, the machine translation results of the clauses are combined to obtain the machine translation results of the candidate source language training text.

Preferably, before the training text in the source language is used as training data, and the target sentence segmentation result is used as a training label, before the text sentence segmentation model is trained, the method further includes:

Obtaining the result of manually punctuating the source language training text, and obtaining the source language training text after manual labeling;

Using the source language training text as training data and the artificially labeled source language training text as training labels to train a text segmentation model to obtain a preliminary text segmentation model;

Then, the training text using the source language training text as training data and the target sentence segmentation result as a training label includes:

Using the source language training text as training data and the target sentence segmentation result as a training label, train the preliminary text sentence segmentation model.

Preferably, the translation of the source language text after the sentence segmentation to obtain the target language text includes:

Divide the source language text after the sentence segmentation into clauses according to the termination punctuation it contains to obtain the divided clause sequence;

Use a preset machine translation model to translate each clause in the clause sequence of the source language text after the sentence segmentation separately to obtain a machine translation result for each clause;

According to the order of the clauses in the clause sequence, the machine translation results of the clauses are combined to obtain the target language text.

A translation device, including:

Source language text acquisition unit, used to obtain the source language text to be translated;

Text segmentation unit, used to segment the source language text according to the current translation scenario, to obtain the source language text after sentence segmentation;

The source language text translation unit is used to translate the source language text after the sentence segmentation to obtain the target language text.

Preferably, the text segmentation unit includes:

A model reference unit, used to input the source language text into a preset text segmentation model to obtain the source language text after the sentence segmentation output by the text segmentation model;

Wherein, the text segmentation model is obtained by training the source language training text as training data and using the sentence segmentation result of the source language training text that matches the current translation scene as the training label.

Preferably, it further includes: a text segmentation model determination unit, which is used to determine a text segmentation model; the text segmentation model includes:

Source language training text acquisition unit, used to obtain source language training text;

A sentence segmentation result determination unit, configured to determine a sentence segmentation result of the source language training text that matches the current translation scenario, as a target sentence segmentation result;

The first model training unit is used to train the text segmentation model by using the source language training text as training data and the target sentence segmentation result as a training label.

Preferably, the sentence segmentation result determination unit includes:

A target language training text obtaining unit, configured to obtain the translated target language training text of the source language training text in the current translation scenario;

The sentence changing unit is used for referring to the set sentence changing method to change the sentence breaking method of the source language training text to obtain the changed source language training text, which is composed of the changed source language training text and the source language training The text constitutes the candidate source language training text;

A source language training text translation unit, configured to use a preset machine translation model to translate each candidate source language training text to obtain a machine translation result of each candidate source language training text;

The similarity determination unit is used to determine the machine translation result of each of the candidate source language training texts and the similarity with the target language training text, and use the candidate source language training text with the highest similarity as the target sentence segmentation result.

Preferably, the sentence-breaking modification unit includes:

A non-terminating punctuation determining unit, configured to determine the non-terminating punctuation included in the source language training text;

A non-terminating punctuation replacement unit is used to replace each non-terminating punctuation included in the source language training text with a terminating punctuation to obtain a modified source language training text.

Preferably, the source language training text translation unit includes:

A first clause dividing unit, configured to divide each of the candidate source language training texts according to the terminating punctuation it contains to obtain the divided clause sequence;

The first clause translation unit is configured to use a preset machine translation model to separately translate each clause in the clause sequence of the candidate source language training text to obtain a machine translation result of each clause;

The first translation result merging unit is used to merge the machine translation results of each clause in the order of each clause in the clause sequence to obtain the machine translation result of the candidate source language training text.

Preferably, the text segmentation model further includes:

A manual labeling result obtaining unit, which is used to obtain a result of manually punctuating the source language training text to obtain the source language training text after manual labeling;

A second model training unit, configured to use the source language training text as training data and the artificially labeled source language training text as training labels to train a text segmentation model to obtain a preliminary text segmentation model;

Then the first model training unit is specifically used for:

Preferably, the source language text translation unit includes:

A second clause dividing unit, configured to divide the source language text after the sentence segmentation into clauses according to the terminating punctuation it contains to obtain the divided clause sequence;

The second clause translation unit is used to translate each clause in the clause sequence of the source language text after the sentence segmentation by using a preset machine translation model to obtain a machine translation result of each clause;

The second translation result merging unit is used to merge the machine translation results of each clause in the order of each clause in the clause sequence to obtain the target language text.

A translation device, including memory and processor;

The memory is used to store programs;

The processor is configured to execute the program and implement the steps of the translation method as described above.

A readable storage medium on which a computer program is stored, which when executed by a processor, implements the steps of the translation method as described above.

As can be seen from the above technical solution, when the translation method provided in the embodiments of the present application obtains the source language text to be translated, the source language text is further segmented according to the current translation scenario, and the resulting source language text after the sentence segmentation is obtained It is more in line with the current translation scenario. Obviously, compared with the existing translation methods, this application adds a sentence segmentation optimization process to the obtained source language text, that is, considering the current translation scenario to re-segment the source language text, so that the source language text The sentence-breaking method is more optimized, and then the source language text after sentence-breaking is translated, and the quality of the target language text obtained will be higher.

BRIEF DESCRIPTION

In order to more clearly explain the embodiments of the present invention or the technical solutions in the prior art, the following will briefly introduce the drawings required in the embodiments or the description of the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, without paying any creative work, other drawings can be obtained based on these drawings.

FIG. 1 is a flowchart of a translation method disclosed in an embodiment of the present application;

2 is a schematic structural diagram of a translation device disclosed in an embodiment of the present application;

3 is a block diagram of a hardware structure of a translation device disclosed in an embodiment of the present application.

detailed description

The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all the embodiments. Based on the embodiments in the present application, all other embodiments obtained by a person of ordinary skill in the art without creative work fall within the protection scope of the present application.

The text translation process is about to translate the source language text to be translated into the target language text. According to the different sources of the source language text to be translated, the sentence-breaking method is not unique, and the source language text to be translated is taken as an example for speech recognition to be translated. In different translation scenarios, different sentence breaking methods of the source language text will affect the quality of the target language text after translation based on the source language text. For example, the source language text may have different translation results in different contexts. As another example, there may be differences in the translation results of the source language text in different translation occasions. For example, the translation result of the source language text in a meeting occasion needs to be more rigorous and standardized, while the translation result of the source language text may be more Casual and colloquial.

In the prior art, the source language text to be translated is directly sent to the machine translation model for translation, and the sentence breaking method of the source language text is not standardized. For example, the source language text obtained through speech recognition may be affected by the speaker's speaking habits. The sentence-breaking method is not optimized, and the current translation scenario is not taken into account, and the quality of the results after translation is not high. To this end, this application provides an optimized translation method. The translation method of the present application can be applied to electronic devices with data processing capabilities.

Next, the translation method of this case will be introduced with reference to FIG. 1. The method may include:

Step S100: Acquire the source language text to be translated.

Specifically, the source language text to be translated can be obtained through multiple ways, such as the source language text uploaded by the user or the text obtained by receiving voice recognition of the user's voice data for speech recognition. Taking the voice translation process as an example, voice endpoint detection technology can be used to process the acquired real-time voice to obtain a voice segment. The speech segment is further recognized, and the recognized text is obtained as the source language text to be translated.

Here, the source language is the language used for the text to be translated. Correspondingly, the translated language is defined as the target language, and the purpose of this application is to translate the source language text to obtain the target language text.

Step S110: Perform sentence segmentation on the source language text according to the current translation scenario to obtain the source language text after sentence segmentation.

It is understandable that the sentence breaking method in the source language text (that is, the punctuation in the source language text) obtained in the previous step may be affected by the speaker's speaking habits. The sentence breaking method is not standardized and the current translation scenario is not considered. If you directly translate the obtained source language text, the quality of the translation result is not high.

Therefore, in this step, the process of sentence segmentation processing of the source language text is added, and the sentence segmentation processing process takes into account the current translation scenario, so that the sentence segmentation method of the source language text after the sentence segmentation is more in line with the current translation scenario. For the detailed process of the source language text sentence processing, the following will be introduced in detail.

Step S120: Translate the source language text after the sentence segmentation to obtain the target language text.

Generally, you can use the machine translation model to translate the source language text after the sentence segmentation obtained in the previous step to obtain the translated target language text.

On this basis, the embodiments of the present application can also choose to synthesize the target language text into speech according to the needs of the user, and then perform speech broadcasting to realize the conversion process from the source language speech to the target language speech.

According to the translation method provided in the embodiment of the present application, when the source language text to be translated is obtained, the source language text is further segmented according to the current translation scenario, and the obtained source language text after the sentence segmentation is more in line with the current translation scenario. Obviously, Compared with existing translation methods, this application adds a sentence segmentation optimization process to the obtained source language text, that is, considering the current translation scenario to re-segment the source language text, so that the source language text segmentation method is more optimized, and then based on this After the sentence source text is translated, the quality of the target language text will be higher.

In another embodiment of the present application, in the above step S110, the source language text is segmented according to the current translation scenario, and the process of obtaining the source language text after segmentation is introduced.

First of all, it can be understood that there are certain characteristics of sentence-breaking modes in different translation scenarios. Therefore, the application can pre-set rules of sentence-breaking modes corresponding to each translation scenario. For example, for meeting occasions, you may need to use short sentences as much as possible, that is, use as many punctuation as possible. Then, the rule of the sentence-breaking method corresponding to the translation scene of the conference occasion can be set, and the number of the termination punctuation used is larger than that of the non-termination punctuation.

Here, punctuation is divided into two types: end-type punctuation and non-end-type punctuation according to whether they can fully express the meaning of the sentence, where the end-type punctuation represents the complete expression of the meaning of the sentence, such as a period, question mark, exclamation mark, etc. Non-terminating punctuation means that the sentence cannot be completely expressed, such as comma, comma, etc.

Based on this, during the current translation, the preset correspondence can be queried to determine the rules of the sentence-breaking mode corresponding to the current translation scenario. After obtaining the source language text to be translated, the source language text is subjected to sentence segmentation processing according to the determined sentence segmentation mode rules to obtain the source language text after sentence segmentation. Obviously, the source language text after the sentence segmentation can meet the needs of the current translation scenario.

Further, the embodiments of the present application also provide another processing method for sentence segmentation of the source language text, that is, a process of sentence segmentation of the source language text can be performed using a machine learning model. The detailed process is as follows:

The machine learning model for sentence segmentation processing in this embodiment is defined as a text sentence segmentation model, which can use existing machine learning models of various structures, such as the BLSTM model under the sequence annotation framework, the Self-Attention model, etc., or the codec The sequence generation model under the Encode-Decode framework, of course, can also use a combination of existing multiple structural models.

Of course, if the model under the sequence labeling framework is used, the input of the model is each word in the text sequence, and the output of the model is the punctuation category corresponding to each word. The punctuation category can be null, comma, period, question mark, etc. , Where the null value means that no punctuation is added after the word.

If the model under the Encode-Decode framework is used, the input of the model may be a text sequence without punctuation, and the output of the model is a text sequence containing punctuation information, that is, the result of the model adding punctuation to the input text sequence. The specific form of machine learning model can be selected according to the needs of the application, and this application is not strictly limited.

Further, after determining the structure of the text segmentation model, it is further necessary to obtain training data of the model to train the text segmentation model. In this embodiment of the present application, a large number of source language training texts can be collected as training data. Define the set of source language training text as T1. Further, it is also necessary to determine the sentence segmentation result of each source language training text in T1 that matches the current translation scenario, as a training label corresponding to the source language training text, and the training label and the training data are used to train the text segmentation model together. It can be understood that the training data acquired in this embodiment may be extracted from the source language text to be translated. In addition, training data can also be obtained through other means, for example, selecting a part of text from existing material text as training data.

The text segmentation model after training based on the above training data and the training label has the ability to process the input samples according to the needs of the current translation scenario and output the sentence segmentation results that match the current translation scenario. Based on this, the obtained source language text can be input into the text segmentation model to obtain the source language text after the sentence segmentation output by the text segmentation model, that is, the source language text after the sentence segmentation optimization processing.

In another embodiment of the present application, the determination process of the foregoing text segmentation model is expanded and described. The determination process of the text segmentation model may include:

A1. Obtain the source language training text.

As described above, the set consisting of source language training texts is defined as T1.

A2. Determine a sentence segmentation result of the source language training text that matches the current translation scene as a target sentence segmentation result.

The set of target sentence segmentation results of the source language training text that meets the current translation scenario is defined as T2. T2 is the result of translating each source language training text in T1.

A3. Use the source language training text as training data and the target sentence segmentation result as a training label to train a text sentence segmentation model.

Based on the determination process of the text segmentation model in the above example, the embodiment of the present application provides another method for determining the text segmentation model, that is, before the above A3, the following steps are added:

A4. Obtain the result of manually punctuating the source language training text, and obtain the source language training text after manual labeling.

Specifically, after the source language training text is obtained, the source language training text may be manually punctuated to obtain the manually labeled source language training text.

A5. Use the source language training text as training data and the artificially labeled source language training text as training labels to train a text segmentation model to obtain a preliminary text segmentation model.

On the basis of step A4, the source language training text can be used as the training data, and the artificially labeled source language training text can be used as the training label to train the text segmentation model to obtain a preliminary text segmentation model.

On this basis, the above step A3 may specifically include:

Specifically, a model adaptive update method may be used, the source language training text is used as training data, and the target sentence segmentation result is used as a training label to update the parameters of the preliminary text sentence segmentation model.

Using this model update method can increase the amount of model training data, which in turn makes the text segmentation model obtained by training more excellent.

In another embodiment of the present application, the process of determining the sentence segmentation result of the source language training text that matches the current translation scenario as the target sentence segmentation result is described in A2 above.

It can be understood that, as described above, the rules of the sentence segmentation mode corresponding to each translation scenario can be preset. Then, in this embodiment, the source language training text may be subjected to sentence segmentation processing according to the rules of the sentence segmentation mode corresponding to the current translation scenario, to obtain a sentence segmentation result for each source language training text.

In addition, another optional implementation manner is provided in this embodiment, which may specifically include:

A21. Acquire the translated target language training text of the source language training text in the current translation scenario.

Specifically, the translated target language training text of the source language training text in the current translation scenario can be determined by manual translation. That is, the source language training text can be translated manually to obtain the target language training text according to the current translation scenario.

A22. Referring to the set sentence changing method, change the sentence breaking method of the source language training text to obtain the changed source language training text, and the candidate source is composed of the changed source language training text and the source language training text Language training text.

Specifically, the embodiment of the present application may set a sentence breaking modification method in advance, and then may change the sentence breaking method of the source language training text according to the set sentence breaking modification method.

It is understandable that by appropriately setting the sentence-breaking modification method, multiple changed source language training texts can be expanded. The sentence breaking method of the source language training text conforming to the current translation scenario, or the sentence breaking method of the source language training text itself, or the sentence breaking method of a certain changed source language training text.

That is, the processing procedure in this step is to expand the candidate source language training text, which includes the source language training text's sentence-breaking method that matches the current translation scenario.

A23. Translate each candidate source language training text by using a preset machine translation model to obtain a machine translation result of each candidate source language training text.

A24. Determine the machine translation result of each of the candidate source language training texts and the similarity with the target language training text, and use the candidate source language training text with the highest similarity as the target sentence segmentation result.

Specifically, the target language training text is the translated result of the source language training text in the current translation scenario. Based on this, the target language training text is used as the standard in this step to determine the similarity between the machine translation results of each candidate source language training text and the target language training text. It is understandable that the higher the similarity of the candidate source language training text, the higher the degree of conformance with the current translation scene. Based on this, the candidate source language training text with the highest similarity can be selected as the target sentence segmentation result of the source language training text that matches the current translation scenario.

Optionally, when calculating the similarity in this step, the BLEU scoring method can be used, that is, the target language training text is used as the standard, and the machine translation results of each candidate source language training text are separately scored and evaluated, the higher the score value The candidate source language training text represents the higher the similarity with the target language training text.

Further, this embodiment introduces an optional way to change the sentence breaking method of the source language training text to obtain the changed source language training text, which may specifically include:

A221. Determine the non-terminating punctuation included in the source language training text.

Specifically, for each source language training text T1 _j (j = 1 ... n) in the source language training text set T1, n is the number of source language training texts in T1, and the non-terminating punctuation included in T1 _{j is} determined. The number M.

A222. Each non-terminating punctuation included in the source language training text is replaced with terminating punctuation to obtain the modified source language training text.

It can be understood that any non-terminating punctuation in T1 _j can be replaced with terminating punctuation.

According to the replacement method introduced in this step, if the number of non-terminating punctuation points included in T1 _j is M, the source language training text before replacement and the modified source language training text obtained through replacement constitute candidate source language training A text set, which contains 2 ^ M (power of 2) candidate source language training texts.

The following is an example:

Specifically, for the source language training text: "The weather is good today, I want to go climbing, do you go?" Since there are two commas in it that are non-terminal punctuation, each comma can be replaced with a terminal punctuation, such as a period In the end, 2 ^ 2 = 4 candidate source language training texts can be obtained, as follows:

1. The weather is good today. I want to go climbing. Do you go?

2. The weather is good today. I want to go climbing, do you go?

3. The weather is good today. I want to go climbing. are you going?

4. The weather is good today. I want to go climbing. are you going?

It can be understood that, of the four candidate source language training texts obtained, Article 1 is the source language training text itself, and Articles 2-4 are the modified source language training text obtained after punctuation replacement.

Based on the implementation of A22 introduced in the above embodiment, the embodiment of the present application further introduces the above A23, an optional implementation of translating each of the candidate source language training texts using a preset machine translation model, specifically Can include:

A231. Divide each candidate source language training text into clauses according to the termination punctuation it contains to obtain a divided clause sequence.

Specifically, for each of the candidate source language training texts, the terminal punctuation contained therein is traversed from the beginning, each terminal punctuation is used as a dividing point, and the candidate source language training text is divided into several clauses. The clauses of the form a sequence of clauses in the order of the candidate source language training text.

A232. Use a preset machine translation model to separately translate each clause in the clause sequence of the candidate source language training text to obtain a machine translation result of each clause.

A232. Combine the machine translation results of the clauses according to the order of the clauses in the clause sequence to obtain the machine translation results of the candidate source language training text.

It can be understood that the number of candidate source language training texts is 2 ^ M, and each candidate language training text is translated in the above manner, and finally 2 ^ M machine translation results can be obtained.

According to the above-mentioned processing methods introduced in this application, a part of non-terminating punctuation can be converted into terminating punctuation, the occurrence probability of terminating punctuation will increase, and in the machine translation process, it is a translation based on the content before terminating punctuation Therefore, according to the application scheme, the time for waiting for termination punctuation will be shortened, thereby increasing the output speed of translation results, reducing the subjective time for users to wait for translation results, and improving the user experience.

Still taking the above example as an example to illustrate the implementation process of A23:

For ease of expression, the four candidate source language training texts of the above examples are defined as candidate texts 1-4, respectively.

For candidate text 1: Because only the final punctuation appears in the sentence, and there is no terminal punctuation in the sentence, the sentence cannot be further split, or the split clause is the candidate text 1 itself. Therefore, the candidate text 1 can be sent to the machine translation model as a sentence for translation.

For candidate text 2: The sentence is followed by a period, and the sentence can be split. Candidate text 2 can be split into two clauses, which are:

Clause 21: The weather is good today.

Clause 22: I want to go climbing, do you go?

The two clauses after splitting are sent to the machine translation model for translation, and the machine translation results are merged to obtain the machine translation result of candidate text 2.

For Candidate Text 3: The sentence is followed by a period after "climbing the mountain", the sentence can be split, and Candidate Text 3 can be split to get two clauses, respectively:

Clause 31: The weather is good today. I want to go climbing.

Clause 32: Are you going?

For the split two clauses, they are sent to the machine translation model for translation, and the machine translation results are merged to obtain the machine translation result of candidate text 3.

For Candidate Text 4: There is a period after “Good” and “Mountain Climbing” in the sentence, and the sentence can be split. Candidate Text 4 can be split into three clauses, namely:

Clause 41: The weather is good today.

Clause 42: I want to go mountain climbing.

Clause 43: Are you going?

The three clauses after splitting are sent to the machine translation model for translation, and the machine translation results are merged to obtain the machine translation result of the candidate text 4.

Furthermore, suppose that for the above candidate texts 1-4, the BLEU method is used for scoring, and the score values are: 0.1, 0.2, 0.3, 0.4 in order. Then, the candidate text 4 with the highest score can be selected as the target sentence segmentation result of the source language training text that matches the current translation scenario.

Then, the source language training text: "The weather is good today, I want to go climbing, do you go?"

Target sentence sentence result: "The weather is good today. I want to go mountain climbing. Are you going?"

The target sentence segmentation results of the source language training text machine can be used as training data and training labels to train the text segmentation model.

In still another embodiment of the present application, in the above step S120, the process of translating the source language text after the sentence segmentation to obtain the target language text is introduced.

Based on the introduction of the above embodiment, it can be known that the source language text after sentence segmentation can be translated using a machine translation model. In the specific translation process, you can first divide the clauses according to the terminating punctuation contained in the source language text after the sentence segmentation to obtain the divided clause sequence. Further, using a preset machine translation model, each clause in the clause sequence corresponding to the source language text after the sentence segmentation is translated separately to obtain a machine translation result of each clause. According to the order of the clauses in the clause sequence, the machine translation results of the clauses are combined to obtain the machine translation result of the source language text after the sentence segmentation, that is, the target language text is obtained.

Based on the introduction of the above embodiments, it can be seen that the present application considers the current translation scenario to segment the source language text, and the resulting source language text after the segmentation is more in line with the current translation scenario, and then based on this, the translated source language text is translated , The quality of the target language text will be higher.

Further, in this application, when punctuation is used to break sentences, a part of non-terminating punctuation can be converted into terminating punctuation, the occurrence probability of terminating punctuation will increase, and in the machine translation process, it is the content before terminating punctuation Perform a translation, so the time to wait for the termination of punctuation will be shortened according to the application scheme, thereby increasing the output speed of translation results, reducing the subjective time for users to wait for translation results, and improving the user experience.

The translation device provided by the embodiments of the present application will be described below. The translation device described below and the translation method described above can be referred to each other.

Refer to FIG. 2, which is a schematic structural diagram of a translation device disclosed in an embodiment of the present application.

As shown in FIG. 2, the device may include:

The source language text obtaining unit 11 is used to obtain the source language text to be translated;

The text segmentation unit 12 is configured to segment the source language text according to the current translation scenario to obtain the source language text after the sentence segmentation;

The source language text translation unit 13 is configured to translate the source language text after the sentence segmentation to obtain the target language text.

Optionally, the above text segmentation unit may include:

Optionally, the translation device of the present application may further include: a text segmentation model determination unit for determining a text segmentation model. The text segmentation model may include:

Optionally, the above sentence determination result determination unit may include:

Optionally, the above sentence changing unit may include:

Optionally, the above source language training text translation unit may include:

Optionally, the above text segmentation model may also include:

The second model training unit is configured to use the source language training text as training data and the artificially labeled source language training text as training labels to train a text segmentation model to obtain a preliminary text segmentation model. Based on this, the above-mentioned first model training unit can be specifically used for:

Optionally, the above source language text translation unit may include:

The translation apparatus provided in the embodiments of the present application may be applied to translation equipment, such as PC terminals, cloud platforms, servers, and server clusters. Optionally, FIG. 3 shows a block diagram of the hardware structure of the translation device. Referring to FIG. 3, the hardware structure of the translation device may include: at least one processor 1, at least one communication interface 2, at least one memory 3, and at least one communication bus 4 ;

In the embodiment of the present application, the number of the processor 1, the communication interface 2, the memory 3, and the communication bus 4 is at least one, and the processor 1, the communication interface 2, and the memory 3 complete communication with each other through the communication bus 4;

The processor 1 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc .;

The memory 3 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), for example, at least one magnetic disk memory;

Among them, the memory stores a program, and the processor can call the program stored in the memory, and the program is used for:

Obtain the source language text to be translated;

Optionally, the detailed functions and extended functions of the program may refer to the above description.

An embodiment of the present application further provides a readable storage medium, where the readable storage medium may store a program suitable for execution by a processor, and the program is used to:

Obtain the source language text to be translated;

Finally, it should also be noted that in this article, relational terms such as first and second are used only to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities Or there is any such actual relationship or order between operations. Moreover, the terms "include", "include" or any other variant thereof are intended to cover non-exclusive inclusion, so that a process, method, article, or device that includes a series of elements includes not only those elements, but also those not explicitly listed Or other elements that are inherent to this process, method, article, or equipment. Without more restrictions, the element defined by the sentence "include one ..." does not exclude that there are other identical elements in the process, method, article or equipment that includes the element.

The embodiments in this specification are described in a progressive manner. Each embodiment focuses on the differences from other embodiments, and the same or similar parts between the embodiments may refer to each other.

The above description of the disclosed embodiments enables those skilled in the art to implement or use this application. Various modifications to these embodiments will be apparent to those skilled in the art. The general principles defined herein can be implemented in other embodiments without departing from the spirit or scope of the present application. Therefore, the present application will not be limited to the embodiments shown in this document, but should conform to the widest scope consistent with the principles and novel features disclosed in this document.

Claims

A translation method, characterized by including:

Obtain the source language text to be translated;

Segment the source language text according to the current translation scenario to obtain the source language text after sentence segmentation;

Translate the source language text after the sentence segmentation to obtain the target language text.
The method according to claim 1, wherein the step of segmenting the source language text according to the translation scenario to obtain the source language text after the sentence segmentation includes:

Input the source language text into a preset text segmentation model to obtain the source language text after the segmentation output by the text segmentation model;

Wherein, the text segmentation model is obtained by training the source language training text as training data and using the sentence segmentation result of the source language training text that matches the current translation scene as the training label.
The method according to claim 2, wherein the process of determining the text segmentation model includes:

Get the source language training text;

Determining a sentence segmentation result of the source language training text that matches the current translation scenario, as a target sentence segmentation result;

Use the source language training text as training data and the target sentence segmentation result as a training label to train a text sentence segmentation model.
The method according to claim 3, wherein the determining the sentence segmentation result of the source language training text that matches the current translation scenario as the target sentence segmentation result includes:

Acquiring the translated target language training text of the source language training text in the current translation scenario;

With reference to the set sentence changing method, the sentence breaking method of the source language training text is changed to obtain the changed source language training text, and the candidate source language training is composed of the changed source language training text and the source language training text text;

Using a preset machine translation model to translate each candidate source language training text to obtain a machine translation result of each candidate source language training text;

Determining the machine translation result of each of the candidate source language training texts and the similarity with the target language training text, and using the candidate source language training text with the highest similarity as the target sentence segmentation result.
The method according to claim 4, wherein the reference setting of the sentence changing method changes the sentence breaking method of the source language training text to obtain the changed source language training text, including:

Determine the non-terminating punctuation included in the source language training text;

Each non-terminating punctuation included in the source language training text is replaced with terminating punctuation to obtain the modified source language training text.
The method according to claim 5, wherein the preset machine translation model is used to translate each candidate source language training text to obtain a machine translation result of each candidate source language training text ,include:

Divide each candidate source language training text into clauses according to the termination punctuation it contains to obtain a divided clause sequence;

Use a preset machine translation model to separately translate each clause in the clause sequence of the candidate source language training text to obtain a machine translation result for each clause;

According to the order of the clauses in the clause sequence, the machine translation results of the clauses are combined to obtain the machine translation results of the candidate source language training text.
The method according to claim 3, characterized in that before the training text in the source language is used as training data, and the target sentence segmentation result is used as a training label, the method further includes:

Obtaining the result of manually punctuating the source language training text, and obtaining the source language training text after manual labeling;

Using the source language training text as training data and the artificially labeled source language training text as training labels to train a text segmentation model to obtain a preliminary text segmentation model;

Then, the training text using the source language training text as training data and the target sentence segmentation result as a training label includes:

Using the source language training text as training data and the target sentence segmentation result as a training label, the preliminary text sentence segmentation model is trained.
The method according to claim 1, wherein the translation of the source language text after the sentence segmentation to obtain the target language text includes:

Divide the source language text after the sentence segmentation into clauses according to the termination punctuation it contains to obtain the divided clause sequence;

Use a preset machine translation model to translate each clause in the clause sequence of the source language text after the sentence segmentation separately to obtain a machine translation result for each clause;

According to the order of the clauses in the clause sequence, the machine translation results of the clauses are combined to obtain the target language text.
A translation device, characterized in that it includes:

Source language text acquisition unit, used to obtain the source language text to be translated;

Text segmentation unit, used to segment the source language text according to the current translation scenario, to obtain the source language text after sentence segmentation;

The source language text translation unit is used to translate the source language text after the sentence segmentation to obtain the target language text.
The device according to claim 9, wherein the text segmentation unit comprises:

A model reference unit, used to input the source language text into a preset text segmentation model to obtain the source language text after the sentence segmentation output by the text segmentation model;

Wherein, the text segmentation model is obtained by training the source language training text as training data and using the sentence segmentation result of the source language training text that matches the current translation scene as the training label.
The device according to claim 10, further comprising: a text segmentation model determination unit for determining a text segmentation model; the text segmentation model includes:

Source language training text acquisition unit, used to obtain source language training text;

A sentence segmentation result determination unit, configured to determine a sentence segmentation result of the source language training text that matches the current translation scenario, as a target sentence segmentation result;

The first model training unit is used to train the text segmentation model by using the source language training text as training data and the target sentence segmentation result as a training label.
The apparatus according to claim 11, wherein the sentence segmentation result determination unit comprises:

A target language training text obtaining unit, configured to obtain the translated target language training text of the source language training text in the current translation scenario;

The sentence changing unit is used for referring to the set sentence changing method to change the sentence breaking method of the source language training text to obtain the changed source language training text, which is composed of the changed source language training text and the source language training The text constitutes the candidate source language training text;

A source language training text translation unit, configured to use a preset machine translation model to translate each candidate source language training text to obtain a machine translation result of each candidate source language training text;

The similarity determination unit is used to determine the machine translation result of each of the candidate source language training texts and the similarity with the target language training text, and use the candidate source language training text with the highest similarity as the target sentence segmentation result.
The device according to claim 12, wherein the sentence changing unit comprises:

A non-terminating punctuation determining unit, configured to determine the non-terminating punctuation included in the source language training text;

A non-terminating punctuation replacement unit is used to replace each non-terminating punctuation included in the source language training text with a terminating punctuation to obtain a modified source language training text.
The apparatus according to claim 13, wherein the source language training text translation unit includes:

A first clause dividing unit, configured to divide each of the candidate source language training texts according to the terminating punctuation it contains to obtain the divided clause sequence;

The first clause translation unit is configured to use a preset machine translation model to separately translate each clause in the clause sequence of the candidate source language training text to obtain a machine translation result of each clause;

The first translation result merging unit is used to merge the machine translation results of each clause in the order of each clause in the clause sequence to obtain the machine translation result of the candidate source language training text.
The apparatus according to claim 11, wherein the text segmentation model further comprises:

A manual labeling result obtaining unit, which is used to obtain a result of manually punctuating the source language training text to obtain the source language training text after manual labeling;

A second model training unit, configured to use the source language training text as training data and the artificially labeled source language training text as training labels to train a text segmentation model to obtain a preliminary text segmentation model;

Then the first model training unit is specifically used for:

Using the source language training text as training data and the target sentence segmentation result as a training label, train the preliminary text sentence segmentation model.
The apparatus according to claim 9, wherein the source language text translation unit includes:

A second clause dividing unit, configured to divide the source language text after the sentence segmentation into clauses according to the terminating punctuation it contains to obtain the divided clause sequence;

The second clause translation unit is used to translate each clause in the clause sequence of the source language text after the sentence segmentation by using a preset machine translation model to obtain a machine translation result of each clause;

The second translation result merging unit is used to merge the machine translation results of each clause in the order of each clause in the clause sequence to obtain the target language text.
A translation device, characterized in that it includes a memory and a processor;

The memory is used to store programs;

The processor is configured to execute the program and implement the steps of the translation method according to any one of claims 1-8.
A readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, each step of the translation method according to any one of claims 1-8 is realized.