CN114254658A - Method, device, equipment and storage medium for generating translation evaluation training data - Google Patents

Method, device, equipment and storage medium for generating translation evaluation training data Download PDF

Info

Publication number
CN114254658A
CN114254658A CN202111527223.XA CN202111527223A CN114254658A CN 114254658 A CN114254658 A CN 114254658A CN 202111527223 A CN202111527223 A CN 202111527223A CN 114254658 A CN114254658 A CN 114254658A
Authority
CN
China
Prior art keywords
translation
sample
sentences
sentence
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111527223.XA
Other languages
Chinese (zh)
Inventor
王永杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Liulishuo Information Technology Co ltd
Original Assignee
Shanghai Liulishuo Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Liulishuo Information Technology Co ltd filed Critical Shanghai Liulishuo Information Technology Co ltd
Priority to CN202111527223.XA priority Critical patent/CN114254658A/en
Publication of CN114254658A publication Critical patent/CN114254658A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/51Translation evaluation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/45Example-based machine translation; Alignment

Abstract

A method, a device, equipment and a storage medium for generating translation evaluation training data are provided, and the method comprises the following steps: obtaining a sample original sentence to be translated; obtaining a sample translation statement corresponding to a sample original statement; acquiring a reference translation set corresponding to a sample original sentence, wherein the reference translation set comprises a plurality of reference translation sentences; selecting a reference translation statement with the highest similarity with the sample translation statement as a sample, and editing the statement; obtaining a sample translation quality label of each vocabulary in the original sample sentence by using the post-sample editing sentence and the sample translation sentence; and establishing a training set according to the sample translation quality label. According to the method, the sample post-editing sentences are generated by using the reference translation sentences of the original sentences, so that under the condition that no post-editing data is disclosed, post-editing or labeling is not required to be performed manually, the post-editing sentences and the sample translation quality labels can still be obtained, and the training efficiency of the translation evaluation model is ensured while the training data for training the translation evaluation model is obtained.

Description

Method, device, equipment and storage medium for generating translation evaluation training data
Technical Field
The embodiment of the invention relates to the technical field of machine translation, in particular to a method, a device, equipment and a storage medium for generating translation evaluation training data.
Background
Machine Translation (MT) refers to a technique for translating an original text in one natural language (generally referred to as a source language) into a translated text in another natural language (generally referred to as a target language) using a computer. The translation process is completed through a machine translation system, so that the translation efficiency is higher compared with manual translation.
In the machine translation process, the accuracy of a translation result needs to be ensured while the translation efficiency is ensured. Therefore, it becomes important to evaluate the translation quality of the machine translation (for example, feedback can be performed through the evaluation result to optimize the machine translation system). The evaluation of the translation quality is generally manually evaluated or automatically evaluated, wherein the automatic evaluation utilizes a translation evaluation model to intelligently evaluate the quality of a translation result output by machine translation, has the advantages of rapidness, timely feedback and the like, and can ensure the consistency of the evaluation result, so that the automatic evaluation by a machine gradually becomes a mainstream evaluation mode.
At present, the training of the evaluation model still has certain limitations.
Disclosure of Invention
The embodiment of the invention provides a method, a device, equipment and a storage medium for generating translation evaluation training data, which can ensure the training efficiency of a translation evaluation model while obtaining the training data for training the translation evaluation model.
In order to solve the above problem, an embodiment of the present invention provides a method for generating translation evaluation training data, including: obtaining one or more sample original sentences to be translated; obtaining a sample translation statement corresponding to the sample original statement, wherein the sample translation statement is obtained by translating the sample original statement; acquiring a reference translation set corresponding to the sample original sentence, wherein the reference translation set comprises a plurality of reference translation sentences; selecting the reference translation statement with the highest similarity with the sample translation statement from the reference translation set as a sample post-editing statement; utilizing the sample post-editing sentences and the sample translation sentences to obtain sample translation quality labels of all words in the sample original sentences; and establishing a training set according to the sample translation quality label, wherein the training set comprises the sample original sentence, the sample translation sentence and the sample translation quality label, and the training set is used for training a translation evaluation model.
Correspondingly, an embodiment of the present invention further provides a device for generating translation evaluation training data, including: the system comprises a first statement acquisition module, a second statement acquisition module and a translation module, wherein the first statement acquisition module is used for acquiring one or more sample original statements to be translated; the sample translation statement acquisition module is used for acquiring a sample translation statement corresponding to the sample original statement, and the sample translation statement is obtained by translating the sample original statement; a reference translation set obtaining module, configured to obtain a reference translation set corresponding to the sample original sentence, where the reference translation set includes a plurality of reference translation sentences; the second sentence acquisition module is used for selecting the reference translation sentence with the highest similarity with the sample translation sentence from the reference translation set as a sample post-editing sentence; the label obtaining module is used for obtaining sample translation quality labels of all vocabularies in the sample original sentences by utilizing the sample post-editing sentences and the sample translation sentences; and the training set establishing module is used for establishing a training set according to the sample translation quality label, the training set comprises the sample original sentence, the sample translation sentence and the sample translation quality label, and the training set is used for training a translation evaluation model.
Accordingly, an apparatus according to an embodiment of the present invention is further provided, which includes at least one memory and at least one processor, where the memory stores one or more computer instructions, and the one or more computer instructions are executed by the processor to implement the method for generating the translation evaluation training data according to the embodiment of the present invention.
Correspondingly, the embodiment of the invention also provides a storage medium, wherein one or more computer instructions are stored in the storage medium and used for realizing the method for generating the translation evaluating training data in the embodiment of the invention.
Compared with the prior art, the technical scheme of the embodiment of the invention has the following advantages:
in the embodiment of the invention, a plurality of reference translation sentences of an original sentence are utilized to generate a sample post-editing sentence, and the sample post-editing sentence is utilized to generate a sample translation quality label (label) of each vocabulary in the sample original sentence for training a translation evaluation model.
Drawings
FIG. 1 is a flow chart of an embodiment of a method for generating translation evaluation training data according to the present invention;
FIG. 2 is a functional block diagram of an embodiment of an apparatus for generating translation evaluation training data according to the present invention;
fig. 3 is a hardware structure diagram of a device according to an embodiment of the present invention.
Detailed Description
As can be seen from the background art, there are still limitations to the training of evaluation models.
Specifically, when the translation quality of machine translation is evaluated, the quality evaluation includes three levels of quality evaluation of vocabulary, sentences and documents. On a translation evaluation task at a vocabulary level, when a training set is established, the acquisition of a quality label of a vocabulary depends on Post-Edit (PE) sentences of a machine translation result manually.
However, in some translation evaluation tasks (for example, translation evaluation for translating chinese into english), if post-editing data is not disclosed, a specific quality label cannot be constructed, and a model cannot be trained, that is, a post-editing sentence must be disclosed. If the quality labels of specific words are manually edited or manually marked directly, more energy is needed, and therefore the efficiency of model training is low.
In order to solve the technical problem, an embodiment of the present invention provides a method for generating translation evaluation training data. Referring to FIG. 1, a flow diagram of an embodiment of a method for generating profiling training data of the present invention is shown.
In the embodiment of the invention, the method for generating the translation evaluation training data comprises the following basic steps:
step S1: obtaining one or more sample original sentences to be translated;
step S2: obtaining a sample translation statement corresponding to the sample original statement, wherein the sample translation statement is obtained by translating the sample original statement;
step S3: acquiring a reference translation set corresponding to the sample original sentence, wherein the reference translation set comprises a plurality of reference translation sentences;
step S4: selecting the reference translation statement with the highest similarity with the sample translation statement from the reference translation set as a sample post-editing statement;
step S5: utilizing the sample post-editing sentences and the sample translation sentences to obtain sample translation quality labels of all words in the sample original sentences;
step S6: and establishing a training set according to the sample translation quality label, wherein the training set comprises the sample original sentence, the sample translation sentence and the sample translation quality label, and the training set is used for training a translation evaluation model.
In the embodiment of the invention, a plurality of reference translation sentences of an original sentence are utilized to generate a sample post-editing sentence, and the sample post-editing sentence is utilized to generate a sample translation quality label of each vocabulary in the sample original sentence for training a translation evaluation model.
In order to make the aforementioned objects, features and advantages of the embodiments of the present invention comprehensible, specific embodiments accompanied with figures are described in detail below.
Referring to fig. 1, step S1 is executed to obtain one or more sample original sentences to be translated.
The sample original sentence refers to a sentence that needs to be translated. In the machine translation, a sentence in a certain source language needs to be translated into a sentence in another specified target language, for example, in a translation task of translating chinese into english, the source language is chinese, and the target language is english. In training the translational profile model, the sample primitive sentences are used as part of a training set.
In this embodiment, the sample original sentence is a source language sentence. For example, the sample original sentence is "i like cat because cat is elegant. ". It should be noted that the double quotation marks are only used for limiting the content range of the examples, and are not essential for representing the content of the sample original sentence, and those skilled in the art can use other symbols which are not easy to be confused to limit the content range of the sample original sentence, and the double quotation marks used in the following are all the same as described above. It should also be noted that the words may be considered as a special word.
The number of the sample original sentences may be one or more. In one embodiment, the sample original sentence may be from source language corpus, which includes one or more sample original sentences. In another embodiment, the sample original sentence may also be from a different source language corpus. In other embodiments, the sample original statement may also come from multiple independent statements that differ in origin. The corpus generally refers to a paragraph or an article composed of a plurality of sentences.
Continuing to refer to fig. 1, step S2 is executed to obtain a sample translated sentence corresponding to the sample original sentence, where the sample translated sentence is obtained by translating the sample original sentence.
The sample translation statements correspond to the sample original statements, and the sample original statements are also used as a part of a training set in the process of training a translation evaluating model.
In addition, after a sample post-editing statement is subsequently obtained, a sample translation quality label of each vocabulary in the sample original statement is obtained by using the sample post-editing statement and the sample translation statement. Post-editing generally refers to a process of performing artificial correction (e.g., modification and rendering) on an output result of a machine translation system, so that the output result becomes an acceptable translation result, thereby improving translation quality.
In this embodiment, the post-editing further includes a process of correcting the manual translation result (for example, in an application scenario of answering by a student, correcting the translation result of the student). Therefore, in the present embodiment, the sample translated sentence is obtained by performing machine translation or manual translation on the sample original sentence.
As an example, the sample original sentence is "i like cat because cat is elegant. "the corresponding sample translation statement is" I lik cats because the same area algorithm.
With continuing reference to fig. 1, step S3 is executed to obtain a reference translation set corresponding to the sample original sentence, where the reference translation set includes a plurality of reference translation sentences.
The reference translation sentence refers to a sentence for comparison with the sample translation sentence, and the reference translation sentence is generally a translation with higher translation quality, that is, the reference translation sentence meets the requirement of confidence.
The sample original sentences are in one-to-one correspondence with the reference translation set, the reference translation set comprises a plurality of reference translation sentences, and the plurality of reference translation sentences in the same reference translation set are used as candidate sample post-editing sentences corresponding to the sample original sentences, so that the sample post-editing sentences corresponding to the sample original sentences can be selected from the reference translation set subsequently.
In the vocabulary level translation evaluation task, the training of the translation evaluation model relies on post-editing sentences, however, on some translation evaluation tasks, the situation that the data is edited after being disclosed does not exist, so that a translation evaluation model cannot be trained, in the present embodiment, a plurality of reference translated sentences of the original sentence are used to generate a sample post-edit sentence, even in the case of no post-editing data being disclosed, the post-editing sentence can be generated (i.e., automatic post-editing is realized without relying on the post-editing data being disclosed), and the post-editing sentence is obtained without being manually marked, so that the training data for training the translation evaluation model can be constructed under the condition of no post-public editing data, and further, the training efficiency of the translation evaluation model is ensured while the training data for training the translation evaluation model is obtained.
It should be noted that each reference translation set includes a plurality of reference translation statements, so that the number of the reference translation statements is increased, and thus, more accurate samples can be obtained subsequently and then the statements can be edited, and further, the evaluation accuracy of the subsequently obtained training translation evaluation model can be improved.
It will be appreciated that when a sentence in a source language is translated into another sentence in a specified target language, a plurality of translation results will generally exist in the target language for a vocabulary in the source language sentence, so that a plurality of reference translated sentences can be obtained based on the sample original sentence. For example, the plurality of reference translated sentences are T1, T2, T3, T4 and T5, respectively, and the plurality of reference translated sentences constitute a reference translation set S, which is { T1, T2, T3, T4, T5 }. As an example, the sample original sentence is "i like cats because they are elegant. "then, the plurality of reference translated sentences includes" I like cs best of the area vertical algorithm "," I local cs best of the area vertical algorithm "," I like cs best of the area vertical algorithm ", and" I local cs best of the area vertical algorithm ", and the plurality of reference translated sentences constitutes the reference translation set S.
In this embodiment, the reference translation set corresponding to the sample original sentence is obtained by one or two of machine translation and manual translation.
Taking the example of obtaining the reference translation set through manual translation, for some scenarios of translation questions, meta information of the questions is usually stored, that is, reference translation sentences are stored in advance, and the stored reference translation sentences can be directly used. It is understood that the reference translation set is obtained by manual translation, and is not limited to the scenario of the translation topic. It should be noted that, the reference translation set is obtained by means of manual translation, which is beneficial to improving the accuracy of the reference translation statement.
When the reference translation set corresponding to the sample original sentence is obtained by means of machine translation, the step of obtaining the reference translation set correspondingly comprises: obtaining a plurality of candidate machine translation sets from different machine translation tools, each candidate machine translation set comprising one or more candidate machine translation statements; performing first screening processing on the candidate machine translation sets, and selecting a plurality of candidate machine translation statements meeting a first preset condition, wherein the first preset condition comprises a confidence coefficient; and acquiring a reference translation statement from a plurality of candidate machine translation statements meeting a first preset condition, wherein the reference translation statement forms a reference translation set.
Through the machine translation mode, can acquire reference translation set automatically, compare with manual translation, be favorable to improving the efficiency of acquiring reference translation set. It should be noted that when the reference translation set corresponding to the sample original sentence is obtained by machine translation, a machine translation tool is required to have a higher accuracy.
In this embodiment, the machine translation tools correspond to the candidate machine translation sets one to one, and therefore, when a plurality of machine translation tools are used, the number of candidate machine translation sets is correspondingly multiple.
In this embodiment, a plurality of reference translation statements of an original statement are subsequently used to generate a sample post-edit statement, so that to improve the accuracy of the sample post-edit statement, the first preset condition includes a confidence level, that is, a candidate machine translation statement with a higher confidence level needs to be selected as a reference translation statement, thereby improving the evaluation accuracy of a subsequently obtained training translation evaluation model. Wherein, the higher the confidence, the higher the translation quality.
For this purpose, a first screening process is performed on the plurality of candidate machine translation sets in one or more of preset manners. In this embodiment, the preset manner includes: selecting a common candidate machine translation statement from a plurality of candidate machine translation sets; or selecting candidate machine translation sentences from machine translation tools meeting the accuracy requirement; or, obtaining the average value of the confidence scores of all the vocabularies in the candidate machine translation sentence by utilizing the confidence score of each vocabulary output by the machine translation tool, and using the average value as the translation result score; and selecting candidate machine translation sentences of which the translation result scores are greater than or equal to a preset score threshold.
When a plurality of candidate machine translation sets have a common candidate machine translation statement, the accuracy of characterizing the common candidate machine translation statement is high.
The machine translation tool meeting the accuracy requirement is also a machine translation tool with higher reliability, so that the candidate machine translation sentences from the machine translation tool meeting the accuracy requirement are selected, and the accuracy of the candidate machine translation sentences is also favorably ensured. For example, machine translation tools that meet accuracy requirements include google translation systems and the like.
The confidence score for each vocabulary output by the machine translation tool facilitates a more direct assessment of the accuracy of the candidate machine translated sentences. It should be noted that, in the actual operation process, the preset score threshold may be set according to the actual requirement.
In this embodiment, acquiring a reference translation sentence from a plurality of candidate machine translation sentences satisfying a first preset condition includes: judging whether the number of candidate machine translation sentences meeting a first preset condition meets a first number threshold condition, wherein the first number threshold condition comprises the following steps: the number of candidate machine translation statements meeting a first preset condition is greater than or equal to a first preset number; under the condition that the number of the candidate machine translation sentences meeting the first preset condition does not meet the number threshold condition, selecting all the candidate machine translation sentences meeting the first preset condition as reference translation sentences; under the condition that the number of the candidate machine translation sentences meeting the first preset condition meets the number threshold condition, selecting the previous first preset number of candidate machine translation sentences with the highest confidence coefficient from the candidate machine translation sentences meeting the first preset condition as reference translation sentences.
The number of candidate machine translation sentences meeting the first preset condition is judged according to the first number threshold condition, so that a sufficient number of reference translation sentences are obtained, the confidence of the selected reference translation sentences is high, and the reference translation sentences with the highest similarity to the sample translation sentences are selected from the reference translation set as sample post-editing sentences. Therefore, when the number of the candidate machine translation sentences meeting the first preset condition is less than the first preset number, all the candidate machine translation sentences meeting the first preset condition are selected as reference translation sentences in order to ensure that the reference translation sentences have enough number; when the number of the candidate machine translation sentences meeting the first preset condition is larger than or equal to the first preset number, selecting the first preset number of candidate machine translation sentences with the highest confidence coefficient as reference translation sentences, and accordingly meeting the requirements on the number and the confidence coefficient.
It should be noted that the first preset number is not too small, nor too large. If the first preset number is too small, enough reference translation sentences cannot be provided, which is not beneficial to ensuring the accuracy of editing sentences after the sample, so that the reference translation sentences with the highest similarity to the sample translation sentences cannot be found subsequently; if the first preset number is too large, on one hand, the data operation amount is easily too large, so that the efficiency of model training is low, and on the other hand, the candidate machine translation sentences with low accuracy are easily used as reference translation sentences. For this reason, in this embodiment, the first preset number is 5 to 20. For example, the second preset number is 10 or 15.
It should be noted that any combination of the above preset modes may be selected according to actual conditions, so that after the first screening process is performed, the number of candidate machine translation statements meeting the first preset condition is sufficient, and further, a sufficient number of reference translation statements with higher confidence are obtained.
In this embodiment, after obtaining the reference translation set corresponding to the sample original sentence, in the reference translation set, before obtaining the sample and then editing the sentence, the method further includes: carrying out synonymy expansion processing on the reference translation sentences to obtain synonymy sentences of the reference translation sentences; and taking the synonymous sentences as newly added reference translation sentences and adding the newly added reference translation sentences into the reference translation set.
Here, the synonymous sentence means: the sentence with the same meaning as that of the reference translation sentence is expressed in other expression modes, that is, the reference translation sentence is converted into different expression modes on the premise that the semantics of the reference translation sentence are not changed. It will be understood that the language of the synonymous sentences and the reference translated sentences communicate. By obtaining more synonymous expressions, the number of the reference translation sentences in the reference translation set is increased to expand the reference translation set, and the reference translation sentences with the highest similarity to the sample translation sentences are easy to obtain subsequently, so that the accuracy of editing the sentences after the sample is further improved.
Specifically, the synonymy expansion process includes: acquiring a candidate synonym sentence set of each reference translation sentence, wherein each candidate synonym sentence set comprises one or more candidate synonym sentences; removing candidate synonymous sentences identical to any reference translation sentence from the plurality of candidate synonymous sentence sets; after removing the candidate synonymous sentences the same as any reference translation sentence, performing second screening processing on the remaining candidate synonymous sentences in the candidate synonymous sentence set, and selecting a plurality of candidate synonymous sentences meeting a second preset condition, wherein the second preset condition comprises a confidence coefficient; and acquiring the synonym from a plurality of candidate synonym sentences meeting a second preset condition.
In the candidate synonymous sentence set corresponding to any reference translation sentence, there may be a candidate synonymous sentence communicated with another reference translation sentence, and the synonym expansion processing is used to convert the reference translation sentence into a different expression manner, so that it is necessary to remove the candidate synonymous sentence identical to any reference translation sentence first. In addition, the selected synonyms are ensured to have higher accuracy by performing the second screening process on the remaining candidate synonyms in the candidate synonym set.
Specifically, the second screening process is performed on the remaining candidate synonyms in the candidate synonym set, and selecting a plurality of candidate synonyms meeting a second preset condition includes: and selecting candidate synonyms with repetition rates from the multiple candidate synonym sets. That is, the confidence level is characterized by the repetition rate, and in a plurality of candidate synonym sets, when there is a candidate synonym that appears repeatedly, the candidate synonym that appears repeatedly can be characterized to have a higher confidence level.
In this embodiment, acquiring a synonym from a plurality of candidate synonyms that satisfy a second preset condition includes: judging whether the number of the candidate synonymous sentences meeting the second preset condition meets a second number threshold condition, wherein the second number threshold condition comprises the following steps: the number of the candidate synonymous sentences meeting the second preset condition is greater than or equal to the second preset number; under the condition that the number of the candidate synonyms meeting the second preset condition does not meet the second number threshold condition, selecting all the candidate synonyms meeting the second preset condition as synonyms; and under the condition that the number of the candidate synonymous sentences meeting the second preset condition meets the second number threshold condition, selecting the previous second preset number of candidate synonymous sentences with the highest repetition rate from the candidate synonymous sentences meeting the second preset condition as the synonymous sentences.
And judging the number of the candidate synonymous sentences meeting the second preset condition through a second number threshold condition so as to ensure that a sufficient number of synonymous sentences are obtained, and the confidence of the selected synonymous sentences is higher, so that the accuracy of the reference translation sentences is improved, and then the reference translation sentences with the highest similarity to the sample translation sentences can be selected as the sample post-editing sentences. Therefore, when the number of candidate synonyms meeting the second preset condition is less than the second preset number, all candidate synonyms meeting the second preset condition are selected as synonyms in order to ensure that the number of synonyms is enough; and when the number of the candidate synonyms meeting the second preset condition is greater than or equal to the second preset number, selecting the previous second preset number of candidate synonyms with the highest confidence coefficient as the synonyms, so that the requirements on the number and the confidence coefficient are met simultaneously.
It should be noted that the second preset number is not too small, nor too large. If the second preset number is too small, enough synonymous sentences cannot be provided, so that the effect of performing synonymous expansion processing on the reference translation sentences is poor; if the second preset number is too large, on one hand, the data operation amount is easily too large, so that the efficiency of model training is low, and on the other hand, candidate synonym statements with low accuracy are easily used as the synonym statements. For this reason, in the present embodiment, the second preset number is 5 to 20. For example, the second preset number is 10 or 15.
In this embodiment, the synonymous translation sentences of each reference translation sentence are acquired by the synonymous transcription system. The synonymy transcription system has a transcription model for outputting sentences of the same meaning or of an approximate meaning after obtaining input sentences. Through the synonymy transcription system, the efficiency of obtaining the synonymy translation sentences is improved.
Continuing to refer to fig. 1, step S4 is executed, and the reference translation sentence with the highest similarity to the sample translation sentence is selected from the reference translation set as a sample post-editing sentence.
The reference translation statement with the highest similarity is selected as the sample post-editing statement, so that the accuracy of the sample post-editing statement is improved.
In this embodiment, the reference translation sentence with the minimum editing distance from the sample translation sentence is selected as the sample post-editing sentence. Wherein, the smaller the edit distance, the higher the similarity of the two sentences. By editing the distance, the similarity between the sample translation sentence and the reference translation sentence can be quantified, so that the reference translation sentence with the highest similarity can be easily selected from the plurality of reference translation sentences.
Specifically, the editing distance between a sample translation statement and each reference translation statement is obtained; and after the plurality of editing distances are acquired, selecting a reference translation sentence corresponding to the minimum editing distance from the plurality of editing distances as a sample and editing the sentence. The edit distance refers to: how many times the sample translation sentence passes can be the same as the reference translation sentence, wherein one operation comprises the following steps: insert a word, delete a word, or replace a word.
For example, if the sample translation statement is "I like cat used the y area very algorithm", then the sample edited statement screened from the reference translation statements is "I like cat used the y area very algorithm". Wherein, only one operation is needed, and lik is replaced by like.
Continuing to refer to fig. 1, step S5 is executed, and the sample post-editing sentences and the sample translation sentences are used to obtain sample translation quality tags of each vocabulary in the sample original sentences.
In training the translation profiling model, the translation quality labels are also used as part of the training set.
The post-sample editing sentences are obtained by using a plurality of reference translation sentences of the original sentences, so that the post-sample editing sentences are used for generating sample translation quality labels of all words in the original sentences of the sample.
In this embodiment, obtaining the sample translation quality tags of each vocabulary in the sample original sentence by using the post-sample editing sentence and the sample translation sentence includes: matching degree detection is carried out on the sample translation sentences and the sample post-editing sentences, and confidence labels are added to all words in the sample post-editing sentences according to the matching degree detection result, wherein the confidence labels are used for indicating that the translation quality is qualified or the translation quality is unqualified; and aligning words in the sample original sentence and the words in the sample post-editing sentence, adding the confidence label to the corresponding words in the sample original sentence, adding a confidence label for representing that the translation quality is qualified to the words without the corresponding relation in the sample original sentence, and taking the confidence label added to the sample original sentence as a sample translation quality label.
And adding confidence labels to the words by carrying out matching degree detection, thereby accurately judging whether the translation quality of each word is qualified or not and further realizing translation evaluation at the word level. For example, the confidence label "OK" is used to indicate that the translation quality is acceptable, and the confidence label "BAD" is used to indicate that the translation quality is not acceptable.
Specifically, the matching degree detection of the sample translation statement and the sample post-editing statement includes: acquiring a corresponding relation between a sample translation statement and a vocabulary in a sample post-editing statement; and detecting the matching degree of the words with the corresponding relation.
In this embodiment, the minimum edit distance principle is used to obtain the correspondence between the sample translation statement and the vocabulary in the post-sample editing statement. For example, the sample translation statement is "I like cat used the y area very algorithm", and the edit statement after the sample screened from the plurality of reference translation statements is "I like cat used the y area very algorithm", according to the principle of minimum edit distance, only one operation is needed to be performed, and "lik" is replaced by "like", so that "lik" and "like" have a corresponding relationship.
Correspondingly, in the step of adding the confidence label to each vocabulary in the sample post-editing sentence, the confidence label corresponding to each vocabulary (including punctuations) in the sample post-editing sentence is OK BAD OK OK OK OK OK OK OK OK OK OK. It should be noted that punctuation can be regarded as a special word.
Word alignment is a natural language processing technique used to identify the correspondence between words in two languages, that is, when a set of mutually translated sentences is input, word alignment is automatically generated to obtain the correspondence of words. Specifically, a common representation method is i → j, which is used to represent that a target vocabulary with a position of i corresponds to a source vocabulary with a position of j. Here, the target vocabulary is the vocabulary of the edited sentence after the sample, and the source vocabulary is the vocabulary of the original sentence after the sample.
In the sample original sentence, there is a case where a specific word is not required to be translated straightly, and in this case, when word alignment is performed, a word having no correspondence may appear in the sample original sentence, and the case of no correspondence is not caused by poor translation quality. Therefore, a confidence label for representing that the translation quality is qualified is added to the vocabulary without the corresponding relation in the sample original sentence. For example, a confidence tag "OK" is added to words without correspondence in the sample original sentence.
And if the vocabulary in the sample original sentence and the vocabulary in the sample post-editing sentence have a corresponding relation, adding a confidence label which is the same as the corresponding vocabulary in the post-editing sentence to the vocabulary in the sample original sentence. For example, if the confidence tag of any vocabulary in the post-editing sentence is "OK", the corresponding vocabulary in the sample original sentence is also added with the confidence tag "OK", and similarly, if the confidence tag of any vocabulary in the post-editing sentence is "BAD", the corresponding vocabulary in the sample original sentence is also added with the confidence tag "BAD".
For example, the sample original sentence is "i like cats because they are elegant. "the sample translation statement is" I like cat used the y area top algorithm ", the post-sample editing statement is" I like cat used the y area top algorithm ", and after the aforementioned sample translation quality label of each vocabulary in the sample original statement is obtained, the confidence label corresponding to each vocabulary (including punctuation) in the post-sample editing statement is OK BAD OK; correspondingly, the corresponding relations between the vocabularies in the original sample sentence and the edited sample sentence are as follows: i → I, like → like, cat → cats, because → bicause, they → the y, very → very, elegant. → and there is no correspondence between the words in the sample original sentence, so the confidence labels of the words (including punctuation) in the sample original sentence are: OK BAD OK OK OK OK OK OK OK OK OK OK OK.
It should be noted that the labeling manner of the confidence tag is not limited to the "OK" and "BAD" for distinguishing. In other embodiments, other indicia may be used, such as a numeric "1" for acceptable translation quality and a numeric "0" for unacceptable translation quality.
Continuing to refer to fig. 1, step S6 is executed to establish a training set according to the sample translation quality labels, where the training set includes the sample original sentences, the sample translated sentences and the sample translation quality labels, and the training set is used to train the translation evaluation model.
Based on the above description, even in the case of no published post-editing data, the present embodiment can generate the sample post-editing sentences by using a plurality of reference translated sentences of the original sentences, thereby generating sample translation quality labels of each vocabulary in the sample original sentences by using the sample post-editing sentences, and further establishing the training set for training the translation evaluation model, thereby obtaining the training data for training the translation evaluation model and ensuring the training efficiency of the translation evaluation model.
Correspondingly, the embodiment of the invention also provides a device for generating the translation evaluating training data. FIG. 2 is a functional block diagram of an embodiment of the apparatus for generating translation evaluation training data according to the present invention
The generation device of the translation evaluation training data comprises: a first sentence obtaining module 10, configured to obtain one or more sample original sentences to be translated; a sample translation statement acquisition module 20, configured to acquire a sample translation statement corresponding to a sample original statement, where the sample translation statement is obtained by translating the sample original statement; a reference translation set obtaining module 30, configured to obtain a reference translation set corresponding to the sample original sentence, where the reference translation set includes a plurality of reference translation sentences; the second sentence acquisition module 40 is configured to select, from the reference translation set, a reference translation sentence with the highest similarity to the sample translation sentence as a sample post-editing sentence; the label obtaining module 50 is configured to obtain sample translation quality labels of each vocabulary in the sample original sentence by using the sample post-edit sentence and the sample translation sentence; and a training set establishing module 60, configured to establish a training set according to the sample translation quality label, where the training set includes a sample original sentence, a sample translation sentence, and a sample translation quality label, and the training set is used for training a translation evaluation model.
The generation device of the translation evaluation training data comprises a second sentence acquisition module 40 and a label acquisition module 50, so that a sample post-editing sentence is generated by using a plurality of reference translation sentences of an original sentence, and a sample translation quality label of each vocabulary in the sample original sentence is generated by using the sample post-editing sentence for training the translation evaluation model.
The sample original sentence refers to a sentence that needs to be translated, i.e., the sample original sentence is the source language sentence. In training a translational profile model, sample primitive statements are used as part of a training set. For the specific description of the sample original sentence, reference may be made to the corresponding description in the foregoing embodiments, and details are not repeated here.
The sample translated sentence is obtained by translating the sample original sentence. The sample translation sentences correspond to the sample original sentences, and in the process of training the translation evaluating model, the sample translation sentences are also used as a part of the training set. In addition, the sample translated sentences are also used as input of the tag obtaining module 50, so that sample translation quality tags of each vocabulary in the sample original sentences are obtained through the tag obtaining module 50.
In this embodiment, the sample translation statement is obtained through machine translation or manual translation, that is, the sample translation statement may be a result of manual translation or a result of machine translation.
The reference translation set obtaining module 30 is configured to obtain a plurality of reference translation sentences corresponding to the sample original sentence. The reference translation sentence refers to a sentence used for comparison with the sample translation sentence, and the reference translation sentence is generally a translation with higher translation quality, that is, the reference translation sentence meets the requirement of confidence.
The sample original sentences correspond to the reference translation sets in a one-to-one mode, the reference translation sets comprise a plurality of reference translation sentences, the reference translation sentences in the same reference translation set are used as candidate sample post-editing sentences corresponding to the sample original sentences, and therefore proper reference translation sentences are selected from the reference translation sentences to serve as sample post-editing sentences.
In the translation evaluation task of the vocabulary level, the training of the translation evaluation model depends on post-editing sentences, but on some translation evaluation tasks, the situation that post-editing data is not disclosed exists, so that the translation evaluation model cannot be trained.
It should be noted that each reference translation set includes a plurality of reference translation statements, so that the number of the reference translation statements is increased, and thus, more accurate samples can be obtained subsequently and then the statements can be edited, and further, the evaluation accuracy of the subsequently obtained training translation evaluation model can be improved.
It is understood that when a sentence in a source language is translated into a sentence in another specified target language, a word in the source language sentence will usually have a plurality of translation results in the target language, so that a plurality of reference translated sentences can be obtained based on the sample original sentence.
In this embodiment, the reference translation statement in the reference translation set is one or both of a machine translation result and a manual translation result.
Taking the reference translation sentence as an artificial translation result as an example, for some scenarios of translation questions, meta information of the questions is usually stored, that is, the reference translation sentence is stored in advance, and the stored reference translation sentences can be directly used. It is understood that the reference to the case that the translation sentence is a result of manual translation is not limited to the scenario of the translation question. It should be noted that the reference translation statement is a result of manual translation, which is beneficial to improving the accuracy of the reference translation statement.
When the reference translation sentence is a machine translation result, the reference translation set obtaining module 30 includes: a candidate machine translation set acquisition unit for acquiring a plurality of candidate machine translation sets from different machine translation tools, each candidate machine translation set comprising one or more candidate machine translation statements; the first screening unit is used for performing first screening processing on the candidate machine translation sets and selecting a plurality of candidate machine translation sentences meeting a first preset condition, wherein the first preset condition comprises a confidence coefficient; the translation device comprises a reference translation set acquisition unit, a translation processing unit and a translation processing unit, wherein the reference translation set acquisition unit is used for acquiring a reference translation sentence from a plurality of candidate machine translation sentences meeting a first preset condition, and the reference translation sentence constitutes a reference translation set.
Under the condition that the reference translation statement is a machine translation result, the reference translation set acquisition module can automatically acquire the reference translation set, and the efficiency of acquiring the reference translation set is improved.
In this embodiment, the machine translation tools correspond to the candidate machine translation sets one to one, and therefore, when a plurality of machine translation tools are used, the number of candidate machine translation sets is correspondingly multiple.
In this embodiment, in order to improve the accuracy of the post-sample edited statements, the first preset condition includes a confidence level, that is, a candidate machine translation statement with a higher confidence level needs to be selected as a reference translation statement, so as to improve the evaluation accuracy of a subsequently obtained training translation evaluation model. Wherein, the higher the confidence, the higher the translation quality.
For this purpose, the first screening unit performs a first screening process on a plurality of candidate machine translation sets in one or more preset manners. In this embodiment, the preset manner includes: selecting a common candidate machine translation statement from a plurality of candidate machine translation sets as a reference translation statement; or selecting candidate machine translation sentences from machine translation tools meeting the accuracy requirement as reference translation sentences; or obtaining the average value of the confidence scores of all the vocabularies in the candidate machine translation sentences by using the confidence score of each vocabulary output by the machine translation tool, wherein the average value is used as the translation result score, and selecting the candidate machine translation sentences with the translation result scores larger than or equal to the preset score threshold value as the reference translation sentences.
When a plurality of candidate machine translation sets have a common candidate machine translation statement, the accuracy of characterizing the common candidate machine translation statement is high.
The machine translation tool meeting the accuracy requirement is also a machine translation tool with higher reliability, so that the selection of the candidate machine translation sentences from the machine translation tool meeting the accuracy requirement is also beneficial to ensuring the accuracy of the selected candidate machine translation sentences. For example, machine translation tools that meet accuracy requirements include google translation systems and the like.
The confidence score for each vocabulary output by the machine translation tool facilitates a more direct assessment of the accuracy of the candidate machine translated sentences. It should be noted that, in the actual operation process, the preset score threshold may be set according to the actual requirement.
In this embodiment, the reference translation set obtaining unit includes: a first determining subunit, configured to determine whether a number of candidate machine translation statements that satisfy a first preset condition satisfies a first number threshold condition, where the first number threshold condition includes: the number of candidate machine translation statements meeting a first preset condition is greater than or equal to a first preset number; the first selecting subunit is configured to select, when the number of the candidate machine translation statements satisfying the first preset condition does not satisfy the number threshold condition, all the candidate machine translation statements satisfying the first preset condition as reference translation statements, and when the number of the candidate machine translation statements satisfying the first preset condition satisfies the number threshold condition, select, from the candidate machine translation statements satisfying the first preset condition, a first preset number of candidate machine translation statements with a highest confidence coefficient as reference translation statements.
In this embodiment, the first preset number is 5 to 20.
It should be noted that, according to the actual situation, the first screening unit may further select any combination of the preset manners, so that after the first screening processing is performed, the number of candidate machine translation statements meeting the first preset condition is sufficient, and further, a sufficient number of reference translation statements with higher confidence are obtained.
The device for generating the translation evaluation training data further comprises a synonymy expansion module 35 arranged between the reference translation set acquisition module 30 and the second sentence acquisition module 40, and is used for performing synonymy expansion processing on the reference translation sentences to acquire synonymy sentences of the reference translation sentences, taking the synonymy sentences as newly added reference translation sentences, and adding the newly added reference translation sentences into the reference translation set.
By obtaining more synonymous expressions, the number of the reference translation sentences in the reference translation set is increased to expand the reference translation set, and the reference translation sentences with the highest similarity to the sample translation sentences are easy to obtain subsequently, so that the accuracy of editing the sentences after the sample is further improved.
In this embodiment, the synonymous expansion module 35 includes: a candidate synonym sentence set acquiring unit configured to acquire a candidate synonym sentence set of each reference translation sentence, where each candidate synonym sentence set includes one or more candidate synonym sentences; a second screening unit configured to remove a candidate synonymous sentence identical to any reference translation sentence from the plurality of candidate synonymous sentence sets; the third screening unit is used for performing second screening processing on the remaining candidate synonymous sentences in the candidate synonymous sentence set after removing the candidate synonymous sentences which are the same as any reference translation sentence, and selecting a plurality of candidate synonymous sentences meeting a second preset condition, wherein the second preset condition comprises a confidence coefficient; a synonym obtaining unit configured to obtain a synonym from a plurality of candidate synonyms that satisfy a second preset condition.
Specifically, the third screening unit is configured to select candidate synonyms with a repetition rate from the multiple candidate synonym sets. That is, the confidence is characterized by the repetition rate.
In this embodiment, the third screening unit includes: a first judging subunit, configured to judge whether the number of candidate synonymous sentences meeting a second preset condition meets a second number threshold condition, where the second number threshold condition includes: the number of the candidate synonymous sentences meeting the second preset condition is greater than or equal to the second preset number; the first selecting subunit is configured to select all candidate synonymous sentences meeting the second preset condition as synonymous sentences when the number of the candidate synonymous sentences meeting the second preset condition does not meet the second number threshold condition, and select a first second preset number of candidate synonymous sentences with a highest repetition rate from the candidate synonymous sentences meeting the second preset condition as synonymous sentences when the number of the candidate synonymous sentences meeting the second preset condition meets the second number threshold condition.
In this embodiment, the second predetermined number is 5 to 20.
In this embodiment, the candidate synonym sentence set obtaining unit is a synonym transcription system. Wherein the synonymy transcription system has a transcription model for outputting sentences of the same meaning or similar meaning after obtaining input sentences. Through the synonymy transcription system, the efficiency of obtaining the candidate synonymy sentences is improved.
The reference translation sentence with the highest similarity is selected as the sample post-editing sentence through the second sentence acquisition module 40, so that the accuracy of the sample post-editing sentence is improved.
In this embodiment, the second sentence acquisition module 40 is configured to select a reference translation sentence with a minimum editing distance from the sample translation sentence as the sample post-editing sentence. Wherein, the smaller the edit distance, the higher the similarity of the two sentences. By editing the distance, the similarity between the sample translation sentence and the reference translation sentence can be quantified, so that the reference translation sentence with the highest similarity can be easily selected from the plurality of reference translation sentences.
Specifically, the second sentence acquisition module 40 includes: an edit distance acquisition unit configured to acquire edit distances of the sample translation sentences and the reference translation sentences; and the fourth screening unit is used for selecting the reference translation sentence corresponding to the minimum editing distance from the plurality of editing distances as a sample post-editing sentence. Here, the edit distance refers to: how many times the sample translation sentence passes can be the same as the reference translation sentence, wherein one operation comprises the following steps: insert a word, delete a word, or replace a word.
The label obtaining module 50 is configured to obtain a sample translation quality label of each vocabulary in the sample original sentence by using the sample post-edit sentence and the sample translation sentence, where the translation quality label is also used as a part of a training set in the process of training the translation evaluation model.
The post-sample editing sentences are obtained by using a plurality of reference translation sentences of the original sentences, so that the post-sample editing sentences are used for generating sample translation quality labels of all words in the original sentences of the sample.
In this embodiment, the tag obtaining module 50 includes: the matching degree detection unit is used for detecting the matching degree of the sample translation sentences and the sample post-editing sentences, and adding confidence labels to all the words in the sample post-editing sentences according to the matching degree detection result, wherein the confidence labels are used for indicating that the translation quality is qualified or the translation quality is unqualified; and the labeling unit is used for aligning words in the sample original sentence and the words in the sample post-editing sentence, adding the confidence label to the corresponding word in the sample original sentence, adding the confidence label for indicating that the translation quality is qualified to the word without the corresponding relation in the sample original sentence, and taking the confidence label added to the sample original sentence as the sample translation quality label.
And adding confidence labels to the words by carrying out matching degree detection, thereby accurately judging whether the translation quality of each word is qualified or not and further realizing translation evaluation at the word level. For example, the confidence label "OK" is used to indicate that the translation quality is acceptable, and the confidence label "BAD" is used to indicate that the translation quality is not acceptable.
In this embodiment, the matching degree detecting unit is configured to obtain a correspondence between the sample translation statement and the vocabulary in the sample post-editing statement, and perform matching degree detection on the vocabulary having the correspondence. Specifically, the matching degree detection unit obtains the correspondence between the sample translation sentence and the vocabulary in the post-sample editing sentence by using the minimum editing distance principle. That is, in the sample translation sentence and the sample post-edit sentence, the vocabulary with the minimum edit distance has a correspondence relationship.
Word alignment is a natural language processing technique used to identify the correspondence between words in two languages, that is, when a set of mutually translated sentences is input, word alignment is automatically generated to obtain the correspondence of words. Specifically, a common representation method is i → j, which is used to represent that a target vocabulary with a position of i corresponds to a source vocabulary with a position of j. Here, the target vocabulary is the vocabulary of the edited sentence after the sample, and the source vocabulary is the vocabulary of the original sentence after the sample.
In the sample original sentence, there is a case where a specific word is not required to be translated straightly, and in this case, when performing word alignment, a word without a correspondence relationship may appear in the sample original sentence, and the case of no correspondence relationship is not caused by poor translation quality, and therefore, a confidence tag for indicating that the translation quality is acceptable is added to the word without a correspondence relationship in the sample original sentence. For example, a confidence tag "OK" is added to words without correspondence in the sample original sentence.
And if the vocabulary in the sample original sentence and the vocabulary in the sample post-editing sentence have a corresponding relation, adding a confidence label which is the same as the corresponding vocabulary in the post-editing sentence to the vocabulary in the sample original sentence. For example, if the confidence tag of any vocabulary in the post-editing sentence is "OK", the corresponding vocabulary in the sample original sentence is also added with the confidence tag "OK", and similarly, if the confidence tag of any vocabulary in the post-editing sentence is "BAD", the corresponding vocabulary in the sample original sentence is also added with the confidence tag "BAD".
It should be noted that the labeling manner of the confidence tag is not limited to the "OK" and "BAD" for distinguishing. In other embodiments, other indicia may be used, such as a numeric "1" for acceptable translation quality and a numeric "0" for unacceptable translation quality.
The training set creation module 60 creates a training set based on the sample translation quality tags. Based on the above description, even in the case of no published post-editing data, the present embodiment can generate the sample post-editing sentences by using a plurality of reference translated sentences of the original sentences, thereby generating sample translation quality labels of each vocabulary in the sample original sentences by using the sample post-editing sentences, and further establishing the training set for training the translation evaluation model, thereby obtaining the training data for training the translation evaluation model and ensuring the training efficiency of the translation evaluation model.
The embodiment of the invention also provides equipment, and the equipment can realize the method for generating the translation evaluating training data provided by the embodiment of the invention through the method for generating the translation evaluating training data in a loading program form.
Referring to fig. 3, a hardware structure diagram of a device provided by an embodiment of the present invention is shown. The device of the embodiment comprises: at least one processor 01, at least one communication interface 02, at least one memory 03, and at least one communication bus 04.
In this embodiment, the number of the processor 01, the communication interface 02, the memory 03 and the communication bus 04 is at least one, and the processor 01, the communication interface 02 and the memory 03 complete mutual communication through the communication bus 04.
The communication interface 02 may be an interface of a communication module for performing network communication, for example, an interface of a GSM module.
The processor 01 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement the detection method of the present embodiment.
The memory 03 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory), such as at least one disk memory.
The memory 03 stores one or more computer instructions, which are executed by the processor 01 to implement the method for generating the translation evaluation training data provided in the foregoing embodiments.
It should be noted that the above terminal device may further include other devices (not shown) that may not be necessary for the disclosure of the embodiment of the present invention; these other components may not be necessary to understand the disclosure of embodiments of the present invention, which are not individually described herein.
The embodiment of the present invention further provides a storage medium, where one or more computer instructions are stored in the storage medium, and the one or more computer instructions are used to implement the method for generating the translation evaluation training data provided in the foregoing embodiment.
In the method for generating translation evaluation training data provided by this embodiment, a plurality of reference translation sentences of an original sentence are used to generate a sample post-editing sentence, and the sample post-editing sentence is used to generate a sample translation quality label of each vocabulary in the sample original sentence, so as to train a translation evaluation model.
The embodiments of the present invention described above are combinations of elements and features of the present invention. Unless otherwise mentioned, the elements or features may be considered optional. Each element or feature may be practiced without being combined with other elements or features. In addition, the embodiments of the present invention may be configured by combining some elements and/or features. The order of operations described in the embodiments of the present invention may be rearranged. Some configurations of any embodiment may be included in another embodiment, and may be replaced with corresponding configurations of the other embodiment. It is obvious to those skilled in the art that claims that are not explicitly cited in each other in the appended claims may be combined into an embodiment of the present invention or may be included as new claims in a modification after the filing of the present application.
Embodiments of the invention may be implemented by various means, such as hardware, firmware, software, or a combination thereof. In a hardware configuration, the method according to an exemplary embodiment of the present invention may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), processors, controllers, micro-controllers, microprocessors, and the like.
In a firmware or software configuration, embodiments of the present invention may be implemented in the form of modules, procedures, functions, and the like. The software codes may be stored in memory units and executed by processors. The memory unit is located inside or outside the processor, and may transmit and receive data to and from the processor via various known means.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Although the embodiments of the present invention have been disclosed, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (20)

1. A method for generating translation evaluation training data is characterized by comprising the following steps:
obtaining one or more sample original sentences to be translated;
obtaining a sample translation statement corresponding to the sample original statement, wherein the sample translation statement is obtained by translating the sample original statement;
acquiring a reference translation set corresponding to the sample original sentence, wherein the reference translation set comprises a plurality of reference translation sentences;
selecting the reference translation statement with the highest similarity with the sample translation statement from the reference translation set as a sample post-editing statement;
utilizing the sample post-editing sentences and the sample translation sentences to obtain sample translation quality labels of all words in the sample original sentences;
and establishing a training set according to the sample translation quality label, wherein the training set comprises the sample original sentence, the sample translation sentence and the sample translation quality label, and the training set is used for training a translation evaluation model.
2. The method for generating translation evaluation training data according to claim 1, wherein the reference translation set corresponding to the sample original sentence is obtained by one or both of machine translation and manual translation.
3. The method for generating translation evaluation training data according to claim 1, wherein a reference translation set corresponding to the sample original sentence is obtained by a machine translation;
obtaining the reference translation set comprises:
obtaining a plurality of candidate machine translation sets from different machine translation tools, each of the candidate machine translation sets comprising one or more candidate machine translation statements;
performing first screening processing on the candidate machine translation sets, and selecting a plurality of candidate machine translation statements meeting a first preset condition, wherein the first preset condition comprises a confidence coefficient;
and acquiring a reference translation statement from a plurality of candidate machine translation statements meeting the first preset condition, wherein the reference translation statement forms a reference translation set.
4. The method for generating translation evaluation training data according to claim 3, wherein the first screening process is performed on the candidate machine translation sets by using one or more preset manners, where the preset manners include:
selecting a common candidate machine translation statement from the plurality of candidate machine translation sets;
or selecting the candidate machine translation sentences from machine translation tools meeting the accuracy requirement;
or, obtaining an average value of the confidence scores of all the vocabularies in the candidate machine translation sentence by using the confidence score of each vocabulary output by the machine translation tool, wherein the average value is used as a translation result score; and selecting the candidate machine translation sentences of which the translation result scores are greater than or equal to a preset score threshold.
5. The method for generating translation evaluation training data according to claim 3, wherein the obtaining of the reference translated sentence from the plurality of candidate machine translated sentences satisfying the first preset condition comprises: judging whether the number of candidate machine translation sentences meeting a first preset condition meets a first number threshold condition, wherein the first number threshold condition comprises the following steps: the number of the candidate machine translation sentences meeting the first preset condition is greater than or equal to a first preset number;
under the condition that the number of the candidate machine translation sentences meeting the first preset condition does not meet the number threshold condition, selecting all the candidate machine translation sentences meeting the first preset condition as reference translation sentences;
under the condition that the number of the candidate machine translation sentences meeting the first preset condition meets the number threshold condition, selecting the previous first preset number of candidate machine translation sentences with the highest confidence coefficient from the candidate machine translation sentences meeting the first preset condition as reference translation sentences.
6. The method for generating translation evaluation training data according to claim 5, wherein the first predetermined number is 5 to 20.
7. The method for generating translation evaluation training data according to claim 1, wherein after obtaining a reference translation set corresponding to the sample original sentence, before selecting the reference translation sentence with the highest similarity to the sample translation sentence as a sample post-editing sentence from the reference translation set, the method further comprises: and performing synonymy expansion processing on the reference translation sentences to obtain synonymy sentences of the reference translation sentences, taking the synonymy sentences as newly added reference translation sentences, and adding the newly added reference translation sentences into the reference translation set.
8. The method for generating translation evaluation training data according to claim 7, wherein performing synonymous expansion processing on the reference translation statement, and acquiring the synonymous statement of the reference translation statement comprises:
obtaining a candidate synonym sentence set of each reference translation sentence, wherein each candidate synonym sentence set comprises one or more candidate synonym sentences;
removing candidate synonymous sentences identical to any of the reference translation sentences from the plurality of candidate synonymous sentence sets;
after removing the candidate synonymous sentences the same as any reference translation sentence, performing second screening processing on the remaining candidate synonymous sentences in the candidate synonymous sentence set, and selecting a plurality of candidate synonymous sentences meeting a second preset condition, wherein the second preset condition comprises a confidence coefficient;
and obtaining the synonym sentences from the candidate synonym sentences meeting the second preset condition.
9. The method for generating translation evaluation training data according to claim 8, wherein performing a second screening process on remaining candidate synonyms in the candidate synonym set, and selecting a plurality of candidate synonyms that satisfy a second preset condition comprises: and selecting candidate synonyms with repetition rates from the multiple candidate synonym sets.
10. The method for generating translation profiling training data according to claim 8, wherein obtaining synonymous sentences from the plurality of candidate synonymous sentences satisfying the second preset condition comprises:
judging whether the number of candidate synonymous sentences meeting a second preset condition meets a second number threshold condition, wherein the second number threshold condition comprises the following steps: the number of the candidate synonyms meeting the second preset condition is greater than or equal to a second preset number;
under the condition that the number of the candidate synonyms meeting the second preset condition does not meet the second number threshold condition, selecting all the candidate synonyms meeting the second preset condition as synonyms;
and under the condition that the number of the candidate synonymous sentences meeting the second preset condition meets a second number threshold condition, selecting the previous second preset number of candidate synonymous sentences with the highest repetition rate from the candidate synonymous sentences meeting the second preset condition as the synonymous sentences.
11. The method for generating translation evaluation training data according to claim 10, wherein the second predetermined number is 5 to 20.
12. The method for generating translation profiling training data according to claim 8, wherein the synonymous sentences of the reference translation sentence are obtained by a synonymous transcription system.
13. The method for generating translation evaluation training data according to claim 1, wherein selecting the reference translation sentence with the highest similarity to the sample translation sentence as a sample post-editing sentence from the reference translation set comprises: and selecting the reference translation statement with the minimum editing distance with the sample translation statement as a sample post-editing statement.
14. The method for generating translation evaluation training data according to claim 1, wherein the sample translated sentences are obtained by performing machine translation or manual translation on the sample original sentences.
15. The method for generating translation evaluation training data according to claim 1, wherein the obtaining of the sample translation quality index of each vocabulary in the sample original sentence by using the sample post-editing sentence and the sample translation sentence comprises:
matching degree detection is carried out on the sample translation sentences and the sample post-editing sentences, and confidence labels are added to all vocabularies in the sample post-editing sentences according to matching degree detection results, wherein the confidence labels are used for indicating that the translation quality is qualified or the translation quality is unqualified;
and performing word alignment on the vocabulary in the sample original sentence and the vocabulary in the sample post-editing sentence, adding the confidence label to the corresponding vocabulary in the sample original sentence, adding a confidence label for indicating that the translation quality is qualified to the vocabulary without the corresponding relation in the sample original sentence, and taking the confidence label added to the sample original sentence as a sample translation quality label.
16. The method for generating translation evaluation training data according to claim 15, wherein the detecting the matching degree of the sample translated sentences and the sample post-editing sentences comprises: acquiring the corresponding relation between the sample translation statement and the vocabulary in the sample post-editing statement;
and detecting the matching degree of the words with the corresponding relation.
17. The method for generating translation evaluation training data according to claim 16, wherein a minimum edit distance is used to obtain a correspondence between the sample translated sentence and a vocabulary in the sample post-edit sentence.
18. A translation evaluation training data generation apparatus, comprising:
the system comprises a first statement acquisition module, a second statement acquisition module and a translation module, wherein the first statement acquisition module is used for acquiring one or more sample original statements to be translated;
the sample translation statement acquisition module is used for acquiring a sample translation statement corresponding to the sample original statement, and the sample translation statement is obtained by translating the sample original statement;
a reference translation set obtaining module, configured to obtain a reference translation set corresponding to the sample original sentence, where the reference translation set includes a plurality of reference translation sentences;
the second sentence acquisition module is used for selecting the reference translation sentence with the highest similarity with the sample translation sentence from the reference translation set as a sample post-editing sentence;
the label obtaining module is used for obtaining sample translation quality labels of all vocabularies in the sample original sentences by utilizing the sample post-editing sentences and the sample translation sentences;
and the training set establishing module is used for establishing a training set according to the sample translation quality label, the training set comprises the sample original sentence, the sample translation sentence and the sample translation quality label, and the training set is used for training a translation evaluation model.
19. An apparatus comprising at least one memory and at least one processor, the memory storing one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method of generating translation evaluation training data according to any of claims 1 to 17.
20. A storage medium having stored thereon one or more computer instructions for implementing a method of generating translation evaluation training data according to any one of claims 1 to 17.
CN202111527223.XA 2021-12-14 2021-12-14 Method, device, equipment and storage medium for generating translation evaluation training data Pending CN114254658A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111527223.XA CN114254658A (en) 2021-12-14 2021-12-14 Method, device, equipment and storage medium for generating translation evaluation training data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111527223.XA CN114254658A (en) 2021-12-14 2021-12-14 Method, device, equipment and storage medium for generating translation evaluation training data

Publications (1)

Publication Number Publication Date
CN114254658A true CN114254658A (en) 2022-03-29

Family

ID=80792202

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111527223.XA Pending CN114254658A (en) 2021-12-14 2021-12-14 Method, device, equipment and storage medium for generating translation evaluation training data

Country Status (1)

Country Link
CN (1) CN114254658A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114818748A (en) * 2022-05-10 2022-07-29 北京百度网讯科技有限公司 Method for generating translation model, translation method and device
CN115965018A (en) * 2023-01-04 2023-04-14 北京百度网讯科技有限公司 Training method of information generation model, information generation method and device
CN115965018B (en) * 2023-01-04 2024-04-26 北京百度网讯科技有限公司 Training method of information generation model, information generation method and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114818748A (en) * 2022-05-10 2022-07-29 北京百度网讯科技有限公司 Method for generating translation model, translation method and device
CN115965018A (en) * 2023-01-04 2023-04-14 北京百度网讯科技有限公司 Training method of information generation model, information generation method and device
CN115965018B (en) * 2023-01-04 2024-04-26 北京百度网讯科技有限公司 Training method of information generation model, information generation method and device

Similar Documents

Publication Publication Date Title
CN108763510B (en) Intention recognition method, device, equipment and storage medium
KR101678787B1 (en) Method for automatic question-answering and apparatus therefor
JP6909832B2 (en) Methods, devices, equipment and media for recognizing important words in audio
JP5901001B1 (en) Method and device for acoustic language model training
CN111259873B (en) Table data extraction method and device
CN103823794A (en) Automatic question setting method about query type short answer question of English reading comprehension test
US20160188569A1 (en) Generating a Table of Contents for Unformatted Text
CN109460558B (en) Effect judging method of voice translation system
CN107748744A (en) A kind of method for building up and device for sketching the contours frame knowledge base
CN112016271A (en) Language style conversion model training method, text processing method and device
CN110874536A (en) Corpus quality evaluation model generation method and bilingual sentence pair inter-translation quality evaluation method
CN112101032A (en) Named entity identification and error correction method based on self-distillation
CN114254658A (en) Method, device, equipment and storage medium for generating translation evaluation training data
CN110751234A (en) OCR recognition error correction method, device and equipment
KR102251554B1 (en) Method for generating educational foreign language text by adjusting text difficulty
CN110008807A (en) A kind of training method, device and the equipment of treaty content identification model
CN111354354A (en) Training method and device based on semantic recognition and terminal equipment
KR20190090636A (en) Method for automatically editing pattern of document
CN113822052A (en) Text error detection method and device, electronic equipment and storage medium
CN111492364B (en) Data labeling method and device and storage medium
CN111178098A (en) Text translation method, device and equipment and computer readable storage medium
CN113553833B (en) Text error correction method and device and electronic equipment
CN114580391A (en) Chinese error detection model training method, device, equipment and storage medium
CN115017876A (en) Method and terminal for automatically generating emotion text
CN112733517B (en) Method for checking requirement template conformity, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination