CN110874536A - Corpus quality evaluation model generation method and bilingual sentence pair inter-translation quality evaluation method - Google Patents

Corpus quality evaluation model generation method and bilingual sentence pair inter-translation quality evaluation method Download PDF

Info

Publication number
CN110874536A
CN110874536A CN201810995294.4A CN201810995294A CN110874536A CN 110874536 A CN110874536 A CN 110874536A CN 201810995294 A CN201810995294 A CN 201810995294A CN 110874536 A CN110874536 A CN 110874536A
Authority
CN
China
Prior art keywords
bilingual
sentence
corpus
quality
bilingual sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810995294.4A
Other languages
Chinese (zh)
Other versions
CN110874536B (en
Inventor
陆军
汪嘉怿
施杨斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201810995294.4A priority Critical patent/CN110874536B/en
Publication of CN110874536A publication Critical patent/CN110874536A/en
Application granted granted Critical
Publication of CN110874536B publication Critical patent/CN110874536B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Machine Translation (AREA)

Abstract

The embodiment of the invention discloses a corpus quality evaluation model generation method, a method for evaluating the inter-translation quality of bilingual sentence pairs, a device, equipment and a storage medium thereof. The method for generating the corpus quality evaluation model comprises the following steps: constructing a bilingual corpus, wherein the bilingual corpus comprises a plurality of bilingual sentence pairs and translation quality labels corresponding to the bilingual sentence pairs; and training a preset corpus quality evaluation network by taking the bilingual sentence pairs and the inter-translation quality labels corresponding to the bilingual sentence pairs as training samples to generate a corpus quality evaluation model, wherein the corpus quality evaluation model is suitable for evaluating the inter-translation quality of the given bilingual sentence pairs. The embodiment of the invention can realize the evaluation of the inter-translation quality of the bilingual sentence pairs.

Description

Corpus quality evaluation model generation method and bilingual sentence pair inter-translation quality evaluation method
Technical Field
The invention relates to the technical field of machine translation, in particular to a corpus quality evaluation model generation method, a bilingual sentence pair inter-translation quality evaluation device, equipment and a storage medium.
Background
Machine translation refers to the technique of translating words from one natural language (the source language) to another natural language (the target language) using a computer program. Currently, the mainstream technical trend in this field is represented by a corpus-based Machine Translation technology, such as Statistical Machine Translation (SMT) and Neural Network Machine Translation (NMT), which rely on a corpus containing a large amount of training data to train a Translation model. Whether for SMT or NMT, the quality of translation is closely related to the quality and size of the corpus. Therefore, it is important to evaluate the quality of the corpus in the corpus.
Bilingual corpus, sometimes called bilingual parallel corpus, is a corpus data in this type of corpus and is the key training data of the machine translation model. Bilingual corpora generally refer to corpus of text that are translatable with respect to each other, and generally include corpus of text at word level, phrase level, sentence level, and document level. For example, "today is a good weather, It's a nice day today" is a bilingual corpus translated between Chinese and English, and belongs to a sentence level bilingual corpus.
Most of the previous schemes for evaluating the quality of bilingual corpus are based on the translation probability of vocabularies to calculate the translation probability of sentence pairs, and further evaluate the quality of bilingual corpus, and the approximate processing procedure is as follows: 1) constructing a bilingual word list, calculating the translation probability of the vocabulary, and obtaining word list entries; for example, a vocabulary entry of "apple 0.80.6" indicates that the probability of translating the english word "apple" into the chinese word "apple" is 0.8 and the probability of translating the chinese word "apple" into the english word "apple" is 0.6. 2) If the bilingual corpus is a phrase-level or sentence-level bilingual corpus, respectively segmenting the original text and the translated text of the bilingual corpus, and performing word alignment processing to obtain a word pair relationship; the word alignment processing refers to processing for associating original words and translated words that are likely to be translated with each other. 3) And calculating the overall translation probability of the bilingual corpus by using a proper algorithm (for example, calculating the statistically weighted mutual translation word proportion) by combining the word pair relation and the vocabulary translation probability of the word pair obtained in the step 1). Here, the quality of the bilingual corpus is reflected in the overall translation probability, and the higher the overall translation probability is, the better the quality of the bilingual corpus is considered.
Although the quality of the bilingual corpus can be reflected to a certain degree by using the scheme, the scheme is based on the processing of vocabularies fundamentally, on one hand, the scheme depends on a constructed bilingual word list, on the other hand, word segmentation processing and word alignment processing are required to be carried out on the original text and the translated text, on the other hand, other algorithms are required to be introduced to calculate the final overall translation probability, the uncertainty of the processing can influence the calculation result of the overall translation probability, and the overall translation probability cannot accurately reflect the quality of the bilingual corpus.
Disclosure of Invention
In view of the above, the present invention provides a training method, a quality assessment method, a device, an apparatus and a computer storage medium based on bilingual corpus, so as to solve the problem that it is difficult to perform quality assessment on bilingual corpus.
In a first aspect, the present invention provides a method for generating a corpus quality assessment model, where the method includes:
constructing a bilingual corpus, wherein the bilingual corpus comprises a plurality of bilingual sentence pairs and translation quality labels corresponding to the bilingual sentence pairs;
and training a preset corpus quality evaluation network by taking the bilingual sentence pairs and the inter-translation quality labels corresponding to the bilingual sentence pairs as training samples to generate a corpus quality evaluation model, wherein the corpus quality evaluation model is suitable for evaluating the inter-translation quality of the given bilingual sentence pairs.
In a second aspect, the present invention further provides an apparatus for generating a corpus quality assessment model, where the apparatus includes:
the language database construction module is used for constructing a bilingual language database which comprises a plurality of bilingual sentence pairs and inter-translation quality labels corresponding to the bilingual sentence pairs;
and the corpus quality evaluation model training module is used for training a preset corpus quality evaluation network by taking the bilingual sentence pair and the inter-translation quality label corresponding to the bilingual sentence pair as a training sample to generate a corpus quality evaluation model, and the corpus quality evaluation model is suitable for evaluating the inter-translation quality of the given bilingual sentence pair.
In a third aspect, the present invention further provides a generating device of a corpus quality assessment model, including:
a memory for storing a program;
a processor for executing the program stored in the memory to perform the method as described above.
In a fourth aspect, the present invention also provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method as described above.
In a fifth aspect, the present invention further provides a method for evaluating the inter-translation quality of a bilingual sentence pair, where the method includes:
acquiring a bilingual sentence pair to be evaluated;
inputting the bilingual sentence pair into a trained corpus quality evaluation model;
and determining the inter-translation quality of the bilingual sentence pairs according to the output of the corpus quality evaluation model.
In a sixth aspect, the present invention further provides a device for evaluating the mutual translation quality of bilingual sentence pairs, where the device includes:
the bilingual sentence pair acquisition module is used for acquiring a bilingual sentence pair to be evaluated;
the bilingual sentence pair input module is used for inputting the bilingual sentence pair into a trained corpus quality evaluation model;
and the corpus quality evaluation model determines the inter-translation quality of the bilingual sentence pairs according to the output of the corpus quality evaluation model.
In a seventh aspect, the present invention further provides a device for evaluating the mutual translation quality of bilingual sentence pairs, which includes:
a memory for storing a program;
and the processor is used for operating the program stored in the memory to execute the method for evaluating the inter-translation quality of the bilingual sentence pairs.
In an eighth aspect, the present invention further provides a computer-readable storage medium, on which computer program instructions are stored, which when executed by a processor implement the method for evaluating the inter-translation quality of a bilingual sentence pair as described above.
The embodiment of the invention can realize the expected training of the corpus quality evaluation network by constructing the bilingual training corpus containing the mutually translated bilingual sentence pairs and the non-mutually translated bilingual sentence pairs, so that the stable mapping relation from the bilingual sentence pairs to the mutually translated quality labels is formed, the bilingual sentence pairs can be used for mutually translated quality evaluation of the bilingual sentence pairs, and the evaluation result has high accuracy.
Drawings
Fig. 1 is a schematic flow chart of a corpus quality assessment model generation method according to an embodiment of the present invention.
Fig. 2 is a flowchart illustrating a method for evaluating the translation quality of bilingual sentence pairs according to an embodiment of the present invention.
FIG. 3 is a diagram illustrating a training process of a corpus quality assessment network according to an embodiment of the present invention.
Fig. 4 is a schematic structural diagram of a corpus quality assessment model generation apparatus according to an embodiment of the present invention.
Fig. 5 is a schematic structural diagram of an apparatus for evaluating the translation quality of bilingual sentence pairs according to an embodiment of the present invention.
Fig. 6 is a schematic diagram of a hardware structure of the apparatus according to the embodiment of the present invention.
Detailed Description
The invention is described in detail below with reference to the drawings and examples. It should be understood that the specific embodiments described are merely illustrative of the invention and are not intended to limit the invention. Terms such as first, second, etc. in this document are only used for distinguishing one entity (or operation) from another entity (or operation), and do not indicate any relationship or order between these entities (or operations); in addition, terms such as upper, lower, left, right, front, rear, and the like in the text denote directions or orientations, and only relative directions or orientations, not absolute directions or orientations. Without additional limitation, an element defined by the phrase "comprising" does not exclude the presence of other elements in a process, method, article, or apparatus that comprises the element.
The invention aims to train the constructed corpus quality evaluation network by constructing a brand-new bilingual corpus which is used as training data to generate a corpus quality evaluation model, and the model can realize the evaluation of the inter-translation quality of target bilingual corpus. Various aspects of the invention are described in detail below.
< bilingual corpus >
In order to achieve quality assessment of bilingual corpus, especially bilingual corpus at phrase level or sentence level, the bilingual corpus constructed according to the embodiments of the present invention includes bilingual sentence pairs, which refer to mutually translatable phrases or sentences having different source languages and target languages, such as inter-translated Chinese-English phrases or sentences, inter-translated Chinese-Russian phrases or sentences, inter-translated English-French phrases or sentences, inter-translated French-French phrases or sentences, and the like. The number of bilingual sentence pairs in the bilingual corpus can be set according to actual conditions and requirements, for example, the number of bilingual sentence pairs can be in the order of ten thousand, hundred thousand, million or ten million, and the greater the number of bilingual sentence pairs, the better the training effect on the corpus quality evaluation network is.
In order to realize the quality evaluation of the bilingual corpus, the bilingual corpus constructed in the embodiment of the invention comprises a "positive example" and a "negative example", wherein the "positive example" refers to the bilingual corpus which is completely translated mutually, the bilingual corpus which is completely translated mutually is considered to have higher translation quality, and a high-quality label is printed on the bilingual corpus to be used as a positive sample of a corpus quality evaluation model generated by subsequent training. The 'counterexample' refers to the incompletely translated bilingual corpus, the incompletely translated bilingual corpus is considered to have low translation quality, and a low-quality label is printed on the incompletely translated bilingual corpus to serve as a negative sample of a corpus quality evaluation model generated by subsequent training. Therefore, the bilingual corpus constructed in the embodiment of the present invention includes: the completely inter-translated bilingual sentence pairs and the incompletely inter-translated bilingual sentence pairs correspond to the high-quality tags and the low-quality tags, respectively. The following describes in detail the bilingual sentence pairs of two qualities, respectively.
For convenience of description, the two sentences included in the bilingual sentence pair are hereinafter referred to as an original sentence and a translated sentence, respectively. It will be understood by those skilled in the art that the original sentence and the translated sentence are only used to distinguish two sentences in the bilingual sentence pair, and are not specific to a sentence in a certain language. The original sentence may be any one of the bilingual sentence pair, and accordingly, the other sentence of the bilingual sentence pair is the translated sentence.
< complete translation of bilingual sentence pairs >
In one embodiment of the present invention, the fully translated bilingual sentence pair refers to a bilingual parallel sentence pair with perfect word alignment, for example, "It's a nice day today and today's today's bilingual pair" belongs to the fully translated bilingual sentence pair.
< incompletely translated bilingual sentence pairs >
In one embodiment of the present invention, the incompletely translated bilingual sentence pairs refer to any bilingual sentence pair that cannot be called completely translated bilingual sentence pair, i.e., any bilingual sentence pair that does not have a perfect word alignment relationship. For example, words are randomly deleted in the original sentence and/or the translated sentence, such as "today's nice, It's a nice day today", and the Chinese sentence lacks words aligned with "day" and thus belongs to a pair of two-sentence that are not completely translated with each other; for another example, new words are randomly inserted into the original sentence and/or the translated sentence, and the randomly inserted words lack the aligned translated words and thus belong to a bilingual sentence pair which is not completely translated with each other; for another example, if the word order of the original sentence and/or the translated sentence is randomly disturbed, the word alignment relationship is wrong, and thus the two-language sentence pair belongs to incomplete inter-translation.
For the above situations, in an actual application scenario, selection and necessary combination can be performed according to actual conditions and requirements (such as the source of the training data, the amount of the training data, the accuracy of model training, and the like).
As an example, the incompletely translated bilingual sentence pair includes two phrases or two sentences without mutual translation relationship, for example, "weather is good, tell your name", two phrases in the sentence pair do not have mutual translation relationship; also, for example, "what we eat at night, It's a nice day today", two sentences in a sentence pair do not have a mutual translation relationship, and thus belong to a bilingual sentence pair that is not completely translated mutually.
In addition, in the embodiment of the present invention, a pause sign is used, and two phrases or sentences of a bilingual sentence pair are separated to indicate that the two phrases or sentences form a bilingual sentence pair. Other symbols may also be used to indicate such relationships in different embodiments or in different operating environments, such as "|", "- - -", and/or "- - -" and the like.
< Source of training data >
The training data of the embodiment of the invention comprises all bilingual sentence pairs in the bilingual corpus, and the sources of the bilingual sentence pairs are different according to the mutual translation conditions of the bilingual sentence pairs.
In one embodiment of the present invention, for the "regular" bilingual sentence pairs, the accumulated bilingual parallel sentence pairs that are translated into each other in the field can be used, and this part of data is easy to obtain.
For example, if a completely translated bilingual sentence pair is desired to be used as a high-quality sentence pair, the incompletely translated sentence pair may be eliminated, and the "true" translated bilingual sentence pair may be obtained as a positive sample in the training data.
In addition, a high-quality bilingual sentence pair that is translated manually can be directly used as the bilingual sentence pair that is translated mutually in the "good case", but the amount of the training data is usually not much due to the high cost of manual processing.
There are many acquisition modes for the incompletely inter-translated bilingual sentence pair of "counterexample", and the incompletely inter-translated bilingual sentence pair can be constructed directly by the existing sentences, or can be constructed by processing on the completely inter-translated bilingual sentence pair. The following exemplary lists some of the described incompletely translated bilingual sentence pairs:
a) on the basis of the complete inter-translated bilingual sentence pair, the original sentence and/or the translated sentence are/is converted into other sentences through manual or computer means: for example, "today is a good weather, What is your name? "; "what we eat today, It' sa nice day today".
b) On the basis of the complete inter-translated bilingual sentence pair, a single (or multiple) word is randomly deleted in the original text and/or the translated text sentence by manual or computer means, such as "today's nice, It's nice day today".
c) On the basis of the complete translation of the bilingual sentence pairs, the word sequence of the original text and/or the translated text is randomly disturbed by manual or computer means, such as "today's weather, It's nice today day day day day.
d) On the basis of the complete translation of the bilingual sentence pairs, other words are randomly inserted into the original sentence and/or the translated sentence through manual or computer means.
e) On the basis of the completely inter-translated bilingual sentence pairs, at least one part of the original and/or translated sentences is replaced by machine translation sentence pairs through manual or computer means. Here, the quality of machine translation is generally considered to be poor.
f) The sentences of the two languages are arbitrarily selected and arbitrarily paired.
g) Any other method that can reduce the quality of the translation.
For a) to g) above, the incompletely translated bilingual pairs may be constructed based on any one of the above two, or based on any combination of two or more of the above two, three or more of the above two, so that the incompletely translated bilingual pairs meeting the "counterexample" can be obtained as the negative sample in the training data.
For the above a) to f), in order to avoid the investment of labor cost, the processes themselves such as data addition, deletion, modification, pairing and the like in a) to f) can be realized mainly by a computer means, and for those skilled in the computer field, the processes themselves are easy to realize, and for the realization processes and principles of the processes themselves, detailed descriptions are omitted here.
It is noted that with respect to a) to e) described above, in the embodiment of the present invention, when the number of words of the operation concerned exceeds 10% of the total number of words of the sentence, it is considered that an incompletely inter-translated bilingual sentence pair of "counterexamples" is formed. Of course, other proportional thresholds, such as 20%, 30%, etc., may be set as criteria for forming the incompletely translated bilingual sentence pairs.
< corpus quality assessment network >
It can be understood that after the corpus quality assessment network and the training samples are set, the training samples are input into the corpus quality assessment network, the network outputs the labels corresponding to the samples, based on the labels output by the network and the real labels of the training samples, the loss function value of the network can be calculated, and the network parameters are adjusted according to the loss function value. Based on the updated parameters, the training samples are input into the network again, the loss function values are calculated and the network parameters are updated according to the labels output by the network and the real labels of the training samples, and by analogy, the network parameters are continuously updated to enable the loss function to reach the minimum (in practice, when the loss function is converged or is smaller than a preset threshold value, the loss function is considered to reach the minimum). The group of parameters with the minimum loss function is the optimal parameters of the network, and after the optimal parameters are determined, the trained model is obtained.
In an embodiment of the present invention, the corpus quality assessment network includes a word embedding layer, a sentence embedding layer, a concatenation layer, and a classification layer, which are connected in sequence: the word embedding layer is used for generating word vector sequences of words included by two sentences in the bilingual sentence pair; the sentence embedding layer is used for respectively generating sentence vectors corresponding to the two sentences according to the word vector sequences of the words included in the two sentences; the splicing layer is used for splicing sentence vectors corresponding to the two sentences to obtain a spliced vector; and the classification layer is used for outputting a translation quality label according to the splicing vector.
In one embodiment of the present invention, the input of the word embedding layer is two sentences (or two sentences subjected to word segmentation) in a bilingual sentence pair, and the output is a word vector sequence of words included in the two sentences. The word embedding layer may be, for example, word vector models such as word2vec, GloVe, etc. In one embodiment, the word embedding layer further includes an Attention (Attention) module for capturing information of a mutual translation relationship between words of the Chinese sentence and the translated sentence in the bilingual sentence pair, so that the trained corpus quality assessment model can more effectively predict the mutual translation quality of the bilingual sentence pair.
In one embodiment of the present invention, the sentence embedding layer inputs the word vector sequence output by the word embedding layer, and outputs the sentence vectors corresponding to two sentences in the bilingual sentence pair. The sentence embedding layer can be implemented by adopting Network structures such as a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), a Long-Short Term Memory (LSTM), and the like. For example, the same network structure may be adopted for processing the original sentence and the translated sentence of a single bilingual sentence pair, for example, both CNN networks are adopted, or different neural network structures may be adopted, for example, the original sentence is processed by using the CNN network, and the translated sentence is processed by using the RNN network, and so on. In addition, each neural network in the above process may be expanded, for example, an RNN network may be added to the CNN network.
In an embodiment of the present invention, the input of the concatenation layer is a sentence vector of two sentences output by the sentence embedding layer, and the output is a concatenation vector obtained by concatenating the sentence vectors of the two sentences.
In one embodiment of the present invention, the input of the classification layer is the concatenation vector output by the concatenation layer, the output is the probability that the bilingual sentence pair belongs to each translation quality label, and the label with the maximum probability is used as the translation quality of the bilingual sentence pair.
In one embodiment, the classification layer includes a fully connected layer and a maximum flexibility (Softmax) layer in series. The input of the full connection layer is the splicing vector output by the splicing layer, the output of the full connection layer is the input of the Softmax layer, the output of the Softmax layer is the probability that the bilingual sentence pair belongs to each translation quality label, and the label with the maximum probability is the translation quality of the bilingual sentence pair.
The number of fully-connected layers may be set by one skilled in the art without limitation. In one embodiment, to take the classification effect and training efficiency of the model into account, the number of fully connected layers is set to 2.
The output of the Softmax layer is the same in dimension as the number of categories of quality labels. For example, the quality label includes both high quality and low quality, and then the output of the Softmax layer is a two-dimensional vector, and each dimension of the vector represents the probability that the bilingual sentence pair belongs to the high quality label and the low quality label. For another example, the quality label includes three types, i.e., high quality, medium quality, and low quality, then the output of the Softmax layer is a three-dimensional vector, and each dimension of the vector represents the probability that the bilingual sentence pair belongs to the high, medium, and low quality labels. The label with the highest probability is the inter-translation quality of the bilingual sentence pair.
For the training sample, the training sample is a data sample marked with a classification label, and the label is the true category to which the data sample belongs. In the invention, the training sample is a bilingual sentence pair marked with a mutual translation quality label, specifically, the bilingual sentence pair completely translated is a positive sample, and the label is 1 (representing high quality); pairs of incompletely translated sentences are negative examples, labeled 0 (indicating low quality).
Based on the constructed bilingual corpus including the completely inter-translated bilingual sentence pairs and the incompletely inter-translated bilingual sentence pairs and the inter-translation quality labels corresponding to the sentence pairs, the corpus quality evaluation network is trained to generate a corpus quality evaluation model for evaluating the inter-translation quality of the given bilingual sentence pairs.
Based on the above, an embodiment of the present invention may provide a method for generating a corpus quality assessment model, and with reference to fig. 1, the method includes:
s101, constructing a bilingual corpus, wherein the bilingual corpus comprises a plurality of bilingual sentence pairs and translation quality labels corresponding to the bilingual sentence pairs;
and S102, taking the bilingual sentence pair and the translation quality label corresponding to the bilingual sentence pair as training samples, and training a preset corpus quality evaluation network to generate a corpus quality evaluation model, wherein the corpus quality evaluation model is suitable for evaluating the translation quality of the given bilingual sentence pair.
By utilizing the scheme provided by the invention, the preset corpus quality evaluation network can be trained based on the constructed bilingual corpus, so that the corpus quality evaluation model is generated, the model can be used for evaluating the inter-translation quality of the given bilingual sentence pair, and the evaluation result is stable and reliable.
Referring to fig. 2, the present invention further provides a method for evaluating the inter-translation quality of bilingual sentence pairs, which performs quality evaluation on the bilingual corpus to be evaluated by using the corpus quality evaluation model trained by the method shown in fig. 1, where the evaluation method includes:
s201, acquiring a bilingual sentence pair to be evaluated; (ii) a
S202, inputting the bilingual sentence pair into a trained corpus quality evaluation model;
s203, determining the inter-translation quality of the bilingual sentence pair according to the output of the corpus quality evaluation model.
By using the method for evaluating the inter-translation quality of the bilingual sentence pairs, the evaluation result is stable and reliable.
The application scenario applicable to the embodiment of the invention comprises most occasions needing or capable of carrying out quality evaluation on the bilingual corpus, for example, in a mining project of bilingual data resources dominated by a user, the user can utilize the embodiment of the invention to carry out quality evaluation on the mined bilingual data, so that the mining effect can be grasped qualitatively or quantitatively, and a mining scheme can be optimized based on the mining effect. For another example, in the selecting process of bilingual corpus in machine translation, the method can be used for evaluating the candidate bilingual corpus and eliminating low-quality corpus to achieve the effect of optimizing the bilingual corpus.
The following describes, by way of specific examples, alternative specific processes of embodiments of the present invention. It should be noted that the scheme of the present invention does not depend on a specific algorithm, and in practical applications, any known or unknown hardware, software, algorithm, program, or any combination thereof may be used to implement the scheme of the present invention, and the scheme of the present invention is within the protection scope of the present invention as long as the essential idea of the scheme of the present invention is adopted.
Fig. 3 shows a schematic diagram of a training process of a corpus quality assessment network according to an embodiment of the present invention, where the corpus quality assessment network includes a word embedding layer, a CNN layer, a concatenation layer, two full-link layers, and a Softmax layer, which are connected in sequence.
Wherein SRC and TGT represent original and translated text, respectively, for example, SRC: today, weather is good; and (3) TGT: it's and day today.
① the original text and the translated text are first participled to obtain the word sequence, for example, SRC "today", "weather" and "good".
②, the words of the original and translated sentences are inserted into word vector word-embedding modules of the layer, so that the words in the sentences are all converted into a vector, for example, "today" in SRC [0.13,0.21,0.0.101, …,0.28], the vector dimension can be 200 or 300.
③ the vector sequence of the original and translated sentences is inputted into the CNN network of the sentence embedding layer, where the CNN network includes a convolution layer (convolutional layer) and a pooling layer (posing layer), and information of the sentences can be extracted, the CNN network module can output a vector representing semantics of the sentences, such as [0.280,0.116, …,0.101 ].
Here, given that CNN is a classical network structure in a neural network, a sentence can be accurately represented by vectorization, and this vector represents the semantic meaning of the sentence.
④, obtaining sentence vectors of original and translated texts, inputting the concatenation layer to splice the two together (concatenation), obtaining a higher dimension concatenation vector which represents the sentence pair.
⑤, the splicing vector contains the semantics of the original text and the translated text, the vector enters two full-connected layers (2-layerfully connection) and a Softmax layer, and finally a prediction result is output, wherein the prediction result is the quality score or the quality label representing the sentence pair.
The two full connection layers mainly model the semantic matching degree of the original text and the translated text, and Softmax is used for outputting a final label.
The prediction result is expressed by probabilities of 0 and 1, and if the probability of 1 is greater than the probability of 0, the label of the sentence pair is determined to be 1 (high quality label), and if the probability of 1 is less than or equal to the probability of 0, the label of the sentence pair is determined to be 0 (low quality label).
Further, the FIG. 3 embodiment may be implemented using the TensorFlow tool. In the process, an Attention mechanism-based Attention module can be constructed between word vector sequences of the original text and the translated text, and is used for capturing mutual translation relation information between words of the original text and the translated text in the bilingual sentence pairs, so that the quality of the sentence pairs can be effectively predicted.
Based on the above example, it can be understood that, firstly, the implementation process of the embodiment of the present invention does not need a bilingual vocabulary, so that there is no problem of dependency of the vocabulary; in addition, the original text and the translated text are modeled (a word embedding module and a CNN network module), and the semantics of the original text and the translated text can be well represented, so that if the quality of the original text and the quality of the translated text are poor (or good), the original text and the translated text are also embodied in the related modules and are embodied in the finally output quality labels. Therefore, the final prediction result is an evaluation result obtained by fusing the original text itself, the translated text itself, and the degree of translation between the original text and the translated text.
Corresponding to the method for generating the corpus quality assessment model in the embodiment of the invention, the invention also provides a device, equipment and computer storage medium for generating the corpus quality assessment model.
Referring to fig. 4, the apparatus for generating a corpus quality assessment model includes:
a corpus construction module 100, configured to construct a bilingual corpus, where the bilingual corpus includes a plurality of bilingual sentence pairs and translation quality labels corresponding to the bilingual sentence pairs;
and the corpus quality evaluation model training module 200 is configured to train a preset corpus quality evaluation network by using the bilingual sentence pair and the inter-translation quality label corresponding to the bilingual sentence pair as a training sample to generate a corpus quality evaluation model, where the corpus quality evaluation model is suitable for evaluating the inter-translation quality of the given bilingual sentence pair.
The generating device of the corpus quality evaluation model comprises:
a memory for storing a program;
and the processor is used for operating the program stored in the memory so as to execute each step in the corpus quality assessment model generation method according to the embodiment of the invention.
The present invention further provides a computer-readable storage medium, on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the computer program instructions implement the steps in the corpus quality assessment model generation method according to the embodiment of the present invention.
The method can realize the expected training of the corpus quality evaluation network, and the generated model is used for the quality evaluation of the bilingual corpus.
Corresponding to the method for evaluating the mutual translation quality of the bilingual sentence pairs in the embodiment of the invention, the invention also provides a device, equipment and computer storage medium for evaluating the mutual translation quality of the bilingual sentence pairs. Wherein the content of the first and second substances,
referring to fig. 5, the apparatus for evaluating the mutual translation quality of bilingual sentence pairs includes:
a bilingual sentence pair obtaining module 10, configured to obtain a bilingual sentence pair to be evaluated;
a bilingual sentence pair input module 20, configured to input the bilingual sentence pair into a trained corpus quality assessment model;
the corpus quality evaluation model 30 is configured to determine the inter-translation quality of the bilingual sentence pairs according to the output of the corpus quality evaluation model.
The mutual translation quality evaluation equipment for the bilingual sentence pairs comprises:
a memory for storing a program;
and the processor is used for operating the program stored in the memory to execute each step in the method for evaluating the inter-translation quality of the bilingual sentence pairs according to the embodiment of the invention.
The present invention further provides a computer-readable storage medium, on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the computer program instructions implement the steps in the method for evaluating the inter-translation quality of bilingual sentence pairs according to the embodiment of the present invention.
The device, the equipment and the computer storage medium for evaluating the inter-translation quality of the bilingual sentence pairs can realize the quality evaluation of bilingual corpora, and the accuracy of an evaluation result is high.
It should be noted that in the above embodiments, all or part may be implemented by software, hardware, firmware, or any combination thereof. When used in whole or in part, is implemented in the form of a computer program product that includes one or more computer program instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer program instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer program instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
FIG. 6 is a block diagram illustrating an exemplary hardware architecture capable of implementing methods and apparatus according to embodiments of the present invention, such as bilingual corpus-based training apparatus and bilingual corpus quality assessment apparatus according to embodiments of the present invention. Computing device 1000 includes, among other things, input device 1001, input interface 1002, processor 1003, memory 1004, output interface 1005, and output device 1006.
The input interface 1002, the processor 1003, the memory 1004, and the output interface 1005 are connected to each other via a bus 1010, and the input device 1001 and the output device 1006 are connected to the bus 1010 via the input interface 1002 and the output interface 1005, respectively, and further connected to other components of the computing device 1000.
Specifically, the input device 1001 receives input information from the outside and transmits the input information to the processor 1003 via the input interface 1002; the processor 1003 processes the input information based on computer-executable instructions stored in the memory 1004 to generate output information, stores the output information temporarily or permanently in the memory 1004, and then transmits the output information to the output device 1006 through the output interface 1005; output device 1006 outputs the output information external to computing device 1000 for use by a user.
The computing device 1000 may perform the steps of the methods of the present invention described above.
Processor 1003 may be one or more Central Processing Units (CPUs). When the processor 601 or the processor 701 is one CPU, the CPU may be a single-core CPU or a multi-core CPU.
The memory 1004 may be, but is not limited to, one or more of Random Access Memory (RAM), Read Only Memory (ROM), Erasable Programmable Read Only Memory (EPROM), compact disc read only memory (CD-ROM), a hard disk, and the like. The memory 1004 is used to store program codes.
It is understood that the functions of any module or all modules provided in the embodiment of the present invention may be implemented by the central processing unit 1003 shown in fig. 6.
All parts of the specification are described in a progressive mode, the same and similar parts of all embodiments can be referred to each other, and each embodiment is mainly introduced to be different from other embodiments. In particular, for apparatus and system embodiments, the description is relatively simple as it is substantially similar to the method embodiments, and reference may be made to the description of the method embodiments in this section for their relevance.

Claims (22)

1. A method for generating a corpus quality assessment model, the method comprising:
constructing a bilingual corpus, wherein the bilingual corpus comprises a plurality of bilingual sentence pairs and translation quality labels corresponding to the bilingual sentence pairs;
and training a preset corpus quality evaluation network by taking the bilingual sentence pairs and the inter-translation quality labels corresponding to the bilingual sentence pairs as training samples to generate a corpus quality evaluation model, wherein the corpus quality evaluation model is suitable for evaluating the inter-translation quality of the given bilingual sentence pairs.
2. The method according to claim 1, wherein the step of training the preset corpus quality assessment network to generate the corpus quality assessment model comprises:
training a preset corpus quality evaluation network to determine the optimal parameters of the corpus quality evaluation network;
and taking the corpus quality evaluation network under the optimal parameters as a corpus quality evaluation model.
3. The method of claim 1, wherein the inter-translation quality labels comprise high quality labels and low quality labels, the constructing a bilingual corpus comprising:
obtaining a plurality of bilingual sentence pairs, wherein the bilingual sentence pairs comprise completely inter-translated bilingual sentence pairs and incompletely inter-translated bilingual sentence pairs; and
the bilingual sentence pairs that are completely translated are marked as high-quality labels, and the bilingual sentence pairs that are not completely translated are marked as low-quality labels.
4. The method according to claim 3, wherein the incompletely translated bilingual sentence pairs are obtained based on the completely translated bilingual sentence pairs, and in the incompletely translated bilingual sentence pairs, the ratio of the number of incompletely translated words to the total number of words of the corresponding sentence is greater than or equal to a preset threshold.
5. The method of claim 3 or 4, wherein said bilingual sentence pair comprises an original sentence and a translated sentence, said incompletely translated bilingual sentence pair being obtained by at least one of the following ways:
deleting at least one word in the original sentence and/or the translated sentence in the complete translation bilingual sentence pair;
adding at least one word in the original sentence and/or the translated sentence in the complete translation bilingual sentence pair;
changing the word sequence of the original sentence and/or the translated sentence in the completely inter-translated bilingual sentence pair;
replacing at least one part of the original sentence and/or the translated sentence in the complete inter-translated bilingual sentence pair with a machine translation result;
the original sentence and/or the translated sentence in the completely translated bilingual sentence pair are replaced by other sentences except the sentence.
6. The method according to claim 1, wherein the preset corpus quality assessment network comprises a word embedding layer, a sentence embedding layer, a splicing layer and a classification layer which are connected in sequence; wherein the content of the first and second substances,
the word embedding layer is used for generating a word vector sequence of words included by two sentences in the bilingual sentence pair;
the sentence embedding layer is used for respectively generating sentence vectors corresponding to the two sentences according to the word vector sequences of the words included in the two sentences;
the splicing layer is used for splicing sentence vectors corresponding to the two sentences to obtain a spliced vector;
and the classification layer is used for outputting a translation quality label according to the splicing vector.
7. The method of claim 6, wherein the word embedding layer further comprises an attention module for capturing information of interpretive relationships between words of two of the pairs of sentences.
8. The method of claim 6, wherein the sentence embedding layer is a convolutional neural network and/or a cyclic neural network.
9. The method of claim 6, wherein the classification layer comprises a fully-connected layer and a compliance maximum layer in series.
10. The method according to claim 6 or 9, wherein the classification layer outputs the probability that the bilingual sentence pair belongs to each translation quality label, and takes the translation quality label with the highest probability as the translation quality of the bilingual sentence pair.
11. An apparatus for generating a corpus quality assessment model, the apparatus comprising:
the language database construction module is used for constructing a bilingual language database which comprises a plurality of bilingual sentence pairs and inter-translation quality labels corresponding to the bilingual sentence pairs;
and the corpus quality evaluation model training module is used for training a preset corpus quality evaluation network by taking the bilingual sentence pair and the inter-translation quality label corresponding to the bilingual sentence pair as a training sample to generate a corpus quality evaluation model, and the corpus quality evaluation model is suitable for evaluating the inter-translation quality of the given bilingual sentence pair.
12. A generating device of a corpus quality assessment model comprises:
a memory for storing a program;
a processor for executing the program stored in the memory to perform the method of any one of claims 1 to 10.
13. A computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method of any one of claims 1 to 10.
14. A method for evaluating the inter-translation quality of bilingual sentence pairs comprises the following steps:
acquiring a bilingual sentence pair to be evaluated;
inputting the bilingual sentence pair into a trained corpus quality evaluation model;
and determining the inter-translation quality of the bilingual sentence pairs according to the output of the corpus quality evaluation model.
15. The method according to claim 14, wherein said corpus quality assessment model comprises a word embedding layer, a sentence embedding layer, a concatenation layer and a classification layer, which are connected in sequence; wherein the content of the first and second substances,
the word embedding layer is used for generating a word vector sequence of words included by two sentences in the bilingual sentence pair;
the sentence embedding layer is used for respectively generating sentence vectors corresponding to the two sentences according to the word vector sequences of the words included in the two sentences;
the splicing layer is used for splicing sentence vectors corresponding to the two sentences to obtain a spliced vector;
and the classification layer is used for outputting a translation quality label according to the splicing vector.
16. The method of claim 15, wherein the word embedding layer further comprises an attention module for capturing information of interpretive relationships between words of two of the pairs of sentences.
17. The method of claim 15, wherein the sentence embedding layer is a convolutional neural network and/or a cyclic neural network.
18. The method of claim 15, wherein the classification layer comprises a fully-connected layer and a compliance maximum layer in series.
19. The method according to claim 15 or 18, wherein the classification layer outputs the probability that the bilingual sentence pair belongs to each translation quality label, and takes the translation quality label with the highest probability as the translation quality of the bilingual sentence pair.
20. An apparatus for evaluating a mutual translation quality of a bilingual sentence pair, the apparatus comprising:
the bilingual sentence pair acquisition module is used for acquiring a bilingual sentence pair to be evaluated;
the bilingual sentence pair input module is used for inputting the bilingual sentence pair into a trained corpus quality evaluation model;
and the corpus quality evaluation model is used for determining the inter-translation quality of the bilingual sentence pairs according to the output of the corpus quality evaluation model.
21. An apparatus for evaluating the translation quality of a bilingual sentence pair, comprising:
a memory for storing a program;
a processor for executing the program stored in the memory to perform the method of any of claims 14-19.
22. A computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method of any one of claims 14-19.
CN201810995294.4A 2018-08-29 2018-08-29 Corpus quality evaluation model generation method and double-sentence pair inter-translation quality evaluation method Active CN110874536B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810995294.4A CN110874536B (en) 2018-08-29 2018-08-29 Corpus quality evaluation model generation method and double-sentence pair inter-translation quality evaluation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810995294.4A CN110874536B (en) 2018-08-29 2018-08-29 Corpus quality evaluation model generation method and double-sentence pair inter-translation quality evaluation method

Publications (2)

Publication Number Publication Date
CN110874536A true CN110874536A (en) 2020-03-10
CN110874536B CN110874536B (en) 2023-06-27

Family

ID=69714634

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810995294.4A Active CN110874536B (en) 2018-08-29 2018-08-29 Corpus quality evaluation model generation method and double-sentence pair inter-translation quality evaluation method

Country Status (1)

Country Link
CN (1) CN110874536B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347795A (en) * 2020-10-04 2021-02-09 北京交通大学 Machine translation quality evaluation method, device, equipment and medium
CN113641724A (en) * 2021-07-22 2021-11-12 北京百度网讯科技有限公司 Knowledge tag mining method and device, electronic equipment and storage medium
CN113642337A (en) * 2020-05-11 2021-11-12 阿里巴巴集团控股有限公司 Data processing method and device, translation method, electronic device and computer readable storage medium
CN113761944A (en) * 2021-05-20 2021-12-07 腾讯科技(深圳)有限公司 Corpus processing method, apparatus, device and storage medium for translation model
CN114386437A (en) * 2022-01-13 2022-04-22 延边大学 Mid-heading translation quality estimation method and system based on cross-language pre-training model

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1203316A1 (en) * 1999-06-30 2002-05-08 Synerges OY System for internationalization of search input information
US20070203689A1 (en) * 2006-02-28 2007-08-30 Kabushiki Kaisha Toshiba Method and apparatus for bilingual word alignment, method and apparatus for training bilingual word alignment model
CN101777044A (en) * 2010-01-29 2010-07-14 中国科学院声学研究所 System for automatically evaluating machine translation by using sentence structure information and implementing method
JP2011118496A (en) * 2009-12-01 2011-06-16 National Institute Of Information & Communication Technology Language-independent word segmentation for statistical machine translation
CN102945232A (en) * 2012-11-16 2013-02-27 沈阳雅译网络技术有限公司 Training-corpus quality evaluation and selection method orienting to statistical-machine translation
CN105512114A (en) * 2015-12-14 2016-04-20 清华大学 Parallel sentence pair screening method and system
CN106066851A (en) * 2016-06-06 2016-11-02 清华大学 A kind of neural network training method considering evaluation index and device
CN106598959A (en) * 2016-12-23 2017-04-26 北京金山办公软件股份有限公司 Method and system for determining intertranslation relationship of bilingual sentence pairs

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1203316A1 (en) * 1999-06-30 2002-05-08 Synerges OY System for internationalization of search input information
US20070203689A1 (en) * 2006-02-28 2007-08-30 Kabushiki Kaisha Toshiba Method and apparatus for bilingual word alignment, method and apparatus for training bilingual word alignment model
JP2011118496A (en) * 2009-12-01 2011-06-16 National Institute Of Information & Communication Technology Language-independent word segmentation for statistical machine translation
CN101777044A (en) * 2010-01-29 2010-07-14 中国科学院声学研究所 System for automatically evaluating machine translation by using sentence structure information and implementing method
CN102945232A (en) * 2012-11-16 2013-02-27 沈阳雅译网络技术有限公司 Training-corpus quality evaluation and selection method orienting to statistical-machine translation
CN105512114A (en) * 2015-12-14 2016-04-20 清华大学 Parallel sentence pair screening method and system
CN106066851A (en) * 2016-06-06 2016-11-02 清华大学 A kind of neural network training method considering evaluation index and device
CN106598959A (en) * 2016-12-23 2017-04-26 北京金山办公软件股份有限公司 Method and system for determining intertranslation relationship of bilingual sentence pairs

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
古丽尼尕尔・买合木提;帕力旦・吐尔逊;艾斯卡尔・艾木都拉;: "基于词形分析的汉-维机器翻译性能分析" *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113642337A (en) * 2020-05-11 2021-11-12 阿里巴巴集团控股有限公司 Data processing method and device, translation method, electronic device and computer readable storage medium
CN113642337B (en) * 2020-05-11 2023-12-19 阿里巴巴集团控股有限公司 Data processing method and device, translation method, electronic device, and computer-readable storage medium
CN112347795A (en) * 2020-10-04 2021-02-09 北京交通大学 Machine translation quality evaluation method, device, equipment and medium
CN113761944A (en) * 2021-05-20 2021-12-07 腾讯科技(深圳)有限公司 Corpus processing method, apparatus, device and storage medium for translation model
CN113761944B (en) * 2021-05-20 2024-03-15 腾讯科技(深圳)有限公司 Corpus processing method, device and equipment for translation model and storage medium
CN113641724A (en) * 2021-07-22 2021-11-12 北京百度网讯科技有限公司 Knowledge tag mining method and device, electronic equipment and storage medium
CN113641724B (en) * 2021-07-22 2024-01-19 北京百度网讯科技有限公司 Knowledge tag mining method and device, electronic equipment and storage medium
CN114386437A (en) * 2022-01-13 2022-04-22 延边大学 Mid-heading translation quality estimation method and system based on cross-language pre-training model

Also Published As

Publication number Publication date
CN110874536B (en) 2023-06-27

Similar Documents

Publication Publication Date Title
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN109493977B (en) Text data processing method and device, electronic equipment and computer readable medium
CN108121700B (en) Keyword extraction method and device and electronic equipment
CN110019732B (en) Intelligent question answering method and related device
JP5901001B1 (en) Method and device for acoustic language model training
CN110874536B (en) Corpus quality evaluation model generation method and double-sentence pair inter-translation quality evaluation method
TW202009749A (en) Human-machine dialog method, device, electronic apparatus and computer readable medium
CN112100354B (en) Man-machine conversation method, device, equipment and storage medium
KR102254612B1 (en) method and device for retelling text, server and storage medium
CN113722493B (en) Text classification data processing method, apparatus and storage medium
CN113076739A (en) Method and system for realizing cross-domain Chinese text error correction
CN111144120A (en) Training sentence acquisition method and device, storage medium and electronic equipment
CN113158687B (en) Semantic disambiguation method and device, storage medium and electronic device
CN112347241A (en) Abstract extraction method, device, equipment and storage medium
EP4174714A1 (en) Text sequence generation method, apparatus and device, and medium
CN111753082A (en) Text classification method and device based on comment data, equipment and medium
CN111813923A (en) Text summarization method, electronic device and storage medium
CN112818091A (en) Object query method, device, medium and equipment based on keyword extraction
CN112860919A (en) Data labeling method, device and equipment based on generative model and storage medium
Li et al. A method for resume information extraction using bert-bilstm-crf
CN115860006A (en) Aspect level emotion prediction method and device based on semantic syntax
CN112687328B (en) Method, apparatus and medium for determining phenotypic information of clinical descriptive information
CN112417860A (en) Training sample enhancement method, system, device and storage medium
WO2022227196A1 (en) Data analysis method and apparatus, computer device, and storage medium
CN114676699A (en) Entity emotion analysis method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant