CN115587599A

CN115587599A - Quality detection method and device for machine translation corpus

Info

Publication number: CN115587599A
Application number: CN202211128975.3A
Authority: CN
Inventors: 黄嘉鑫; 谢育涛; 尹曦; 谢凯
Original assignee: International Digital Economy Academy IDEA
Current assignee: International Digital Economy Academy IDEA
Priority date: 2022-09-16
Filing date: 2022-09-16
Publication date: 2023-01-10
Anticipated expiration: 2042-09-16
Also published as: CN115587599B

Abstract

The invention provides a method and a device for detecting the quality of a machine translation corpus, wherein the method comprises the following steps: when a machine translation corpus to be detected is received, preprocessing the machine translation corpus to obtain a standard text and a word segmentation sequence; and carrying out problem detection processing on the standard text and the word segmentation sequence according to a preset multi-language detection rule to obtain a detection result. According to the invention, after the machine translation corpus to be detected is preprocessed into the standard text and the participle sequence, the problem detection is automatically carried out on the standard text and the participle sequence by using the preset multi-language detection rule, so that the method is suitable for multiple languages, has a wide application range, does not need a translator to carry out manual quality inspection, improves the detection efficiency, and ensures the quality of the detection result.

Description

Quality detection method and device for machine translation corpus

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method and a device for detecting the quality of machine translation corpora.

Background

Machine translation is a technology with decades of development history, and the technology has transited to a neural network machine translation stage from the initial artificial regular translation to the statistical-based translation, and is one of the most active directions in the natural language processing of deep learning at present. With the rapid development of the machine translation technology of the deep learning neural network, people find that the time and the cost required by the translation can be effectively reduced by using the machine translation through a large amount of practice, and particularly under the condition of huge translation amount or time pressing, the time and the money cost can be greatly saved by using the machine translation. And the parallel translation sentence pairs are increased, the quality of the training set is improved, and the translation effect of the final machine learning model can be promoted. Parallel sentence pairs generated by various ways such as mining, manual marking, retracing and the like in massive network data are necessary to further correct the results, and the method is a scheme with great application value.

Language barriers have resulted in data verification that is not readily supported by language experts, but it is difficult to work directly with low-resource language translators, and furthermore, because there are very few available translators, the labeling process is more susceptible to variations in the translator's own proficiency, and it is not easy to check the quality of the translator's work. The existing low-resource translation literature is difficult to give guidance in constructing a parallel data set and ensuring the execution quality.

Therefore, the prior art has defects and needs to be improved and developed.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method and an apparatus for detecting quality of machine-translated corpus, aiming at solving the problem that manual quality inspection of translators is required for quality detection of machine-translated corpus in the prior art.

The technical scheme adopted by the invention for solving the technical problem is as follows:

a quality detection method for machine translation corpora comprises the following steps:

when a machine translation corpus to be detected is received, preprocessing the machine translation corpus to obtain a standard text and a word segmentation sequence;

and carrying out problem detection processing on the standard text and the word segmentation sequence according to a preset multi-language detection rule to obtain a detection result.

In one implementation, the machine-translated corpus is a monolingual text or a bilingual text, the bilingual text including an original text and a translated text; when receiving a machine translation corpus to be detected, preprocessing the machine translation corpus to obtain a standard text and a word segmentation sequence, wherein the method comprises the following steps of:

if the language material translated by the machine is a monolingual text, preprocessing the monolingual text to obtain a standard text and a word segmentation sequence;

and if the language material translated by the machine is a bilingual text, respectively preprocessing an original text and a translated text in the bilingual text to obtain a standard text and a word segmentation sequence corresponding to the original text and a standard text and a word segmentation sequence corresponding to the translated text.

In one implementation, when receiving a machine translation corpus to be detected, preprocessing the machine translation corpus to obtain a standard text and a word segmentation sequence, including:

when a machine translation corpus to be detected is received, determining the target language of the machine translation corpus;

removing invisible characters in the machine translation corpus, combining a plurality of blank characters into a blank space, and performing character format unified processing on the machine translation corpus to obtain a standard text;

and inputting the standard text into a word segmentation tool corresponding to the target language to obtain a word segmentation sequence.

In one implementation, performing text format unified processing on the machine-translated corpus includes:

if the machine translation corpus belongs to the letter class language, converting the letter format of the machine translation corpus into lowercase;

if the machine translation corpus is in a non-letter language, unifying the character formats of the machine translation corpus into a full-angle language or a half-angle language.

In one implementation, the multi-language detection rules include monolingual detection rules and bilingual detection rules; performing problem detection processing on the standard text and the word segmentation sequence according to a preset multi-language detection rule to obtain a detection result, wherein the problem detection processing comprises the following steps:

if the language material translated by the machine is a monolingual text, performing problem detection processing on the standard text and the participle sequence according to the monolingual detection rule to obtain a monolingual detection result;

if the machine translation corpus is a bilingual text, performing problem detection processing on the standard text and the participle sequence corresponding to the original text and the standard text and the participle sequence corresponding to the translated text respectively according to the single-language detection rule to obtain a single-language detection result of the original text and a single-language detection result of the translated text;

obtaining bilingual parallel sentence pairs corresponding to the bilingual texts according to the standard texts and the segmentation sequences corresponding to the original texts and the standard texts and the segmentation sequences corresponding to the translated texts;

and carrying out problem detection processing on the bilingual parallel sentence pairs according to the bilingual detection rule to obtain a bilingual detection result.

In one implementation, the monolingual detection rule includes: language identification detection items, sentence length limitation detection items, sensitive word detection items, sentence confusion degree detection items, messy code character detection items, punctuation digit proportion detection items and word frequency limitation detection items;

the detection step of the language identification detection item comprises the following steps: inputting the standard text into a pre-trained language identification classification model and a preset rule engine, performing language discrimination on the language of the standard text, comparing a language discrimination result with the target language, and determining a matching result of the language discrimination result and the target language;

the step of detecting the sentence length limitation detection item comprises the following steps: comparing the total number of the characters of the standard text with a preset character threshold value to determine whether the total number of the characters is a qualified result; or comparing the number of the sequence elements of the word segmentation sequence with a preset sequence element threshold value to determine whether the number of the sequence elements is qualified;

the detection step of the sensitive word detection item comprises the following steps: detecting the sensitive words in the standard text according to a pre-constructed sensitive word dictionary of each language and a multi-mode matching algorithm;

the sentence confusion detection item detection step includes: evaluating the confusion degree of the standard text by using a multilingual statement confusion degree calculation model, and determining whether the semantics of the standard text are smooth or not;

the detection step of the messy code character detection item comprises the following steps: searching Unicode codes corresponding to the target language, extracting initial characters which do not belong to the Unicode codes in the normative text by using a regular pattern, filtering general characters in the initial characters to obtain messy code characters, calculating the number of the messy code characters, and determining a language semantic result of the normative text;

the detection step of the punctuation number proportion detection item comprises the following steps: acquiring the total number of characters of the standard text, respectively extracting punctuations and numbers in the standard text, counting the number of the punctuations and the number of the numbers to obtain a first ratio of the punctuations to the total number of the characters and a second ratio of the numbers to the total number of the characters, and determining a limit relation between the first ratio and the second ratio and a preset ratio range;

the detection step of the word frequency limitation detection item comprises the following steps: and counting each word group in the word segmentation sequence to obtain the highest word frequency, comparing the highest word frequency with a preset word frequency threshold value, and determining whether the sentence corresponding to the word segmentation sequence is invalid or not.

In one implementation, the monolingual detection result includes: matching results obtained by detecting the language identification detection items; whether the total number of characters obtained by the sentence length limitation detection item is qualified or not or whether the number of sequence elements is qualified or not is detected; the sensitive words are obtained by detecting the sensitive word detection items; detecting whether the semantics obtained by the sentence confusion detection item are smooth or not; the language semantic result obtained by the detection of the messy code character detection item; the limit relation obtained by detecting the punctuation digit ratio detection item; and detecting whether the obtained sentence is invalid or not by the word frequency limit detection item.

In one implementation, the bilingual species detection rule includes: a sentence length ratio detection item, a word frequency distance detection item, a digital alignment detection item, a URL alignment detection item, a word alignment detection item, a semantic similarity detection item and an analog scoring detection item;

the detection step of the sentence length ratio detection item comprises the following steps: acquiring the bilingual parallel sentence pair, acquiring a first total number of characters of a standard text corresponding to an original text and a second total number of characters of the standard text corresponding to a translated text, calculating a character ratio between the first total number of characters and the second total number of characters, and comparing the character ratio with a preset character ratio threshold value;

or acquiring the bilingual parallel sentence pair to obtain the number of first sequence elements of a word segmentation sequence corresponding to the original text and the number of second sequence elements of a word segmentation sequence corresponding to the translated text, calculating the element number ratio between the number of the first sequence elements and the number of the second sequence elements, and comparing the element number ratio with a preset element number ratio threshold;

the detection step of the word frequency distance detection item comprises the following steps: acquiring the bilingual parallel sentence pair, acquiring a first highest word frequency of a standard text corresponding to an original text and a second highest word frequency of the standard text corresponding to a translated text, calculating a difference between the first highest word frequency and the second highest word frequency, and comparing the difference with a preset difference threshold;

the detecting step of the digital alignment detection item comprises the following steps: acquiring the bilingual parallel sentence pairs, extracting the numbers in the standard text corresponding to the original text one by using a regular pattern to obtain a first number set and the numbers in the standard text corresponding to the translated text to obtain a second number set, calculating a number intersection and a number union between the first number set and the second number set, calculating a number proportion value of the number intersection and the number union, and comparing the number proportion value with a preset number proportion value;

the URL alignment detection item detection step comprises the following steps: acquiring the bilingual parallel sentence pairs, extracting URLs in a standard text corresponding to an original text one by using a regular mode to obtain a first URL set and URLs in a standard text corresponding to a translated text to obtain a second URL set, calculating a URL intersection and a URL union set between the first URL set and the second URL set, calculating URL proportional values of the URL intersection and the URL union set, and comparing the URL proportional values with a preset URL ratio;

the detection step of the word alignment detection item comprises the following steps: acquiring the bilingual parallel sentence pairs, matching word groups in the word segmentation sequence corresponding to the original text and the word groups in the word segmentation sequence corresponding to the translated text in pairs through a bilingual dictionary, counting word alignment results, and calculating sentence-level alignment scores;

the semantic similarity detection item detection step comprises the following steps: acquiring the bilingual parallel sentence pair, mapping the standard texts corresponding to the original text to the Embedding in the same space one by one to obtain a first sentence characterization vector, mapping the standard texts corresponding to the translated text to the Embedding in the same space one by one to obtain a second sentence characterization vector, and calculating a cosine value between the first sentence characterization vector and the second sentence characterization vector;

the detection step of the simulation scoring detection item comprises the following steps: inputting the bilingual parallel sentence pairs into a machine learning model trained in advance to obtain translation confidence scores; the machine learning model is formed by training data by bilingual parallel sentences with manual scoring results.

In one implementation, the bilingual detection result includes: a comparison result between a character ratio obtained by the sentence length ratio detection item detection and a preset character ratio threshold value, or a comparison result between an element number ratio and a preset element number ratio threshold value; comparing the difference obtained by the word frequency distance detection item detection with a preset difference threshold value; the comparison result of the digital proportion value obtained by the detection of the digital alignment detection item and the preset digital ratio is obtained; comparing the URL proportion value obtained by detecting the URL alignment detection item with a preset URL ratio; sentence-level alignment scores obtained by detecting the word alignment detection items; detecting a cosine value between a first statement representation vector and a second statement representation vector by the semantic similarity detection item; and the translation confidence score obtained by the detection of the simulation scoring detection item.

In one implementation, the method for detecting the quality of the machine-translated corpus further includes:

storing the standard text to a pre-established corpus;

and carrying out data duplication checking processing on the canonical texts in the corpus.

In one implementation, the data duplication checking processing is performed on the canonical texts in the corpus, and includes:

removing symbols of the standard text by using regular mode, and combining blank characters into a single blank space;

and (3) dividing all the standard texts into barrels, dividing the standard texts with the same first word into the same barrel, and encrypting each standard text of each barrel, wherein the standard texts with the same ciphertext are repeated sentences.

dividing all the standard texts into barrels, dividing the standard texts with the same first word into the same barrel, and dividing the standard texts with the same last word into the same barrel;

giving a unique node to each standard text in the bucket, and calculating the editing distance of every two data in the bucket;

comparing the editing distance with a preset distance threshold, connecting points corresponding to the standard texts with similar comparison results together to obtain connected subgraphs, and obtaining a clustering result according to the connected subgraphs, wherein the standard texts belonging to the same class in the clustering result are repeated sentences.

In an implementation manner, the implementation environment of the quality detection method for the machine translation corpus is spark environment, and during development and debugging, the input text is a parallel sentence pair of chinese and other languages.

The invention also discloses a quality detection device of the machine translation corpus, which comprises:

the preprocessing module is used for preprocessing the machine translation corpus to obtain a standard text and a word segmentation sequence when the machine translation corpus to be detected is received;

and the detection module is used for carrying out problem detection processing on the standard text and the word segmentation sequence according to a preset multi-language detection rule to obtain a detection result.

The invention also discloses a terminal, comprising: the quality detection program of the machine translation corpus is executed by the processor to realize the steps of the quality detection method of the machine translation corpus.

The invention also discloses a computer readable storage medium, which stores a computer program that can be executed for implementing the steps of the quality detection method of the machine translation corpus as described above.

Drawings

FIG. 1 is a flowchart illustrating a method for quality inspection of machine-translated corpus according to a preferred embodiment of the present invention.

FIG. 2 is a functional block diagram of a preferred embodiment of the apparatus for quality detection of machine-translated corpus in accordance with the present invention.

Fig. 3 is a functional block diagram of a preferred embodiment of the terminal of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Some quality inspection tools also exist in the prior art, but the current quality inspection tools are scattered and one-sided, only a part of quality inspection logic is realized, a set of complete solution system is not formed, most of the tools take English as a center, and the tools pay more attention to Latin European and American languages, and lack of support for some languages. Based on the background, the scheme aims to provide a corpus quality automatic evaluation system with high confidence level, support the rapid delivery of multilingual high-quality data and provide a thought for solving the obstacle in machine translation research.

Referring to fig. 1, fig. 1 is a flowchart of a method for detecting quality of machine-translated corpus in accordance with the present invention. As shown in fig. 1, the method for detecting quality of machine-translated corpus according to the embodiment of the present invention includes the following steps:

and S100, when the machine translation corpus to be detected is received, preprocessing the machine translation corpus to obtain a standard text and a word segmentation sequence.

Specifically, when receiving the machine translation corpus to be detected, the embodiment first preprocesses the machine translation corpus, and then performs problem detection processing on the standard text and the participle sequence obtained by preprocessing, so as to realize automatic quality detection and evaluation of the machine translation corpus.

In one implementation, the machine-translated corpus is monolingual text or bilingual text, the bilingual text including both original text and translated text. And if the language material translated by the machine is a monolingual text, preprocessing the monolingual text to obtain a standard text and a word segmentation sequence. And if the language material translated by the machine is a bilingual text, respectively preprocessing an original text and a translated text in the bilingual text to obtain a standard text and a word segmentation sequence corresponding to the original text and a standard text and a word segmentation sequence corresponding to the translated text.

Specifically, the embodiment is provided with a preprocessing module, and the preprocessing module is used for performing word segmentation and normalization processing on the machine translation corpus. The machine translation corpus may be monolingual text or bilingual text, the monolingual text may be original text, and the bilingual text may include both original text and translated text. For any text, pre-processing is performed. Namely, the monolingual text correspondingly obtains a standard text and a word segmentation sequence, and the bilingual text correspondingly obtains two standard texts and two word segmentation sequences. Therefore, the method and the device are suitable for both the monolingual text and the bilingual text, and have a wide application range.

In an embodiment, the step S100 specifically includes:

step S110, when a machine translation corpus to be detected is received, determining the target language of the machine translation corpus;

step S120, removing invisible characters in the machine translation corpus, combining a plurality of blank characters into a blank space, and performing character format unified processing on the machine translation corpus to obtain a standard text;

and S130, inputting the standard text into a word segmentation tool corresponding to the target language to obtain a word segmentation sequence.

Specifically, the pretreatment step comprises: clearing invisible characters, and combining a plurality of blank characters (not limited to spaces, tabs and carriage returns) into one space; and each language is participled by using a corresponding word segmentation tool to obtain a normalized text and a word segmentation sequence, and the normalized text and the word segmentation sequence are output to other modules for use. Wherein the blank characters are characters which are observed by naked eyes and occupy positions; invisible characters are characters that are invisible to the naked eye, but when processed by a program, occupy space.

In a further embodiment, the "performing unified processing on the machine translation corpus in the text format" specifically includes: if the machine translation corpus belongs to the letter class language, converting the letter format of the machine translation corpus into lowercase; if the language material is not in the letter type, unifying the character formats of the language material into full angle or half angle. Specifically, for texts of letters such as latin, cyrillic systems and the like, the letters are uniformly converted into lower case; for non-alphabetic languages such as Chinese, japanese, korean, etc., the full half-angle is unified. The invention can preprocess a plurality of languages and has comprehensiveness.

After step S100 is: and S200, performing problem detection processing on the standard text and the word segmentation sequence according to a preset multi-language detection rule to obtain a detection result.

Specifically, the invention presets the multi-language detection rule, is suitable for various language identification and can comprehensively detect various problems.

In one implementation, the multi-language detection rules include monolingual detection rules and bilingual detection rules. And if the language material translated by the machine is a monolingual text, performing problem detection processing on the standard text and the participle sequence according to the monolingual detection rule to obtain a monolingual detection result. If the machine translation corpus is a bilingual text, performing problem detection processing on the standard text and the participle sequence corresponding to the original text and the standard text and the participle sequence corresponding to the translated text respectively according to the single-language detection rule to obtain a single-language detection result of the original text and a single-language detection result of the translated text; obtaining bilingual parallel sentence pairs corresponding to the bilingual texts according to the standard texts and the segmentation sequences corresponding to the original texts and the standard texts and the segmentation sequences corresponding to the translated texts; and carrying out problem detection processing on the bilingual parallel sentence pairs according to the bilingual detection rule to obtain a bilingual detection result. If the original text corresponds to a first standard text and a first word segmentation sequence, and the translated text corresponds to a second standard text and a second word segmentation sequence, the bilingual parallel sentence pair is a sentence pair of the first standard text and the second standard text and a sentence pair of the first word segmentation sequence and the second word segmentation sequence.

In one embodiment, the monolingual detection rule includes: language identification detection items, sentence length limitation detection items, sensitive word detection items, sentence confusion degree detection items, messy code character detection items, punctuation digit proportion detection items and word frequency limitation detection items.

Firstly, language identification detection items: inputting the standard text into a pre-trained language identification classification model and a preset rule engine, performing language discrimination on the language of the standard text, comparing a language discrimination result with the target language, and determining a matching result of the language discrimination result and the target language. Specifically, the language to which the standard text belongs is judged through a pre-trained language identification classification model and a rule engine designed according to a statistical result, whether the language of the current text is in accordance with expectation is checked, and some data with unmatched languages are screened out. The language identification and classification model may be obtained by training using monolingual corpus and language label data thereof, such as cld3 of google, fasttext and lingua of facebook (meta), or may be obtained by training using a ready-made open-source language identification and classification model. The rule engine carries out statistical induction on the monolingual statement data and the language tags corresponding to the monolingual statement data. For example, if a sentence with a certain character a is in the language a, a rule is added: if a appears in the sentence, judging as A; if two words appear in pairs at the same time and both appear in B language sentences, then rule is added: if the 2 words appear in the sentence, the B language is judged. The verification of whether the language of the current text is in accordance with the expectation means that the language material is determined to be in the A-B language in advance when the quality inspection is carried out, so that whether the original text is in the A language or not and whether the translated text is in the B language or not are judged, and if not, the text is not in accordance with the expectation. The removal of some data with unmatched language means to delete the sentence with the original language not expected in the A language or the sentence with the translated language not expected in the B language.

Second, sentence length restriction detection item: comparing the total number of the characters of the standard text with a preset character threshold value to determine whether the total number of the characters is a qualified result; or comparing the number of the sequence elements of the word segmentation sequence with a preset sequence element threshold value to determine whether the number of the sequence elements is qualified.

Specifically, the length of the single sentence is limited according to the scene needs. The total number of characters of the standard text (suitable for Chinese, japanese and the like) or the number of sequence elements of the participle sequence (suitable for European and American language family) can be directly calculated, and whether the total number of characters of the standard text is within a preset threshold value interval or not is judged; if the current time is not within the interval, the quality inspection logic does not reach the standard.

Third, sensitive word detection item: and detecting the sensitive words in the standard text according to a pre-constructed sensitive word dictionary of each language and a multi-mode matching algorithm.

Specifically, whether words with sensitive political tendency (or against political party tendency), violence tendency, unhealthy color or plain language appear in the examination specification text is called sensitive word. A common scheme is to construct a sensitive word dictionary of each language in advance, and then use a multi-mode matching algorithm (e.g., an AC automaton) to detect whether a sensitive word is matched in an input standard text.

Fourth, sentence confusion detection items: and evaluating the confusion degree of the standard text by using a multi-language sentence confusion degree calculation model, and determining whether the semantics of the standard text are smooth.

Specifically, a multi-language sentence confusion degree calculation model is used for evaluating the confusion degree of the standard text so as to deduce whether the semantics of the sentence are consistent or not.

Fifth, detection items of messy code characters: searching Unicode codes corresponding to the target language, extracting initial characters which do not belong to the Unicode codes in the standard text by using a regular mode, filtering general characters in the initial characters to obtain messy code characters, calculating the number of the messy code characters, and determining a language semantic result of the standard text.

Specifically, texts of all languages in the world are formed by corresponding characters or letters, in a contemporary computer, the characters and the letters have Unicode codes in one-to-one correspondence, characters outside the interval are extracted by using a regular mode according to the Unicode coding interval of the corresponding language, universal characters such as numbers, symbols and English letters are filtered, the rest are considered as 'messy codes', the number of the messy codes is calculated, and if the quantity of the messy codes in a single language sentence is too large, the characters are considered to be unmatched in language or unknown in semantics.

Sixthly, punctuation number proportion detection items: the method comprises the steps of obtaining the total number of characters of the standard text, respectively extracting punctuations and numbers in the standard text, counting the number of the punctuations and the number of the numbers to obtain a first proportion of the punctuations in the total number of the characters and a second proportion of the numbers in the total number of the characters, and determining the limit relation between the first proportion and the second proportion and a preset proportion range.

Specifically, punctuation and numbers are respectively extracted, the number is counted, the proportion of the total number of the characters in the whole sentence is respectively calculated, and the proportion is also limited within a reasonable range. And if the limit relationship is that the first occupation ratio and the second occupation ratio are not in the preset occupation ratio range, the detection result of the item does not reach the standard.

Seventh, word frequency limit detection item: and counting each word group in the word segmentation sequence to obtain the highest word frequency, comparing the highest word frequency with a preset word frequency threshold value, and determining whether the sentence corresponding to the word segmentation sequence is invalid or not.

Specifically, each phrase in the phrase sequence is counted to obtain the highest word frequency, whether the highest word frequency exceeds a preset word frequency threshold value is judged, if the highest word frequency exceeds the preset word frequency threshold value, it can be considered that excessive circulation occurs, and the sentence is invalid.

In addition, in addition to the general processing logic, the grammar rules which can be converted into the code language can be added into the automatic processing process according to the grammar rules which are provided by the language experts and aim at various languages, so that the deficiency of the existing quality inspection guidance of certain small languages is made up to a certain extent.

It can be understood that the above items are in a parallel relationship, the detection sequence can be in any order, the detection can be performed in sequence according to a preset order, that is, the logic and the order to be executed can be arranged by self, and whether the problem is ended in advance can be set.

In one embodiment, the monolingual detection result includes: matching results obtained by detecting the language identification detection items; whether the total number of characters obtained by the sentence length limitation detection item is qualified or not or whether the number of sequence elements is qualified or not is detected; the sensitive words are obtained by detecting the sensitive word detection items; detecting whether the semantics obtained by the sentence confusion detection item are smooth or not; the language semantic result obtained by the detection of the messy code character detection item; the limit relation obtained by detecting the punctuation digit ratio detection item; and detecting whether the obtained sentence is invalid or not by the word frequency limit detection item.

Specifically, the detection results of the items are displayed, so that the problems in the standard text of the single language can be comprehensively detected.

In one embodiment, the bilingual detection rule comprises: a sentence length ratio detection item, a word frequency distance detection item, a digital alignment detection item, a URL alignment detection item, a word alignment detection item, a semantic similarity detection item and an analog scoring detection item.

First, sentence length ratio detection item: acquiring the bilingual parallel sentence pair, acquiring a first total number of characters of a standard text corresponding to an original text and a second total number of characters of the standard text corresponding to a translated text, calculating a character ratio between the first total number of characters and the second total number of characters, and comparing the character ratio with a preset character ratio threshold value; or acquiring the bilingual parallel sentence pairs, acquiring the number of first sequence elements of a word segmentation sequence corresponding to the original text and the number of second sequence elements of a word segmentation sequence corresponding to the translated text, calculating the element number ratio between the number of the first sequence elements and the number of the second sequence elements, and comparing the element number ratio with a preset element number ratio threshold value.

Specifically, the ratio of sentence lengths between two translated sentence pairs in two languages is maintained within a certain range, and if the sentence length is too large or too small, there may be a problem that the sentence pairs are not aligned, the sentence length limitation detection item is used to detect the length of one language and the length of the other language corresponding to the translation, and the calculated lengths are divided to obtain the ratio, that is, the total number of the first characters is divided by the total number of the second characters, or the number of the first sequence elements is divided by the number of the second sequence elements, so as to determine whether the result is within the interval. The sentence length ratio is generally in the interval of [1/3,3], but some languages (Laos and Thai) are particularly short and some languages (Arab and Komago) are particularly long according to the input language pair characteristics.

Secondly, detecting term of word frequency distance: and acquiring the bilingual parallel sentence pair, acquiring a first highest word frequency of the standard text corresponding to the original text and a second highest word frequency of the standard text corresponding to the translated text, calculating a difference between the first highest word frequency and the second highest word frequency, and comparing the difference with a preset difference threshold.

Specifically, the times of high-frequency words appearing in two language translation sentence pairs are close to each other, if the difference between the highest word frequencies on the two sides is large, the condition of missing translation can be suspected, the highest word frequencies in the two standard texts are respectively calculated by using word frequency limit detection items, the difference is obtained by subtracting the highest word frequencies, and whether the difference exceeds a set maximum acceptable difference value or not is judged.

Third, digital alignment detection item: the method comprises the steps of obtaining the bilingual parallel sentence pairs, extracting numbers in a standard text corresponding to an original text one by using a regular mode to obtain a first number set and numbers in a standard text corresponding to a translated text to obtain a second number set, calculating a number intersection and a number union between the first number set and the second number set, calculating a number proportion value of the number intersection and the number union, and comparing the number proportion value with a preset number ratio.

Fourthly, URL alignment detection item: and acquiring the bilingual parallel sentence pairs, extracting URLs in the standard texts corresponding to the original texts one by using a regular mode to obtain a first URL set and URLs in the standard texts corresponding to the translated texts to obtain a second URL set, calculating URL intersections and URL union sets between the first URL set and the second URL set, calculating URL proportional values of the URL intersections and the URL union sets, and comparing the URL proportional values with preset URL ratios.

Specifically, in the process of translating in different languages, the keywords of numbers and URL types are generally not translated as they are, so that it can be checked whether the keywords appear in pairs, and if the keywords are different from each other, it is suspected that there are some exceptions such as wrong translation and misalignment. And (3) extracting numbers and URLs in bilingual parallel sentence pairs one by using regular rules, then respectively aiming at numerical values and URLs, calculating the proportion of intersection and union after collecting the sets, and judging whether the ratio exceeds the minimum acceptable ratio. However, this rule is not 100% applicable, for example, the Laos uses the Buddha calendar to calculate the years, the year value is different from the year value of most languages using the Gregorian calendar, and at this time, if the year values are consistent, the year values are wrong, so in actual use, a reasonable threshold value is set or whether the rule is used is selected according to the situation.

Fifth, word alignment detection items: and acquiring the bilingual parallel sentence pairs, matching word groups in the word segmentation sequence corresponding to the original text and the word groups in the word segmentation sequence corresponding to the translated text in pairs through a bilingual dictionary, counting word alignment results, and calculating sentence-level alignment scores.

Specifically, the words in the bilingual parallel sentence pair should have a corresponding mapping relationship, that is, the words should be translated into synonyms of another language, so that pairwise matching of phrases in the segmentation sequence of the bilingual parallel sentence pair can be performed through the bilingual dictionary, word alignment results are counted, and sentence-level alignment scores are calculated.

Sixthly, semantic similarity detection items: the bilingual parallel sentence pair is obtained, the standard texts corresponding to the original texts are mapped to the Embedding in the same space one by one to obtain first sentence characterization vectors, the standard texts corresponding to the translated texts are mapped to the Embedding in the same space one by one to obtain second sentence characterization vectors, and cosine values between the first sentence characterization vectors and the second sentence characterization vectors are calculated.

Specifically, with the help of some neural network model methods (such as laser, and also can be trained by a self-designed scheme), input bilingual parallel sentence pairs are mapped to Embedding in the same space one by one to obtain corresponding sentence characterization vectors, and cosine values, namely cosine similarity, of the two sentence characterization vectors are calculated and used for representing semantic similarity of the two sentences. And filtering the corpora with low similarity scores during testing.

Seventhly, simulating a scoring detection item: inputting the bilingual parallel sentence pairs into a machine learning model trained in advance to obtain translation confidence scores; the machine learning model is formed by training data by bilingual parallel sentences with manual scoring results.

Specifically, the simulation scoring detection items can realize that a machine can make initial judgment on the translation confidence of each parallel sentence on data in large batch, the basic principle is to train a machine learning model simulating manual scoring based on the training data of bilingual parallel sentence pairs with manual scoring results accumulated previously, score new bilingual parallel sentence pairs based on the trained machine learning model, screen out data with poor scoring results, and perform different manual verification strategies on the rest data according to scoring intervals and return manual results to perform secondary training on the machine learning model.

In addition, the invention also performs copy verification when the test set is used for training the model. Generally, a data set for machine learning is divided into a training set, a development (verification) set and a testing (evaluation) set according to functions, and english corresponding to the three sets are train, dev (valid) and test. The amount of the general training set is large (such as 500w sentence pairs), and the quality requirement is not so high (such as >90% accuracy). While the number of sentences in the test set is relatively small (e.g., 1w sentence pairs), the quality requirement is very high (e.g., >98% accuracy), since this is to be used to test the effect of the evaluation model.

If only aiming at a test set with higher quality requirements compared with training data, the data requires extremely fine degree, usually, original text is provided, and the annotator translates the original text into other languages, and in the process of manual annotation, considering the nature and the authenticity of the translation result, the annotator is usually required to forbid the reference of the translation result generated by an external translation engine, otherwise, the situation of prejudicing the referenced engine occurs during use, and therefore, whether the translation is a translation copy generated by an open translation engine is also required to be detected. Specifically, the literal similarity score (such as BM25, edit distance, jaccard similarity, bleu value, etc.) is performed on the translation produced by the translation expert and the results produced by the translation engine(s), and if there are a plurality of translation engines capable of supporting the language translation and only one score is obviously too high, the score difference with the rest scores is large; or only a single engine can support the translation of the language pair with a score higher than a predetermined threshold, all of which determine that the translation is a duplicate that references the translation engine.

In one embodiment, the bilingual detection result comprises: a comparison result between a character ratio obtained by the sentence length ratio detection item detection and a preset character ratio threshold value, or a comparison result between an element number ratio and a preset element number ratio threshold value; comparing the difference obtained by the word frequency distance detection item detection with a preset difference threshold value; the comparison result of the digital proportion value obtained by the detection of the digital alignment detection item and the preset digital ratio is obtained; comparing the URL proportion value obtained by detecting the URL alignment detection item with a preset URL ratio; sentence-level alignment scores obtained by detecting the word alignment detection items; detecting a cosine value between a first statement representation vector and a second statement representation vector by the semantic similarity detection item; and the translation confidence score obtained by the detection of the simulation scoring detection item.

Specifically, the detection results of the items are displayed, so that the problems in the bilingual standard text can be comprehensively detected.

In one implementation, the method for detecting the quality of the machine-translated corpus further includes: storing the standard text to a pre-established corpus; and carrying out data re-checking processing on the canonical texts in the corpus. That is to say, the similarity calculation is performed on all the data and repeated data is found out through the standard text and the word segmentation sequence obtained after the preprocessing, and the data with the high similarity are prevented from being stored in the database for many times as much as possible.

In one embodiment, the data duplication checking process is performed on the canonical texts in the corpus, and comprises the following steps: removing symbols of the standard text by using regular mode, and combining blank characters into a single blank space; and (3) dividing all the standard texts into barrels, dividing the standard texts with the same first word into the same barrel, and encrypting each standard text of each barrel, wherein the standard texts with the same ciphertext are repeated statements.

Specifically, in this embodiment, a literal match is detected, symbols are removed from a canonical text by means of regularization, redundant blank characters (including spaces, tab characters, line feed, and the like) are merged into a single space, so as to reduce the computational complexity, the barrel division is performed based on the same first word, MD5 or other encryption processing is performed, the ciphertext that is the same is regarded as a group for grouping, and finally the result in the same group is a repeated statement. For example, "weather today is good. "," what is eaten at noon today? "," rainy in tomorrow "," thunder in afterday ", the result of dividing the barrel is as follows: (1) weather is good today. What is eaten at noon today? (2) raining in the tomorrow; and (3) lightning strike in the acquired period. And then, independently encrypting each piece of data in the step (1), and dividing the same ciphertext into one group, independently encrypting each piece of data in the step (2), and dividing the same ciphertext into one group, and independently encrypting each piece of data in the step (3), and dividing the same ciphertext into one group. The same data after encryption is classified as literal data because the corresponding original text is the same.

In another embodiment, the data re-searching processing is performed on the canonical texts in the corpus, and includes: removing symbols of the standard text by using regular mode, and combining blank characters into a single blank space; dividing all the standard texts into barrels, dividing the standard texts with the same first word into the same barrel, and dividing the standard texts with the same last word into the same barrel; giving a unique node to each standard text in the bucket, and calculating the editing distance of every two data in the bucket; comparing the editing distance with a preset distance threshold, connecting points corresponding to the standard texts with similar comparison results together to obtain a connected subgraph, and obtaining a clustering result according to the connected subgraph, wherein the standard texts belonging to the same class in the clustering result are repeated sentences.

Specifically, the fuzzy similarity detection is performed in the embodiment, symbols are removed and words are segmented, then based on the fact that the first word or the last word is the same, barrel segmentation is performed respectively, a unique node is given to each piece of data, the data in the barrel are subjected to pairwise calculation of edit distance or other word face similarity calculation, a threshold value is set in advance, points which are determined to be similar are connected in a graph, a clustering result is obtained according to connected subgraphs, and the same type of words are repeated sentences. In the graph, the connecting edges are used for connecting points corresponding to similar data together to obtain a connected subgraph, and the connected subgraphs are a group of similar data as long as the connected subgraphs are lifted one by one; the data of the left single point has no similar language material.

In an implementation manner, an implementation environment of the quality detection method for a machine translation corpus is a spark environment. The invention is developed based on spark, is convenient for scheduling the computing power of multiple machines and ensures the corpus processing efficiency. Specifically, each module is developed based on spark, the whole quality inspection tool can be put into a multi-machine environment for use, concurrency is increased, the overall performance is improved, the code level expandability is high, and newly-added check logics or newly-added adaptive languages under each large module are facilitated. In actual use, only part of the modules can be used, and if only monolingual corpus quality inspection is carried out, the parallel corpus quality inspection module is not needed.

When debugging is carried out, the results of all the links are integrated, and if fatal errors exist or scores are actually too low, manual re-marking or abandoning is carried out; and partitioning the rest data into regions according to the performance condition, and selecting to perform manual recheck or directly selecting high-confidence linguistic data to store in a database for production and use. Here, a fatal error refers to a problem that a sensitive word or a sentence is ultra-short or ultra-long, and the like, and since there is no model error, if it is determined that the detected error is a problem, it is defined as a fatal error. The score means that some quality inspection links output a score, such as semantic similarity, confusion, and the like.

In addition, during development, the invention mainly takes Chinese characters as quality inspection points for development, and then extends to foreign languages. Therefore, the embodiment integrates and optimizes the beneficial ideas and capabilities of the multi-party quality inspection tool, is developed based on Chinese as the center, covers a plurality of languages, and supports continuous extension of the quality inspection tool.

Therefore, the first embodiment provides a comprehensive quality detection system for the linguistic data required by the multi-language machine translation model, on one hand, the beneficial ideas and the capabilities of a plurality of quality inspection tools are integrated and optimized, on the other hand, the supported languages have wider coverage and can be suitable for various scenes, and therefore, the embodiment provides more comprehensive and credible automatic detection capability and provides effective guarantee for the final linguistic data enabling translation model. Secondly, the embodiment is realized based on spark, and compared with other single-machine quality inspection tools, the multi-machine processing capability is expanded, and the processing efficiency of the corpus is greatly improved.

Further, as shown in fig. 2, based on the method for detecting the quality of the machine-translated corpus, the present invention further provides a device for detecting the quality of the machine-translated corpus, including:

the preprocessing module 100 is configured to, when a machine translation corpus to be detected is received, preprocess the machine translation corpus to obtain a standard text and a word segmentation sequence;

and the detection module 200 is configured to perform problem detection processing on the standard text and the word segmentation sequence according to a preset multi-language detection rule to obtain a detection result.

Further, as shown in fig. 3, based on the quality detection method for machine-translated corpus, the present invention further provides a terminal, including: a memory 20, a processor 10 and a quality detection program 30 of machine translation corpus stored on the memory 20 and operable on the processor 10, wherein the quality detection program 30 of machine translation corpus implements the steps of the method for detecting quality of machine translation corpus as described above when executed by the processor.

The present invention also provides a computer-readable storage medium storing a computer program executable for implementing the steps of the method for quality detection of machine-translated corpus as described above.

In summary, the method and apparatus for detecting quality of machine translation corpus disclosed in the present invention includes: when a machine translation corpus to be detected is received, preprocessing the machine translation corpus to obtain a standard text and a word segmentation sequence; and carrying out problem detection processing on the standard text and the word segmentation sequence according to a preset multi-language detection rule to obtain a detection result. According to the method, the machine translation corpus to be detected is preprocessed into the standard text and the participle sequence, the problem detection is automatically carried out on the standard text and the participle sequence by using the preset multi-language detection rule, the method is suitable for multiple languages, the application range is wide, manual quality inspection by translators is not needed, the detection efficiency is improved, and the quality of a detection result is ensured.

It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.

Claims

1. A quality detection method for machine translation corpora is characterized by comprising the following steps:

2. The method for detecting the quality of the machine-translated corpus according to claim 1, wherein the machine-translated corpus is a monolingual text or a bilingual text, and the bilingual text comprises an original text and a translated text; when receiving a machine translation corpus to be detected, preprocessing the machine translation corpus to obtain a standard text and a word segmentation sequence, wherein the method comprises the following steps:

3. The method for detecting the quality of the machine translation corpus according to claim 1, wherein the step of preprocessing the machine translation corpus to obtain a canonical text and a word segmentation sequence when the machine translation corpus to be detected is received comprises:

when a machine translation corpus to be detected is received, determining a target language to which the machine translation corpus belongs;

4. The method for detecting the quality of the machine-translated corpus according to claim 3, wherein the step of performing unified text format processing on the machine-translated corpus comprises:

5. The method of claim 2, wherein said multi-lingual detection rules comprise monolingual detection rules and bilingual detection rules; performing problem detection processing on the standard text and the word segmentation sequence according to a preset multi-language detection rule to obtain a detection result, wherein the problem detection processing comprises the following steps:

6. The method of claim 5, wherein the single-language detection rule comprises: language identification detection items, sentence length limitation detection items, sensitive word detection items, sentence confusion degree detection items, messy code character detection items, punctuation digit proportion detection items and word frequency limitation detection items;

the sentence confusion detection item detection step includes: evaluating the confusion degree of the standard text by using a multi-language sentence confusion degree calculation model, and determining whether the semantics of the standard text are smooth;

the detection step of the messy code character detection item comprises the following steps: searching Unicode codes corresponding to the target language, extracting initial characters which do not belong to the Unicode codes in the standard text by using a regular mode, filtering general characters in the initial characters to obtain messy code characters, calculating the number of the messy code characters, and determining a language semantic result of the standard text;

the detection step of the punctuation number proportion detection item comprises the following steps: acquiring the total number of characters of the standard text, respectively extracting punctuations and numbers in the standard text, counting the number of the punctuations and the number of the numbers to obtain a first proportion of the punctuations in the total number of the characters and a second proportion of the numbers in the total number of the characters, and determining a limit relation between the first proportion and the second proportion and a preset proportion range;

7. The method for quality inspection of machine-translated corpus according to claim 6, wherein said monolingual inspection result comprises: matching results obtained by detecting the language identification detection items; whether the total number of characters obtained by the sentence length limitation detection item is qualified or not or whether the number of sequence elements is qualified or not is detected; the sensitive words are obtained by detecting the sensitive word detection items; detecting whether the semantics obtained by the sentence confusion detection item are smooth or not; the language semantic result obtained by the detection of the messy code character detection item; the limit relation obtained by detecting the punctuation digit ratio detection item; and detecting whether the obtained statement is invalid or not by the word frequency limitation detection item.

8. The method of claim 5, wherein the bilingual detection rule comprises: a sentence length ratio detection item, a word frequency distance detection item, a digital alignment detection item, a URL alignment detection item, a word alignment detection item, a semantic similarity detection item and an analog scoring detection item;

the detecting step of the digital alignment detection item comprises the following steps: acquiring the bilingual parallel sentence pairs, extracting numbers in a standard text corresponding to an original text one by using a regular mode to obtain a first number set and numbers in a standard text corresponding to a translated text to obtain a second number set, calculating a number intersection and a number union between the first number set and the second number set, calculating a number proportion value of the number intersection and the number union, and comparing the number proportion value with a preset number ratio;

the detection step of the semantic similarity detection item comprises the following steps: acquiring the bilingual parallel sentence pair, mapping the standard texts corresponding to the original text to the Embedding in the same space one by one to obtain a first sentence characterization vector, mapping the standard texts corresponding to the translated text to the Embedding in the same space one by one to obtain a second sentence characterization vector, and calculating a cosine value between the first sentence characterization vector and the second sentence characterization vector;

9. The method for quality inspection of machine-translated corpus according to claim 8, wherein said bilingual inspection result comprises: a comparison result between a character ratio obtained by the sentence length ratio detection item detection and a preset character ratio threshold value, or a comparison result between an element number ratio and a preset element number ratio threshold value; comparing the difference obtained by the word frequency distance detection item detection with a preset difference threshold value; the comparison result of the digital proportion value obtained by the detection of the digital alignment detection item and the preset digital ratio is obtained; comparing the URL proportion value obtained by detecting the URL alignment detection item with a preset URL ratio; sentence-level alignment scores obtained by detecting the word alignment detection items; detecting cosine values between a first statement characterization vector and a second statement characterization vector by the semantic similarity detection item; and the translation confidence score obtained by the detection of the simulation scoring detection item.

10. The method for detecting quality of machine-translated corpus according to claim 1, wherein said method for detecting quality of machine-translated corpus further comprises:

storing the standard text to a pre-established corpus;

11. The method for detecting quality of machine-translated corpus according to claim 10, wherein the performing data retrieval on the canonical texts in the corpus comprises:

and (3) dividing all the standard texts into barrels, dividing the standard texts with the same first word into the same barrel, and encrypting each standard text of each barrel, wherein the standard texts with the same ciphertext are repeated statements.

12. The method for detecting quality of machine-translated corpus according to claim 10, wherein the performing data retrieval on the canonical texts in the corpus comprises:

comparing the editing distance with a preset distance threshold, connecting points corresponding to the standard texts with similar comparison results together to obtain a connected subgraph, and obtaining a clustering result according to the connected subgraph, wherein the standard texts belonging to the same class in the clustering result are repeated sentences.

13. The method for detecting the quality of the machine-translated corpus according to claim 1, wherein the implementation environment of the method for detecting the quality of the machine-translated corpus is spark environment, and during development and debugging, the input text is a parallel sentence pair of chinese and other languages.

14. A quality detection device for a machine translation corpus is characterized by comprising:

15. A terminal, comprising: a memory, a processor and a program for detecting quality of machine-translated corpus stored in the memory and executable on the processor, wherein the program for detecting quality of machine-translated corpus realizes the steps of the method for detecting quality of machine-translated corpus as claimed in any one of claims 1 to 13 when executed by the processor.

16. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, which can be executed to implement the steps of the method for detecting the quality of machine-translated corpus according to any one of claims 1 to 13.