CN111178089B

CN111178089B - Bilingual parallel data consistency detection and correction method

Info

Publication number: CN111178089B
Application number: CN201911324133.3A
Authority: CN
Inventors: 杜权; 李自荐
Original assignee: Shenyang Yayi Network Technology Co ltd
Current assignee: Shenyang Yayi Network Technology Co ltd
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2023-03-14
Anticipated expiration: 2039-12-20
Also published as: CN111178089A

Abstract

The invention discloses a bilingual parallel data consistency detection and correction method, which comprises the following steps: performing word segmentation operation on the monolingual data sets of the source language and the target language in the basic data set, and forming a bilingual parallel data set; performing word alignment information acquisition operation on the bilingual parallel data set to obtain a word corresponding relation of sentences in the data set and performing auxiliary judgment to obtain an entity corresponding table; if the sequence numbers of the sentence pairs are not consistent, correcting; detecting the consistency of the parenthesis content in the sentence pair on the basis of the bilingual parallel data set after word segmentation; evaluating the inconsistency of the detected parenthesis content, and giving a correction or deletion operation; acquiring word adhesion conditions appearing in data and carrying out splitting correction; and obtaining a bilingual parallel data set subjected to data processing finally. The method accurately identifies and corrects the conditions of inconsistency, word adhesion and the like in the sentence, improves the bilingual data quality and improves the translation performance of a neural machine.

Description

Bilingual parallel data consistency detection and correction method

Technical Field

The invention relates to the field of machine translation, in particular to a bilingual parallel data consistency detection and correction method.

Background

In recent years, a machine translation system constructed by using a deep neural network, which is referred to as a neural machine translation technology for short, becomes the mainstream of the artificial intelligence direction at present. However, as a neural network model with a super-strong learning ability, a neural machine translation model often needs to rely on a large-scale bilingual parallel corpus for support in a training process. Generally, the quality of bilingual data corpus will seriously affect the performance of the neural machine translation model. However, with the development of the current internet technology, although there are a lot of unstructured bilingual parallel data on the network, the quality thereof cannot be guaranteed, and in the existing bilingual parallel corpus, besides sentence pairs with extremely poor sentence interconversion, such as language errors, serious sentence missing, and the like, noise in the data mainly comes from some tiny problems, but this will also have a serious influence on the training effect of the neural machine translation model.

In the past research work, for the conditions of inconsistent correspondence of certain entities, word adhesion, inconsistent serial numbers and the like existing in data, the problems are difficult to correctly identify by using a general detection method based on the consistency processing of important task data in the related work of translating bilingual parallel corpora by a neural machine because the influence of the problems on some automatic sentence inter-translation evaluation indexes is small. However, the processing work for these tiny problems is important because the existence of the above conditions in the data may cause the same problem to the final translation result of the neural machine translation model, and influence the effect of the model on the final generation of the translation.

In the field of machine translation, the quality improvement of neural machine translation model training corpora is always the key direction of research, and one of the main reasons is that various problems may exist in bilingual data due to the diversity of data corpus sources at present. In the field of computer vision, it is important to train a neural network model to remove noise in training data and make the training data exhibit a high-quality state, and image data cleaning work for improving the learning effect of the model has become essential basic work. Similarly, in the process of training the neural machine translation model, in order to train a model with reliable parameter estimation, a large number of high-quality parallel sentence pairs are required to support model training. Therefore, the insufficient data quality often causes a defect in the performance of the model. In terms of neural machine translation, the deep neural machine translation model used for training has super-strong learning capability, so that the deep neural machine translation model is very sensitive to detail problems in training data of the deep neural machine translation model, and if excessive detail problems exist in the training data of the deep neural machine translation model, the defects of the performance of the final model can be caused, and the translation effect and the use experience are influenced.

Disclosure of Invention

Aiming at solving the problems that in the process of training a neural machine translation model in the prior art, the quality of bilingual data corpora is uneven, and various problems appearing in the data need to be fundamentally improved on the basis of the existing data, so that the data quality is improved.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

the invention discloses a bilingual parallel data consistency detection and correction method, which comprises the following steps of:

1) Acquiring bilingual parallel corpus public data sets in specified language directions from a public data set website as basic data sets for data consistency correction, and respectively forming monolingual data sets by source language sentences and target language sentences in the basic data sets as main data sets for subsequent sentence consistency correction rule learning;

2) Performing word segmentation operation on the monolingual data sets of the source language and the target language in the basic data set by using an open-source word segmentation technology, and forming a bilingual parallel data set subjected to word segmentation finally by using the data sets subjected to word segmentation;

3) Performing word alignment information acquisition operation on the bilingual parallel data set after word segmentation by using a fast _ alignment word alignment technology to obtain a word corresponding relation of sentences in the data set;

4) According to the bilingual parallel data set after word segmentation, counting the occurrence frequency of the corresponding relation of the named entities appearing in each sentence, and simultaneously performing auxiliary judgment by using the corresponding relation of the vocabulary to obtain an entity corresponding table; using the entity corresponding table to correct the consistency of the named entity of the sentence;

5) Judging whether the sentence pairs have the condition of inconsistent sequence numbers according to the obtained bilingual parallel data set after word segmentation, and if the sentence pairs have the condition of inconsistent sequence numbers, correcting the sentence pairs with the inconsistent sequence numbers to ensure the final processed data inter-translation;

6) Detecting the consistency of the parenthesis content in the sentence pair on the basis of the bilingual parallel data set after word segmentation according to the obtained vocabulary corresponding relation; meanwhile, the inconsistency of the detected parenthesis content is evaluated, the corresponding situation of the parenthesis content is determined according to the final evaluation result, and correction or deletion operation is given;

7) Starting from the actual data situation of the bilingual parallel data set after the obtained word segmentation, obtaining the word adhesion situation appearing in the data and carrying out split correction, judging the splittability of the current problem position before correction, ensuring the split accuracy to the maximum extent and ensuring the data quality;

8) After the consistency detection method is carried out on all sentence pairs in the basic data set, the bilingual parallel data set subjected to data processing is finally obtained, so that the bilingual parallel data set has less data noise and higher data quality compared with the original basic data set.

And 4) using the occurrence frequency of the vocabulary correspondence in the data set and the vocabulary correspondence, generating a named entity corresponding frequency table by using the vocabulary correspondence, generating a standard entity corresponding frequency table according to the high-frequency correspondence, and correcting the normalization of the sentences with inconsistent correspondence according to the standard entity corresponding frequency table.

And 5) performing sentence number correspondence by using the word alignment information and the actual situation of the sentence pair, selecting a specific processing mode of the current occurrence situation, correcting the sequence number part in the current sentence pair to ensure the sentence correspondence, specifically, replacing the sequence number part of the sentence at the other end by using the sequence number content of the standard sentence on the premise of ensuring the consistency of the sequence number part by using the sentence at one end as the standard.

And 6) evaluating the consistency of the parenthesis content in the data by using the word alignment information to ensure the sentence inter-translation and consistency of the parenthesis condition in the sentence.

And 7), using the word alignment information and the word corresponding frequency index, judging whether a word adhesion condition exists at a specified position in the sentence, and evaluating the processability of the position possibly having the word adhesion problem currently to make the most reasonable processing mode.

The invention has the following beneficial effects and advantages:

1. based on the original data processing task, the method based on the word alignment technology and the new word discovery idea is used according to the bilingual sentence pairs in the basic sentence set, so that the conditions of inconsistency, word adhesion and the like in the sentences can be effectively and accurately identified and corrected, the bilingual data quality is effectively improved, and the translation performance of a neural machine can be indirectly improved.

2. The invention realizes data quality improvement by using a mode based on word alignment technology and corresponding frequency statistics, has transparent architecture, can effectively process sentences with corresponding inconsistency, word adhesion and other conditions, is a universal data cleaning method, can quickly correct sentence problems, carries out corresponding processing operation according to specific actual conditions, and achieves a relatively efficient and accurate data cleaning method. The program structure is simple, and the operating speed is fast.

Drawings

FIG. 1 is a flow chart of a method for detecting and correcting the consistency of physical parts according to the present invention;

FIG. 2 is a flowchart of a method for detecting and correcting correspondence between parenthesis contents according to the present invention;

Detailed Description

The invention is further elucidated with reference to the accompanying drawings.

The invention provides an effective data correction method, which is a portable and quick data correction method for correcting bilingual data problems by using various technical means.

The invention discloses a bilingual parallel data consistency detection and correction method, which comprises the following steps:

In step 1), a public bilingual data set with an inter-translation relationship is used as a basic data set for data consistency correction. In the practical situation of acquiring the machine translation training data, the main source is collection and arrangement depending on the public data set, but the data quality cannot be guaranteed due to various data sources.

Currently, generally speaking, training a high-quality neural machine translation model requires high-quality large-scale bilingual parallel data for training support. However, it is difficult to ensure data quality for large-scale data sets obtained, and therefore, automatic data noise processing using a computer is a very important operation. For the situation, the invention uses the existing public bilingual data set to process various complex data noise situations in the data set to obtain bilingual inter-translation sentences with higher data quality, so that the finally obtained data set can have higher data quality for neural machine translation model training, and the final translation model has better performance.

In step 2), the open source word segmentation method is utilized to respectively perform word segmentation operation on the source language data and the target language data according to the language directions corresponding to the source language data and the target language data, and original sentences are segmented into a word sequence form.

According to the present invention, word segmentation operation is performed on data according to an open source word segmentation method according to the existing bilingual sentence sequence. The method mainly has the function of providing data support for downstream tasks of neural machine translation model training, and the reason is that a plurality of subsequent operation steps need to be completed by depending on word sequences, and the operation in the subsequent steps can be smoothly performed by performing the operation.

In step 3), word alignment processing is performed on the segmented data set obtained in step 2) according to the open source word alignment method fast _ align technology, and the word corresponding information of each bilingual sentence pair facilitates subsequent processing.

And 4) using the occurrence frequency of the vocabulary correspondence in the data set and the vocabulary correspondence, generating a named entity corresponding frequency table by using the vocabulary correspondence, generating a standard entity corresponding frequency table according to the high-frequency correspondence, and correcting the normalization of the sentences with inconsistent correspondence according to the standard entity corresponding frequency table. The main purpose of the method is to correct the inconsistency of named entities in data, and a processing object is an organization name abbreviation part. The reason is that the object has a short general length, is generally represented by English capital letters with a certain length, has few special cases, two ends of bilingual data generally need to be kept consistent, and a processing process can generate less ambiguity to ensure the data quality. In the processing process, the method firstly extracts continuous capital letter words in the source language sentence, and sets the length threshold value L to discard the extracted words exceeding the threshold value, because the method considers that the organization name processing operation risk degree is higher due to the overlong length. In addition, the corresponding part in the target language sentence is obtained according to the word corresponding information obtained in the step 3), and the correct corresponding condition must be that the current position of the source language has bidirectional correspondence with the corresponding position of the target language. In addition, the step also includes the steps of counting the occurrence frequency of the obtained entity corresponding example, selecting the corresponding situation of high frequency as a standard, and setting the frequency threshold value as theta. The parts with entity corresponding situation frequency higher than theta are combined into a standard entity corresponding table. And carrying out normalization processing on the entity corresponding parts which are not in the table according to the standard corresponding situation table. As shown in fig. 1.

And 5) performing sentence number correspondence by using the word alignment information and the actual situation of the sentence pair, selecting a specific processing mode of the current occurrence situation, correcting the sequence number part in the current sentence pair to ensure the sentence correspondence, specifically, replacing the sequence number part of the sentence at the other end by using the sequence number content of the standard sentence on the premise of ensuring the consistency of the sequence number part by using the sentence at one end as the standard. Judging whether the bilingual parallel sentence pairs have the condition of inconsistent sequence numbers according to the final result of the bilingual parallel sentence pairs after word segmentation obtained in the step 2), carrying out relevant correction on the sentence pairs with the condition of inconsistent sequence numbers, and ensuring the final inter-translation of the processed data;

in bilingual data, the situation that the serial number appears at the beginning part of a sentence often appears, but some data processing operation errors can cause the situation that the serial number of the beginning part of the sentence is not corresponding or the serial number is lost, if a large number of sentence pairs with the problem appear in training data, the model can be caused to carry out the operation on the sentence pairs with the ordered number contentIn the process of translating the sentences, the situation that the serial numbers do not correspond to each other is caused in the translation result. In the step, the method extracts the sentence pair with the part of the number appearing in the part of the sentence with the length range of R at the beginning part of the sentence to obtain a data set T with the sequence number _s . According to analysis, the method is proved to have the advantages of high efficiency and low cost for T _s The condition that the medium serial number does not correspond to the medium serial number is mainly processed by the following means:

(1) Both ends have serial number parts, the format of the serial number parts is sound, although the corresponding relation exists, the serial number formats are not corresponding, and the formats need to be corrected. The method removes the sequence number part of the target language and directly adds the sequence number in the source language sentence to the original sequence number position of the target language sentence.

(2) Only one end of the source language and the target language has a serial number, and the serial number of the other end is partially or completely lost. It needs to be replenished for the part where this exists. The specific correction method is that one sentence with a sequence number at one end is taken as a standard, the sequence number part of the sentence at the end is extracted and directly supplemented to the corresponding position of the sentence at the other end, and the sequence number correspondence is ensured.

(3) The parts of Source and target sentence whose headers are of length R have a number but are not a sequence number, which would not be processed for the sentence pair in this case.

In step 6), the consistency of the parenthesis content in the data is evaluated by using the word alignment information, the sentence inter-translation consistency and consistency of the parenthesis condition existing in the sentence are ensured, the specific evaluation method is to extract the parenthesis content of one end sentence, calculate the corresponding probability of the parenthesis content and the sentence content of the other end by using the word alignment information, and evaluate the parenthesis content consistency through the probability value. In bilingual parallel data sets, usually many sentences will contain a part of parenthesis, which mainly plays an explanatory role in the sentence. In the existing training data set, the condition that the parenthesis content of the source language sentence and the target language sentence can not be correctly corresponding often occurs, namely the parenthesis content at two ends is not translated or is lost, and the final translation effect of the neural machine translation model is greatly influenced after the neural machine translation model is trained by using the data set with the sentence pair.

In the step, the method firstly obtains sentences of source language or target language with parenthesis content according to the basic bilingual data set obtained in the step 1, and forms a data subset T _b . Using T _b And (3) a data set, wherein sentence bracket content extraction operation is respectively carried out on each sentence pair on the basis of the data set, if the bracket content exists in the source language sentence and the target language sentence, the bracket content is respectively extracted, and the mutual translation condition of the bracket content at two ends is obtained by using the word alignment result obtained in the step (3). If the parenthesis content is translated, the current sentence is not processed. If the parenthesis content does not correspond to the current sentence pair, the method considers that the current sentence pair has the condition of translation missing or translation missing, and the current sentence pair needs to be discarded. In addition, if the current sentence pair only has parenthesis content in one end sentence, the current sentence pair is directly discarded.

The method regards sentences with sentence interconversion probability higher than a certain threshold value theta as the existence of the interconversion relationship. The inter-translational probabilistic inference formula is shown below:

wherein N is _m For the number of words, N, corresponding to each other between the source and target languages _s And N _t The number of words in the source and target language sentences, respectively.

In addition, according to the above corresponding inference, we can draw the following cases: (1) The source language sentence or the target language sentence has bracketed content, and the bracketed content corresponds to each other. (2) The source or target language sentences have bracketed sections, but their bracketed contents do not correspond. (3) The source or target language sentence has a bracket portion at only one end. The specific implementation of the method of this step is shown in fig. 2.

In step 7), using the word alignment information and the word corresponding frequency index, whether a word adhesion condition exists at a specified position in the sentence or not is determined, and meanwhile, the processibility of the position possibly having the word adhesion problem at present is evaluated to make the most reasonable processing mode. And detecting and splitting the sticky part of the vocabulary in the source language sentence and the target language sentence. In this step, the method extracts sentences with word-sticky conditions in an automated manner and splits them according to a standard vocabulary.

According to the bilingual parallel data set after word segmentation obtained in the step 2), frequency statistics is carried out on all appearing words in two language directions, the words with the frequency higher than a threshold value phi are used for forming a standard high-frequency word list corresponding to the language directions, the purpose of limiting the frequency threshold value phi is mainly to ensure that the words in the word list are all common words appearing in data at high frequency, and sticky word splitting processing is not carried out on rare words. The method comprises the steps of detecting whether a sticky word part exists in a sentence or not according to a high-frequency word list corresponding to the language direction, wherein the specific detection mode is to judge whether a continuous character sub-string in a certain length range exists in the sentence or not, the character sub-string is composed of a plurality of words in the high-frequency word list corresponding to the language direction, the length range of a threshold value for processing the sticky word is limited to [10, 20], and the threshold value is set to avoid the splitting ambiguity when the sticky word is too long or too short.

In step 8), after consistency detection is performed on all sentence pairs in the basic data set, a final bilingual parallel data set after data processing is obtained, and compared with the original basic data set, the bilingual parallel data set has less data noise and higher data quality.

In this embodiment, an OPUS english-french bilingual dataset is used as a basic dataset, and a newtest2015 test dataset is used, and after the basic dataset is processed by using the data consistency detection and correction method of the present invention, the content of sentences with content inconsistency can be corrected, so as to obtain a high-quality dataset, which has higher data quality than the basic dataset. The neural machine translation model trained by using the high-quality data set and the neural machine translation model trained by using the basic data set can have higher BLEU value scores. The experimental effect is as follows.

Base-Method represents the BLEU value result of the neural machine translation model on a test set newtest2015 obtained by using a basic data set as a training set; the Check-Method represents the BLEU value result of the neural machine translation model on the test set newtest2015, which is obtained by using a high-quality data set which is improved by using a data consistency detection and correction Method as a training set;

according to the experimental results, the method can accurately detect and correct the sentences with inconsistent data contents in the data set, so that the effect of the machine translation model is improved. The method respectively realizes the improvement of the quality of the data set on the basis of the original data set by utilizing the technologies and ideas of word alignment, new word discovery and the like, corrects several main non-correspondence problems in the data, and makes outstanding contribution to the overall quality improvement of the final model training data.

Claims

1. A bilingual parallel data consistency detection and correction method is characterized by comprising the following steps:

5) Judging whether the sentence pairs have the condition of inconsistent sequence numbers according to the obtained bilingual parallel data set after word segmentation, and if the sentence pairs have the condition of inconsistent sequence numbers, correcting the sentence pairs with the inconsistent sequence numbers to ensure the interconversion of the finally processed data;

2. The bilingual parallel data consistency detection and correction method according to claim 1, characterized in that: and 4) using the occurrence frequency of the vocabulary correspondence in the data set and the vocabulary correspondence, generating a named entity corresponding frequency table by using the vocabulary correspondence, generating a standard entity corresponding frequency table according to the high-frequency correspondence, and correcting the normalization of the sentences with inconsistent correspondence according to the standard entity corresponding frequency table.

3. The bilingual parallel data consistency detection and correction method of claim 1, wherein: in step 5), the sentence sequence number correspondence is carried out by using the word alignment information and the actual condition of the sentence pair, a specific processing mode of the current occurrence condition is selected, the sequence number part in the current sentence pair is corrected, the sentence correspondence is ensured, specifically, the sentence at one end is used as a standard, and the sequence number part of the sentence at the other end is replaced by using the sequence number content of the standard sentence on the premise of ensuring the consistency of the sequence number part.

4. The bilingual parallel data consistency detection and correction method according to claim 1, characterized in that: and 6) evaluating the consistency of the parenthesis content in the data by using the word alignment information to ensure the sentence inter-translation and consistency of the parenthesis condition in the sentence.

5. The bilingual parallel data consistency detection and correction method according to claim 1, characterized in that: and 7), using the word alignment information and the word corresponding frequency index, judging whether a word adhesion condition exists at a specified position in the sentence, and evaluating the processability of the position possibly having the word adhesion problem currently to make the most reasonable processing mode.