CN114742077A

CN114742077A - Generation method of domain parallel corpus and training method of translation model

Info

Publication number: CN114742077A
Application number: CN202210394030.XA
Authority: CN
Inventors: 杨露; 黄细凤; 代翔
Original assignee: CETC 10 Research Institute
Current assignee: CETC 10 Research Institute
Priority date: 2022-04-15
Filing date: 2022-04-15
Publication date: 2022-07-12

Abstract

The invention discloses a method for generating domain parallel corpora and a method for training a translation model, which belong to the field of machine translation in natural language processing and comprise the following steps: and aligning the chapter-level linguistic data and the sentence-level linguistic data in the parallel linguistic data material library by using the machine translation model, and generating the chapter-level parallel linguistic data and the sentence-level parallel linguistic data after aligning to form the field parallel linguistic data. The method can generate the field parallel corpus, realizes the self-updating of the supervised machine translation model, has universality, simultaneously improves the content quality of the field parallel corpus, ensures the correctness of field term translation in the translation process, reduces the cost, realizes the self-circulation of the field parallel corpus generation and the machine translation in the supervised machine translation, and simultaneously improves the efficiency.

Description

Method for generating domain parallel corpus and method for training translation model

Technical Field

The invention relates to the field of machine translation in natural language processing, in particular to a method for generating domain parallel corpora and a method for training a translation model.

Background

Machine translation belongs to the category of computational linguistics, and researches on a technology for translating characters from one natural language into another natural language by a computer program, namely a machine translation model. Machine translation models are classified into supervised and unsupervised. By means of the parallel corpus generation technology, the supervised translation model can realize more complex text automatic translation, and can process the correspondence of different grammar structures, vocabulary identification and idioms.

Parallel corpora refer to text that is placed in parallel with the translation. The parallel text alignment technology refers to a technology for determining an original text and a translated text of a parallel text. The original text refers to a text to be translated, and the translated text is a text in a corresponding language consistent with the content of the original text, for example, in the translation in korea, the korean text is the original text, and the chinese text is the translated text. Parallel corpus generation refers to the realization of parallel text alignment on a sentence level, and a parallel corpus set, i.e., a parallel corpus, is generated. The translation model training refers to training a supervised machine translation model by adopting parallel corpora in a field parallel corpus so that the translation can be accurately translated into the original text.

The domain parallel corpus refers to a parallel corpus related to a specific domain, such as military domain, scientific domain, etc. For example, in the open domain parallel corpus, the domain parallel corpus usually contains more domain knowledge, such as domain terms, domain text expression methods, domain textual specifications, and the like. Supervised machine translation models require learning domain translation knowledge from parallel corpora. Compared with the open domain parallel corpus training model, the translation model trained by the domain parallel corpus has better application effect.

The method for acquiring the parallel linguistic data mainly comprises two methods, namely, the method comprises the steps of manually mining the parallel linguistic data from various databases or documents, such as legal documents, patent databases and the like of various countries; and secondly, collecting bilingual website resources through a web crawler, and generating parallel corpora after processing.

The previous research on parallel corpus generation mainly focuses on the scale and quality of the parallel corpus, and rarely relates to the research on the field parallel corpus generation. Meanwhile, because the collection and processing difficulty of the domain corpora is large, the parallel corpora generation aiming at a specific domain is usually completed by manual translation, so that the current domain corpora are very few, and even some domains do not have corpora capable of meeting the training of machine translation models.

At present, the prior art has the following technical problems: 1) the field parallel corpus is rare, and the use requirement of a machine translation model cannot be met: 2) the universality of the parallel corpus of the existing field is poor; 3) the problem that correct translation of domain terms cannot be guaranteed exists in the existing parallel corpus generation process; 4) the cost for manually determining and generating the field parallel corpus is high, and the efficiency is low.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides a method for generating domain parallel corpus and a method for training a translation model, generates the domain parallel corpus, realizes the self-updating of a supervised machine translation model, has universality, simultaneously improves the content quality of the domain parallel corpus, ensures the correctness of domain term translation in the translation process, reduces the cost, realizes the self-circulation of two parts of domain parallel corpus generation and machine translation in the supervised machine translation, and simultaneously improves the efficiency.

The purpose of the invention is realized by the following scheme:

a method for generating domain parallel corpora comprises the following steps:

and aligning the chapter-level linguistic data and the sentence-level linguistic data in the parallel linguistic data material library by using the machine translation model, and generating the chapter-level parallel linguistic data and the sentence-level parallel linguistic data after aligning to form the field parallel linguistic data.

Further, comprising the sub-steps of:

initializing a training supervision type machine translation model by utilizing open parallel corpora;

collecting bilingual website content, analyzing material titles, content and reporting time to generate a corpus material, and storing the corpus material in a parallel corpus material library;

sub-step of alignment of discourse level parallel corpus: calculating the reporting time difference of an original text material and a translated text material in the parallel corpus material library, matching the field terms in the translated original text material title, if the reporting time difference is greater than a preset time difference threshold value, the original text material and the translated text material are not chapter-level parallel corpuses, if the reporting time difference is less than the preset time difference threshold value, comparing the similarity of the title contents of the two materials by using an initialized supervised machine translation model, if the reporting time difference is greater than the preset title content similarity threshold value, judging the original text materials as chapter-level parallel corpuses, otherwise, judging the original text materials as non-chapter-level parallel corpuses and stopping processing;

and entering a sentence-level parallel corpus alignment sub-step only for the case that the judgment is the chapter-level parallel corpus: and finishing sentence splitting of the original text and the translated text aiming at the discourse-level parallel language materials, matching the field terms in the original text sentence, comparing the content similarity of any two original text sentences and translated text sentences by using the initialized supervised machine translation model, judging the sentence-level parallel language materials if the content similarity is higher than a preset sentence translation performance threshold value, and stopping processing if the content similarity is not higher than the preset sentence translation performance threshold value.

A method for training a translation model, comprising the steps of: updating the machine translation model by using the sentence-level parallel corpus generated by the method, and generating a field parallel corpus by using the updated machine translation model; and respectively circulating the generation process of the domain parallel corpora and the updating process of the machine translation model.

Further, the open parallel corpus comprises an open domain public parallel corpus and an open domain translation interface, and the supervised machine translation model comprises a Bert-Transformer translation model.

Furthermore, the method for collecting the bilingual website content, analyzing the material title, the content and the report time to generate the corpus material, and storing the corpus material into the parallel corpus material library comprises the following substeps:

calling a corpus material table, judging whether materials with the same material reporting time and the same title exist, and discarding the collected materials if the materials exist; and if not, adding the collected materials into the corpus material table.

Further, for the Bert-transform translation model, the rouge value of the computed translation and the Chinese material title text is used as a parameter for content similarity comparison.

Further, the open domain public parallel corpus is obtained by a crawler.

Further, comprising the sub-steps of: a sentence-level parallel corpus list is provided for storing the sentence-level parallel corpus generated as described above.

The beneficial effects of the invention include:

the method realizes the generation of the field parallel linguistic data through the steps of field bilingual website content acquisition, chapter-level field parallel linguistic data alignment, sentence-level field parallel linguistic data acquisition and the like, has certain universality, and can adapt to a plurality of fields under the condition that a field corpus is lacked.

In the invention, the problem that the correct translation of the domain terms cannot be ensured in the parallel corpus generation process is considered, the domain terms related to the original text are automatically matched by means of the domain original translation term comparison table, the content quality of the domain parallel corpus is improved, and the correctness of the domain term translation in the translation process is ensured.

The method can automatically complete the updating training of the machine translation model based on the generated field parallel corpus, and the updated translation model can support the generation of the parallel corpus at the same time, thereby reducing the cost of manually determining the quality of the generated field parallel corpus, realizing the self-circulation of the field parallel corpus generation and the machine translation in the supervised machine translation, and simultaneously improving the corpus generation and the machine translation efficiency.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart illustrating a method for domain parallel corpus generation and translation model training according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a field parallel corpus material collection process according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a process of generating a chapter-level domain parallel corpus according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a sentence-level domain parallel corpus generation flow according to an embodiment of the present invention;

fig. 5 is a schematic operation diagram of a Bert-Transformer-based machine translation model to which the speech generating method and the training method according to the embodiment of the present invention are applied.

Detailed Description

All features disclosed in all embodiments in this specification, or all methods or process steps implicitly disclosed, may be combined and/or expanded, or substituted, in any way, except for mutually exclusive features and/or steps.

Aiming at the problems in the background and aiming at obtaining high-quality domain parallel linguistic data, on one hand, the embodiment of the invention collects bilingual corpus materials from a domain bilingual website by means of a machine translation model and a domain term comparison table, and generates the domain parallel linguistic data through processing; on the other hand, in the process of generating the domain parallel corpus, the updating of the domain machine translation model is realized, and the domain machine translation model is further optimized.

In a specific implementation process, the method generates the domain parallel corpus and realizes the self-updating of the supervised machine translation model through the steps of machine translation model initialization training, domain bilingual website corpus material acquisition, chapter-level corpus alignment based on time and title content, sentence-level corpus alignment based on domain terms and information content and the like.

In order to more specifically describe the specific implementation method of the embodiment of the present invention, the embodiment of the present invention takes the construction of parallel linguistic data in korea in the military field as an example, that is, the original text is korean, and the translated text is chinese, and the translation model using the linguistic data is a Bert-Transformer-based korean machine translation model for translating korean into chinese. The translation model uses a Bert model as the encoder and a transform model as the decoder.

As shown in fig. 1, an embodiment of the present invention provides a method for domain parallel corpus generation and translation model training, including the following steps:

step 1: initializing a translation model: and training a supervised translation model (such as a statistical translation model) based on the open domain open parallel corpus and the open domain translation interface, and initially training a Bert-Transformer translation model. In step 1, the method specifically comprises the following substeps:

step 1.1: preparing a batch of field Chinese language material, calling an open Korean translation interface, such as a Korean translation interface in Baidu translation, translating the batch of Chinese language material into Korean language material, and combining the Korean language material and the Chinese language material to form Korean parallel language material;

step 1.2: preparing open source parallel corpora, such as Korean parallel corpora provided by a world machine translation competition;

step 1.3: the translation model is trained using the parallel corpora prepared in step 1.1 and step 1.2.

Step 2: as shown in fig. 2, based on the crawler technology, domain bilingual website content is collected, webpage content analysis is realized, corpus materials are generated, and the corpus materials are stored in a corpus material table. In step 2, the method specifically comprises the following substeps:

step 2.1: setting a bilingual website url for crawling the material. For example, according to the method, the Korean materials in the military field can be determined to be crawled from websites such as the east Asia daily newspaper and the central daily newspaper, the east Asia daily newspaper is selected as a bilingual website for obtaining the materials, and the crawled Korean material websites url are http:// www.donga.com and http:// www.donga.com/cn respectively.

Step 2.2: and calling a crawler technology, respectively acquiring home page contents of the bilingual webpage, analyzing home page materials url, and storing the home page materials url into a material urls list.

Step 2.3: traversing url in the urls list, calling a crawler technology, acquiring the webpage content of the material, and analyzing the title, the report time and the text of the material; calling a corpus material table, judging whether materials with the same material reporting time and the same title exist, and discarding the newly collected materials if the materials exist; and if not, adding the newly collected materials into the corpus material table.

Step 2.4: and regularly executing the steps 2.1 to 2.3 to update the corpus material table.

And step 3: as shown in fig. 3, based on the korean material and the chinese corpus material crawled in step 2, it is determined whether the two materials are chapter-level parallel corpuses based on the report time and the performance of the title translation. In step 3, the method specifically comprises the following substeps:

step 3.1: and setting the time difference threshold of the discourse-level parallel corpus report as t _ p and the title content similarity threshold as rough _ p.

Step 3.2: inputting Korean materials (Sz, Sf) to be aligned;

step 3.3: obtaining the reporting time of the Sz and Sf materials, calculating the reporting time difference of the two materials, and if the time difference is greater than the time difference threshold value and is t _ p, indicating that the two materials cannot be discourse level parallel linguistic data; if the time difference is smaller than the threshold value, turning to the step 3.4;

the present examples recognize that two materials are less likely to describe the same thing when the difference in the reporting times of the two materials is large.

Step 3.4: and judging the content similarity of the two materials. In step 3.4, the method specifically comprises the following substeps:

step 3.4.1: calling a domain Korean art language comparison table, matching terms appearing in Korean material titles, and replacing the terms with Chinese;

step 3.4.2: calling a Bert-Transformer translation model obtained by initial training in the step 1, translating the Korean material title into Chinese, and calculating the rouge value of the translated text and the Chinese material title text;

step 3.4.3: if the route is larger than the similarity threshold of the title contents, the two materials are discourse level parallel linguistic data; otherwise, the two materials are non-chapter-level parallel corpora.

And 4, step 4: as shown in FIG. 4, based on the chapter-level parallel corpora (Sz, Sf) generated in step 3, sentence-level parallel corpora are generated based on the similarity of the contents. In step 4, the method specifically comprises the following substeps:

step 4.1: a sentence translation performance threshold bleu _ p is set, the maximum values of the lengths of Chinese and Korean sentences are len _ z and len _ f respectively, and sentence-level parallel corpus list is set.

Step 4.2: the chapter-level parallel corpora (Sz, Sf) are input, and the Chinese corpora are according to'. | A The punctuations such as "" and the like are used for dividing sentences, sentences with the length exceeding len _ z are divided into a plurality of sentences, English is aligned with the clauses according to the punctuations such as ". |'", the sentences with the length exceeding len _ f are divided into a plurality of sentences, and Chinese and Korean clause results are sequentially stored in listz and listf.

Step 4.3: and calculating the content similarity of any two sentences a and b of listz and listf, and judging whether the sentences are sentence-level parallel corpora or not. In step 4.3, the method specifically comprises the following substeps:

step 4.3.1: calling a domain Korean term comparison table, matching terms appearing in the sentence b, and replacing the terms with Chinese;

step 4.3.2: calling a Bert-Transformer translation model obtained by initial training in the step 1, translating the sentence b into Chinese, and calculating a bleu value bleu _ b of the translated text and the sentence a;

step 4.3.3: comparing the size of bleu _ b with the sentence translation performance threshold bleu _ p, if bleu _ b is larger than bleu _ p, the sentence a and the sentence b are parallel corpuses, and adding the sentence pair (a, b) into the sentence level parallel corpus list.

And 5: and updating the translation model in the training field by using the corpora in the sentence-level parallel corpus list to generate a new translation model.

Example 1

A method for generating domain parallel corpora comprises the following steps:

Example 2

On the basis of the embodiment 1, the method comprises the following substeps:

sub-step of alignment of discourse level parallel corpora: calculating the reporting time difference of an original text material and a translated text material in the parallel corpus material library, matching the field terms in the translated original text material title, if the reporting time difference is greater than a preset time difference threshold value, the original text material and the translated text material are not chapter-level parallel corpuses, if the reporting time difference is less than the preset time difference threshold value, comparing the similarity of the title contents of the two materials by using an initialized supervised machine translation model, if the reporting time difference is greater than the preset title content similarity threshold value, judging the original text materials as chapter-level parallel corpuses, otherwise, judging the original text materials as non-chapter-level parallel corpuses and stopping processing;

and entering a sentence-level parallel corpus alignment sub-step only for the case that the judgment is the chapter-level parallel corpus: and finishing the sentence division of the original text and the translated text aiming at the chapter-level parallel linguistic data, matching the field terms in the translated original text sentence, comparing the content similarity of any two original text sentences and translated text sentences by using the initialized supervised machine translation model, judging the sentence-level parallel linguistic data if the content similarity is higher than a preset sentence translation performance threshold value, and otherwise, judging the sentence-level parallel linguistic data and stopping the processing.

Example 3

On the basis of the embodiment 2, the method for training the translation model comprises the following steps: updating a machine translation model by using the sentence-level parallel corpus generated by the method of the embodiment 1, and generating a field parallel corpus by using the updated machine translation model; and respectively circulating the generation process of the domain parallel corpora and the updating process of the machine translation model.

Example 4

On the basis of the embodiment 3, the open parallel corpus comprises an open domain public parallel corpus and an open domain translation interface, and the supervised machine translation model comprises a Bert-Transformer translation model.

Example 5

On the basis of the embodiment 3, the method for collecting the bilingual website content, analyzing the material title, the content and the report time to generate the corpus material, and storing the corpus material into the parallel corpus material library comprises the following substeps: calling a corpus material table, judging whether materials with the same material reporting time and the same title exist, and discarding the collected materials if the materials exist; and if not, adding the collected materials into the corpus material table.

Example 6

On the basis of the embodiment 4, the rouge value of the calculated translation and the Chinese material title text is used as a parameter for comparing the content similarity for the Bert-Transformer translation model.

Example 7

On the basis of example 4, the open-domain public parallel corpus is obtained by a crawler.

Example 8

On the basis of embodiment 3, the method comprises the following substeps: a sentence-level corpus parallelism list is provided for storing the sentence-level corpus generated by the method of claim 1.

The units described in the embodiments of the present invention may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations described above.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiment; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method described in the above embodiments.

The parts not involved in the present invention are the same as or can be implemented using the prior art.

The above-described embodiment is only one embodiment of the present invention, and it will be apparent to those skilled in the art that various modifications and variations can be easily made based on the application and principle of the present invention disclosed in the present application, and the present invention is not limited to the method described in the above-described embodiment of the present invention, so that the above-described embodiment is only preferred, and not restrictive.

Other embodiments than the above examples may be devised by those skilled in the art based on the foregoing disclosure, or by adapting and using knowledge or techniques of the relevant art, and features of various embodiments may be interchanged or substituted and such modifications and variations that may be made by those skilled in the art without departing from the spirit and scope of the present invention are intended to be within the scope of the following claims.

Claims

1. A method for generating domain parallel corpora is characterized by comprising the following steps:

2. The method for generating domain parallel corpus according to claim 1, comprising the sub-steps of:

initializing a training supervision type machine translation model by utilizing the open parallel corpora;

3. A method for training a translation model, comprising the steps of: updating a machine translation model by using the sentence-level parallel corpus generated by the method of claim 1, and generating a domain parallel corpus by using the updated machine translation model; and respectively circulating the generation process of the domain parallel corpora and the updating process of the machine translation model.

4. The method for training a translation model according to claim 3, wherein the open parallel corpus comprises open domain public parallel corpus and open domain translation interface, and the supervised machine translation model comprises a Bert-Transformer translation model.

5. The method for training a translation model according to claim 3, wherein said collecting bilingual website contents and parsing the material title, content and reporting time to generate corpus materials, storing them in the parallel corpus materials library, comprises the sub-steps of:

calling a corpus material table, judging whether materials with the same material reporting time and the same title exist in the corpus material table, and discarding the collected materials if the materials exist; and if not, adding the collected materials into the corpus material table.

6. The method for training the translation model of claim 4, wherein a rouge value of the calculated translation and the Chinese material title text is used as a parameter for comparing the content similarity for the Bert-Transformer translation model.

7. The method as claimed in claim 4, wherein the open-domain published parallel corpus is obtained by a crawler.

8. A method for training a translation model according to claim 3, comprising the sub-steps of: a sentence-level corpus parallelism list is provided for storing the sentence-level corpus parallelism generated by the method of claim 1.