CN114528861A

CN114528861A - Foreign language translation training method and device based on corpus

Info

Publication number: CN114528861A
Application number: CN202210204937.5A
Authority: CN
Inventors: 申丽霞
Original assignee: Zhengzhou University of Science and Technology
Current assignee: Zhengzhou University of Science and Technology
Priority date: 2022-03-02
Filing date: 2022-03-02
Publication date: 2022-05-24

Abstract

The invention discloses a foreign language translation training method and a foreign language translation training device based on a corpus, which relate to the technical field of natural language processing and specifically comprise the following steps: randomly extracting a preset number of training corpora from any parallel corpus to construct a first parallel corpus; constructing and training an initial translation model according to the first parallel language library; acquiring a translation corpus by using an initial translation model; calculating a translation confidence score of any statement in the translation corpus, and comparing the translation confidence score with a preset evaluation threshold; updating the translation corpus according to the comparison result, and splicing the translation corpus with any monolingual corpus to obtain a second parallel corpus; acquiring an integral corpus according to any one parallel corpus and a second parallel corpus, and training the initial translation model again; according to the invention, the scale of the parallel corpus is enlarged to ensure the accuracy of the translation model result, and the parallel corpus is enlarged to ensure the accuracy of the translated sentences merged into the original parallel corpus, so that the trained translation model is more accurate.

Description

Foreign language translation training method and device based on corpus

Technical Field

The invention relates to the technical field of natural language processing, in particular to a foreign language translation training method and device based on a corpus.

Background

Natural language processing is an important research direction for computer science artificial intelligence. The study on how to enable people and computers to effectively communicate by using natural language is a subject integrating linguistics, computer science and mathematics.

Among them, neural machine translation is an important task that cannot be ignored. In recent years, neural machine translation has attracted a great deal of attention in academia and industry. The neural network machine translation model can obtain good performance and benefit from large-scale and high-quality bilingual parallel training corpora, and currently, the high-quality parallel corpora usually exist among a small number of languages and are often limited to certain specific fields, such as government documents, news and the like; therefore, how to ensure the accuracy of the translation model result in the limited parallel training corpus is a problem that needs to be solved urgently by those skilled in the art.

Disclosure of Invention

In view of the above, the present invention provides a method and an apparatus for training foreign language translation based on a corpus, which overcome the above-mentioned drawbacks.

In order to achieve the above purpose, the invention provides the following technical scheme:

a foreign language translation training method based on a corpus specifically comprises the following steps:

randomly extracting a preset number of training corpora from any parallel corpus to construct a first parallel corpus;

constructing and training an initial translation model according to the first parallel language library;

translating a source language sentence in any monolingual corpus into a target language sentence by using an initial translation model to obtain a translation corpus;

calculating a translation confidence score of any statement in the translation corpus, and comparing the translation confidence score with a preset evaluation threshold;

updating the translation corpus according to the comparison result, and splicing the translation corpus with any monolingual corpus to obtain a second parallel corpus;

and acquiring the whole corpus according to any one parallel corpus and the second parallel corpus, and training the initial translation model again.

Optionally, the construction step of the initial translation model is:

preprocessing sentences in the first parallel language material library to obtain a preprocessed text;

performing word segmentation processing on the preprocessed text according to the automatic word segmentation model to obtain word segmentation text information;

and training by utilizing a recurrent neural network based on the word segmentation text information, and establishing and training an initial translation model.

Optionally, the step of obtaining the automatic word segmentation model includes:

acquiring a preprocessed text, and performing word segmentation processing on the preprocessed text to obtain word segmentation text information at a character level;

acquiring part-of-speech tags and word segmentation tags of word segmentation text information;

combining part-of-speech labels and word segmentation labels of word segmentation text information to obtain binary label information;

and training by using a recurrent neural network based on the word segmentation text information and the binary label information to construct an automatic word segmentation model.

Optionally, the obtaining step of the translation confidence score of any statement is:

obtaining a translation confidence evaluation index according to historical data;

acquiring the weight of each translation confidence evaluation index;

and obtaining the translation confidence score of any statement according to each translation confidence evaluation index and the corresponding weight.

Optionally, the calculation formula of the translation confidence score is as follows:

in the formula, i is the number of translation confidence evaluation indexes; lambda [ alpha ]_iThe weight of the ith translation confidence evaluation index; h is_iThe index is the ith translation confidence evaluation index.

Optionally, the step of updating the translation corpus specifically includes:

calculating a translation confidence score of any statement in the translation corpus, and comparing the translation confidence score with a preset evaluation threshold; if the evaluation value is larger than or equal to the second evaluation threshold value, the translation corpus is not updated; if the evaluation value is smaller than the first evaluation threshold value, text recognition is carried out on any sentence in the translation corpus according to a preset length;

matching the recognized text with the text in the source language sentence;

acquiring a text to be replaced according to a monolingual corpus of a target language;

replacing the text to be replaced with the corresponding content in the identified text to obtain a second translation sentence;

calculating translation confidence score of the second translation statement, if the translation confidence score is smaller than a first evaluation threshold, replacing the texts to be replaced one by one, and calculating the translation confidence score respectively to obtain the best translation statement and update the translation corpus; and if the evaluation value is larger than or equal to the second evaluation threshold value, storing the second translation sentence into the translation corpus and updating the translation corpus.

A foreign language translation training device based on a corpus comprises an initial training module, an evaluation module, a first corpus construction module, a second corpus construction module and a retraining module;

the initial training module is used for constructing and training an initial translation model according to the first parallel language library;

the evaluation module is used for calculating the translation confidence score of any statement in the translation corpus and comparing the translation confidence score with a preset evaluation threshold; updating the translation corpus according to the comparison result;

the first corpus construction module is used for splicing the updated translation corpus and any monolingual corpus to obtain a second parallel corpus;

the second corpus construction module is used for acquiring an integral corpus according to any one parallel corpus and a second parallel corpus;

and the retraining module is used for retraining the initial translation model according to the whole corpus.

Optionally, the initial training module includes a corpus extraction module, a preprocessing module, an automatic word segmentation module, and a model training module;

the corpus extraction module is used for extracting training corpuses with preset quantity and constructing a first parallel corpus;

the preprocessing module is used for preprocessing the sentences in the first parallel language material library to obtain a preprocessed text;

the automatic word segmentation module is used for carrying out word segmentation processing on the preprocessed text to obtain word segmentation text information;

and the model training module is used for establishing and training an initial translation model.

Compared with the prior art, the foreign language translation training method and device based on the corpus ensure the accuracy of the translation model result by enlarging the scale of the parallel corpus and ensure the accuracy of the translated sentences incorporated into the original parallel corpus by enlarging the parallel corpus so as to ensure that the trained translation model is more accurate.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a schematic structural diagram of the apparatus of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a foreign language translation training method and a foreign language translation training device based on a corpus, wherein the method comprises the following steps as shown in figure 1:

step 1, randomly extracting a preset number of training corpora from any parallel corpus to construct a first parallel corpus;

the parallel corpus is also called a translation corpus, is a corpus formed by combining original texts and translated texts, is used for training, testing and the like of a machine translation model, and can be a corpus formed by combining original texts and translated texts, such as Chinese and Mandarin, Chinese and English, Chinese and Japanese, Japanese and Chinese and the like.

In this embodiment, 1500 pairs of Chinese and English language pairs are randomly extracted from a Chinese-English translation corpus as training corpora, and a corpus is separately established for the training corpora, defined as a first parallel corpus; in this embodiment, chinese is defined as a source language and english is defined as a target language.

Step 2, constructing and training an initial translation model according to the first parallel language database, which specifically comprises the following steps:

based on the word segmentation text information, utilizing a bidirectional cyclic neural network to train, and establishing and training an initial translation model;

furthermore, the process of training by using the bidirectional recurrent neural network is as follows: the method comprises the steps of coding word segmentation text information from a forward direction and a reverse direction based on a bidirectional RNN coder, determining a hidden state of the bidirectional RNN coder at each time step, decoding the hidden state and semantic vectors of the bidirectional RNN coder at each time step based on a non-directional RNN decoder, establishing an initial translation model, and training the initial translation model.

In the embodiment, the encoding is performed from the positive direction and the negative direction through the bidirectional recurrent neural network, and the hidden state and the semantic vector of each time step are determined, so that the hidden state and the semantic vector of all the time steps are prevented from being compressed in a fixed-length vector, and the sentence translation accuracy of the initial translation model is improved.

The automatic word segmentation model is constructed by the following steps:

training by using a long-short term memory network based on word segmentation text information and binary label information to obtain an automatic word segmentation model;

wherein, the preprocessing is to carry out regularization, error correction, digital regularization and the like on the training corpus;

in this embodiment: and sequentially carrying out messy code filtering processing, Chinese half-corner character to full-corner character processing, Chinese word segmentation processing and English corpus lowercase processing on the data in the first parallel corpus, and establishing a corresponding word list.

Furthermore, the obtained word segmentation text information is used for training the long-term and short-term memory network until the current iteration number is larger than or equal to the preset maximum iteration number or the accuracy of the binary label information output by the long-term and short-term memory network is larger than a preset accuracy threshold, and then the automatic word segmentation model is obtained.

Step 3, translating the source language sentences in any monolingual corpus into target language sentences by using the initial translation model to obtain a translation corpus, which specifically comprises the following steps:

designating any language database in the existing Chinese language database, then translating all sentences in the Chinese language database into English sentences through an initial translation model, storing all English sentences in the language database according to the translation sequence, and defining the English sentences as a translation language database;

step 4, calculating a translation confidence score of any statement in the translation corpus, and comparing the translation confidence score with a preset evaluation threshold; the method specifically comprises the following steps:

acquiring the weight of each translation confidence evaluation index;

obtaining a translation confidence score of any statement according to each translation confidence evaluation index and the corresponding weight;

the translation confidence score is compared to a preset evaluation threshold.

The translation confidence evaluation index may include: the fluency degree of the translated sentence, the translation probability between the source language sentence and the word in the translated sentence, and the translation probability between the source language sentence and the phrase in the translated sentence are described;

the translation probability is related to the language habits, fixed collocation and the field of the source language sentence and the translated sentence, namely English.

The calculation formula of the translation confidence score is as follows:

in the formula, i is the number of translation confidence evaluation indexes; lambda [ alpha ]_iThe weight of the ith translation confidence evaluation index; h is_iThe translation confidence evaluation index.

Step 5, updating the translation corpus according to the comparison result, and splicing the translation corpus with any monolingual corpus to obtain a second parallel corpus;

the step of updating the translation corpus is as follows:

calculating a translation confidence score of any statement in the translation corpus, and comparing the translation confidence score with a preset evaluation threshold value; if the evaluation value is greater than or equal to the optimal evaluation threshold value, the translation corpus is not updated; if the evaluation value is smaller than the minimum evaluation threshold value, performing text recognition on any sentence in the translation corpus according to a preset length;

matching the recognized text with the text in the source language sentence;

replacing the text to be replaced and the corresponding content of the identified text to obtain a new translation sentence;

carrying out translation confidence score calculation on the new translation sentences, if the translation confidence score is smaller than the lowest evaluation threshold value, replacing the texts to be replaced one by one, and respectively calculating translation confidence scores to obtain the best translation sentences and updating a translation corpus; if the evaluation value is larger than or equal to the second evaluation threshold value, the sentence which is completely replaced is stored in the translation corpus, and the translation corpus is updated.

And 6, acquiring an integral corpus according to any one parallel corpus and the second parallel corpus, and training the initial translation model again.

The embodiment further includes a foreign language translation training device based on a corpus, as shown in fig. 2, the structure of which includes an initial training module, an evaluation module, a first corpus construction module, a second corpus construction module, and a retraining module;

the initial training module is used for constructing and training an initial translation model according to the first parallel language database;

the first corpus building module is used for splicing the updated translation corpus and any monolingual corpus to obtain a second parallel corpus;

the second corpus construction module is used for acquiring an integral corpus according to any one parallel corpus and the second parallel corpus;

The initial training module comprises a corpus extraction module, a preprocessing module, an automatic word segmentation module and a model training module;

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A foreign language translation training method based on a corpus is characterized by comprising the following specific steps:

2. The corpus-based foreign language translation training method according to claim 1, wherein the initial translation model is constructed by the steps of:

3. The corpus-based foreign language translation training method according to claim 2, wherein the automatic word segmentation model is obtained by the steps of:

4. The corpus-based foreign language translation training method according to claim 1, wherein the step of obtaining the translation confidence score of any sentence is:

acquiring the weight of each translation confidence evaluation index;

5. A corpus-based foreign language translation training method according to any one of claims 1-4, wherein the translation confidence score is calculated by the formula:

in the formula, i is the number of translation confidence evaluation indexes; lambda [ alpha ]_iThe weight of the ith translation confidence evaluation index; h is_iThe index is evaluated for i first translation confidence degrees.

6. The language corpus-based foreign language translation training method according to claim 1, wherein the step of updating the translation language corpus specifically comprises:

matching the recognized text with the text in the source language sentence;

7. A foreign language translation training device based on a corpus is characterized by comprising an initial training module, an evaluation module, a first corpus construction module, a second corpus construction module and a retraining module;

the first corpus building module is used for splicing the updated translation corpus with any monolingual corpus to obtain a second parallel corpus;

8. The corpus-based foreign language translation training device according to claim 7, wherein the initial training module comprises a corpus extraction module, a preprocessing module, an automatic word segmentation module, and a model training module;

the corpus extraction module is used for extracting a preset number of training corpuses and constructing a first parallel corpus;