CN110705320A

CN110705320A - State-defense military-industry-field machine translation method and system for subdivision field

Info

Publication number: CN110705320A
Application number: CN201910948363.0A
Authority: CN
Inventors: 雷贺功; 李斌; 姚晗; 晏裕生; 程洁丹; 孙孟阳; 董文轩; 江洋
Original assignee: INTRODUCTION OF TECHNOLOGY RESEARCH & ECONOMY DEVELOPMENT INSTITUTE
Current assignee: INTRODUCTION OF TECHNOLOGY RESEARCH & ECONOMY DEVELOPMENT INSTITUTE
Priority date: 2019-10-08
Filing date: 2019-10-08
Publication date: 2020-01-17

Abstract

The invention discloses a machine translation method and a machine translation system for the national defense and military industry field in the subdivision field. On the basis of a general machine translation model, carrying out subdivision domain division on parallel linguistic data sentences in a linguistic database, and training a subdivision domain machine translation model; when a user translates, a trained SVM text classification model is adopted to automatically determine a subdivision domain according to a text to be translated, a corresponding subdivision domain machine translation model is called to translate the text to be translated, and a subdivision domain translation result is generated. The subdivision domain machine translation model is trained by parallel corpus sentence pairs in the subdivision domain, so that the subdivision domain machine translation quality can be obviously improved when text translation is carried out in the subdivision domain facing the national defense and military industry field.

Description

State-defense military-industry-field machine translation method and system for subdivision field

Technical Field

The invention relates to the technical field of machine translation, in particular to a method and a system for machine translation in the national defense and military industry field oriented to the subdivision field.

Background

Machine translation is the process of converting one natural language (source language) to another natural language (target language) using a computer. Neural Machine Translation (NMT) is a currently commonly used Machine Translation method, which is based on deep learning, and encodes a sentence to be translated (a source sentence) into a vector through an encoder (encoder) by using an encoder-decoder (encoder-decoder) model with an attention-based mechanism (attention-based), and then decodes the vector of the source sentence through the decoder (decoder) to form a corresponding Translation (a target sentence). Because a large amount of subdivision field translation requirements exist in the national defense military industry field, the current universal machine translation model is difficult to perform directional optimization aiming at the subdivision field of the national defense military industry field, and the expected translation effect is difficult to achieve, so that the machine translation models facing different subdivision fields are urgently needed to be trained, and the translation quality of the corresponding subdivision fields is improved.

Disclosure of Invention

The invention aims to provide a machine translation method and a machine translation system for the national defense military field in the subdivision field, and aims to solve the problem that the existing general machine translation model is poor in translation quality of documents in the subdivision field in the national defense military field.

In order to achieve the purpose, the invention provides the following scheme:

a defense and military industry field machine translation method for a subdivision field comprises the following steps:

acquiring parallel corpus sentence pairs in a corpus; the parallel language sentence pairs comprise original texts and corresponding translated texts;

acquiring a trained SVM text classification model;

classifying the parallel corpus sentence pairs into each subdivision field of a knowledge system in the national defense and military industry field by adopting the trained SVM text classification model;

respectively training a general machine translation model by adopting the parallel corpus sentence pairs of each subdivision domain to generate a corresponding subdivision domain machine translation model;

acquiring a text to be translated;

determining a subdivision field of the text to be translated by adopting the SVM text classification model;

and calling a subdivision domain machine translation model corresponding to the subdivision domain of the text to be translated to translate the text to be translated, and generating a subdivision domain translation result.

Optionally, before acquiring the parallel corpus sentence pairs in the corpus, the method further includes:

acquiring the existing translation result in the national defense science and technology field; the translation result is the original text and the translated text of the successfully translated text;

and adopting a sentence alignment tool to divide the sentence-level translation results into sentence-level translation results, and performing sentence alignment operation on the sentence-level translation results according to the original text and the translated text to generate a plurality of parallel language material sentence pairs to be stored in the corpus.

Optionally, before the obtaining of the trained SVM text classification model, the method further includes:

selecting a plurality of parallel corpus sentence pairs marked with subdivision fields in the corpus as a training set;

and training each parallel corpus sentence pair in the training set and the corresponding subdivision field by adopting a Support Vector Machine (SVM) method to generate a trained SVM text classification model.

Optionally, after the generating the translation result of the segment domain, the method further includes:

acquiring a manual proofreading result of the translation result of the subdivided field;

and dividing the manual proofreading result into a plurality of parallel corpus sentence pairs by adopting a sentence alignment tool and storing the sentence pairs in the corpus.

Optionally, after the SVM text classification model is adopted to determine the subdivided field of the text to be translated, the method further includes:

judging whether a user manually adjusts the subdivision field of the text to be translated or not to obtain a first judgment result;

if the first judgment result is that the user does not manually adjust the subdivision fields of the text to be translated, storing the text to be translated and the subdivision fields corresponding to the text to be translated into the corpus;

if the first judgment result is that the user manually adjusts the subdivision region of the text to be translated, judging whether a subdivision region annotation person approves the subdivision region of the text to be translated determined by the SVM text classification model, and obtaining a second judgment result;

if the second judgment result is that the segmentation field of the text to be translated is determined by the SVM text classification model approved by the segmentation field label personnel, storing the text to be translated and the corresponding segmentation field thereof into the corpus;

and if the second judgment result is that the segmentation field marking personnel do not approve the segmentation field of the text to be translated determined by the SVM text classification model, the text to be translated and the corresponding segmentation field are not stored.

A segment domain oriented defense military domain machine translation system, the system comprising:

the parallel corpus sentence pair acquisition module is used for acquiring parallel corpus sentence pairs in a corpus; the parallel language sentence pairs comprise original texts and corresponding translated texts;

the SVM text classification model acquisition module is used for acquiring a trained SVM text classification model;

the parallel linguistic data sentence pair subdivision domain dividing module is used for classifying the parallel linguistic data sentence pairs into each subdivision domain of a national defense and military industry domain knowledge system by adopting the trained SVM text classification model;

the subdivision domain machine translation model training module is used for respectively training a general machine translation model by adopting the parallel corpus sentence pairs of each subdivision domain to generate a corresponding subdivision domain machine translation model;

the translation system comprises a to-be-translated text acquisition module, a translation module and a translation module, wherein the to-be-translated text acquisition module is used for acquiring a to-be-translated text;

the segmentation domain automatic division module is used for determining the segmentation domain of the text to be translated by adopting the SVM text classification model;

and the subdivision domain machine translation module is used for calling a subdivision domain machine translation model corresponding to the subdivision domain of the text to be translated to translate the text to be translated and generating a subdivision domain translation result.

Optionally, the system further includes:

the existing translation achievement acquisition module is used for acquiring existing translation achievements in the national defense science and technology field before acquiring the parallel corpus sentence pairs in the corpus; the translation result is the original text and the translated text of the successfully translated text;

and the parallel corpus sentence pair dividing module is used for dividing the chapter-level translation results into sentence-level translation results by adopting a sentence alignment tool, performing sentence alignment operation on the sentence-level translation results according to the original text and the translated text, and generating a plurality of parallel corpus sentence pairs to be stored in the corpus.

Optionally, the system further includes:

the text classification training set selection module is used for selecting a plurality of parallel corpus sentence pairs marked with subdivision fields in the corpus as a training set before acquiring a trained SVM text classification model;

and the text classification module training module is used for training each parallel corpus sentence pair in the training set and the corresponding subdivision field by adopting a Support Vector Machine (SVM) method to generate a trained SVM text classification model.

Optionally, the system further includes:

the manual proofreading result acquisition module is used for acquiring the manual proofreading result of the translation result of the subdivided field after the translation result of the subdivided field is generated;

and the manual proofreading result dividing module is used for dividing the manual proofreading result into a plurality of parallel corpus sentence pairs by adopting a sentence alignment tool and storing the sentence pairs in the corpus.

Optionally, the system includes:

the manual adjustment judging module is used for judging whether a user manually adjusts the subdivision field of the text to be translated or not after the subdivision field of the text to be translated is determined by the SVM text classification model, so that a first judging result is obtained;

the first storage module is used for storing the text to be translated and the corresponding subdivided fields thereof into the corpus if the first judgment result indicates that the user does not manually adjust the subdivided fields of the text to be translated;

the segmentation field automatic division result judgment module is used for judging whether a segmentation field label person approves the segmentation field of the text to be translated determined by the SVM text classification model or not if the first judgment result is that the user manually adjusts the segmentation field of the text to be translated, and obtaining a second judgment result;

and the second storage module is used for storing the text to be translated and the corresponding subdivided field of the text to be translated into the corpus if the second judgment result is that the subdivided field of the text to be translated determined by the SVM text classification model is approved by the subdivided field labeling personnel.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides a machine translation method and a machine translation system in the national defense and military industry field facing to the subdivision field, wherein the method divides the subdivision field of parallel linguistic data sentences in a linguistic database on the basis of a general machine translation model and trains a machine translation model in the subdivision field; when a user translates, a trained SVM text classification model is adopted to automatically determine a subdivision domain according to a text to be translated, a corresponding subdivision domain machine translation model is called to translate the text to be translated, and a subdivision domain translation result is generated. The subdivision domain machine translation model is trained by parallel corpus sentence pairs in the subdivision domain, so that the subdivision domain machine translation quality can be obviously improved when text translation is carried out in the subdivision domain facing the national defense and military industry field.

In addition, the classification of the subdivided fields can be manually adjusted by a user in the method, the translated text can be manually corrected by the user after the translation is finished, and the result of the manual correction is stored in the system, so that the result of the manual adjustment and correction can be periodically trained, an SVM text classification model and a machine translation model of the subdivided fields are continuously optimized, and the machine translation quality of the subdivided fields is further improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of a machine translation method in the defense and military industry field for the subdivision field provided by the invention;

FIG. 2 is a schematic diagram of the principle of the defense military industry field machine translation method for the subdivision field;

fig. 3 is a structural diagram of a machine translation system in the defense and military industry field for the subdivision field provided by the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention aims to provide a machine translation method and a machine translation system in the national defense military field for the subdivision field, which improve the translation quality of neural machine translation in the national defense military field by carrying out targeted model training on different subdivision fields and solve the problem of poor translation quality of the existing general machine translation model to documents in the subdivision field in the national defense military field.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a flowchart of a machine translation method in the national defense military field for the subdivision field provided by the invention, and fig. 2 is a schematic diagram of a principle of the machine translation method in the national defense military field for the subdivision field provided by the invention. Referring to fig. 1 and fig. 2, the method for translating a machine in the national defense and military industry field for a subdivided field specifically includes:

step 101: and acquiring parallel corpus sentence pairs in the corpus.

Before acquiring the parallel corpus sentence pairs in the corpus in step 101, a corpus required by the present invention needs to be established, which specifically includes:

acquiring the translation achievements of the existing national defense science and technology field accumulated in the early stage; the translation result is the original text and the translated text of the successfully translated text;

and adopting a sentence alignment tool to divide the sentence-level translation results into sentence-level translation results, and performing sentence alignment operation on the sentence-level translation results according to the original text and the translated text to generate a plurality of parallel language material sentence pairs to be stored in the corpus. The parallel language sentence pair comprises an original text and a corresponding translated text.

The parallel language material sentence pair refers to a manual translation result at sentence level, such as a original text 'Fire control system developedby Loral', and a corresponding translation text 'Fire control system developed by Lolal'. Then the original text "Fire control system developed by Lorale" and the corresponding translation text "Fire control system developed by Lorale" form a parallel language sentence pair.

The corpus not only stores more than two thousand parallel corpus sentence pairs divided by the translation results of the existing defense science and technology field, but also stores a plurality of parallel corpus sentence pairs divided by the translation results of the current text to be translated and the subdivision field thereof in real time.

Step 102: and acquiring a trained SVM text classification model.

Before the trained SVM text classification model is obtained in step 102, the trained SVM text classification model needs to be trained, which specifically includes:

1) and constructing a training set. And combing the parallel corpus sentence pairs in the current corpus, and selecting a plurality of parallel corpus sentence pairs marked with subdivision fields as a training set. Wherein the training set data is not less than 10% of the total data volume.

2) And training each parallel corpus sentence pair in the training set and the corresponding subdivision field by adopting an SVM method to form a trained SVM text classification model.

An SVM (support vector machine) is a generalized linear classifier for classifying data in a supervised learning manner, can perform linear classification by using a kernel method, and is widely applied to pattern recognition problems such as portrait recognition and text classification. The input of the SVM text classification model trained by the invention is a parallel corpus sentence pair, and the output is a corresponding subdivision field.

The subdivision field is the minimum field obtained by finely dividing the technical field of the Chinese defense military industry of the knowledge system by combing the knowledge system of the Chinese defense military industry field. The knowledge system in the national defense and military industry field comprises a plurality of levels, the layers have subordination, for example, the engineering science field can be further divided into knowledge nodes of mechanical engineering, engineering thermophysics, electrical engineering subjects and the like, and the mechanical engineering can be further divided into subdivision fields of mechanics and robots, transmission mechanics, mechanical dynamics and the like.

Step 103: and classifying the parallel corpus sentence pairs into each subdivision field of a knowledge system in the national defense and military industry field by adopting the trained SVM text classification model.

The trained SVM text classification model adopts a text classification technology based on SVM to classify the existing parallel corpus sentence pairs into the subdivision field of the corresponding knowledge nodes of the corresponding knowledge system.

Step 104: and respectively training a universal machine translation model by adopting the parallel corpus sentence pairs of each subdivision field to generate a corresponding subdivision field machine translation model.

And aiming at each subdivision field, training a machine translation model of the subdivision field based on the classified parallel corpus sentence pairs. The universal machine translation model is an existing machine translation model and comprises an encoder and a decoder, wherein the encoder encodes a sentence (a source sentence) to be translated into a vector through the encoder, and then the decoder decodes the vector of the source sentence to form a corresponding translation. Compared with a general machine translation model, the machine translation model of the subdivision field trained by the invention has the difference of different training sets, and the machine translation model of the subdivision field is trained by collecting the training sets (namely parallel corpus sentence pairs in a corpus) of the corresponding subdivision field in the actual use process, so that the translation quality of the subdivision field is improved.

Step 105: and acquiring a text to be translated.

The text to be translated can be a whole article, or a paragraph or a sentence in the article.

Step 106: and determining the subdivision field of the text to be translated by adopting the SVM text classification model.

Before the user translates, the system automatically classifies texts according to contents to be translated and determines the subdivision fields of the texts to be translated. After the automatically divided segmentation fields are displayed to the user, the user can modify the segmentation fields. And performing machine translation according to the automatically determined original text subdivision domain or a machine translation model corresponding to the subdivision domain manually adjusted by the user to form a corresponding translation.

That is, after the step 106 determines the subdivided domain of the text to be translated by using the SVM text classification model, the method further includes:

judging whether a user manually adjusts the subdivision field of the text to be translated or not to obtain a first judgment result; if the first judgment result is that the user does not manually adjust the subdivision fields of the text to be translated, the user approves the subdivision fields automatically divided by the SVM text classification model, and at the moment, the text to be translated and the subdivision fields corresponding to the text to be translated can be directly stored in the corpus; if the first judgment result is that the user manually adjusts the subdivision field of the text to be translated, the user does not approve the subdivision field automatically divided by the SVM text classification model, and at the moment, the accuracy of the subdivision field division result needs to be further judged, and the method specifically comprises the following steps:

when the subdivision fields automatically divided by the SVM text classification model are inconsistent with the subdivision fields uploaded by the user, automatically and randomly submitting the subdivision fields to 5 annotators by a machine for manual judgment, judging whether the annotators in the subdivision fields approve the subdivision fields of the text to be translated determined by the SVM text classification model, and obtaining a second judgment result;

if the second judgment result is that more than 3 subdivision field labels approve the subdivision field of the text to be translated determined by the SVM text classification model, storing the text to be translated and the corresponding subdivision field thereof into the corpus as training set candidate data of the SVM text classification model; the SVM text classification model is trained by adopting the updated text classification training set, so that the accuracy of text classification can be continuously improved.

And if the second judgment result is that more than 3 segmentation field labels do not recognize the segmentation field of the text to be translated determined by the SVM text classification model, discarding the data and not storing the text to be translated and the corresponding segmentation field.

Step 107: and calling a subdivision domain machine translation model corresponding to the subdivision domain of the text to be translated to translate the text to be translated, and generating a subdivision domain translation result.

The translation process of the method is that a user opens a to-be-translated text to be translated firstly, clicks 'start translation' on a foreground interface, pops up a translation configuration frame, selects the languages of an original text and a translated text, automatically divides the subdivision field of the to-be-translated text in a background and displays the subdivision field to the user, and the user selects whether to change the current subdivision field or not, if not, calls a machine translation model corresponding to the automatically divided subdivision field to start translation; and if the current subdivision field is changed by the user, calling a machine translation model corresponding to the subdivision field manually adjusted by the user to start translation. After the translation is completed, the background displays the translation results of the current subdivided field on a foreground interface, and a user can conveniently perform further translation proofreading.

If the user does not modify the translated text, directly storing the current text to be translated and the translation result of the corresponding subdivision field into a corpus; if the user modifies the improper translation place in the translation text, the modification is completed and then the modification is submitted to the background, and the background stores the translation result (manual proofreading result) of the subdivision field modified by the user. Therefore, the background can divide the parallel corpus sentence pairs according to the current translation result of the subdivided field or the manual proofreading result after user proofreading and store the parallel corpus sentence pairs into the corpus to serve as training set data of the machine translation model of the subdivided field. The machine translation model is trained by adopting the updated machine translation model training set, so that the machine translation quality of the subdivided field can be continuously improved.

Namely, after the step 107 generates the translation result of the subdivided domain, the method further comprises the following steps:

acquiring a manual proofreading result of the translation result of the subdivided field; and dividing the manual proofreading result into a plurality of parallel language material sentence pairs by adopting a sentence alignment tool, storing the parallel language material sentence pairs in the language database, and training a subdivision field machine translation model according to the parallel language material sentence pairs in the updated language database.

In practical application, the subdivision domain machine translation model can be trained once every half year, and the updating time of the subdivision domain machine translation model can be adjusted according to user requirements and actual data scale.

The invention divides the subdivision fields of the corpus on the basis of a universal machine translation model, trains the subdivision fields and the translation model thereof. When a user translates, the SVM text classification model automatically determines the segmentation field according to the original text, the user can also manually adjust the segmentation field classification, and the system continuously optimizes the SVM text classification model according to the updated segmentation field classification result. After the translation is finished, a user can manually correct the translated text, the result of manual correction is stored in the system, the machine translation model of the subdivision field is trained periodically according to the result of manual correction, and the machine translation quality of the subdivision field can be continuously improved.

Based on the subdivision-domain-oriented machine translation method in the national defense military field, the invention also provides a subdivision-domain-oriented machine translation system in the national defense military field, and referring to fig. 3, the system comprises:

a parallel corpus sentence pair obtaining module 301, configured to obtain a parallel corpus sentence pair in a corpus; the parallel language sentence pairs comprise original texts and corresponding translated texts;

an SVM text classification model obtaining module 302, configured to obtain a trained SVM text classification model;

a parallel corpus sentence pair subdivision domain dividing module 303, configured to classify the parallel corpus sentence pairs into each subdivision domain of a national defense and military industry domain knowledge system by using the trained SVM text classification model;

a segmentation domain machine translation model training module 304, configured to respectively train a general machine translation model by using the parallel corpus sentence pairs of each segmentation domain, and generate a corresponding segmentation domain machine translation model;

a to-be-translated text acquisition module 305, configured to acquire a to-be-translated text;

a segmentation domain automatic division module 306, configured to determine a segmentation domain of the text to be translated by using the SVM text classification model;

and the segmentation field machine translation module 307 is configured to invoke a segmentation field machine translation model corresponding to the segmentation field of the text to be translated to translate the text to be translated, so as to generate a segmentation field translation result.

Further, the system further comprises:

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A national defense and military industry field machine translation method for a subdivision field is characterized by comprising the following steps:

acquiring a trained SVM text classification model;

acquiring a text to be translated;

2. The method for machine translation in the national defense and military field according to claim 1, wherein before the obtaining of the parallel corpus sentence pairs in the corpus, the method further comprises:

3. The national defense and military field machine translation method according to claim 2, further comprising, before the obtaining of the trained SVM text classification model:

4. The national defense and military field machine translation method according to claim 3, further comprising, after the generating the translation results of the segment fields:

5. The defense and military field machine translation method of claim 4, wherein after the SVM text classification model is adopted to determine the subdivided field of the text to be translated, the method further comprises the following steps:

6. A defense military field machine translation system oriented to a subdivision field, the system comprising:

7. The defense military domain machine translation system of claim 6, wherein the system further comprises:

8. The defense military domain machine translation system of claim 7, further comprising:

9. The defense military domain machine translation system of claim 8, wherein the system further comprises:

10. The defense military domain machine translation system of claim 9, wherein the system comprises: