CN111738022B

CN111738022B - Machine translation optimization method and system in national defense and military industry field

Info

Publication number: CN111738022B
Application number: CN202010578821.9A
Authority: CN
Inventors: 姚晗; 晏裕生; 熊晓丹; 孙孟阳; 董文轩; 江洋; 李兴亚; 苏慧超
Original assignee: China Institute Of Marine Technology & Economy
Current assignee: China Institute Of Marine Technology & Economy
Priority date: 2020-06-23
Filing date: 2020-06-23
Publication date: 2023-04-18
Anticipated expiration: 2040-06-23
Also published as: CN111738022A

Abstract

The invention relates to a machine translation optimization method and system in the field of national defense and military industry. The method comprises the following steps: extracting keywords of an article to be translated to obtain a keyword list; calculating word vectors of each keyword and corresponding contexts in the keyword list; calculating cosine similarity of the keyword pairs by the word vectors; performing hierarchical clustering on all keywords in the keyword list according to the cosine similarity to obtain a plurality of word categories; and translating all the keywords in each word category by adopting a machine translation model to obtain a user translation method. The invention aims to provide a machine translation optimization method and a machine translation optimization system in the field of national defense and military industry, so that translation consistency of keywords in the whole article is ensured, and translation quality is improved.

Description

Machine translation optimization method and system in national defense and military industry field

Technical Field

The invention relates to the field of machine translation, in particular to a machine translation optimization method and system in the field of national defense and military industry.

Background

Machine Translation is a process of converting a natural language (source language) into another natural language (target language) by using a computer, and Neural Machine Translation (NMT) is a currently commonly used Machine Translation method, which is based on deep learning, a sentence to be translated (source sentence) is encoded into a vector through an encoder (encoder) and a decoder (decoder) by an encoder-decoder model with an attention-based mechanism (attention-based), and then the vector of the source sentence is decoded by the decoder (decoder) after the sentence is learned in a deep Neural network to form a corresponding Translation (target sentence).

A large number of professional terms exist in the field of national defense and military industry, and the prior machine translation process only considers the context relationship of each sentence and does not consider the context relationship of chapter level, so that the front translation and the back translation of the same term in the same article are inconsistent, and the translation quality is seriously influenced. In the post-translation correction process, the same word is simply replaced in full text according to the same interpretation, so that the problem of ambiguity of one word cannot be solved, and the current situation that the overall accuracy of a machine translation result is low is caused.

Disclosure of Invention

The invention aims to provide a machine translation optimization method and a machine translation optimization system in the field of national defense and military industry, so that translation consistency and accuracy of keywords in the whole article are ensured, and translation quality is improved.

In order to achieve the purpose, the invention provides the following scheme:

a machine translation optimization method in the field of national defense and military industry comprises the following steps:

extracting keywords of an article to be translated to obtain a keyword list;

calculating word vectors of each keyword and corresponding context in the keyword list, wherein the corresponding context is a word adjacent to the keyword in the article to be translated;

calculating cosine similarity of the keyword pairs according to the word vectors; the keyword pair is two identical keywords which appear at different positions of the article to be translated in the keyword list;

performing hierarchical clustering on each keyword in the keyword list according to the cosine similarity to obtain a plurality of word categories; each word category at least comprises one keyword;

translating all the keywords in each word category by adopting a machine translation model to obtain a plurality of translation methods corresponding to each word category;

determining the translation with the highest occurrence probability in the multiple translations corresponding to the word categories as the user translation corresponding to the word categories;

translating the article to be translated by using the machine translation model to obtain a machine translation method of each keyword;

judging whether the machine translation method is the same as a user translation method of the word category to which the corresponding keyword belongs to obtain a first judgment result;

if the first judgment result is negative, translating the key words by adopting the user translation method;

if the first judgment result is yes, translating the keyword by adopting the machine translation method.

Optionally, the extracting keywords of the article to be translated to obtain a keyword list specifically includes:

obtaining a corpus;

segmenting the article to be translated to obtain a plurality of words;

calculating a word frequency-inverse text frequency value of each word according to the corpus;

and sequencing all the words according to the word frequency-inverse text frequency value to obtain a keyword list.

Optionally, the calculating a word frequency-inverse text frequency value of each word according to the corpus specifically includes:

according to the formula TF-IDF _i,j ＝TF _i,j ×IDF _i Calculating a word frequency-inverse text frequency value for each of said words, wherein,

TF-IDF _i,j representing the importance degree of the word i in the article j to be translated, TF representing the word frequency, IDF _i,j Representing the inverse text frequency, TF _i,j Representing the occurrence of the word i in the article j to be translatedWord frequency value, IDF _i Representing the inverse text frequency value, n, of the word i _i,j Represents the number of times, Σ, that the word i appears in the article j to be translated _k n _k,j Represents the total number of words in the article j to be translated, n _k,j Representing the number of times of the k-th word in the article j to be translated, | D | representing the total number of articles in the corpus, | { j: t |, in the corpus _i ∈d _j Denotes the number of articles in the corpus containing word i, t _i Representing words i, d _j Representing the article j to be translated.

Optionally, the determining process of the machine translation model is:

acquiring bilingual parallel sentence pairs, wherein the bilingual parallel sentence pairs consist of original texts and translations corresponding to the original texts;

and inputting the bilingual parallel sentence pairs into a deep neural network to obtain a machine translation model.

Optionally, the performing hierarchical clustering on each keyword in the keyword list according to the cosine similarity to obtain a plurality of word categories specifically includes:

calculating the average value of cosine similarity of all keywords in the current word category and target keywords under the current clustering frequency to obtain the average cosine similarity before division, wherein the current word category is the word category obtained under the last clustering frequency;

judging whether the average cosine similarity before division is greater than a specific threshold value or not to obtain a second judgment result;

if the second judgment result is yes, the target keyword is divided into the current word category to obtain a divided word category;

calculating the average value of the cosine similarity of all the keywords in the divided word categories to obtain the divided average cosine similarity;

judging whether the divided average cosine similarity is smaller than a set multiple of the divided average cosine similarity to obtain a third judgment result;

and if the third judgment result is yes, deleting the target keywords from the divided word categories, dividing the target keywords into word categories different from the current word categories, then updating the current clustering times, and returning to the average value of the cosine similarity of all the keywords in the current word categories and the target keywords under the current clustering times to obtain the average cosine similarity before division.

A machine translation optimization system in the national defense and military field comprises:

the extraction module is used for extracting keywords of the article to be translated to obtain a keyword list;

the first calculation module is used for calculating word vectors of each keyword and corresponding context in the keyword list, wherein the corresponding context is a word adjacent to the keyword in the article to be translated;

the second calculation module is used for calculating the cosine similarity of the keyword pair according to the word vector; the keyword pair is two identical keywords appearing at different positions of the article to be translated in the keyword list;

the clustering module is used for respectively carrying out hierarchical clustering on each keyword in the keyword list according to the cosine similarity to obtain a plurality of word categories; each word category at least comprises one keyword;

the translation module is used for translating all the keywords in each word category by adopting a machine translation model to obtain a plurality of translations corresponding to each word category;

the user translation determining module is used for determining the translation with the highest occurrence probability in the plurality of translations corresponding to the word categories as the user translation corresponding to the word categories;

the machine translation method determining module is used for translating the article to be translated by utilizing the machine translation model to obtain a machine translation method of each keyword;

the first judgment module is used for judging whether the machine translation method is the same as the user translation method of the word category to which the corresponding keyword belongs to obtain a first judgment result;

the user translation module is used for translating the key words by adopting the user translation if the first judgment result is negative;

and the machine translation method translation module is used for translating the keyword by adopting the machine translation method if the first judgment result is positive.

Optionally, the extracting module specifically includes:

a corpus acquisition unit for acquiring a corpus;

the word segmentation unit is used for segmenting words of the article to be translated to obtain a plurality of words;

the word frequency-inverse text frequency value calculating unit is used for calculating the word frequency-inverse text frequency value of each word according to the corpus;

and the keyword list unit is used for sequencing all the words according to the word frequency-inverse text frequency value to obtain a keyword list.

Optionally, the word frequency-inverse text frequency value calculating unit includes:

a word frequency-inverse text frequency value calculating operator unit for calculating the word frequency-inverse text frequency value according to the formula TF-IDF _i,j ＝TF _i,j ×IDF _i Calculating a word frequency-inverse text frequency value for each of the words, wherein,

TF-IDF _i,j representing the importance degree of the word i in the article j to be translated, TF representing the word frequency, IDF _i,j Representing the inverse text frequency, TF _i,j Indicating the word frequency value, IDF, of the word i appearing in the article j to be translated _i Representing the inverse text frequency value, n, of the word i _i,j Represents the number of times, Σ, that the word i appears in the article j to be translated _k n _k,j Representing the total number of words in the article j to be translated, n _k,j Representing the occurrence frequency of the kth word in the article j to be translated, | D | representing the total number of articles in the corpus, | { j: t |, in the corpus _i ∈d _j Denotes the number of articles in the corpus that contain word i, t _i The words i, d _j Representing the article j to be translated.

Optionally, the determining of the machine translation model specifically includes:

the bilingual parallel sentence pair acquisition unit is used for acquiring a bilingual parallel sentence pair, and the bilingual parallel sentence pair consists of an original text and a translation corresponding to the original text;

and the machine translation model determining unit is used for inputting the bilingual parallel sentence pairs into the deep neural network to obtain a machine translation model.

Optionally, the clustering module specifically includes:

the pre-division average cosine similarity calculation unit is used for calculating the average value of the cosine similarity of all keywords in the current word category and the target keywords under the current clustering frequency to obtain the pre-division average cosine similarity, and the current word category is the word category obtained under the last clustering frequency;

the second judging unit is used for judging whether the average cosine similarity before division is greater than a specific threshold value or not to obtain a second judging result;

the divided word category unit is used for dividing the target keyword into the current word category to obtain a divided word category if the second judgment result is yes;

the divided average cosine similarity calculation unit is used for calculating the average value of the cosine similarities of all the keywords in the divided word categories to obtain the divided average cosine similarity;

a third judging unit, configured to judge whether the divided average cosine similarity is smaller than a set multiple of the average cosine similarity before the division, to obtain a third judgment result;

and the clustering unit is used for deleting the target keywords from the divided word categories, dividing the target keywords into word categories different from the current word categories, updating the current clustering times, and returning to the average value of the cosine similarity of all the keywords in the current word categories and the target keywords under the current clustering times to obtain the average cosine similarity before division if the third judgment result is yes.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects: according to the method, before machine translation, the keywords of the article are identified, clustering operation is carried out on each keyword to obtain multiple word categories, the translation method with the highest translation method probability in each word category is extracted to serve as a standard translation method, translation consistency of the keywords in the whole article is ensured, and translation quality is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a detailed flowchart of a machine translation optimization method in the national defense and military industry field according to embodiment 1 of the present invention;

FIG. 2 is an overall flowchart of a machine translation optimization method in the national defense and military field in embodiment 1 of the present invention;

fig. 3 is a schematic composition diagram of a machine translation optimization system in the field of defense and military industry in embodiment 2 of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention aims to provide a machine translation optimization method and system in the field of national defense and military industry. According to the method, before machine translation, the keywords of the article are identified, clustering operation is carried out on each keyword to obtain multiple word categories, the translation method with the highest translation method probability in each word category is extracted to serve as a standard translation method, translation consistency of the keywords in the whole article is ensured, and translation quality is improved.

In order to make the aforementioned objects, features and advantages of the present invention more comprehensible, the present invention is described in detail with reference to the accompanying drawings and the detailed description thereof.

Example 1

The embodiment provides a machine translation optimization method in the field of national defense and military industry, and referring to fig. 1 and 2, the overall flow chart of the machine translation optimization method is that first, an original text translated in a previous stage is collected to form a corpus; then, calculating subject terms of the original text; then clustering the subject term; then, establishing a translation method of each type; and finally, translating according to a preset translation method in the translation process.

The specific process of the machine translation optimization method comprises the following steps:

step 101: and extracting keywords of the article to be translated to obtain a keyword list.

Step 102: calculating a word vector of each keyword and a corresponding context in the keyword list, where the corresponding context is a word adjacent to the keyword in the article to be translated, the adjacent word may be two words in front of the keyword and adjacent to the keyword and two words behind the keyword and adjacent to the keyword in the article to be translated, and the word vector of each keyword may be obtained by calculating a feature vector formed by each keyword and the corresponding context through word2 vec.

Step 103: and calculating the cosine similarity of the keyword pair by the word vector. The keyword pair is two identical keywords appearing at different positions of the article to be translated in the keyword list.

Step 104: and performing hierarchical clustering on each keyword in the keyword list according to the cosine similarity to obtain a plurality of word categories. Each of the word categories includes at least one of the keywords. By means of hierarchical clustering, a space vector model is adopted to calculate characteristic vectors formed by words around the entities, cosine similarity is used for comparison, and similar entities are clustered into one class for solving the problem of polysemy of a word.

Step 105: and translating all the keywords in each word category by adopting a machine translation model to obtain a plurality of translation methods corresponding to each word category.

Step 106: and determining the translation with the highest occurrence probability in the multiple translations corresponding to the word categories as the user translation corresponding to the word categories.

Step 107: and translating the article to be translated by using the machine translation model to obtain a machine translation method of each keyword.

Step 108: and judging whether the machine translation method is the same as the user translation method of the word category to which the corresponding keyword belongs to obtain a first judgment result.

Step 109: and if the first judgment result is negative, translating the key words by adopting the user translation method.

Step 110: if the first judgment result is yes, translating the keyword by adopting the machine translation method.

Step 101 specifically includes:

(1) And obtaining a corpus. And collecting a large amount of foreign language data in the national defense military industry field to form a corpus, wherein the corpus is a foreign language data database formed in the long-term working process of the national defense military industry field and is stored in a chapter form. The corpus is used for providing an inverse text frequency (IDF) calculation basis when the TF-IDF algorithm calculates the keywords, and words with more times of occurrence in all articles such as the article and the article of the article are prevented from being used as the keywords.

(2) And segmenting the article to be translated to obtain a plurality of words. Where the word segmentation toolkit jieba segmentation commonly used in the art may be used.

(3) Calculating the word frequency-inverse text frequency value of each word according to the corpus specifically comprises the following steps:

according to the formula TF-IDF _i,j ＝TF _i,j ×IDF _i Calculating a word frequency-inverse text frequency value for each of the words, wherein,

TF-IDF _i,j indicating the importance of the word i in the article j to be translated,TF denotes the word frequency, IDF _i,j Representing the inverse text frequency, TF _i,j Indicating the word frequency value, IDF, of the word i appearing in the article j to be translated _i Representing the inverse text frequency value, n, of the word i _i,j Represents the number of times, sigma, that the word i appears in the article j to be translated _k n _k,j Represents the total number of words in the article j to be translated, n _k,j Representing the number of times of the k-th word in the article j to be translated, | D | representing the total number of articles in the corpus, | { j: t |, in the corpus _i ∈d _j Denotes the number of articles in the corpus containing word i, t _i Representing words i, d _j Representing the article j to be translated.

(4) And sequencing all the words according to the word frequency-inverse text frequency value to obtain a keyword list. The keywords refer to subject words with more occurrence times in the article to be translated.

Step 104 specifically includes:

(1) And calculating the average value of the cosine similarity of all the keywords in the current word category and the target keywords under the current clustering frequency to obtain the average cosine similarity before division, wherein the current word category is the word category obtained under the last clustering frequency.

(2) And judging whether the average cosine similarity before division is greater than a specific threshold value or not to obtain a second judgment result.

(3) And if the second judgment result is yes, dividing the target keyword into the current word category to obtain the divided word category.

(4) And calculating the average value of the cosine similarity of all the keywords in the divided word categories to obtain the divided average cosine similarity.

(5) And judging whether the average cosine similarity after the division is smaller than a set multiple of the average cosine similarity before the division to obtain a third judgment result.

(6) And if the third judgment result is yes, deleting the target keywords from the divided word categories, dividing the target keywords into word categories different from the current word categories, then updating the current clustering frequency, and returning to the average value of the cosine similarity of all the keywords in the current word categories and the target keywords under the current clustering frequency to obtain the average cosine similarity before division.

And when the current clustering frequency is 1, sequencing according to the sequence of cosine similarity from large to small, merging two keywords with the largest cosine similarity into a new class, then calculating the average cosine similarity after merging and comparing the cosine similarity before merging, setting a threshold cosine similarity to be more than 0.6, and stopping clustering until the average cosine similarity after merging is less than 80% of the average cosine similarity before merging.

Wherein the determination process of the machine translation model is as follows:

and acquiring bilingual parallel sentence pairs, wherein the bilingual parallel sentence pairs consist of original texts and translated texts corresponding to the original texts. For example, the original text "Fire control system developed by Lorale" corresponds to the translation "Fire control system developed by Lorale", which is a bilingual parallel sentence pair.

And inputting the bilingual parallel sentence pairs into the deep neural network, wherein NiuTrans can be adopted to obtain a machine translation model.

Example 2

The embodiment provides a machine translation optimization system in the field of national defense and military industry, and referring to fig. 3, the system specifically includes:

the extracting module 201 is configured to extract a keyword of an article to be translated to obtain a keyword list.

A first calculating module 202, configured to calculate a word vector of each keyword and a corresponding context in the keyword list. The corresponding context is a word adjacent to the keyword in the article to be translated, the adjacent words can be two words in front of and adjacent to the keyword and two words in back of and adjacent to the keyword in the article to be translated, and a word vector of each keyword can be obtained by calculating a feature vector formed by each keyword and the corresponding context through word2 vec.

A second calculating module 203, configured to calculate cosine similarity of the keyword pair from the word vector; the keyword pair is two identical keywords appearing at different positions of the article to be translated in the keyword list.

A clustering module 204, configured to perform hierarchical clustering on each keyword in the keyword list according to the cosine similarity to obtain a plurality of word categories; each of the word categories includes at least one of the keywords.

The translation module 205 is configured to translate all the keywords in each of the word categories by using a machine translation model to obtain multiple translations corresponding to each of the word categories.

And the user translation determining module 206 is configured to determine a translation with the highest occurrence probability in the multiple translations corresponding to the word category as the user translation corresponding to the word category.

And the machine translation method determining module 207 is configured to translate the article to be translated by using the machine translation model to obtain a machine translation method of each keyword.

The first determining module 208 is configured to determine whether the machine translation is the same as the user translation of the word category to which the corresponding keyword belongs, and obtain a first determination result.

And the user translation module 209 is configured to translate the keyword by using the user translation if the first determination result is negative.

And a machine translation module 210, configured to translate the keyword by using the machine translation method if the first determination result is yes.

As an optional implementation manner, the extraction module specifically includes:

a corpus acquiring unit, configured to acquire a corpus.

And the word segmentation unit is used for segmenting words of the article to be translated to obtain a plurality of words.

And the word frequency-inverse text frequency value calculating unit is used for calculating the word frequency-inverse text frequency value of each word according to the corpus.

As an optional implementation, the word frequency-inverse text frequency value calculating unit includes:

TF-IDF _i,j representing the importance degree of the word i in the article j to be translated, TF representing the word frequency, IDF _i,j Representing the inverse text frequency, TF _i,j Indicating the word frequency value, IDF, of the word i appearing in the article j to be translated _i Representing the inverse text frequency value, n, of the word i _i,j Represents the number of times, Σ, that the word i appears in the article j to be translated _k n _k,j Represents the total number of words in the article j to be translated, n _k,j Representing the occurrence frequency of the kth word in the article j to be translated, | D | representing the total number of articles in the corpus, | { j: t |, in the corpus _i ∈d _j Denotes the number of articles in the corpus containing word i, t _i Representing words i, d _j Representing the article j to be translated. . />

As an optional implementation, the determining of the machine translation model specifically includes:

and the bilingual parallel sentence pair acquisition unit is used for acquiring a bilingual parallel sentence pair, wherein the bilingual parallel sentence pair consists of an original text and a translation corresponding to the original text.

As an optional implementation manner, the clustering module specifically includes:

and the pre-division average cosine similarity calculation unit is used for calculating the average value of the cosine similarity of all the keywords in the current word category and the target keywords under the current clustering frequency to obtain the pre-division average cosine similarity, and the current word category is the word category obtained under the last clustering frequency.

And the second judging unit is used for judging whether the average cosine similarity before division is greater than a specific threshold value or not to obtain a second judging result.

And the divided word category unit is used for dividing the target keyword into the current word category to obtain the divided word category if the second judgment result is yes.

And the divided average cosine similarity calculation unit is used for calculating the average value of the cosine similarities of all the keywords in the divided word categories to obtain the divided average cosine similarity.

And the third judging unit is used for judging whether the divided average cosine similarity is smaller than the set multiple of the average cosine similarity before division to obtain a third judging result.

And the clustering unit is used for deleting the target keyword from the divided word categories if the third judgment result is yes, dividing the target keyword into word categories different from the current word category, updating the current clustering frequency, and returning to the average value of the cosine similarity of all the keywords in the current word category and the target keyword under the current clustering frequency to obtain the average cosine similarity before division.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A machine translation optimization method in the field of national defense and military industry is characterized by comprising the following steps:

extracting keywords of an article to be translated to obtain a keyword list;

calculating cosine similarity of the keyword pairs according to the word vectors; the keyword pair is two identical keywords appearing at different positions of the article to be translated in the keyword list;

2. The method for optimizing the machine translation in the national defense and military industry field according to claim 1, wherein the extracting keywords of the article to be translated to obtain the keyword list specifically comprises:

acquiring a corpus;

segmenting the article to be translated to obtain a plurality of words;

3. The method for optimizing the machine translation in the national defense and military industry field according to claim 2, wherein the calculating the word frequency-inverse text frequency value of each word according to the corpus specifically comprises:

TF-IDF _i,j representing the importance degree of the word i in the article j to be translated, TF representing the word frequency, IDF _i,j Representing the inverse text frequency, TF _i,j Indicating the word frequency value, IDF, of the word i appearing in the article j to be translated _i Representing the inverse text frequency value, n, of the word i _i,j Represents the number of times, Σ, that the word i appears in the article j to be translated _k n _k,j Represents the total number of words in the article j to be translated, n _k,j Representing the occurrence frequency of the kth word in the article j to be translated, | D | representing the total number of articles in the corpus, | { j: t |, in the corpus _i ∈d _j Denotes the number of articles in the corpus containing word i, t _i Representing words i, d _j Representing the article j to be translated.

4. The method for optimizing the machine translation in the national defense and military industry field according to claim 1, wherein the determination process of the machine translation model is as follows:

5. The method for optimizing machine translation in the national defense and military industry field according to claim 1, wherein the step of performing hierarchical clustering on each keyword in the keyword list according to the cosine similarity to obtain a plurality of word categories specifically comprises the steps of:

6. The utility model provides a national defense military industry field machine translation optimization system which characterized in that includes:

the user translation determining module is used for determining the translation with the highest occurrence probability in the multiple translations corresponding to the word categories as the user translation corresponding to the word categories;

the machine translation method determining module is used for translating the article to be translated by using the machine translation model to obtain a machine translation method of each keyword;

7. The national defense and military industry field machine translation optimization system of claim 6, wherein the extraction module specifically comprises:

a corpus acquisition unit for acquiring a corpus;

8. The national defense and military industry field machine translation optimization system of claim 7, wherein the word frequency-inverse text frequency value calculation unit comprises:

TF-IDF _i,j representing the importance degree of the word i in the article j to be translated, TF representing the word frequency, IDF _i,j Representing the inverse text frequency, TF _i,j Indicating the word frequency value, IDF, of the word i appearing in the article j to be translated _i Representing the inverse text frequency value, n, of the word i _i,j Represents the number of times, Σ, that the word i appears in the article j to be translated _k n _k,j Represents the total number of words in the article j to be translated, n _k,j Representing the occurrence frequency of the kth word in the article j to be translated, | D | representing the total number of articles in the corpus, | { j: t |, in the corpus _i ∈d _j Denotes the number of articles in the corpus that contain word i, t _i The words i, d _j Representing the article j to be translated.

9. The system of claim 6, wherein the determination of the machine translation model specifically comprises:

10. The national defense and military industry field machine translation optimization system of claim 6, wherein the clustering module specifically comprises:

a second judging unit, configured to judge whether the average cosine similarity before division is greater than a specific threshold, to obtain a second judgment result;