CN111738022B - Machine translation optimization method and system in national defense and military industry field - Google Patents

Machine translation optimization method and system in national defense and military industry field Download PDF

Info

Publication number
CN111738022B
CN111738022B CN202010578821.9A CN202010578821A CN111738022B CN 111738022 B CN111738022 B CN 111738022B CN 202010578821 A CN202010578821 A CN 202010578821A CN 111738022 B CN111738022 B CN 111738022B
Authority
CN
China
Prior art keywords
word
keyword
article
translated
machine translation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010578821.9A
Other languages
Chinese (zh)
Other versions
CN111738022A (en
Inventor
姚晗
晏裕生
熊晓丹
孙孟阳
董文轩
江洋
李兴亚
苏慧超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Institute Of Marine Technology & Economy
Original Assignee
China Institute Of Marine Technology & Economy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Institute Of Marine Technology & Economy filed Critical China Institute Of Marine Technology & Economy
Priority to CN202010578821.9A priority Critical patent/CN111738022B/en
Publication of CN111738022A publication Critical patent/CN111738022A/en
Application granted granted Critical
Publication of CN111738022B publication Critical patent/CN111738022B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to a machine translation optimization method and system in the field of national defense and military industry. The method comprises the following steps: extracting keywords of an article to be translated to obtain a keyword list; calculating word vectors of each keyword and corresponding contexts in the keyword list; calculating cosine similarity of the keyword pairs by the word vectors; performing hierarchical clustering on all keywords in the keyword list according to the cosine similarity to obtain a plurality of word categories; and translating all the keywords in each word category by adopting a machine translation model to obtain a user translation method. The invention aims to provide a machine translation optimization method and a machine translation optimization system in the field of national defense and military industry, so that translation consistency of keywords in the whole article is ensured, and translation quality is improved.

Description

Machine translation optimization method and system in national defense and military industry field
Technical Field
The invention relates to the field of machine translation, in particular to a machine translation optimization method and system in the field of national defense and military industry.
Background
Machine Translation is a process of converting a natural language (source language) into another natural language (target language) by using a computer, and Neural Machine Translation (NMT) is a currently commonly used Machine Translation method, which is based on deep learning, a sentence to be translated (source sentence) is encoded into a vector through an encoder (encoder) and a decoder (decoder) by an encoder-decoder model with an attention-based mechanism (attention-based), and then the vector of the source sentence is decoded by the decoder (decoder) after the sentence is learned in a deep Neural network to form a corresponding Translation (target sentence).
A large number of professional terms exist in the field of national defense and military industry, and the prior machine translation process only considers the context relationship of each sentence and does not consider the context relationship of chapter level, so that the front translation and the back translation of the same term in the same article are inconsistent, and the translation quality is seriously influenced. In the post-translation correction process, the same word is simply replaced in full text according to the same interpretation, so that the problem of ambiguity of one word cannot be solved, and the current situation that the overall accuracy of a machine translation result is low is caused.
Disclosure of Invention
The invention aims to provide a machine translation optimization method and a machine translation optimization system in the field of national defense and military industry, so that translation consistency and accuracy of keywords in the whole article are ensured, and translation quality is improved.
In order to achieve the purpose, the invention provides the following scheme:
a machine translation optimization method in the field of national defense and military industry comprises the following steps:
extracting keywords of an article to be translated to obtain a keyword list;
calculating word vectors of each keyword and corresponding context in the keyword list, wherein the corresponding context is a word adjacent to the keyword in the article to be translated;
calculating cosine similarity of the keyword pairs according to the word vectors; the keyword pair is two identical keywords which appear at different positions of the article to be translated in the keyword list;
performing hierarchical clustering on each keyword in the keyword list according to the cosine similarity to obtain a plurality of word categories; each word category at least comprises one keyword;
translating all the keywords in each word category by adopting a machine translation model to obtain a plurality of translation methods corresponding to each word category;
determining the translation with the highest occurrence probability in the multiple translations corresponding to the word categories as the user translation corresponding to the word categories;
translating the article to be translated by using the machine translation model to obtain a machine translation method of each keyword;
judging whether the machine translation method is the same as a user translation method of the word category to which the corresponding keyword belongs to obtain a first judgment result;
if the first judgment result is negative, translating the key words by adopting the user translation method;
if the first judgment result is yes, translating the keyword by adopting the machine translation method.
Optionally, the extracting keywords of the article to be translated to obtain a keyword list specifically includes:
obtaining a corpus;
segmenting the article to be translated to obtain a plurality of words;
calculating a word frequency-inverse text frequency value of each word according to the corpus;
and sequencing all the words according to the word frequency-inverse text frequency value to obtain a keyword list.
Optionally, the calculating a word frequency-inverse text frequency value of each word according to the corpus specifically includes:
according to the formula TF-IDF i,j =TF i,j ×IDF i Calculating a word frequency-inverse text frequency value for each of said words, wherein,
Figure BDA0002552348730000021
TF-IDF i,j representing the importance degree of the word i in the article j to be translated, TF representing the word frequency, IDF i,j Representing the inverse text frequency, TF i,j Representing the occurrence of the word i in the article j to be translatedWord frequency value, IDF i Representing the inverse text frequency value, n, of the word i i,j Represents the number of times, Σ, that the word i appears in the article j to be translated k n k,j Represents the total number of words in the article j to be translated, n k,j Representing the number of times of the k-th word in the article j to be translated, | D | representing the total number of articles in the corpus, | { j: t |, in the corpus i ∈d j Denotes the number of articles in the corpus containing word i, t i Representing words i, d j Representing the article j to be translated.
Optionally, the determining process of the machine translation model is:
acquiring bilingual parallel sentence pairs, wherein the bilingual parallel sentence pairs consist of original texts and translations corresponding to the original texts;
and inputting the bilingual parallel sentence pairs into a deep neural network to obtain a machine translation model.
Optionally, the performing hierarchical clustering on each keyword in the keyword list according to the cosine similarity to obtain a plurality of word categories specifically includes:
calculating the average value of cosine similarity of all keywords in the current word category and target keywords under the current clustering frequency to obtain the average cosine similarity before division, wherein the current word category is the word category obtained under the last clustering frequency;
judging whether the average cosine similarity before division is greater than a specific threshold value or not to obtain a second judgment result;
if the second judgment result is yes, the target keyword is divided into the current word category to obtain a divided word category;
calculating the average value of the cosine similarity of all the keywords in the divided word categories to obtain the divided average cosine similarity;
judging whether the divided average cosine similarity is smaller than a set multiple of the divided average cosine similarity to obtain a third judgment result;
and if the third judgment result is yes, deleting the target keywords from the divided word categories, dividing the target keywords into word categories different from the current word categories, then updating the current clustering times, and returning to the average value of the cosine similarity of all the keywords in the current word categories and the target keywords under the current clustering times to obtain the average cosine similarity before division.
A machine translation optimization system in the national defense and military field comprises:
the extraction module is used for extracting keywords of the article to be translated to obtain a keyword list;
the first calculation module is used for calculating word vectors of each keyword and corresponding context in the keyword list, wherein the corresponding context is a word adjacent to the keyword in the article to be translated;
the second calculation module is used for calculating the cosine similarity of the keyword pair according to the word vector; the keyword pair is two identical keywords appearing at different positions of the article to be translated in the keyword list;
the clustering module is used for respectively carrying out hierarchical clustering on each keyword in the keyword list according to the cosine similarity to obtain a plurality of word categories; each word category at least comprises one keyword;
the translation module is used for translating all the keywords in each word category by adopting a machine translation model to obtain a plurality of translations corresponding to each word category;
the user translation determining module is used for determining the translation with the highest occurrence probability in the plurality of translations corresponding to the word categories as the user translation corresponding to the word categories;
the machine translation method determining module is used for translating the article to be translated by utilizing the machine translation model to obtain a machine translation method of each keyword;
the first judgment module is used for judging whether the machine translation method is the same as the user translation method of the word category to which the corresponding keyword belongs to obtain a first judgment result;
the user translation module is used for translating the key words by adopting the user translation if the first judgment result is negative;
and the machine translation method translation module is used for translating the keyword by adopting the machine translation method if the first judgment result is positive.
Optionally, the extracting module specifically includes:
a corpus acquisition unit for acquiring a corpus;
the word segmentation unit is used for segmenting words of the article to be translated to obtain a plurality of words;
the word frequency-inverse text frequency value calculating unit is used for calculating the word frequency-inverse text frequency value of each word according to the corpus;
and the keyword list unit is used for sequencing all the words according to the word frequency-inverse text frequency value to obtain a keyword list.
Optionally, the word frequency-inverse text frequency value calculating unit includes:
a word frequency-inverse text frequency value calculating operator unit for calculating the word frequency-inverse text frequency value according to the formula TF-IDF i,j =TF i,j ×IDF i Calculating a word frequency-inverse text frequency value for each of the words, wherein,
Figure BDA0002552348730000041
TF-IDF i,j representing the importance degree of the word i in the article j to be translated, TF representing the word frequency, IDF i,j Representing the inverse text frequency, TF i,j Indicating the word frequency value, IDF, of the word i appearing in the article j to be translated i Representing the inverse text frequency value, n, of the word i i,j Represents the number of times, Σ, that the word i appears in the article j to be translated k n k,j Representing the total number of words in the article j to be translated, n k,j Representing the occurrence frequency of the kth word in the article j to be translated, | D | representing the total number of articles in the corpus, | { j: t |, in the corpus i ∈d j Denotes the number of articles in the corpus that contain word i, t i The words i, d j Representing the article j to be translated.
Optionally, the determining of the machine translation model specifically includes:
the bilingual parallel sentence pair acquisition unit is used for acquiring a bilingual parallel sentence pair, and the bilingual parallel sentence pair consists of an original text and a translation corresponding to the original text;
and the machine translation model determining unit is used for inputting the bilingual parallel sentence pairs into the deep neural network to obtain a machine translation model.
Optionally, the clustering module specifically includes:
the pre-division average cosine similarity calculation unit is used for calculating the average value of the cosine similarity of all keywords in the current word category and the target keywords under the current clustering frequency to obtain the pre-division average cosine similarity, and the current word category is the word category obtained under the last clustering frequency;
the second judging unit is used for judging whether the average cosine similarity before division is greater than a specific threshold value or not to obtain a second judging result;
the divided word category unit is used for dividing the target keyword into the current word category to obtain a divided word category if the second judgment result is yes;
the divided average cosine similarity calculation unit is used for calculating the average value of the cosine similarities of all the keywords in the divided word categories to obtain the divided average cosine similarity;
a third judging unit, configured to judge whether the divided average cosine similarity is smaller than a set multiple of the average cosine similarity before the division, to obtain a third judgment result;
and the clustering unit is used for deleting the target keywords from the divided word categories, dividing the target keywords into word categories different from the current word categories, updating the current clustering times, and returning to the average value of the cosine similarity of all the keywords in the current word categories and the target keywords under the current clustering times to obtain the average cosine similarity before division if the third judgment result is yes.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects: according to the method, before machine translation, the keywords of the article are identified, clustering operation is carried out on each keyword to obtain multiple word categories, the translation method with the highest translation method probability in each word category is extracted to serve as a standard translation method, translation consistency of the keywords in the whole article is ensured, and translation quality is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a detailed flowchart of a machine translation optimization method in the national defense and military industry field according to embodiment 1 of the present invention;
FIG. 2 is an overall flowchart of a machine translation optimization method in the national defense and military field in embodiment 1 of the present invention;
fig. 3 is a schematic composition diagram of a machine translation optimization system in the field of defense and military industry in embodiment 2 of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a machine translation optimization method and system in the field of national defense and military industry. According to the method, before machine translation, the keywords of the article are identified, clustering operation is carried out on each keyword to obtain multiple word categories, the translation method with the highest translation method probability in each word category is extracted to serve as a standard translation method, translation consistency of the keywords in the whole article is ensured, and translation quality is improved.
In order to make the aforementioned objects, features and advantages of the present invention more comprehensible, the present invention is described in detail with reference to the accompanying drawings and the detailed description thereof.
Example 1
The embodiment provides a machine translation optimization method in the field of national defense and military industry, and referring to fig. 1 and 2, the overall flow chart of the machine translation optimization method is that first, an original text translated in a previous stage is collected to form a corpus; then, calculating subject terms of the original text; then clustering the subject term; then, establishing a translation method of each type; and finally, translating according to a preset translation method in the translation process.
The specific process of the machine translation optimization method comprises the following steps:
step 101: and extracting keywords of the article to be translated to obtain a keyword list.
Step 102: calculating a word vector of each keyword and a corresponding context in the keyword list, where the corresponding context is a word adjacent to the keyword in the article to be translated, the adjacent word may be two words in front of the keyword and adjacent to the keyword and two words behind the keyword and adjacent to the keyword in the article to be translated, and the word vector of each keyword may be obtained by calculating a feature vector formed by each keyword and the corresponding context through word2 vec.
Step 103: and calculating the cosine similarity of the keyword pair by the word vector. The keyword pair is two identical keywords appearing at different positions of the article to be translated in the keyword list.
Step 104: and performing hierarchical clustering on each keyword in the keyword list according to the cosine similarity to obtain a plurality of word categories. Each of the word categories includes at least one of the keywords. By means of hierarchical clustering, a space vector model is adopted to calculate characteristic vectors formed by words around the entities, cosine similarity is used for comparison, and similar entities are clustered into one class for solving the problem of polysemy of a word.
Step 105: and translating all the keywords in each word category by adopting a machine translation model to obtain a plurality of translation methods corresponding to each word category.
Step 106: and determining the translation with the highest occurrence probability in the multiple translations corresponding to the word categories as the user translation corresponding to the word categories.
Step 107: and translating the article to be translated by using the machine translation model to obtain a machine translation method of each keyword.
Step 108: and judging whether the machine translation method is the same as the user translation method of the word category to which the corresponding keyword belongs to obtain a first judgment result.
Step 109: and if the first judgment result is negative, translating the key words by adopting the user translation method.
Step 110: if the first judgment result is yes, translating the keyword by adopting the machine translation method.
Step 101 specifically includes:
(1) And obtaining a corpus. And collecting a large amount of foreign language data in the national defense military industry field to form a corpus, wherein the corpus is a foreign language data database formed in the long-term working process of the national defense military industry field and is stored in a chapter form. The corpus is used for providing an inverse text frequency (IDF) calculation basis when the TF-IDF algorithm calculates the keywords, and words with more times of occurrence in all articles such as the article and the article of the article are prevented from being used as the keywords.
(2) And segmenting the article to be translated to obtain a plurality of words. Where the word segmentation toolkit jieba segmentation commonly used in the art may be used.
(3) Calculating the word frequency-inverse text frequency value of each word according to the corpus specifically comprises the following steps:
according to the formula TF-IDF i,j =TF i,j ×IDF i Calculating a word frequency-inverse text frequency value for each of the words, wherein,
Figure BDA0002552348730000081
TF-IDF i,j indicating the importance of the word i in the article j to be translated,TF denotes the word frequency, IDF i,j Representing the inverse text frequency, TF i,j Indicating the word frequency value, IDF, of the word i appearing in the article j to be translated i Representing the inverse text frequency value, n, of the word i i,j Represents the number of times, sigma, that the word i appears in the article j to be translated k n k,j Represents the total number of words in the article j to be translated, n k,j Representing the number of times of the k-th word in the article j to be translated, | D | representing the total number of articles in the corpus, | { j: t |, in the corpus i ∈d j Denotes the number of articles in the corpus containing word i, t i Representing words i, d j Representing the article j to be translated.
(4) And sequencing all the words according to the word frequency-inverse text frequency value to obtain a keyword list. The keywords refer to subject words with more occurrence times in the article to be translated.
Step 104 specifically includes:
(1) And calculating the average value of the cosine similarity of all the keywords in the current word category and the target keywords under the current clustering frequency to obtain the average cosine similarity before division, wherein the current word category is the word category obtained under the last clustering frequency.
(2) And judging whether the average cosine similarity before division is greater than a specific threshold value or not to obtain a second judgment result.
(3) And if the second judgment result is yes, dividing the target keyword into the current word category to obtain the divided word category.
(4) And calculating the average value of the cosine similarity of all the keywords in the divided word categories to obtain the divided average cosine similarity.
(5) And judging whether the average cosine similarity after the division is smaller than a set multiple of the average cosine similarity before the division to obtain a third judgment result.
(6) And if the third judgment result is yes, deleting the target keywords from the divided word categories, dividing the target keywords into word categories different from the current word categories, then updating the current clustering frequency, and returning to the average value of the cosine similarity of all the keywords in the current word categories and the target keywords under the current clustering frequency to obtain the average cosine similarity before division.
And when the current clustering frequency is 1, sequencing according to the sequence of cosine similarity from large to small, merging two keywords with the largest cosine similarity into a new class, then calculating the average cosine similarity after merging and comparing the cosine similarity before merging, setting a threshold cosine similarity to be more than 0.6, and stopping clustering until the average cosine similarity after merging is less than 80% of the average cosine similarity before merging.
Wherein the determination process of the machine translation model is as follows:
and acquiring bilingual parallel sentence pairs, wherein the bilingual parallel sentence pairs consist of original texts and translated texts corresponding to the original texts. For example, the original text "Fire control system developed by Lorale" corresponds to the translation "Fire control system developed by Lorale", which is a bilingual parallel sentence pair.
And inputting the bilingual parallel sentence pairs into the deep neural network, wherein NiuTrans can be adopted to obtain a machine translation model.
Example 2
The embodiment provides a machine translation optimization system in the field of national defense and military industry, and referring to fig. 3, the system specifically includes:
the extracting module 201 is configured to extract a keyword of an article to be translated to obtain a keyword list.
A first calculating module 202, configured to calculate a word vector of each keyword and a corresponding context in the keyword list. The corresponding context is a word adjacent to the keyword in the article to be translated, the adjacent words can be two words in front of and adjacent to the keyword and two words in back of and adjacent to the keyword in the article to be translated, and a word vector of each keyword can be obtained by calculating a feature vector formed by each keyword and the corresponding context through word2 vec.
A second calculating module 203, configured to calculate cosine similarity of the keyword pair from the word vector; the keyword pair is two identical keywords appearing at different positions of the article to be translated in the keyword list.
A clustering module 204, configured to perform hierarchical clustering on each keyword in the keyword list according to the cosine similarity to obtain a plurality of word categories; each of the word categories includes at least one of the keywords.
The translation module 205 is configured to translate all the keywords in each of the word categories by using a machine translation model to obtain multiple translations corresponding to each of the word categories.
And the user translation determining module 206 is configured to determine a translation with the highest occurrence probability in the multiple translations corresponding to the word category as the user translation corresponding to the word category.
And the machine translation method determining module 207 is configured to translate the article to be translated by using the machine translation model to obtain a machine translation method of each keyword.
The first determining module 208 is configured to determine whether the machine translation is the same as the user translation of the word category to which the corresponding keyword belongs, and obtain a first determination result.
And the user translation module 209 is configured to translate the keyword by using the user translation if the first determination result is negative.
And a machine translation module 210, configured to translate the keyword by using the machine translation method if the first determination result is yes.
As an optional implementation manner, the extraction module specifically includes:
a corpus acquiring unit, configured to acquire a corpus.
And the word segmentation unit is used for segmenting words of the article to be translated to obtain a plurality of words.
And the word frequency-inverse text frequency value calculating unit is used for calculating the word frequency-inverse text frequency value of each word according to the corpus.
And the keyword list unit is used for sequencing all the words according to the word frequency-inverse text frequency value to obtain a keyword list.
As an optional implementation, the word frequency-inverse text frequency value calculating unit includes:
a word frequency-inverse text frequency value calculating operator unit for calculating the word frequency-inverse text frequency value according to the formula TF-IDF i,j =TF i,j ×IDF i Calculating a word frequency-inverse text frequency value for each of the words, wherein,
Figure BDA0002552348730000101
TF-IDF i,j representing the importance degree of the word i in the article j to be translated, TF representing the word frequency, IDF i,j Representing the inverse text frequency, TF i,j Indicating the word frequency value, IDF, of the word i appearing in the article j to be translated i Representing the inverse text frequency value, n, of the word i i,j Represents the number of times, Σ, that the word i appears in the article j to be translated k n k,j Represents the total number of words in the article j to be translated, n k,j Representing the occurrence frequency of the kth word in the article j to be translated, | D | representing the total number of articles in the corpus, | { j: t |, in the corpus i ∈d j Denotes the number of articles in the corpus containing word i, t i Representing words i, d j Representing the article j to be translated. . />
As an optional implementation, the determining of the machine translation model specifically includes:
and the bilingual parallel sentence pair acquisition unit is used for acquiring a bilingual parallel sentence pair, wherein the bilingual parallel sentence pair consists of an original text and a translation corresponding to the original text.
And the machine translation model determining unit is used for inputting the bilingual parallel sentence pairs into the deep neural network to obtain a machine translation model.
As an optional implementation manner, the clustering module specifically includes:
and the pre-division average cosine similarity calculation unit is used for calculating the average value of the cosine similarity of all the keywords in the current word category and the target keywords under the current clustering frequency to obtain the pre-division average cosine similarity, and the current word category is the word category obtained under the last clustering frequency.
And the second judging unit is used for judging whether the average cosine similarity before division is greater than a specific threshold value or not to obtain a second judging result.
And the divided word category unit is used for dividing the target keyword into the current word category to obtain the divided word category if the second judgment result is yes.
And the divided average cosine similarity calculation unit is used for calculating the average value of the cosine similarities of all the keywords in the divided word categories to obtain the divided average cosine similarity.
And the third judging unit is used for judging whether the divided average cosine similarity is smaller than the set multiple of the average cosine similarity before division to obtain a third judging result.
And the clustering unit is used for deleting the target keyword from the divided word categories if the third judgment result is yes, dividing the target keyword into word categories different from the current word category, updating the current clustering frequency, and returning to the average value of the cosine similarity of all the keywords in the current word category and the target keyword under the current clustering frequency to obtain the average cosine similarity before division.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (10)

1. A machine translation optimization method in the field of national defense and military industry is characterized by comprising the following steps:
extracting keywords of an article to be translated to obtain a keyword list;
calculating word vectors of each keyword and corresponding context in the keyword list, wherein the corresponding context is a word adjacent to the keyword in the article to be translated;
calculating cosine similarity of the keyword pairs according to the word vectors; the keyword pair is two identical keywords appearing at different positions of the article to be translated in the keyword list;
performing hierarchical clustering on each keyword in the keyword list according to the cosine similarity to obtain a plurality of word categories; each word category at least comprises one keyword;
translating all the keywords in each word category by adopting a machine translation model to obtain a plurality of translation methods corresponding to each word category;
determining the translation with the highest occurrence probability in the multiple translations corresponding to the word categories as the user translation corresponding to the word categories;
translating the article to be translated by using the machine translation model to obtain a machine translation method of each keyword;
judging whether the machine translation method is the same as a user translation method of the word category to which the corresponding keyword belongs to obtain a first judgment result;
if the first judgment result is negative, translating the key words by adopting the user translation method;
if the first judgment result is yes, translating the keyword by adopting the machine translation method.
2. The method for optimizing the machine translation in the national defense and military industry field according to claim 1, wherein the extracting keywords of the article to be translated to obtain the keyword list specifically comprises:
acquiring a corpus;
segmenting the article to be translated to obtain a plurality of words;
calculating a word frequency-inverse text frequency value of each word according to the corpus;
and sequencing all the words according to the word frequency-inverse text frequency value to obtain a keyword list.
3. The method for optimizing the machine translation in the national defense and military industry field according to claim 2, wherein the calculating the word frequency-inverse text frequency value of each word according to the corpus specifically comprises:
according to the formula TF-IDF i,j =TF i,j ×IDF i Calculating a word frequency-inverse text frequency value for each of the words, wherein,
Figure FDA0002552348720000021
TF-IDF i,j representing the importance degree of the word i in the article j to be translated, TF representing the word frequency, IDF i,j Representing the inverse text frequency, TF i,j Indicating the word frequency value, IDF, of the word i appearing in the article j to be translated i Representing the inverse text frequency value, n, of the word i i,j Represents the number of times, Σ, that the word i appears in the article j to be translated k n k,j Represents the total number of words in the article j to be translated, n k,j Representing the occurrence frequency of the kth word in the article j to be translated, | D | representing the total number of articles in the corpus, | { j: t |, in the corpus i ∈d j Denotes the number of articles in the corpus containing word i, t i Representing words i, d j Representing the article j to be translated.
4. The method for optimizing the machine translation in the national defense and military industry field according to claim 1, wherein the determination process of the machine translation model is as follows:
acquiring bilingual parallel sentence pairs, wherein the bilingual parallel sentence pairs consist of original texts and translations corresponding to the original texts;
and inputting the bilingual parallel sentence pairs into a deep neural network to obtain a machine translation model.
5. The method for optimizing machine translation in the national defense and military industry field according to claim 1, wherein the step of performing hierarchical clustering on each keyword in the keyword list according to the cosine similarity to obtain a plurality of word categories specifically comprises the steps of:
calculating the average value of cosine similarity of all keywords in the current word category and target keywords under the current clustering frequency to obtain the average cosine similarity before division, wherein the current word category is the word category obtained under the last clustering frequency;
judging whether the average cosine similarity before division is greater than a specific threshold value or not to obtain a second judgment result;
if the second judgment result is yes, the target keyword is divided into the current word category to obtain a divided word category;
calculating the average value of the cosine similarity of all the keywords in the divided word categories to obtain the divided average cosine similarity;
judging whether the divided average cosine similarity is smaller than a set multiple of the divided average cosine similarity to obtain a third judgment result;
and if the third judgment result is yes, deleting the target keywords from the divided word categories, dividing the target keywords into word categories different from the current word categories, then updating the current clustering times, and returning to the average value of the cosine similarity of all the keywords in the current word categories and the target keywords under the current clustering times to obtain the average cosine similarity before division.
6. The utility model provides a national defense military industry field machine translation optimization system which characterized in that includes:
the extraction module is used for extracting keywords of the article to be translated to obtain a keyword list;
the first calculation module is used for calculating word vectors of each keyword and corresponding context in the keyword list, wherein the corresponding context is a word adjacent to the keyword in the article to be translated;
the second calculation module is used for calculating the cosine similarity of the keyword pair according to the word vector; the keyword pair is two identical keywords appearing at different positions of the article to be translated in the keyword list;
the clustering module is used for respectively carrying out hierarchical clustering on each keyword in the keyword list according to the cosine similarity to obtain a plurality of word categories; each word category at least comprises one keyword;
the translation module is used for translating all the keywords in each word category by adopting a machine translation model to obtain a plurality of translations corresponding to each word category;
the user translation determining module is used for determining the translation with the highest occurrence probability in the multiple translations corresponding to the word categories as the user translation corresponding to the word categories;
the machine translation method determining module is used for translating the article to be translated by using the machine translation model to obtain a machine translation method of each keyword;
the first judgment module is used for judging whether the machine translation method is the same as the user translation method of the word category to which the corresponding keyword belongs to obtain a first judgment result;
the user translation module is used for translating the key words by adopting the user translation if the first judgment result is negative;
and the machine translation method translation module is used for translating the keyword by adopting the machine translation method if the first judgment result is positive.
7. The national defense and military industry field machine translation optimization system of claim 6, wherein the extraction module specifically comprises:
a corpus acquisition unit for acquiring a corpus;
the word segmentation unit is used for segmenting words of the article to be translated to obtain a plurality of words;
the word frequency-inverse text frequency value calculating unit is used for calculating the word frequency-inverse text frequency value of each word according to the corpus;
and the keyword list unit is used for sequencing all the words according to the word frequency-inverse text frequency value to obtain a keyword list.
8. The national defense and military industry field machine translation optimization system of claim 7, wherein the word frequency-inverse text frequency value calculation unit comprises:
a word frequency-inverse text frequency value calculating operator unit for calculating the word frequency-inverse text frequency value according to the formula TF-IDF i,j =TF i,j ×IDF i Calculating a word frequency-inverse text frequency value for each of the words, wherein,
Figure FDA0002552348720000041
TF-IDF i,j representing the importance degree of the word i in the article j to be translated, TF representing the word frequency, IDF i,j Representing the inverse text frequency, TF i,j Indicating the word frequency value, IDF, of the word i appearing in the article j to be translated i Representing the inverse text frequency value, n, of the word i i,j Represents the number of times, Σ, that the word i appears in the article j to be translated k n k,j Represents the total number of words in the article j to be translated, n k,j Representing the occurrence frequency of the kth word in the article j to be translated, | D | representing the total number of articles in the corpus, | { j: t |, in the corpus i ∈d j Denotes the number of articles in the corpus that contain word i, t i The words i, d j Representing the article j to be translated.
9. The system of claim 6, wherein the determination of the machine translation model specifically comprises:
the bilingual parallel sentence pair acquisition unit is used for acquiring a bilingual parallel sentence pair, and the bilingual parallel sentence pair consists of an original text and a translation corresponding to the original text;
and the machine translation model determining unit is used for inputting the bilingual parallel sentence pairs into the deep neural network to obtain a machine translation model.
10. The national defense and military industry field machine translation optimization system of claim 6, wherein the clustering module specifically comprises:
the pre-division average cosine similarity calculation unit is used for calculating the average value of the cosine similarity of all keywords in the current word category and the target keywords under the current clustering frequency to obtain the pre-division average cosine similarity, and the current word category is the word category obtained under the last clustering frequency;
a second judging unit, configured to judge whether the average cosine similarity before division is greater than a specific threshold, to obtain a second judgment result;
the divided word category unit is used for dividing the target keyword into the current word category to obtain a divided word category if the second judgment result is yes;
the divided average cosine similarity calculation unit is used for calculating the average value of the cosine similarities of all the keywords in the divided word categories to obtain the divided average cosine similarity;
a third judging unit, configured to judge whether the divided average cosine similarity is smaller than a set multiple of the average cosine similarity before the division, to obtain a third judgment result;
and the clustering unit is used for deleting the target keywords from the divided word categories, dividing the target keywords into word categories different from the current word categories, updating the current clustering times, and returning to the average value of the cosine similarity of all the keywords in the current word categories and the target keywords under the current clustering times to obtain the average cosine similarity before division if the third judgment result is yes.
CN202010578821.9A 2020-06-23 2020-06-23 Machine translation optimization method and system in national defense and military industry field Active CN111738022B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010578821.9A CN111738022B (en) 2020-06-23 2020-06-23 Machine translation optimization method and system in national defense and military industry field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010578821.9A CN111738022B (en) 2020-06-23 2020-06-23 Machine translation optimization method and system in national defense and military industry field

Publications (2)

Publication Number Publication Date
CN111738022A CN111738022A (en) 2020-10-02
CN111738022B true CN111738022B (en) 2023-04-18

Family

ID=72650552

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010578821.9A Active CN111738022B (en) 2020-06-23 2020-06-23 Machine translation optimization method and system in national defense and military industry field

Country Status (1)

Country Link
CN (1) CN111738022B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678287A (en) * 2013-11-30 2014-03-26 武汉传神信息技术有限公司 Method for unifying keyword translation
CN106484682A (en) * 2015-08-25 2017-03-08 阿里巴巴集团控股有限公司 Based on the machine translation method of statistics, device and electronic equipment
CN108920473A (en) * 2018-07-04 2018-11-30 中译语通科技股份有限公司 A kind of data enhancing machine translation method based on similar word and synonym replacement
CN109299480A (en) * 2018-09-04 2019-02-01 上海传神翻译服务有限公司 Terminology Translation method and device based on context of co-text
CN109858045A (en) * 2019-02-01 2019-06-07 北京字节跳动网络技术有限公司 Machine translation method and device
CN110991196A (en) * 2019-12-18 2020-04-10 北京百度网讯科技有限公司 Translation method and device for polysemous words, electronic equipment and medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8639698B1 (en) * 2012-07-16 2014-01-28 Google Inc. Multi-language document clustering

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678287A (en) * 2013-11-30 2014-03-26 武汉传神信息技术有限公司 Method for unifying keyword translation
CN106484682A (en) * 2015-08-25 2017-03-08 阿里巴巴集团控股有限公司 Based on the machine translation method of statistics, device and electronic equipment
CN108920473A (en) * 2018-07-04 2018-11-30 中译语通科技股份有限公司 A kind of data enhancing machine translation method based on similar word and synonym replacement
CN109299480A (en) * 2018-09-04 2019-02-01 上海传神翻译服务有限公司 Terminology Translation method and device based on context of co-text
CN109858045A (en) * 2019-02-01 2019-06-07 北京字节跳动网络技术有限公司 Machine translation method and device
CN110991196A (en) * 2019-12-18 2020-04-10 北京百度网讯科技有限公司 Translation method and device for polysemous words, electronic equipment and medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Yumi WAKITA 等.Fine keyword clustering using a thesaurus and example sentences for speech translation.《6th International Conference on Spoken Language Processing》.2000,1-4. *
孙常龙.基于Web的未登录词翻译技术研究.《中国优秀硕士学位论文全文数据库 信息科技辑》.2012,(第06期),I138-2258. *

Also Published As

Publication number Publication date
CN111738022A (en) 2020-10-02

Similar Documents

Publication Publication Date Title
CN108197111B (en) Text automatic summarization method based on fusion semantic clustering
CN105224640B (en) Method and equipment for extracting viewpoint
CN110427618B (en) Countermeasure sample generation method, medium, device and computing equipment
CN110727880B (en) Sensitive corpus detection method based on word bank and word vector model
CN105760474B (en) Method and system for extracting feature words of document set based on position information
CN106598959B (en) Method and system for determining mutual translation relationship of bilingual sentence pairs
CN109635297B (en) Entity disambiguation method and device, computer device and computer storage medium
CN112257453B (en) Chinese-Yue text similarity calculation method fusing keywords and semantic features
CN108073571B (en) Multi-language text quality evaluation method and system and intelligent text processing system
CN111539229A (en) Neural machine translation model training method, neural machine translation method and device
CN106611041A (en) New text similarity solution method
CN111626042B (en) Reference digestion method and device
CN112329482A (en) Machine translation method, device, electronic equipment and readable storage medium
CN110929022A (en) Text abstract generation method and system
CN111178009B (en) Text multilingual recognition method based on feature word weighting
CN115759119B (en) Financial text emotion analysis method, system, medium and equipment
Li et al. Chinese spelling check based on neural machine translation
CN111738022B (en) Machine translation optimization method and system in national defense and military industry field
CN112131341A (en) Text similarity calculation method and device, electronic equipment and storage medium
Arcan A comparison of statistical and neural machine translation for Slovene, Serbian and Croatian
CN112257460B (en) Pivot-based Hanyue combined training neural machine translation method
Shang Research on Chinese New Word Discovery Algorithm Based on Mutual Information
CN113971403A (en) Entity identification method and system considering text semantic information
CN114266249A (en) Mass text clustering method based on birch clustering
CN108733824B (en) Interactive theme modeling method and device considering expert knowledge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant