CN114792101A

CN114792101A - Method for generating and translating input information of machine translation and obtaining machine model

Info

Publication number: CN114792101A
Application number: CN202210723325.7A
Authority: CN
Inventors: 刘明童; 付宇; 周明
Original assignee: Beijing Lanzhou Technology Co ltd
Current assignee: Beijing Lanzhou Technology Co ltd
Priority date: 2022-06-24
Filing date: 2022-06-24
Publication date: 2022-07-26
Anticipated expiration: 2042-06-24
Also published as: CN114792101B

Abstract

The invention relates to the technical field of machine translation, in particular to a machine translation input information generation method, a translation method, a machine translation model acquisition method and a readable storage medium, wherein the machine translation input information generation method comprises the following steps: obtaining a machine translation source sentence, and performing preliminary retrieval from a preset memory base based on at least two different retrieval modes to obtain an initial retrieval source sentence; filtering the initial retrieval source sentence according to a preset condition to obtain a filtered retrieval source sentence; dynamically fusing the filtered retrieval source sentences to perform word coverage on the translation source sentences to obtain covered translation source sentences; and the translation source sentence after the covering is used as input information. The whole input information is expanded through the steps, the preset conditions are set for the retrieval source sentence for filtering, and the retrieval source sentence is dynamically fused, so that the input information is expanded while the fusion noise is reduced, the subsequent translation is convenient, and the problem of low translation quality of the existing machine is solved.

Description

Method for generating and translating input information of machine translation and obtaining machine model

Technical Field

The present invention relates to the field of machine translation technologies, and in particular, to a method for generating input information for machine translation, a method for obtaining a machine translation model, and a computer-readable storage medium.

Background

Machine translation is an important component in natural language processing, and has gained wide attention in recent years, with the continuous development of deep neural networks, the neural machine translation model trained by using an end-to-end model gradually exceeds the statistical-based machine translation model, and many relevant and practical software or projects, such as google translation, hundred-degree translation, calf translation and the like, are derived. The translation memory library is a means for assisting translation, integrates linguistic data collected in advance and linguistic data translated before, and compares and utilizes the linguistic data in the current translation process, so that the situations of wrong translation and repeated translation are prevented, particularly the consistency of translation of certain proper nouns can be enhanced, and the quality and readability of final translation are improved.

In the prior art, a simpler method is usually adopted to fuse the retrieved sentences, for example: and directly splicing the retrieved sentences of the target end behind the sentences to be translated, and removing some redundant information by using the aligned information. The methods can play a role of supplementing information to a certain extent, but often bring more noise content, influence the whole translation, and do not fully utilize the retrieved result, so that the method cannot better cover the whole information of the source-end sentence which is desired to be translated. In addition, the method of simply splicing the retrieved target-side sentence into the source-side sentence which we want to translate does not consider interaction between two different languages at the input end when the whole sentence is encoded, so that the final translated content cannot be effectively associated with the additionally spliced target-side sentence, and the problem of low translation quality of the existing machine exists.

Disclosure of Invention

In order to solve the problem of low quality of the existing machine translation, the invention provides a machine translation input information generation method, a machine translation model acquisition method and a computer readable storage medium.

The invention provides a method for generating input information of machine translation, which comprises the following steps:

obtaining a machine translation source sentence, and performing preliminary retrieval from a preset memory base based on at least two different retrieval modes to obtain an initial retrieval source sentence;

filtering the initial retrieval source sentence according to a preset condition to obtain a filtered retrieval source sentence;

dynamically fusing the filtered retrieval source sentence to perform word coverage on the translation source sentence to obtain a covered translation source sentence;

and the translation source sentence after the covering is used as input information.

Preferably, the obtaining of the machine translation source sentence and the preliminary retrieval from the preset memory base based on at least two different retrieval modes to obtain the initial retrieval source sentence specifically includes the following steps:

obtaining a machine translation source sentence, and performing preliminary retrieval from a preset memory base based on at least two different retrieval modes to obtain at least two corpus pairs;

and mixing the two language material pairs to obtain an initial retrieval source sentence.

Preferably, the predetermined conditions include alignment and part of speech tagging.

Preferably, the step of filtering the initial search source sentence according to a predetermined condition to obtain a filtered search source sentence specifically includes the following steps:

performing preliminary filtering on the initial retrieval source sentence through alignment;

and performing part-of-speech tagging on the translation source sentence and the search source sentence after the preliminary filtering, and reserving a part of the translation source sentence, which has the same part of speech as the translation source sentence, in the search source sentence.

Preferably, the dynamic fusing of the filtered search source sentences includes setting an information threshold covering the translation source sentences and setting an added information threshold of the corpus pairs.

The invention further provides a machine translation method for solving the technical problems, which is used for acquiring the input information generated by any one of the machine translation input information generation methods, and inputting the input information into a preset machine translation model to obtain a corresponding translation result.

The invention also provides a method for acquiring a machine translation model to solve the technical problem, which is used for acquiring the input information generated by any one of the method for generating the input information of the machine translation, and executing a mask task on the acquired input information to acquire extended input information;

and inputting the extended input information into a preset machine translation model to execute a translation enhancement training task so as to obtain an enhanced machine translation model.

Preferably, the input information of the preset machine translation model further includes a translation source sentence.

Preferably, the step of inputting the extended input information into a preset machine translation model to execute a translation enhancement training task to obtain an enhanced machine translation model specifically includes the following steps:

and splicing the extended input information on the translation source sentence in a preset mode, and inputting the extended input information into a preset machine translation model together to execute a translation enhancement training task so as to obtain an enhanced machine translation model.

The present invention further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed, the computer program implements the method for generating input information for machine translation according to any of the above-mentioned technical problems.

Compared with the prior art, the input information generation method for machine translation, the machine translation method, the acquisition method for the machine translation model and the computer readable storage medium have the following advantages:

1. the invention relates to a method for generating input information of machine translation, which comprises the steps of firstly obtaining a machine translation source sentence, and can carry on the preliminary search from the preset memory bank and get the initial search source sentence on the basis of at least two different search modes, search the source sentence and filter the source sentence according to the predetermined condition and get the search source sentence after filtering, in addition, dynamically fusing the filtered retrieval source sentence and performing word coverage on the translation source sentence to obtain a covered translation source sentence, and finally taking the covered translation source sentence as input information, the whole input information is expanded through the steps, and the searching source sentence is filtered by setting the preset condition, and the retrieval source sentences are dynamically fused, so that the fusion noise is reduced, the input information is expanded, the subsequent translation is convenient to perform, and the problem of low translation quality of the existing machine is solved.

2. Firstly, obtaining a machine translation source sentence, and carrying out preliminary retrieval from a preset memory base based on at least two different retrieval modes to obtain at least two corpus pairs; the two corpus pairs are mixed to obtain the initial retrieval source sentence, the initial retrieval source sentence obtained based on the two different corpus pairs can be obtained through the steps, and the practicability is high.

3. The preset conditions comprise an alignment mode and part-of-speech tagging, and the preset conditions are set so that the search source sentences can be preliminarily screened to obtain the required search source sentences, the speed of the whole process can be increased, and the method has high practicability and feasibility.

4. In the steps of the invention, the initial retrieval source sentence is preliminarily filtered through an alignment processing mode, the translation source sentence and the preliminarily filtered retrieval source sentence are subjected to part-of-speech tagging, and the part of the retrieval source sentence with the same part of speech as the translation source sentence in part-of-speech tagging is reserved. In the step, the retrieval source sentence and the retrieval target sentence are subjected to word correspondence in an alignment mode to obtain corresponding information of words, wherein the words which do not meet the word correspondence are screened out; the steps can further filter a large amount of original retrieval source sentences, and select the retrieval source sentences which further meet the conditions, so that the method has strong practicability. Secondly, the parts of speech tagging is carried out on the words in the initially filtered retrieval source sentence and the translation source sentence simultaneously, the words with the parts of speech inconsistent with the parts of speech of the translation source sentence in the retrieval source sentence are filtered, the words with the parts of speech identical with the parts of speech of the translation source sentence in the retrieval source sentence are reserved, the retrieval source sentence subjected to the initial screening can be filtered for the second time, the parts of speech inconsistent in the retrieval source sentence are screened out, the retrieval source sentence meeting the conditions is obtained, and the method has strong practicability and feasibility.

5. The method for dynamically fusing the filtered retrieval source sentence comprises the steps of setting an information threshold covering the translation source sentence and setting an added information threshold of the corpus pair.

6. The invention also provides a machine translation method, a machine translation model acquisition method and a computer readable storage medium, which have the same beneficial effects as the machine translation input information generation method and are not repeated herein.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the embodiments or the prior art description will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings may be obtained according to these drawings without inventive labor.

Fig. 1 is a flowchart illustrating steps of a method for generating input information for machine translation according to a first embodiment of the present invention.

Fig. 2 is a flowchart of the step S1 of the method for generating input information for machine translation according to the first embodiment of the present invention.

Fig. 3 is a flowchart of the step S2 of the method for generating input information for machine translation according to the first embodiment of the present invention.

Fig. 4 is a first flowchart of a method for generating input information for machine translation according to a first embodiment of the present invention.

Fig. 5 is a flowchart illustrating step S21 of the method for generating input information for machine translation according to the first embodiment of the present invention.

Fig. 6 is a flowchart of a second example of a method for generating input information for machine translation according to the first embodiment of the present invention.

FIG. 7 is a flowchart illustrating steps of a machine translation method according to a second embodiment of the present invention.

FIG. 8 is a flowchart illustrating steps of a machine translation model according to a third embodiment of the present invention.

FIG. 9 is a flowchart illustrating the step S302 of the machine translation model according to the third embodiment of the present invention.

Fig. 10 is a third flowchart of a method for generating input information for machine translation according to the first embodiment of the present invention.

Fig. 11 is a block diagram of a machine translation input information generation system according to a fourth embodiment of the present invention.

The attached drawings indicate the following:

4. a machine translation input information generation system;

10. an acquisition module; 20. a filtration module; 30. a processing module; 40. and generating a module.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and implementation examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

The terms "vertical," "horizontal," "left," "right," "up," "down," "left-up," "right-up," "left-down," "right-down," and the like as used herein are for purposes of description only.

Referring to fig. 1, a first embodiment of the present invention provides a method for generating input information for machine translation, including the following steps:

s1, acquiring a machine translation source sentence, and performing preliminary retrieval from a preset memory base based on at least two different retrieval modes to obtain an initial retrieval source sentence;

s2, filtering the initial search source sentence according to preset conditions to obtain a filtered search source sentence;

s3, dynamically fusing the filtered retrieval source sentences to perform word coverage on the translation source sentences to obtain covered translation source sentences;

and S4, taking the translation source sentence after covering as input information.

It can be understood that, the method for generating input information by machine translation according to the present invention includes the steps of obtaining a machine translation source sentence, and can carry on the preliminary search from the preset memory bank and get the initial search source sentence on the basis of at least two different search modes, search the source sentence and filter the source sentence according to the predetermined condition and get the search source sentence after filtering, in addition, dynamically fusing the filtered retrieval source sentence and performing word coverage on the translation source sentence to obtain a covered translation source sentence, and finally taking the covered translation source sentence as input information, the whole input information is expanded through the steps, and the searching source sentence is filtered through setting the preset condition, and the retrieval source sentences are dynamically fused, so that the fusion noise is reduced, the input information is expanded, the subsequent translation is convenient to perform, and the problem of low translation quality of the existing machine is solved.

As an alternative implementation, two different retrieval modes include keyword retrieval and vector retrieval, wherein the keyword retrieval obtains the vocabulary-based sentences matched with the translation source sentences from the preset memory library, and the vector retrieval obtains the semantics-based sentences matched with the translation source sentences from the preset memory library, and the two modes can obtain the sentence information matched with the translation source sentences in different dimensions, provide a large amount of matching information for the following process, and have strong practicability.

It should be noted that, referring to fig. 4, the keyword search mode uses an elastic search tool (i.e., ES) as an implementation tool, where the elastic search tool is an open-source distributed search engine, and stores and searches structured and unstructured data. When searching, the sentence is directly used as a query keyword, searching is carried out from the memory base based on the BM25 algorithm, for each document d in the memory base, a similarity score can be obtained according to the BM25 algorithm in combination with the query sentence,

wherein

It represents a certain word in the query sentence,

the weight of the current word is represented, and the association represents the similarity of all words and documents in the query sentence and represents the relevance of the document and the current query sentence. The result of keyword-based search can be obtained as part of the preliminary search result by the BM25 algorithm.

Furthermore, a faiss open source tool (faiss) is used as a vector retrieval implementation tool in the vector retrieval mode. Before using the faces open source tool, the sensor-transformations source library is used to obtain the vector representation of each sentence, and then the faces open source tool is used to represent the sentences in the memory library by vectors. When a query is needed, according to the vector representation of a query sentence:

with the faiss tool, the euclidean distance between the vector representation of each sentence in the memory and the vector representation of the query sentence can be calculated, and this distance is used to represent the degree of correlation between the two sentences, assuming that the vector representation of the corpus in the memory is:

then the similarity score of the two sentences is:

。

referring to fig. 2, step S1 specifically includes the following steps:

s11, obtaining a machine translation source sentence, and carrying out preliminary retrieval from a preset memory base based on at least two different retrieval modes to obtain at least two corpus pairs;

and S12, mixing the two language material pairs to obtain an initial retrieval source sentence.

It can be understood that, in the steps of the invention, the machine translation source sentence is obtained first, and the preliminary retrieval is carried out from the preset memory base based on at least two different retrieval modes to obtain at least two corpus pairs; the initial retrieval source sentence obtained based on the at least two different corpus pairs can be obtained through the steps, and the method has high practicability.

It should be noted that, after at least two different corpus pairs are fused, a retrieval source sentence which is closely associated with the translation source sentence under different dimensions is obtained.

Specifically, in one embodiment of the present invention, when retrieving the translation source sentence from the preset memory, the vector retrieval is to retrieve a sentence matching the translation source sentence from the preset memory based on the semantics, and the keyword retrieval is to retrieve a sentence matching the translation source sentence based on the vocabulary, which is specifically exemplified as follows: when the translation source sentence is: in the past five years, IAEA has induced In improving the vision of the safety In research and technology, the sentences obtained by vector-based search are: the sentences obtained by the It all supported activities of IAEA and its safety records region based on the keyword search are: from the above example sentences, the result of vector search and translation source sentence are discussed about IAEA, but the difference in the sentence pattern is large; while the results and translation source sentences obtained by keyword retrieval have more parts, the discussion contents are inconsistent. That is, in the embodiment of the present invention, the result obtained by keyword-based search and the result obtained by vector-based search are mixed to obtain the initial search source sentence.

As an optional implementation manner, the predetermined conditions include an alignment manner and part-of-speech tagging, and the setting of the predetermined conditions enables the retrieval source sentence to be preliminarily screened to obtain a required retrieval source sentence, so that the speed of the whole process can be increased, and the method has strong practicability and feasibility.

Referring to fig. 3 and 6, step S2 specifically includes the following steps:

s21, performing preliminary filtering on the initial retrieval source sentence through alignment;

and S22, performing part-of-speech tagging on the translation source sentence and the search source sentence after the preliminary filtering, and reserving the part of the search source sentence with the same part of speech as the translation source sentence.

Understandably, in the step, the initial retrieval source sentence is initially filtered in an alignment processing mode, the translation source sentence and the retrieval source sentence after the initial filtering are subjected to part-of-speech tagging, and a part of the retrieval source sentence with the same part of speech as the translation source sentence is reserved. Performing word correspondence on a retrieval source sentence and a retrieval target sentence in an alignment mode to obtain corresponding information of words, wherein words which do not meet word correspondence are screened out; the steps can further filter a large number of original retrieval source sentences, and select the retrieval source sentences which further meet the conditions, so that the method has strong practicability. And secondly, performing part-of-speech tagging on words in the preliminarily filtered retrieval source sentence and the translation source sentence simultaneously, filtering out words in the retrieval source sentence, which are inconsistent with the part-of-speech of the translation source sentence, and keeping the words in the retrieval source sentence, which are identical with the part-of-speech of the translation source sentence.

Referring to fig. 5 and 6, step S21 specifically includes the following steps:

s211, acquiring a retrieval target sentence based on a preset memory base;

s212, performing word coverage calculation on the translation source sentence and the initial retrieval source sentence, and reserving the part of the initial retrieval source sentence overlapped with the translation source sentence;

s213: and aligning the overlapped part in the initial search source sentence with the search target sentence.

It should be noted that the retrieval target sentence refers to a sentence translated into a corresponding language and stored in the preset memory base as the initial retrieval source sentence.

It can be understood that through the steps, the overlapped part of the initial retrieval source sentence and the translation source sentence can be obtained, and the overlapped part of the initial retrieval source sentence is aligned, so that the initial retrieval source sentence is subjected to primary filtering, and the method has strong practicability.

Referring to FIG. 6, in one embodiment of the present invention, if the translated source sentence is English and the target sentence is translated into Chinese, the translated source sentence is English, the search source sentence is also English, and the search target sentence is Chinese. In the process of aligning the retrieval source sentence, performing word covering calculation on the translation source sentence and the initial retrieval source sentence, reserving covered parts of words in the initial retrieval source sentence and the translation source sentence, then performing alignment processing on words in the overlapped initial retrieval source sentence (English) and the retrieval target sentence (Chinese), wherein the alignment processing process is to perform matching on the Chinese in the retrieval target sentence and the English covered parts of the words in the initial retrieval source sentence, namely corresponding to the words, after the initial retrieval source sentence is aligned, performing part-of-speech tagging on the covered parts of the words in the translation source sentence (English) and the initial retrieval source sentence (English) respectively, and filtering out words with different parts-of-speech of the covered parts of the words in the initial retrieval source sentence and the covered parts of the words in the translation source sentence.

It should be noted that, on the basis of each initial search source sentence, there is a search target sentence in the target language corresponding to the initial search source sentence in the preset memory base.

It should be noted that the tools used for alignment include fast-align and/or awesome-align, and the alignment processing is performed on the search source sentence and the search target sentence by using the alignment tools, so as to obtain word correspondence information. And searching the part needing to be reserved in the target sentence, namely searching the words of the target sentence corresponding to the covered words in the source sentence according to the information obtained by alignment.

For example, when the translation source sentence is: the leader may limit The time to be allowed for Such extensions, The initial search source sentence is: the active leader, May I-terminated sensed topics The time limit for conditions at The session is five minutes retrieval target sentence: agent leader (speaking in english): please ask me to remind representatives, the time limit for this conference to speak is 5 minutes. The alignment process is as follows: firstly, performing word coverage calculation on the translation source sentence and the initial retrieval source sentence, and obtaining the part of the initial retrieval source sentence overlapped with the translation source sentence as follows: and the leader and the time limit, namely, completing the correspondence of the words. After the alignment process is completed, it can be known that the word covering part of the translation source sentence is: the leader and limit the time, the initial search source sentence covering part: leader and the time limit.

After alignment, the preliminary filtering is completed, at which time part-of-speech tagging is performed for further filtering. Specifically, the above example is still used as an example for explanation. The results of the preliminary screening after alignment were: the word covering part of the translation source sentence is as follows: the leader and limit the time, the initial search source sentence covering part: leader and the time limit. And the part-of-speech tagging is respectively used for tagging the word covering part of the translation source sentence and the word covering part of the initial retrieval source sentence, namely, the part-of-speech tagging of verbs, nouns, prepositions and the like is carried out, and the part-of-speech tagging of words in the word covering of the translation source sentence can be known by tagging: [ leader: NNP, limit: VB, the: DT, time: NN ], parts of speech of the words covering the part in the initial search source sentence are: when we know that the parts of speech of limit in two source sentences are different, we only get leader and time after removing the word.

After the part-of-speech tagging is completed, the filtering of the initial search source sentence is completed, and at this time, the process of splicing is still described by taking the above example as an example, and after the part-of-speech tagging is completed, the part covered by the words of the search target sentence is: leadership, time and limitation, after part-of-speech tagging is carried out, filtering words with the covering parts of the retrieval source words and the covering parts of the translation source words, which have different parts of speech, so that only leader and the time are obtained for the covering parts of the retrieval source words, and the word limit is removed because the part of the word has different parts of speech in the translation source words and the retrieval target words, and then the meaning of the limit of the covering parts of the words in the retrieval target words is deleted at the same time, namely, the covering parts of the words in the retrieval target words leave leader and time; when splicing is carried out, the two words of leader and time in the retrieval target sentence are spliced behind the translation source sentence, and the result after splicing is as follows: the leader may limit The time to be allowed for The find extensions, < sep > < noise > leader < noise > time < noise >, wherein < sep > is used as a separator and < noise > characters represent filtered information, and The above process completes one dynamic fusion.

It can be understood that after the initial retrieval source sentence is fused for multiple times, the retrieval source sentence which meets the word coverage calculation, alignment, part of speech tagging and splicing conditions is obtained after each dynamic fusion, and the retrieval target sentence which meets the conditions is obtained at the same time, the retrieval target sentence 1 is obtained after the first dynamic fusion, the retrieval target sentence 2 is obtained for the second time, and the retrieval target sentence i is obtained for the ith time; and (3) stopping dynamic fusion until the condition is not met, wherein the sentence after dynamic fusion splicing is the following sentence: the translation source sentence + the search target sentence 1+ the search target sentence 2+ … … + the search target sentence i, and the above equation is the input information.

It will be appreciated that each dynamic fusion is a separate process of word coverage computation, alignment, part-of-speech tagging, and sentence concatenation.

In the embodiment of the invention, based on 1 translation source sentence, when the translation source sentence is retrieved from a preset memory base through a keyword retrieval mode and a vector retrieval mode, K first corpus pairs which are parallel and obtained based on keyword retrieval and K second corpus pairs which are parallel and obtained based on vector retrieval are respectively obtained, the first corpus pairs and the second corpus pairs are mixed, and the initial retrieval source sentence with 2K parallel corpuses is obtained through mixing.

Optionally, K ranges from 5 to 20. Preferably, in the embodiment of the present invention, K has a value of 10.

That is, in the embodiment of the present invention, 10 first corpus pairs and 10 second corpus pairs are obtained respectively from a translation source sentence through keyword retrieval and vector retrieval, and the 10 first corpus pairs and the 10 second corpus pairs are randomly mixed to obtain 20 mixed parallel corpus pairs, and the 20 mixed corpus pairs are collectively referred to as an initial retrieval source sentence.

When the retrieval source sentence is dynamically fused for the first time, the 20 parallel corpus pairs included in the retrieval source sentence are calculated through a fuzzy matching algorithm of the editing distance, the editing distance between each corpus pair and the translation source sentence is calculated, the smaller the editing distance is, the closer the corpus pairs are, and the corpus pairs in the retrieval source sentence with the top rank are selected for fusion for the first time.

And then, in the second fusion, removing the words in the translation source sentence which are already covered by the previous retrieval source sentence, using the remaining words and the corpus pairs in the remaining 19 initial retrieval source sentences to calculate the editing distance, sequencing, continuously selecting the closest corpus pair in the current retrieval source sentence for fusion, repeating the steps until the set two information threshold conditions are not met, and exiting.

As an optional implementation manner, the manner of dynamically fusing the filtered retrieval source sentence includes setting an information threshold covering the translation source sentence and setting an added information threshold of the corpus pair, that is, the two information thresholds are the information threshold covering the translation source sentence and the added information threshold of the corpus pair, respectively.

Understandably, the information splicing of the corpus pairs can be conditionally performed by covering the information threshold of the translation source sentence and setting the information adding threshold of each corpus pair, so that the problem of repeated fusion noise of the corpus pairs is reduced.

It should be noted that, the manner of setting the information threshold value covering the translation source sentence is that when a certain information threshold value is met, in the embodiment of the present invention, when the threshold value reaches 70% (that is, the ratio of the covered words to the translation source sentence reaches 70%), the subsequent unselected search corpus is automatically stopped from being continuously fused, and by setting the threshold value, the preset model can dynamically select and fuse the number of the search example sentences; furthermore, the information adding threshold of each corpus pair is set in a manner that the proportion of added words is set to reach 10% of the remaining words which are not added in the translation source sentence through a fuzzy matching algorithm, and when the proportion of added words reaches 10% of the remaining words which are not added in the translation source sentence, the information of the current corpus pair can be spliced and subsequently fused.

In the embodiment of the invention, the retrieval source sentence is dynamically fused for a plurality of times until the retrieval source sentence does not meet the condition, and the retrieval is quitted, wherein the specific principle of the dynamic fusion process is as follows:

use of

To represent all words in the source sentence to be translated,

representing the jth corpus in the corpus pair obtained in the initial search source sentence. During the ith fusion, some words in the translation source sentence are already covered previously and are removed from S (i.e., the translation source sentence). Then, the unselected (2K-i) corpus pairs (K represents the number of the first corpus pair or the second corpus pair to be retrieved) are reordered.

It can be understood that, in the embodiment of the present invention, when the translation source sentence is retrieved to obtain the first corpus pair or the second corpus pair, the number of the first corpus pair is equal to that of the second corpus pair.

It should be noted that the basis of the sorting is a fuzzy matching algorithm based on the edit distance. The edit distance includes three basic operations of insertion, deletion, and replacement when calculated. During retrieval, K first corpus pairs and K second corpus pairs are respectively retrieved based on keyword retrieval and vector retrieval. And randomly mixing the K first corpus pairs and the K second corpus pairs to obtain 2K corpus pairs in total, namely the initial retrieval source sentence comprises 2K corpus pairs, and when the dynamic fusion is performed in the first step, calculating and translating the editing distance between the source sentences by using the initial retrieval source sentence of the 2K parallel corpuses, wherein the smaller the editing distance is, the closer the editing distance is, and selecting the current closest retrieval source sentence for performing the operation.

In an embodiment of the present invention, when the translation source sentence is S = [ No one has ever before charged with the masters, if the closest search source sentence obtained by the edit distance calculation is [ No body has ever ] then the word covered by the word is [ No has ever before charged with the masters ], and when sorting is performed next time, the remaining words in the translation source sentence are S1= [ one charged with the masters ], that is, the remaining word S1 in the translation source sentence and the remaining parallel initial search source sentence are subjected to edit distance calculation, and the above operations are repeated until the conditions are not satisfied.

To sum up, in the first embodiment of the present invention, a machine translation source sentence is obtained, the translation source sentence is retrieved from a preset memory base based on a keyword retrieval manner and a vector retrieval manner, where multiple (e.g., 10) first corpus pairs are obtained based on the keyword retrieval manner, multiple (e.g., 10) second corpus pairs are obtained based on the vector retrieval manner, two corpus pairs are randomly mixed to obtain multiple (20) initial retrieval source sentences of parallel mixed corpus pairs, the initial retrieval source sentence is filtered through alignment and part-of-speech tagging, the filtered initial retrieval source sentence is dynamically fused to obtain input information, and the condition for dynamically fusing the initial retrieval source sentence is to set an information threshold covering the translation source sentence and an information addition threshold setting the corpus pair, and the input information after satisfying the condition is obtained after dynamic fusion.

Referring to fig. 7, a second embodiment of the present invention provides a machine translation method, which includes S200 obtaining input information generated by the input information generation method for machine translation provided by the first embodiment of the present invention, and S201 inputting the input information into a preset machine translation model to obtain a corresponding translation result.

It can be understood that the original input information is expanded and input into the preset machine translation model through the method to obtain the corresponding translation result, so that the translation result is more accurate.

Referring to fig. 8, a third embodiment of the invention provides a method for obtaining an enhanced machine translation model, S300: acquiring input information generated by the method for generating input information for machine translation provided by the first embodiment of the invention, S301, executing a mask task on the acquired input information to acquire extended input information;

and S302, inputting the extended input information into a preset machine translation model to execute a translation enhancement training task so as to obtain an enhanced machine translation model.

As an optional implementation manner, the input information input into the preset machine translation model further includes a translation source sentence.

It can be understood that, in the embodiment of the present invention, the translation source sentence and the extended input information are input into the machine translation model together to execute the joint translation enhancement training task to obtain the enhanced machine translation model.

Further, referring to fig. 9, step S302 specifically includes the following steps:

and S3021, splicing the extended input information with the translation source sentence in a preset manner, and inputting the extended input information and the preset machine translation model together to execute a translation enhancement training task so as to obtain an enhanced machine translation model.

By the steps, the expanded input information is spliced behind the translation source sentence and is input into the preset machine translation model together to execute the translation enhancement training task, so that the method has the advantages of enhanced learning on the preset machine translation model and strong practicability.

As an optional implementation, the preset mode includes a direct splicing mode and a preprocessing splicing mode.

It can be understood that the method of inputting the part meeting the condition into the preset model of the invention includes a direct concatenation method or a preprocessing concatenation method, wherein the direct concatenation method is to directly concatenate the retrieved corpus behind the translation source sentence, and the preprocessing concatenation method is to combine the part-of-speech tagging and dynamic fusion in front into the input, add extra weight to the information corresponding to the translation source sentence in the retrieval information, and increase the probability that the part of information is masked, wherein the preprocessing concatenation method and the direct concatenation method have complementary advantages, and can enable the model to quickly adapt to the training target by the preprocessing method, enhance the robustness of the preset model by the direct concatenation method, reduce the error possibly brought in the preprocessing, and both methods are used for training the machine translation model, thereby increasing the quality of machine translation and increasing the readability, has strong practicability.

It should be noted that, in the process of inputting, the direct concatenation mode performs a masking operation according to the BERT mode, and preferably, only performs a masking operation on the translation source sentence, so that the enhanced machine translation model utilizes the concatenated retrieval information; and the preprocessing splicing mode enables the covered part in the translation source sentence to improve the probability of being masked, and the corresponding part is reserved in the retrieval information, so that the preset model can quickly notice the part needing important attention, and the model training speed is improved.

As an alternative embodiment, the loss function of the first joint translation enhancement training is:

wherein theta represents a trained machine translation model, X and Y respectively represent a translation source sentence and a corresponding translation,

representing a search corpus pair.

The delegate performs a masking operation.

It should be noted that when the concatenation is performed by the direct concatenation method, the information obtained by retrieving the masked words may not have a corresponding portion, but the error caused by selecting the consistent words by using the word covering method can be avoided by the direct concatenation method, the word covering method is deficient in semantic thinking, and the machine translation model automatically learns the aligned information by using the direct concatenation method. The splicing mode of preprocessing is more direct, the probability of being masked is improved by the part covered by the words in the translation source sentence, the corresponding part is reserved in the retrieval information, the machine translation model can quickly notice the part needing important attention, and the training speed of the machine translation model is improved. The advantages of the two parts are complementary, the machine translation model can be quickly adapted to a training target in a preprocessing mode, then the robustness of the model is enhanced in a direct splicing mode, and errors possibly brought in preprocessing are reduced.

As an alternative embodiment, the penalty function for the second joint translation enhancement training is:

representing a search corpus pair.

It should be noted that, in the actual use of machine translation, it is sometimes difficult for a translation source provided by a user to find a very similar retrieval source in a preset memory base, which causes a difference in the actual translation and training processes of a machine translation model.

It can be understood that, in the present invention, in the actual process of machine translation, the translation provided by the user sometimes has difficulty in finding a very similar translation memory in the memory library constructed in advance, so that the model has a difference in the actual translation and training processes. In the loss function of the joint training, a training pair which is not spliced and has translation memory is added in the training data, and the enhanced training pair is subjected to joint training, so that the robustness of the model is further improved, the quality of machine translation is further enhanced, the readability is improved, and the practicability is high.

It should be noted that, both the direct splicing method and the pre-processing splicing method are used to enhance the representation of the encoder in the machine translation model, so that a better representation can be obtained to be input to the decoder side, and the enhanced machine translation model understands and utilizes the input.

Further, the understanding of the input information by the encoder part in the machine translation model is enhanced by the mask training, but the decoder part is not additionally and sufficiently trained. In order to enhance the robustness of the machine translation model, the understanding and the utilization of the information of the encoder by the decoder are improved, and the decoder is strengthened again. In the process of strengthening the decoder, a training method of contrast learning is adopted, and the key in the method is how to construct samples, including the construction of positive examples and negative examples. In the embodiment of the invention, the corresponding target sentence in the original corpus pair is taken as a positive example, and a negative example is constructed by the following two methods: (1) a negative example construction method of the traditional comparative learning is used; (2) the method of using the aligned information to guide the removal of words constructs a negative case.

It should be noted that, the method (1) is a construction method of a negative example of the traditional comparative learning, which is to randomly remove words of a corresponding target sentence in a corpus pair to obtain random substitution, that is, in the embodiment of the invention, any word in a translation target sentence is removed to obtain random substitution, and a translation target sentence and a translation source sentence from which some words are removed are combined into a negative example; and the method (2) guides the removal of the word by using the aligned information to obtain aligned replacement, namely, in the embodiment of the invention, one word which is aligned with the retrieval target sentence in the translation target sentence is removed, and a negative example is further constructed. In the original corpus pair, for the translation source sentence, the words in the final retrieval target sentence are mapped in a word covering mode, and the words are spliced in the input. According to the words covered by the words and the aligned information, corresponding word parts in the translation target sentence can be obtained at the same time, the corresponding words are randomly removed, and then a negative example is formed with the translation source sentence.

As an alternative embodiment, the loss function of the comparison training is:

where θ represents the machine translation model obtained by training, and (X, Y) represents the translation source sentence and the translation target sentence. Use of

To represent a negative case constructed using a random approach, using

The negative examples constructed under the guidance of the alignment information are represented, and the distance between the negative examples and the positive examples is kept above eta, so that the whole training model can distinguish the negative examples, and the robustness of the model is enhanced.

In an embodiment of the present invention, please refer to fig. 6 and fig. 10, the source sentence is translated: the leader may limit The time to be allowed for extensions, translate The target sentence: leadership may limit interpreting the voting time. The spliced input information, The leader may limit The time to be allowed for explantations, < sep > < noise > leader < noise > time limit < noise > is taken as an example to be input into The encoder end, wherein < sep > is taken as a separator < noise > character to replace The filtered information.

It should be noted that the original input is a translation target sentence, the "leadership interpretable vote" … "of the original input is obtained through decoding by the decoder, and random replacement is performed, that is, a word in the translation target sentence is replaced randomly, in this embodiment," interpretation "is deleted, the" leadership interpretable vote time … "of random replacement is obtained, alignment replacement is performed, a word after the alignment operation with the retrieval target sentence in the translation target sentence is removed, the" leadership "and" time limit "are remained after the alignment operation of the translation target sentence is completed, in this embodiment," leadership "is deleted, and the obtained alignment replacement" interpretable vote time … "is limited.

To sum up, in the third embodiment of the present invention, the input information is generated by the method for generating input information for machine translation in the first embodiment of the present invention, and the input information may be understood as including multiple target retrieval sentences obtained by dynamically fusing multiple times an initial retrieval source sentence (multiple corpus pairs obtained by randomly mixing multiple first corpus pairs and multiple second corpus pairs), and multiple target retrieval sentences sequentially spliced according to the obtained sequence, and inputting the obtained input information into a mask to perform a mask task to obtain extended input information, and splicing the extended input information into a translation source sentence in a direct splicing manner or a preprocessing manner, and inputting the extended input information into a preset machine translation model to perform a translation enhancement training task together to obtain an enhanced machine translation model.

Referring to fig. 10, a fourth embodiment of the present invention provides a machine translation input information generating system 4, including the following modules:

the acquisition module 10: obtaining a machine translation source sentence, and performing preliminary retrieval from a preset memory base based on at least two different retrieval modes to obtain an initial retrieval source sentence;

the filtering module 20: filtering the initial retrieval source sentence according to a preset condition to obtain a filtered retrieval source sentence;

the processing module 30: dynamically fusing the filtered retrieval source sentence to perform word coverage on the translation source sentence to obtain a covered translation source sentence;

the generation module 40: and the translation source sentence after the covering is used as input information.

It can be understood that, when the modules of the machine translation input information generating system 4 are operated, the machine translation input information generating method provided in the first embodiment needs to be used, and therefore, it is within the scope of the present invention to integrate or configure the obtaining module 10, the filtering module 20, the processing module 30, and the generating module 40 into different hardware to generate functions similar to the effects achieved by the present invention.

A fifth embodiment of the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program realizes the method for generating input information for machine translation described in any one of the above.

In the embodiments provided herein, it should be understood that "B corresponding to a" means that B is associated with a from which B can be determined. It should also be understood, however, that determining B from a does not mean determining B from a alone, but may also be determined from a and/or other information.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Those skilled in the art should also appreciate that the embodiments described in this specification are exemplary and alternative embodiments, and that the acts and modules illustrated are not required in order to practice the invention.

In various embodiments of the present invention, it should be understood that the sequence numbers of the above-mentioned processes do not imply an inevitable order of execution, and the execution order of the processes should be determined by their functions and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

The flowchart and block diagrams in the figures of the present application illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will be understood that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The above detailed descriptions of the method for generating input information for machine translation, the method for obtaining a machine translation model, and the computer-readable storage medium disclosed in the embodiments of the present invention have been provided, and specific examples are applied herein to explain the principles and embodiments of the present invention, and the descriptions of the above embodiments are only used to help understanding the method and the core ideas of the present invention; meanwhile, for the persons skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present description should not be construed as a limitation to the present invention, and any modification, equivalent replacement, and improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A machine translation input information generation method is characterized in that: the method comprises the following steps:

2. The method of generating machine-translated input information of claim 1, wherein: the method comprises the following steps of obtaining a machine translation source sentence, and carrying out initial retrieval from a preset memory base based on at least two different retrieval modes to obtain an initial retrieval source sentence:

3. The machine-translated input information generating method of claim 1, wherein:

the predetermined conditions include alignment and part-of-speech tagging.

4. A machine-translated input information generation method as recited in claim 3, wherein: the step of filtering the initial retrieval source sentence according to a predetermined condition to obtain a filtered retrieval source sentence specifically comprises the following steps:

and performing part-of-speech tagging on the translation source sentence and the search source sentence after the preliminary filtering, and reserving a part of the search source sentence with the same part of speech as the translation source sentence.

5. The machine-translated input information generating method of claim 2, wherein:

and the dynamic fusion mode of the filtered retrieval source sentence comprises setting an information threshold value covering the translation source sentence and setting an added information threshold value of the corpus pair.

6. A machine translation method, characterized by: the method for generating the input information for machine translation according to any one of claims 1 to 5, wherein the input information is generated by the method, and the input information is input into a preset machine translation model to obtain a corresponding translation result.

7. A method for acquiring a machine translation model is characterized by comprising the following steps: the method for generating the machine-translated input information according to any one of claims 1 to 5 is used for acquiring the input information generated by the method for generating the machine-translated input information, and performing a masking task on the acquired input information to acquire the expanded input information;

8. The method for acquiring a machine translation model according to claim 7, wherein: the input information of the input preset machine translation model also comprises a translation source sentence.

9. The method for acquiring a machine translation model according to claim 8, wherein: inputting the extended input information into a preset machine translation model to execute a translation enhancement training task to obtain an enhanced machine translation model specifically comprises the following steps:

10. A computer-readable storage medium storing a computer program, characterized in that: the computer program when executed implements a method of machine-translated input information generation as claimed in any of claims 1-5.