CN114792101B

CN114792101B - Method for generating and translating input information of machine translation and acquiring machine model

Info

Publication number: CN114792101B
Application number: CN202210723325.7A
Authority: CN
Inventors: 刘明童; 付宇; 周明
Original assignee: Beijing Lanzhou Technology Co ltd
Current assignee: Beijing Lanzhou Technology Co ltd
Priority date: 2022-06-24
Filing date: 2022-06-24
Publication date: 2022-09-23
Anticipated expiration: 2042-06-24
Also published as: CN114792101A

Abstract

The invention relates to the technical field of machine translation, in particular to a machine translation input information generation method, a translation method, a machine translation model acquisition method and a readable storage medium, wherein the machine translation input information generation method comprises the following steps: obtaining a machine translation source sentence, and performing preliminary retrieval from a preset memory base based on at least two different retrieval modes to obtain an initial retrieval source sentence; filtering the initial retrieval source sentence according to a preset condition to obtain a filtered retrieval source sentence; dynamically fusing the filtered retrieval source sentence to perform word coverage on the translation source sentence to obtain a covered translation source sentence; and taking the translation source sentence after covering as input information. The whole input information is expanded through the steps, the preset conditions are set for the retrieval source sentence for filtering, and the retrieval source sentence is dynamically fused, so that the input information is expanded while the fusion noise is reduced, the subsequent translation is convenient, and the problem of low translation quality of the existing machine is solved.

Description

Method for generating and translating input information of machine translation and obtaining machine model

Technical Field

The present invention relates to the field of machine translation technologies, and in particular, to a method for generating input information for machine translation, a method for obtaining a machine translation model, and a computer-readable storage medium.

Background

Machine translation is an important component in natural language processing, and has gained wide attention in recent years, with the continuous development of deep neural networks, the neural machine translation model trained by using an end-to-end model gradually exceeds the statistical-based machine translation model, and many relevant and practical software or projects, such as google translation, hundred-degree translation, calf translation and the like, are derived. The translation memory library is a means for assisting translation, integrates linguistic data collected in advance and linguistic data translated before, and compares and utilizes the linguistic data in the current translation process, so that the situations of wrong translation and repeated translation are prevented, particularly the consistency of translation of certain proper nouns can be enhanced, and the quality and readability of final translation are improved.

In the prior art, a relatively simple method is usually adopted to fuse retrieved sentences, for example: and directly splicing the retrieved sentences of the target end behind the sentences to be translated, and removing some redundant information by using the aligned information. The methods can play a role of supplementing information to a certain extent, but often bring more noise content, influence the whole translation, and do not fully utilize the retrieved result, so that the method cannot better cover the whole information of the source-end sentence which is desired to be translated. In addition, the method of simply splicing the retrieved target-side sentence into the source-side sentence which we want to translate does not consider interaction between two different languages at the input end when the whole sentence is encoded, so that the final translated content cannot be effectively associated with the additionally spliced target-side sentence, and the problem of low translation quality of the existing machine exists.

Disclosure of Invention

In order to solve the problem of low quality of the existing machine translation, the invention provides a machine translation input information generation method, a machine translation model acquisition method and a computer readable storage medium.

The invention provides a method for generating input information of machine translation, which comprises the following steps:

obtaining a machine translation source sentence, and performing preliminary retrieval from a preset memory base based on at least two different retrieval modes to obtain an initial retrieval source sentence;

filtering the initial retrieval source sentence according to a preset condition to obtain a filtered retrieval source sentence, wherein the preset condition comprises alignment and part-of-speech tagging;

dynamically fusing the filtered retrieval source sentences to perform word coverage on the translation source sentences to obtain covered translation source sentences;

the translation source sentence after covering is used as input information;

the method comprises the following steps of obtaining a machine translation source sentence, and carrying out preliminary retrieval from a preset memory base based on at least two different retrieval modes to obtain an initial retrieval source sentence: obtaining a machine translation source sentence, and performing preliminary retrieval from a preset memory base based on at least two different retrieval modes to obtain at least two corpus pairs; mixing the two corpus pairs to obtain an initial retrieval source sentence, wherein the retrieval mode comprises keyword retrieval and vector retrieval;

the method for dynamically fusing the filtered retrieval source sentences to perform word coverage on the translation source sentences to obtain the covered translation source sentences specifically comprises the following steps: when the retrieval source sentences are dynamically fused for the first time, the size of the editing distance between each corpus pair and the translation source sentences is calculated through a fuzzy matching algorithm of the editing distance, and the corpus pair in the retrieval source sentences with the most front sequence is selected for fusion during the first fusion; during the second fusion, removing the words in the translation source sentence which are already covered by the previous retrieval source sentence, using the remaining words and the corpus pairs in the remaining initial retrieval source sentence to calculate the editing distance, sequencing, continuously selecting the closest corpus pair in the current retrieval source sentence to perform dynamic fusion, repeating the steps until the conditions of the set information threshold value covering the translation source sentence and the set information threshold value of the corpus pair are not met, and exiting; and after the dynamic fusion is completed, the translation source sentence is covered.

Preferably, each dynamic fusion is a process of individual word coverage calculation, alignment, part-of-speech tagging, and sentence concatenation.

Preferably, the step of filtering the initial search source sentence according to a predetermined condition to obtain a filtered search source sentence specifically includes the following steps:

performing preliminary filtering on the initial retrieval source sentence through alignment;

and performing part-of-speech tagging on the translation source sentence and the search source sentence after the preliminary filtering, and reserving a part of the translation source sentence, which has the same part of speech as the translation source sentence, in the search source sentence.

The invention further provides a machine translation method for solving the technical problems, which is used for acquiring the input information generated by any one of the machine translation input information generation methods, and inputting the input information into a preset machine translation model to obtain a corresponding translation result.

The invention also provides a method for acquiring a machine translation model to solve the technical problem, which is used for acquiring the input information generated by any one of the method for generating the input information of the machine translation, and executing a mask task on the acquired input information to acquire extended input information;

and inputting the extended input information into a preset machine translation model to execute a translation enhancement training task so as to obtain an enhanced machine translation model.

Preferably, the input information of the preset machine translation model further includes a translation source sentence.

Preferably, the step of inputting the extended input information into a preset machine translation model to execute a translation enhancement training task to obtain an enhanced machine translation model specifically includes the following steps:

and splicing the extended input information on the translation source sentence in a preset mode, and inputting the extended input information into a preset machine translation model together to execute a translation enhancement training task so as to obtain an enhanced machine translation model.

The present invention further provides a computer-readable storage medium storing a computer program, which when executed implements any one of the above-mentioned methods for generating input information for machine translation.

Compared with the prior art, the input information generation method for machine translation, the machine translation method, the acquisition method for the machine translation model and the computer readable storage medium have the following advantages:

1. the invention relates to a method for generating input information of machine translation, which comprises the steps of firstly obtaining a machine translation source sentence, and based on at least two different retrieval modes, the initial retrieval source sentence can be obtained by performing initial retrieval from a preset memory base, the initial retrieval source sentence is filtered according to a preset condition to obtain a filtered retrieval source sentence, and in addition, dynamically fusing the filtered retrieval source sentences and performing word coverage on the translation source sentences to obtain covered translation source sentences, and finally taking the covered translation source sentences as input information, the whole input information is expanded through the steps, and the searching source sentence is filtered through setting the preset condition, and the retrieval source sentences are dynamically fused, so that the fusion noise is reduced, the input information is expanded, the subsequent translation is convenient, and the problem of low translation quality of the existing machine is solved.

2. Firstly, obtaining a machine translation source sentence, and carrying out preliminary retrieval from a preset memory base based on at least two different retrieval modes to obtain at least two corpus pairs; the two corpus pairs are mixed to obtain the initial retrieval source sentence, the initial retrieval source sentence obtained based on the two different corpus pairs can be obtained through the steps, and the practicability is high.

3. The preset conditions comprise an alignment mode and part of speech tagging, and the preset conditions are set so that the search source sentence can be preliminarily screened to obtain the required search source sentence, the speed of the whole process can be increased, and the method has high practicability and feasibility.

4. In the steps of the invention, the initial retrieval source sentence is initially filtered through an alignment processing mode, the translation source sentence and the retrieval source sentence after the initial filtering are subjected to part-of-speech tagging, and the part of the retrieval source sentence with the same part of speech as the translation source sentence is reserved. Performing word correspondence on a retrieval source sentence and a retrieval target sentence in an alignment mode to obtain corresponding information of words, wherein words which do not meet word correspondence are screened out; the steps can further filter a large amount of original retrieval source sentences, and select the retrieval source sentences which further meet the conditions, so that the method has strong practicability. And secondly, performing part-of-speech tagging on words in the preliminarily filtered retrieval source sentence and the translation source sentence simultaneously, filtering out words in the retrieval source sentence, which are inconsistent with the part-of-speech of the translation source sentence, and keeping the words in the retrieval source sentence, which are identical with the part-of-speech of the translation source sentence.

5. The method for dynamically fusing the filtered retrieval source sentences comprises the steps of setting an information threshold covering the translation source sentences and setting an added information threshold of the corpus pairs.

6. The invention also provides a machine translation method, a machine translation model acquisition method and a computer readable storage medium, which have the same beneficial effects as the machine translation input information generation method and are not repeated herein.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a flowchart illustrating steps of a method for generating input information for machine translation according to a first embodiment of the present invention.

Fig. 2 is a flowchart of the step S1 of the method for generating input information for machine translation according to the first embodiment of the present invention.

Fig. 3 is a flowchart of the step S2 of the method for generating input information for machine translation according to the first embodiment of the present invention.

Fig. 4 is a first flowchart of a method for generating input information for machine translation according to a first embodiment of the present invention.

Fig. 5 is a flowchart illustrating the step S21 of the method for generating machine-translated input information according to the first embodiment of the present invention.

Fig. 6 is a flowchart of a method for generating input information for machine translation according to the first embodiment of the present invention.

FIG. 7 is a flowchart illustrating steps of a machine translation method according to a second embodiment of the present invention.

FIG. 8 is a flowchart illustrating steps of a machine translation model according to a third embodiment of the present invention.

Fig. 9 is a flowchart of the step S302 of the machine translation model according to the third embodiment of the present invention.

Fig. 10 is a flowchart of a method for generating input information for machine translation according to the first embodiment of the present invention.

Fig. 11 is a block diagram of a machine translation input information generation system according to a fourth embodiment of the present invention.

The attached drawings indicate the following:

4. a machine translation input information generation system;

10. an acquisition module; 20. a filtration module; 30. a processing module; 40. and generating a module.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The terms "vertical," "horizontal," "left," "right," "up," "down," "left up," "right up," "left down," "right down," and the like as used herein are for illustrative purposes only.

Referring to fig. 1, a first embodiment of the present invention provides a method for generating machine-translated input information, including the following steps:

s1, obtaining a machine translation source sentence, and carrying out preliminary retrieval from a preset memory base based on at least two different retrieval modes to obtain an initial retrieval source sentence;

s2, filtering the initial search source sentence according to the preset conditions to obtain a filtered search source sentence;

s3, dynamically fusing the filtered retrieval source sentences to perform word coverage on the translation source sentences to obtain covered translation source sentences;

and S4, taking the translation source sentence after covering as input information.

It can be understood that, the method for generating input information for machine translation according to the present invention includes the steps of obtaining a machine translation source sentence, and based on at least two different retrieval modes, the initial retrieval source sentence can be obtained by performing initial retrieval from a preset memory base, the initial retrieval source sentence is filtered according to a preset condition to obtain a filtered retrieval source sentence, and in addition, dynamically fusing the filtered retrieval source sentence and performing word coverage on the translation source sentence to obtain a covered translation source sentence, and finally taking the covered translation source sentence as input information, the whole input information is expanded through the steps, and the searching source sentence is filtered through setting the preset condition, and the retrieval source sentences are dynamically fused, so that the fusion noise is reduced, the input information is expanded, the subsequent translation is convenient to perform, and the problem of low translation quality of the existing machine is solved.

As an alternative implementation, two different retrieval modes include keyword retrieval and vector retrieval, wherein the keyword retrieval obtains the vocabulary-based sentences matched with the translation source sentences from the preset memory library, and the vector retrieval obtains the semantics-based sentences matched with the translation source sentences from the preset memory library, and the two modes can obtain the sentence information matched with the translation source sentences in different dimensions, provide a large amount of matching information for the following process, and have strong practicability.

It should be noted that, referring to fig. 4, the keyword search mode uses an elastic search tool (i.e., ES) as an implementation tool, where the elastic search tool is an open-source distributed search engine, and stores and searches for structured and unstructured data. When searching, the sentence is directly used as a query keyword, searching is carried out from the memory base based on the BM25 algorithm, for each document d in the memory base, a similarity score can be obtained according to the BM25 algorithm in combination with the query sentence,

in which

It represents a certain word in the query sentence,

the weight of the current word is represented, and the union represents the similarity of all words and documents in the query sentence and represents the relevance of the document and the current query sentence. The result of keyword-based search can be obtained as a partial preliminary search result by the BM25 algorithm.

Furthermore, a faiss open source tool (faiss) is used as a vector retrieval implementation tool in the vector retrieval mode. Before using the faces open source tool, a sensor-transforms source library is used to obtain a vector representation of each sentence, and then the faces open source tool is used to represent the sentences in the memory library by vectors. When a query is needed, according to the vector representation of a query sentence:

by means of the faiss tool, the euclidean distance between the vector representation of each sentence in the memory and the vector representation of the query sentence can be calculated, and the distance is used to represent the degree of correlation between the two sentences, assuming that the vector representation of the corpus in the memory is:

then the similarity score of the two sentences is:

。

referring to fig. 2, step S1 specifically includes the following steps:

s11, obtaining a machine translation source sentence, and carrying out preliminary retrieval from a preset memory base based on at least two different retrieval modes to obtain at least two corpus pairs;

and S12, mixing the two language material pairs to obtain an initial retrieval source sentence.

It can be understood that, in the steps of the invention, the machine translation source sentence is obtained first, and the preliminary retrieval is carried out from the preset memory base based on at least two different retrieval modes to obtain at least two corpus pairs; the initial retrieval source sentence obtained based on the at least two different corpus pairs can be obtained through the steps, and the method has high practicability.

It should be noted that, after at least two different corpus pairs are fused, a retrieval source sentence which is closely associated with the translation source sentence under different dimensions is obtained.

Specifically, in one embodiment of the present invention, when retrieving a translation source sentence from a preset memory base, vector retrieval is to retrieve a sentence matching the translation source sentence from the preset memory base based on semantics, and keyword retrieval is to retrieve a sentence matching the translation source sentence based on vocabulary, and specific examples are as follows: when the translation source sentence is: in the past five years, IAEA has induced In improving the vision of the safety In research and technology, the sentences obtained by vector-based search are: the sentences obtained by the It all supported activities of IAEA and its safety records region based on the keyword search are: over the past 10 years, the World Bank Group induced $33 Billion in height and deletion in leveling centers from the above example sentences, it can be known that the results obtained by vector search and translation source sentences are discussed about IAEA, but the difference in the formula is large; while the results and translation source sentences obtained by keyword retrieval have more parts, the discussion contents are inconsistent. That is, in the embodiment of the present invention, the result obtained by keyword-based search and the result obtained by vector-based search are mixed to obtain the initial search source sentence.

As an optional implementation manner, the predetermined condition includes an alignment manner and part-of-speech tagging, and the setting of the predetermined condition enables the search source sentence to be preliminarily screened to obtain a required search source sentence, so that the speed of the whole process can be increased, and the method has strong practicability and feasibility.

Referring to fig. 3 and 6, step S2 specifically includes the following steps:

s21, performing preliminary filtering on the initial retrieval source sentence through alignment;

and S22, performing part-of-speech tagging on the translation source sentence and the search source sentence after the preliminary filtering, and reserving the part of the search source sentence with the same part of speech as the translation source sentence.

Understandably, in the step, the initial retrieval source sentence is initially filtered in an alignment processing mode, the translation source sentence and the retrieval source sentence after the initial filtering are subjected to part-of-speech tagging, and a part of the retrieval source sentence with the same part of speech as the translation source sentence is reserved. Performing word correspondence on a retrieval source sentence and a retrieval target sentence in an alignment mode to obtain corresponding information of words, wherein words which do not meet word correspondence are screened out; the steps can further filter a large amount of original retrieval source sentences, and select the retrieval source sentences which further meet the conditions, so that the method has strong practicability. Secondly, the parts of speech tagging is carried out on the words in the initially filtered retrieval source sentence and the translation source sentence simultaneously, the words with the parts of speech inconsistent with the parts of speech of the translation source sentence in the retrieval source sentence are filtered, the words with the parts of speech identical with the parts of speech of the translation source sentence in the retrieval source sentence are reserved, the retrieval source sentence subjected to the initial screening can be filtered for the second time, the parts of speech inconsistent in the retrieval source sentence are screened out, the retrieval source sentence meeting the conditions is obtained, and the method has strong practicability and feasibility.

Referring to fig. 5 and 6, step S21 specifically includes the following steps:

s211, acquiring a retrieval target sentence based on a preset memory base;

s212, performing word coverage calculation on the translation source sentence and the initial retrieval source sentence, and reserving the part of the initial retrieval source sentence overlapped with the translation source sentence;

s213: and aligning the overlapped part in the initial search source sentence with the search target sentence.

It should be noted that the retrieval target sentence refers to a sentence translated into a corresponding language and stored in the preset memory base as the initial retrieval source sentence.

It can be understood that through the above steps, the overlapped part of the initial retrieval source sentence with the translation source sentence can be obtained, and the overlapped part of the initial retrieval source sentence is aligned, so that the initial retrieval source sentence is subjected to primary filtering, and the method has strong practicability.

Referring to FIG. 6, in one embodiment of the present invention, if the translated source sentence is English and the target sentence is translated into Chinese, the translated source sentence is English, the search source sentence is also English, and the search target sentence is Chinese. In the process of aligning the retrieval source sentence, word coverage calculation is firstly carried out on the translation source sentence and the initial retrieval source sentence, covered parts of the initial retrieval source sentence and words of the translation source sentence are reserved, then the overlapped initial retrieval source sentence (English) and words in the retrieval target sentence (Chinese) are aligned, the alignment processing process is that the Chinese in the retrieval target sentence is matched with the English covered parts of the words in the initial retrieval source sentence, namely the words correspond to each other, after the initial retrieval source sentence is aligned, the word covered parts of the translation source sentence (English) and the initial retrieval source sentence (English) are respectively marked in terms, and words with different terms in covered parts of the initial retrieval source sentence and the translation source sentence are filtered.

It should be noted that, on the basis of each initial search source sentence, there is a search target sentence in the target language corresponding to the initial search source sentence in the preset memory base.

It should be noted that the tools used for alignment include fast-align and/or awesome-align, and the word correspondence information is obtained by performing alignment processing on the search source sentence and the search target sentence by using the alignment tools. And searching the part needing to be reserved in the target sentence, namely searching the words of the target sentence corresponding to the words covered in the source sentence according to the information obtained by alignment.

For example, when the translation source sentence is: the leader may limit The time to be allowed for The find extensions, The initial search source sentence is: the active leader, May I-terminated sensed topics The time limit for conditions at The session is five minutes retrieval target sentence: agent leader (speaking in english): please ask me to remind representatives, the time limit for this conference to speak is 5 minutes. The alignment process is as follows: firstly, performing word coverage calculation on the translation source sentence and the initial retrieval source sentence, and obtaining the part of the initial retrieval source sentence overlapped with the translation source sentence as follows: and the leader and the time limit, namely, the correspondence of the words is finished by finding the Chinese meanings of the leader and the time limit in the retrieval target sentence. After the alignment process is completed, it can be known that the word covering part of the translation source sentence is: the leader and the limit the time, the initial search source sentence covering part: leader and the time limit.

After alignment, the preliminary filtering is completed, at which point part-of-speech tagging is performed for further filtering. Specifically, the above example is still used as an example for explanation. The results of the preliminary screening after alignment were: the word covering part of the translation source sentence is as follows: the leader and the limit the time, the initial search source sentence covering part: leader and the time limit. Part-of-speech tagging is performed on part-of-speech tagging of the word covering portion of the translation source sentence and the word covering portion of the initial retrieval source sentence, namely, tagging of parts-of-speech such as verbs, nouns and prepositions, and we can know that the parts-of-speech of the words in the word covering portion of the translation source sentence are respectively: [ leader: NNP, limit: VB, the: DT, time: NN ], the part of speech of the word covering the part in the initial search source sentence is: when we know that the parts of speech of limit in two source sentences are not the same, we only get leader and time after removing the word.

After the part-of-speech tagging is completed, the filtering of the initial search source sentence is completed, and at this time, the process of splicing is still described by taking the above example as an example, and after the part-of-speech tagging is completed, the part covered by the words of the search target sentence is: leadership, time and limitation, after part-of-speech tagging is carried out, filtering words with covering parts of the retrieval source sentence and covering parts of the translation source sentence, wherein the words with the covering parts of the retrieval source sentence are inconsistent with the covering parts of the translation source sentence in part-of-speech, the word covering parts of the retrieval source sentence only obtain leader and the time, and the word of limit is removed because the word has different parts of speech in the translation source sentence and the retrieval target sentence, so that the meaning of the limit covering parts of the words in the retrieval target sentence is deleted simultaneously, namely, the covering parts of the words in the retrieval target sentence are left with leadership and time; when splicing is carried out, the two words of leader and time in the retrieval target sentence are spliced behind the translation source sentence, and the result after splicing is as follows: the leader may limit The time to be allowed for The find extensions, < sep > < noise > leader < noise > time < noise >, wherein < sep > is used as a separator and < noise > characters represent filtered information, and The above process completes one dynamic fusion.

It can be understood that after the initial retrieval source sentence is fused for multiple times, the retrieval source sentence meeting the conditions of word coverage calculation, alignment, part of speech tagging and splicing is obtained after each dynamic fusion, and meanwhile, the retrieval target sentence meeting the conditions is obtained, the retrieval target sentence 1 is obtained after the first dynamic fusion, the retrieval target sentence 2 is obtained for the second time, and the retrieval target sentence i is obtained for the ith time; and (3) quitting dynamic fusion until the condition is not met, wherein the sentence after the dynamic fusion splicing is completed is: the translation source sentence + the search target sentence 1+ the search target sentence 2+ … … + the search target sentence i, and the above equation is the input information.

It will be appreciated that each dynamic fusion is a separate process of word coverage computation, alignment, part-of-speech tagging, and sentence concatenation.

In the embodiment of the invention, based on 1 translation source sentence, when the translation source sentence is retrieved from a preset memory base through a keyword retrieval mode and a vector retrieval mode, K first corpus pairs which are parallel and obtained based on keyword retrieval and K second corpus pairs which are parallel and obtained based on vector retrieval are respectively obtained, the first corpus pairs and the second corpus pairs are mixed, and the initial retrieval source sentence with 2K parallel corpuses is obtained through mixing.

Optionally, K has a value in the range of 5 to 20. Preferably, in the embodiment of the present invention, the value of K is 10.

That is, in the embodiment of the present invention, 10 first corpus pairs and 10 second corpus pairs are obtained from the translation source sentence through keyword retrieval and vector retrieval, the 10 first corpus pairs and 10 second corpus pairs are randomly mixed to obtain 20 mixed parallel corpus pairs, and the 20 mixed corpus pairs are collectively referred to as the initial retrieval source sentence.

When the retrieval source sentence is dynamically fused for the first time, the 20 parallel corpus pairs included in the retrieval source sentence are calculated through a fuzzy matching algorithm of the editing distance, the editing distance between each corpus pair and the translation source sentence is calculated, the smaller the editing distance is, the closer the corpus pairs are, and the corpus pairs in the retrieval source sentence with the top rank are selected for fusion for the first time.

And then, in the second fusion, removing the words in the translation source sentence which are already covered by the previous retrieval source sentence, using the remaining words and the corpus pairs in the remaining 19 initial retrieval source sentences to calculate the editing distance, sequencing, continuously selecting the closest corpus pair in the current retrieval source sentence for fusion, repeating the steps until the set two information threshold conditions are not met, and exiting.

As an optional implementation manner, the manner of dynamically fusing the filtered retrieval source sentence includes setting an information threshold covering the translation source sentence and setting an added information threshold of the corpus pair, that is, the two information thresholds are the information threshold covering the translation source sentence and the added information threshold of the corpus pair, respectively.

Understandably, the information splicing of the corpus pairs can be performed conditionally by covering the information threshold of the translation source sentence and setting the setting of the added information threshold of each corpus pair, so that the problem of repeated fusion noise of the corpus pairs is reduced.

It should be noted that, the manner of setting the information threshold value covering the translation source sentence is that when a certain information threshold value is met, in the embodiment of the present invention, when the threshold value reaches 70% (that is, the ratio of the covered words to the translation source sentence reaches 70%), the subsequent unselected search corpus is automatically stopped from being continuously fused, and by setting the threshold value, the preset model can dynamically select and fuse the number of the search example sentences; furthermore, the information adding threshold of each corpus pair is set in a manner that the proportion of added words is set to reach 10% of the remaining words which are not added in the translation source sentence through a fuzzy matching algorithm, and when the proportion of added words reaches 10% of the remaining words which are not added in the translation source sentence, the information of the current corpus pair can be spliced and subsequently fused.

In the embodiment of the invention, the retrieval source sentence is dynamically fused for a plurality of times until the retrieval source sentence does not meet the condition, and the retrieval is quitted, wherein the specific principle of the dynamic fusion process is as follows:

use of

To represent all words in the source sentence to be translated,

representing the jth corpus in the corpus pair obtained in the initial search source sentence. During the ith fusion, some words in the translation source sentence are already covered previously and are removed from S (i.e., the translation source sentence). Then, the unselected (2K-i) corpus pairs (K represents the number of the first corpus pair or the second corpus pair to be searched) are reordered.

It can be understood that, in the embodiment of the present invention, when the translation source sentence is retrieved to obtain the first corpus pair or the second corpus pair, the number of the first corpus pair is equal to that of the second corpus pair.

It should be noted that the basis of the sorting is a fuzzy matching algorithm based on the edit distance. The edit distance includes three basic operations of insertion, deletion and replacement when calculating. During retrieval, K first corpus pairs and K second corpus pairs are respectively retrieved based on keyword retrieval and vector retrieval. And randomly mixing the K first corpus pairs and the K second corpus pairs to obtain 2K corpus pairs in total, namely the initial retrieval source sentence comprises 2K corpus pairs, and when the dynamic fusion is carried out in the first step, calculating and translating the size of the editing distance between the source sentences by using the initial retrieval source sentences of the 2K parallel corpuses, wherein the smaller the editing distance is, the closer the editing distance is, and selecting the current closest retrieval source sentence for carrying out the operation.

In an embodiment of the present invention, when the translation source sentence is S = [ No one has even before long with the counters ], if the closest search source sentence obtained by the edit distance calculation is [ No body has even before here ], the word covered by the word is [ No has even before ], and when the next sorting is performed, the remaining words in the translation source sentence are S1= [ one charged with the counters ], that is, the word S1 remaining in the translation source sentence and the initial search source sentence which is parallel to the remaining words are subjected to the edit distance calculation, and the above operations are repeated until the conditions are not satisfied.

To sum up, in the first embodiment of the present invention, a machine translation source sentence is obtained, the translation source sentence is retrieved from a preset memory base based on a keyword retrieval manner and a vector retrieval manner, where multiple (e.g., 10) first corpus pairs are obtained based on the keyword retrieval manner, multiple (e.g., 10) second corpus pairs are obtained based on the vector retrieval manner, two corpus pairs are randomly mixed to obtain multiple (20) initial retrieval source sentences of parallel mixed corpus pairs, the initial retrieval source sentence is filtered through alignment and part-of-speech tagging, the filtered initial retrieval source sentence is dynamically fused to obtain input information, and the condition for dynamically fusing the initial retrieval source sentence is to set an information threshold covering the translation source sentence and an information addition threshold setting the corpus pair, and the input information after satisfying the condition is obtained after dynamic fusion.

Referring to fig. 7, a second embodiment of the present invention provides a machine translation method, S200 obtains input information generated by the input information generation method for machine translation provided by the first embodiment of the present invention, and S201 inputs the input information into a preset machine translation model to obtain a corresponding translation result.

The method has the advantages that the original input information is expanded and input into the preset machine translation model to obtain the corresponding translation result, and the translation result is more accurate.

Referring to fig. 8, a third embodiment of the present invention provides a method for acquiring an enhanced machine translation model, S300: acquiring input information generated by a machine translation input information generation method provided by a first embodiment of the invention, S301, executing a mask task on the acquired input information to acquire extended input information;

and S302, inputting the extended input information into a preset machine translation model to execute a translation enhancement training task so as to obtain an enhanced machine translation model.

As an optional implementation manner, the input information input into the preset machine translation model further includes a translation source sentence.

It can be understood that, in the embodiment of the present invention, the translation source sentence and the extension input information are input into the machine translation model together to execute the joint translation enhancement training task to obtain the enhanced machine translation model.

Further, referring to fig. 9, step S302 specifically includes the following steps:

and S3021, splicing the extended input information after the translation source sentence in a preset manner, and inputting the extended input information and the preset machine translation model together to execute a translation enhancement training task so as to obtain an enhanced machine translation model.

By the steps, the expanded input information is spliced behind the translation source sentence and is input into the preset machine translation model together to execute the translation enhancement training task, so that the method has the advantages of enhanced learning on the preset machine translation model and strong practicability.

As an optional implementation, the preset mode includes a direct splicing mode and a preprocessing splicing mode.

It can be understood that the method of inputting the part meeting the condition into the preset model of the invention includes a direct concatenation method or a preprocessing concatenation method, wherein the direct concatenation method is to directly concatenate the retrieved corpus behind the translation source sentence, and the preprocessing concatenation method is to combine the part-of-speech tagging and dynamic fusion in front into the input, add extra weight to the information corresponding to the translation source sentence in the retrieval information, and increase the probability that the part of information is masked, wherein the preprocessing concatenation method and the direct concatenation method have complementary advantages, and can enable the model to quickly adapt to the training target by the preprocessing method, enhance the robustness of the preset model by the direct concatenation method, reduce the error possibly brought in the preprocessing, and both methods are used for training the machine translation model, thereby increasing the quality of machine translation and increasing the readability, has strong practicability.

It should be noted that, in the process of inputting, the direct concatenation mode performs a masking operation according to the BERT mode, and preferably, only performs a masking operation on the translation source sentence, so that the enhanced machine translation model utilizes the concatenated retrieval information; and the preprocessing splicing mode enables the covered part in the translation source sentence to improve the probability of being masked, and the corresponding part is reserved in the retrieval information, so that the preset model can quickly notice the part needing important attention, and the model training speed is improved.

As an alternative embodiment, the loss function of the first joint translation enhancement training is:

where θ represents the trained machine translation model, X, Y represent the translation source sentence and the corresponding translation, respectively,

representing a search corpus pair.

Representing a masking operation.

It should be noted that when the information obtained by searching the masked words is spliced in the direct splicing manner, the corresponding part may not exist in the information obtained by searching, but the error caused by selecting the consistent words in the word covering manner can be avoided in the direct splicing manner, the word covering manner is deficient in the thought of the semantics, and the machine translation model automatically learns the aligned information in the direct splicing manner. The splicing mode of preprocessing is more direct, the probability of being masked is improved by the part covered by the words in the translation source sentence, the corresponding part is reserved in the retrieval information, the machine translation model can quickly notice the part needing important attention, and the training speed of the machine translation model is improved. The advantages of the two parts are complementary, the machine translation model can be quickly adapted to a training target in a preprocessing mode, then the robustness of the model is enhanced in a direct splicing mode, and errors possibly brought in preprocessing are reduced.

As an alternative embodiment, the loss function of the second joint translation enhancement training is:

where theta is used to represent the trained machine translation model, X, Y represent the translation source sentence and the corresponding translation, respectively,

representing a search corpus pair.

It should be noted that, in the actual use of machine translation, it is sometimes difficult for a translation source provided by a user to find a very similar retrieval source in a preset memory base, which causes a difference in the actual translation and training processes of a machine translation model.

It can be understood that, in the present invention, in the actual process of machine translation, the translation provided by the user sometimes has difficulty in finding a very similar translation memory in the memory library constructed in advance, so that the model has a difference in the actual translation and training processes. In the loss function of the joint training, the training pairs which are not spliced and are used for translation memory are added in the training data, and the training pairs after the training pairs are enhanced are subjected to the joint training, so that the robustness of the model is further improved, the quality of the machine translation is further enhanced, the readability of the machine translation is improved, and the strong practicability is achieved.

It should be noted that, both the direct splicing method and the pre-processing splicing method are used to enhance the representation of the encoder in the machine translation model, so that a better representation can be obtained to be input to the decoder side, and the enhanced machine translation model understands and utilizes the input.

Further, the understanding of the input information by the encoder portion in the machine translation model is enhanced by the mask training, but the decoder portion is not additionally trained sufficiently. In order to enhance the robustness of the machine translation model, the understanding and the utilization of the information of the encoder by the decoder are improved, and the decoder is strengthened again. In the process of strengthening the decoder, a training method of contrast learning is adopted, and the key in the method is how to construct examples, including the construction of positive examples and negative examples. In the embodiment of the invention, the corresponding target sentence in the original corpus pair is taken as a positive example, and a negative example is constructed by the following two methods: (1) using a traditional negative example construction method of comparative learning; (2) the method of using the aligned information to guide the removal of words constructs a negative case.

It should be noted that, the method (1) is a construction method of a negative example of the traditional contrast learning, and is to remove words of a corresponding target sentence in a corpus pair at random to obtain random replacement, that is, in the embodiment of the invention, any word in a translation target sentence is removed to obtain random replacement, and a translation target sentence and a translation source sentence from which some words are removed are combined into a negative example; and the method (2) uses the aligned information to guide the removal of the words, namely, the aligned replacement is obtained, namely, in the embodiment of the invention, one word which is aligned with the retrieval target sentence in the translation target sentence is removed, and then a negative example is constructed. In the original corpus pair, for the translation source sentence, the words in the final retrieval target sentence are mapped in a word covering mode, and the words are spliced in the input. According to the words covered by the words and the aligned information, corresponding word parts in the translation target sentence can be obtained at the same time, the corresponding words are randomly removed, and then a negative example is formed with the translation source sentence.

As an alternative embodiment, the loss function of the comparison training is:

where θ represents the trained machine translation model, and (X, Y) represents the translation source sentence and the translation target sentence. Use of

To represent a negative case constructed using a random approach, using

The negative examples constructed under the guidance of the aligned information are represented, and the distance between the negative examples and the positive examples is kept above eta, so that the whole training model can distinguish the negative examples, and the robustness of the model is enhanced.

In an embodiment of the present invention, please refer to fig. 6 and fig. 10, the source sentence is translated: the leader may limit The time to be allowed for extensions, translate The target sentence: the leader may limit interpretation of the voting time. The spliced input information, The leader may limit The time to be allowed for explantations, < sep > < noise > leader < noise > time limit < noise > is taken as an example input encoder end, wherein < sep > is taken as a separator < noise > character to replace The filtered information.

It should be noted that, the original input is the translation target sentence, the "leader can limit interpretation vote …" of the original input is obtained through decoding by the decoder, random replacement is performed, that is, a word in the translation target sentence is replaced randomly, in the present embodiment, "interpretation" is deleted, the "leader can limit vote time …" of random replacement is obtained, alignment replacement is performed, a word after the alignment operation with the retrieval target sentence in the translation target sentence is removed, the "leader" and "time limit" are remained after the alignment operation of the translation target sentence is completed, in the present embodiment, "leader" is deleted, and the obtained alignment replacement "can limit interpretation vote time …".

To sum up, in the third embodiment of the present invention, the input information is generated by the method for generating input information for machine translation in the first embodiment of the present invention, and the input information may be understood as including multiple target retrieval sentences obtained by dynamically fusing multiple times an initial retrieval source sentence (multiple corpus pairs obtained by randomly mixing multiple first corpus pairs and multiple second corpus pairs), and multiple target retrieval sentences sequentially spliced according to the obtained sequence, and inputting the obtained input information into a mask to perform a mask task to obtain extended input information, and splicing the extended input information into a translation source sentence in a direct splicing manner or a preprocessing manner, and inputting the extended input information into a preset machine translation model to perform a translation enhancement training task together to obtain an enhanced machine translation model.

Referring to fig. 10, a fourth embodiment of the invention provides a machine translation input information generating system 4, which includes the following modules:

the acquisition module 10: obtaining a machine translation source sentence, and performing preliminary retrieval from a preset memory base based on at least two different retrieval modes to obtain an initial retrieval source sentence;

the filtering module 20: filtering the initial retrieval source sentence according to a preset condition to obtain a filtered retrieval source sentence;

the processing module 30: dynamically fusing the filtered retrieval source sentences to perform word coverage on the translation source sentences to obtain covered translation source sentences;

the generation module 40: and the translation source sentence after the covering is used as input information.

It can be understood that, when the modules of the machine translation input information generating system 4 are operated, the machine translation input information generating method provided in the first embodiment needs to be used, and therefore, it is within the scope of the present invention to integrate or configure the obtaining module 10, the filtering module 20, the processing module 30, and the generating module 40 into different hardware to generate functions similar to the effects achieved by the present invention.

A fifth embodiment of the present invention provides a computer-readable storage medium storing a computer program, which when executed implements the method for generating machine-translated input information according to any one of the above.

In the embodiments provided herein, it should be understood that "B corresponding to a" means that B is associated with a from which B can be determined. It should also be understood, however, that determining B from a does not mean determining B from a alone, but may also be determined from a and/or other information.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Those skilled in the art should also appreciate that the embodiments described in this specification are exemplary and alternative embodiments, and that the acts and modules illustrated are not required in order to practice the invention.

In various embodiments of the present invention, it should be understood that the sequence numbers of the above-mentioned processes do not imply an inevitable order of execution, and the execution order of the processes should be determined by their functions and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

The flowchart and block diagrams in the figures of the present application illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will be understood that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The above detailed descriptions of the method for generating input information for machine translation, the method for obtaining a machine translation model, and the computer-readable storage medium disclosed in the embodiments of the present invention have been provided, and specific examples are applied herein to explain the principles and embodiments of the present invention, and the descriptions of the above embodiments are only used to help understanding the method and the core ideas of the present invention; meanwhile, for the persons skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present description should not be construed as a limitation to the present invention, and any modification, equivalent replacement, and improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A machine translation input information generation method is characterized in that: the method comprises the following steps:

filtering the initial retrieval source sentence according to preset conditions to obtain a filtered retrieval source sentence, wherein the preset conditions comprise alignment and part of speech tagging;

the translation source sentence after covering is used as input information;

the method for dynamically fusing the filtered retrieval source sentences to perform word coverage on the translation source sentences to obtain the covered translation source sentences specifically comprises the following steps: when the retrieval source sentences are dynamically fused for the first time, the size of the editing distance between each corpus pair and the translation source sentences is calculated through a fuzzy matching algorithm of the editing distance, and the corpus pair in the retrieval source sentences with the most front sequence is selected for fusion during the first fusion; during the second fusion, removing the words in the translation source sentence which are already covered by the previous retrieval source sentence, using the remaining words and the corpus pairs in the remaining initial retrieval source sentence to calculate the editing distance, sequencing, continuously selecting the closest corpus pair in the current retrieval source sentence to perform dynamic fusion, repeating the steps until the conditions of the set information threshold value covering the translation source sentence and the set information threshold value of the corpus pair are not met, and exiting; and after the dynamic fusion is completed, performing word coverage calculation on the translation source sentence to obtain a covered translation source sentence.

2. The method of generating machine-translated input information of claim 1, wherein: and dynamically fusing each time into a single process of word coverage calculation, alignment, part-of-speech tagging and sentence splicing.

3. The method of generating machine-translated input information of claim 1, wherein: the step of filtering the initial retrieval source sentence according to a predetermined condition to obtain a filtered retrieval source sentence specifically comprises the following steps:

4. A machine translation method, characterized by: the method for generating the machine-translated input information according to any one of claims 1 to 3 is used for obtaining the input information generated by the method, and inputting the input information into a preset machine translation model to obtain a corresponding translation result.

5. A method for acquiring a machine translation model is characterized by comprising the following steps: acquiring input information generated by the machine-translated input information generation method according to any one of claims 1 to 3, and performing a masking task on the acquired input information to acquire extended input information;

6. The method for acquiring a machine translation model according to claim 5, wherein: the input information of the input preset machine translation model also comprises a translation source sentence.

7. The method for acquiring a machine translation model according to claim 5, wherein: inputting the extended input information into a preset machine translation model to execute a translation enhancement training task to obtain an enhanced machine translation model specifically comprises the following steps:

8. A computer-readable storage medium storing a computer program, characterized in that: a computer program when executed implementing a method of machine translated input information generation as claimed in any of claims 1-3.