CN113392657A

CN113392657A - Training sample enhancement method and device, computer equipment and storage medium

Info

Publication number: CN113392657A
Application number: CN202110678838.6A
Authority: CN
Inventors: 张轩玮
Original assignee: Beijing IQIYI Science and Technology Co Ltd
Current assignee: Beijing IQIYI Science and Technology Co Ltd
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2021-09-14

Abstract

The application relates to a training sample enhancement method, a training sample enhancement device, computer equipment and a storage medium. The method comprises the following steps: acquiring a first source language sentence and an associated sentence of the first source language sentence; splicing at least one sentence in the associated sentences with the first source language sentence to obtain a second source language sentence; determining a second target language sentence according to the second source language sentence; and forming a sentence pair by using the second source language sentence and the second target language sentence, and using the sentence pair as a training sample. In the application, the second source language sentence is real data rather than pseudo data, so that the negative influence of noise generated by the pseudo data on the machine translation model can be reduced or avoided.

Description

Training sample enhancement method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of machine translation technologies, and in particular, to a method and an apparatus for enhancing a training sample, a method and an apparatus for training a machine translation model, a computer device, and a storage medium.

Background

Currently, machine translation is performed by a translation model obtained by sample training, and a source language is translated into a target language. However, a large number of training samples are required for training the translation model, and the training samples, namely, the sentence pairs of the source language and the target language, are high in acquisition cost, and especially in the whisper language, special translators are required for labeling, so that technicians prefer to extend the training samples in a data enhancement manner to reduce the cost.

At present, methods for enhancing data include word replacement, translation, and the like. Where words are replaced, for example, by "very" in "this action is very cool", this approach requires maintaining a completely correct synonym table because the replaced statement will have a different meaning from the original statement once the synonym table has a problem. In the translation, for example, firstly "He looks good" is translated into english "He looks good" and then translated into chinese "He looks good", and it is obvious that the meaning of the translated sentence is different from that of the original sentence, and "looks good" is not necessarily the meaning of "looks good". It can be seen that this approach, while increasing the number of training samples, also increases the false noise. In fact, the data added by the methods of word replacement and translation are all pseudo data, so that a lot of noise is introduced while the number of training samples is increased, the difficulty of machine learning is increased, and the replacement rate and the noise rate need to be finely adjusted, so that the workload is large and the flexibility is not high.

Disclosure of Invention

To solve the above technical problem or at least partially solve the above technical problem, the present application provides a training sample enhancement method, apparatus, computer device and storage medium.

In a first aspect, the present application provides a training sample enhancement method, including:

acquiring a first source language sentence and an associated sentence of the first source language sentence;

splicing at least one sentence in the associated sentences with the first source language sentence to obtain a second source language sentence;

determining a second target language sentence according to the second source language sentence;

and forming a sentence pair by using the second source language sentence and the second target language sentence, and using the sentence pair as a training sample.

In a second aspect, the present application provides a machine translation model training method, including:

obtaining a training sample, wherein the training sample comprises a plurality of sentence pairs, and each sentence pair comprises a source language sentence and a corresponding target language sentence;

the training sample enhancement method provided by the first aspect is used for expanding the training sample;

and performing model training according to the training samples and the training samples added in the expansion.

In a third aspect, the present application provides a training sample enhancing apparatus, comprising:

the sentence acquisition module is used for acquiring a first source language sentence and an associated sentence of the first source language sentence;

the first determining module is used for splicing at least one sentence in the associated sentences with the first source language sentence to obtain a second source language sentence;

a second determining module, configured to determine a second target language sentence according to the second source language sentence;

and the sentence pair forming module is used for forming a sentence pair by the second source language sentence and the second target language sentence and taking the sentence pair as a training sample.

In a fourth aspect, the present application provides a machine translation model training apparatus, including:

the system comprises a sample acquisition device, a processing device and a processing device, wherein the sample acquisition device is used for acquiring a training sample, the training sample comprises a plurality of sentence pairs, and each sentence pair comprises a source language sentence and a corresponding target language sentence;

the training sample enhancement device provided by the third aspect is used for expanding the training sample;

and the model training device is used for carrying out model training according to the training samples and the extended training samples.

In a fifth aspect, the present application provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.

In a sixth aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above-described method.

In the application, at least one sentence in the associated sentences is spliced with the first source language sentence to obtain the second source language sentence, and then a new data pair is formed based on the second source language sentence, wherein the second source language sentence is real data and is not pseudo data, so that negative effects brought to a machine translation model by noise generated by the pseudo data can be reduced or avoided.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic flowchart of a training sample enhancement method according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a training sample enhancing apparatus according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In a first aspect, an embodiment of the present application provides a training sample enhancement method, as shown in fig. 1, the method includes the following steps:

s110, acquiring a first source language sentence and an associated sentence of the first source language sentence;

the method and the device can be applied to a plurality of scenes needing to translate the text, such as a long text scene of the lines, a short text scene of the lines, a conversation scene and the like.

In particular implementations, the first source language sentence can be obtained from a set of training samples. In practice, a plurality of sentence pairs may be included in the training sample set, each sentence pair including a source language sentence and a corresponding target language sentence. After a new sentence pair is obtained by executing the scheme, the new sentence pair is also added to the training sample set, so that the training sample set is expanded.

It will be appreciated that the training sample set is used to perform model training to obtain a machine translation model that can translate one language to another, for example, translate Chinese to English, where Chinese is the source language and English is the target language. A sentence pair in the training sample set, for example, "she is a beautiful girl" and "Sheisabeautifugirl".

It is understood that the specific meaning of the above related sentences in different application scenarios may be different, but in any scenario, the related sentences generally refer to sentences having a certain paraphrasing effect on the sentences to be translated. For example, in a scenario of translating a speech-line text, the associated sentence of the first source language sentence may be an upper sentence of the first source language sentence, may be a lower sentence of the first source language sentence, or may be the upper sentence and the lower sentence of the first source language sentence. Of course, besides the context sentence, the associated sentence may also be related annotation text, and the like of the sentence to be translated.

In a specific implementation, the number of associated sentences may be selected as needed, for example, the associated sentences are context sentences, and if the number n is set to 1, the previous sentence a and the next sentence c of the first source language sentence b in the source text are selected as context sentences. If n is 2, the upper two sentences a, b and the lower two sentences d, e of the first source language sentence c in the source text are selected as context sentences.

The source text may be of various types, such as movies, television shows, articles, books, etc. Assuming that a machine training model is mainly used for translating lines in a television series, a training sample set of the machine training model needs to contain lines corresponding to a plurality of television series. In order to distinguish different dramas, a source identifier is set for each of the dramas, and different dramas correspond to different source identifiers, so that all lines in the training sample set, which are from the same drama, correspond to one source identifier. Of course, the speech in the training sample set is both the active language sentence and the target language sentence. In order to distinguish the sentence pairs from the same source text in the training sample set and facilitate extracting the context sentence as the associated sentence, a position identifier may be set for the source language sentence from the same source text, and the position identifier is used to indicate the position of the source language sentence in the source text.

That is, the method provided by the present application may further include:

s100, setting a position index for a source language sentence with the same source identifier, wherein the position index is a position number of the source language sentence in a source text corresponding to the source identifier;

based on step S100, the process of obtaining the associated sentence of the first source language sentence in step S110 may include: and selecting the associated text of the first source language sentence from the corresponding source text according to the position index of the first source language sentence. For example, if the position index of the first source language sentence is b, the context sentence of the first source language sentence may include the above sentence a and the below sentence c. Therefore, by setting the position index, the context sentence of the first source language sentence can be found as the associated text more simply and conveniently.

S120, splicing at least one sentence in the associated sentences with the first source language sentence to obtain a second source language sentence;

for example, assume that a context sentence is taken as an associated sentence, the number n is set to 1, the first source language sentence is B, and the corresponding first target language sentence is B; the sentence a of the sentence b is a, and the target language sentence corresponding to the sentence a is A; and if the following statement of the statement b is C, the target language statement corresponding to the following statement C is C. At least one of the context sentences a and c is spliced with the first source language sentence b to obtain a second source language sentence b ', so that the second source language sentence b' comprises three types: a + b, b + c, a + b + c.

For another example, the number n is 2, the first source language sentence is C, and the corresponding first target language sentence is C; c, the sentences above the sentence a and B are respectively corresponding to the target language sentences A and B; and D and E are the following sentences of the sentence c, and the target language sentences corresponding to the following sentences D and E are D and E respectively. At least one of the contextual sentences a, b, d, e is spliced with the first source language sentence c to obtain a second source language sentence c ', so that the second source language sentence c' includes 15 types: (1) any one of the contextual sentences a, b, d, e is spliced with the first source language sentence c, and the obtained second source language sentence c' comprises four: a + c, b + c, c + d, c + e; (2) any two of the contextual sentences a, b, d, e are spliced with the first source language sentence c, and the obtained second source language sentence c' comprises six: a + b + c, b + c + d, c + d + e, a + c + d, a + c + e, b + c + e; (3) any three of the contextual sentences a, b, d, e are spliced with the first source language sentence c, and the obtained second source language sentence c' comprises four: a + b + c + d, a + b + c + e, b + c + d + e, a + c + d + e; (4) the context sentences a, b, d, e are spliced with the first source language sentence c, and the obtained second source language sentence c' comprises: a + b + c + d + e. It can be seen that when n is 2, there are 15 second source language sentences c'.

Wherein the first source language sentence and the first target language sentence are corresponding, from a sentence pair. By analysis, the larger n, the larger the number of second source language sentences,the specific number is (2)²ⁿ-1)。

It can be seen that the second source language sentence obtained by this step S120 is real data, not pseudo data.

S130, determining a second target language sentence according to the second source language sentence;

in specific implementation, there are various ways to determine the second target language sentence in step S130, and two of them are described below:

(1) and taking the first target language sentence corresponding to the first source language sentence as the second target language sentence.

For example, if the number n is 1, the first source language sentence is B, and the corresponding first target language sentence is B, then the first target language sentence B is the first target language sentence B'. At this time, the correspondence between the second source language sentence and the second target language sentence is: a + B → B, B + c → B, a + B + c → B, the second source language sentence and the second target language sentence having a correspondence can constitute a sentence pair.

For another example, the number n is 2, the first source language sentence is C, the corresponding first target language sentence is C, and then the first target language sentence C is the first target language sentence C ', that is, then the 15 second source language sentences C' listed above all correspond to C.

(2) And determining the second target language sentence according to the first source language sentence and sentences except the first source language sentence in the second source language sentence.

That is, when the second target language sentence is selected, not only the first target language sentence but also at least one of the related sentences according to which the second source language sentence is determined in step S120 is selected.

For example, if the number n is 1, the first source language sentence is b, and at least one related sentence selected when determining a second source language sentence is the above sentence a, then the second target language sentence is determined from sentences a and b. At least one of the associated sentences selected in determining a second source language sentence is the context sentences a and c, and then the second target language sentence is determined from sentences a, b and c.

In specific implementation, the method (2) may include: and selecting sentences or sentence combinations adjacent to the first source language sentence from sentences except the first source language sentence in the second source language sentence, and splicing the target language sentence corresponding to the selected sentences or sentence combinations with the first target language sentence corresponding to the first source language sentence to obtain the second target language sentence.

It is to be understood that if one of a plurality of adjacent sentences in a sentence other than the first source language sentence in the second source language sentence is adjacent to the first source language sentence, the plurality of adjacent sentences constitute a sentence combination.

For example, if n is 1, the first source language sentence is B, the second source language sentence B ' is a + B, and the sentence a is adjacent to the first source language sentence B, then B ' is obtained by splicing a and B, that is, B ' is a + B, and at this time, the corresponding relationship between the second source language sentence and the second target language sentence is: b ═ a + B → B ═ a + B. B 'is obtained after C and B are spliced, that is, B' is equal to B + C, and at this time, the corresponding relationship between the second source language sentence and the second target language sentence is: b ═ B + C → B ═ B + C.

It is understood that one or more sentences adjacent to the first source language sentence selected from sentences other than the first source language sentence in the second source language sentence may be used. When the selected sentence is one, for example, B '═ a + B + C → B' ═ B + C, B '═ a + B + C → B' ═ a + B, when the selected sentence is plural, for example, B '═ a + B + C → B' ═ a + B + C.

For another example, where n is 2, the first source language sentence is c, the associated sentences are context sentences a, b, d, e, and the process of determining the second target language sentence is similar to the case where n is 1 and the first source language sentence is b. For example, assume that a, b, d, e are selected when a second source language sentence is formed by concatenation, that is, the second source language sentence c' includes a, b, c, d, and e, and in sentences a, b, d, and e, sentence b and sentence d are adjacent to sentence c, so the corresponding second target language sentence may include: b + C, C + D, where the correspondence between the second source language sentence and the second target language sentence is: c ═ a + B + C + D + e → C ═ C + D.

Further, a and b are adjacent sentences, and b is adjacent to c, and the visible sentences a and b form a sentence combination, and the corresponding relationship between the second source language sentence and the second target language sentence is: c ═ a + B + C + d + e → C ═ a + B + C. d and e are adjacent sentences, the sentence d is adjacent to the sentence c, the visible sentences d and e form a sentence combination, and the corresponding relation between the second source language sentence and the second target language sentence is as follows: c ═ a + b + C + D + E → C ═ C + D + E.

For sentence combinations, there is also a correspondence: c ═ a + B + C + D + E → C ═ a + B + C + D + E, this special case, in fact, is: and splicing the target language sentences corresponding to the sentences except the first source language sentence in the second source language sentence with the first target language sentence to obtain the second target language sentence. For this particular approach, the 15 second source language sentences c' above correspond to the second target language sentences by: a + C → A + C, B + C → B + C, C + d → C + D, C + E → C + E; a + B + C → A + B + C, B + C + D → B + C + D, C + D + E → C + D + E, a + C + D → A + C + D, a + C + E → A + C + E, B + C + E → B + C + E; a + B + C + D → A + B + C + D, a + B + C + E → A + B + C + E, B + C + D + E → B + C + D + E, a + C + D + E → A + C + D + E; a + B + C + D + E → A + B + C + D + E.

It can be seen that the second target language statements obtained by either the (1) th way or the (2) th way are real data, not pseudo data.

S140, forming a sentence pair by the second source language sentence and the second target language sentence, and taking the sentence pair as a training sample.

For example, where n is 1 and the first source language sentence is b, there are 8 new sentence pairs obtained by sample enhancement: a + B → B, B + C → B, a + B + C → B, a + B → A + B, B + C → B + C, a + B + C → B + C, a + B + C → A + B, a + B + C → A + B + C, and the 8 new sentence pairs are added to the training sample set to realize the extension of the training sample set. Of course, when n is 2, the number of new sentence pairs is increased.

It can be understood that the method provided by the application is suitable for various types of sentences, and is particularly suitable for the speech-line sentences, and the speech-line sentences have the characteristics that: the speech-line sentences are different in length, and if a single speech-line sentence is not complete, the contexts of the speech-line sentences have a certain relation, and the context sentences are used as related sentences to further expand the training samples.

For example, the source language is Chinese and the target language is Malay. Firstly, performing model training by adopting a training sample set which originally comprises 1000 training samples to obtain a first model; after the original training sample set is expanded through the training sample enhancement method provided by the application, the expanded training sample set is adopted for model training to obtain a second model. Then, both models were evaluated using the parameter BLEU. Wherein BLEU is called Bilingual Evaluation understatus, which means that Bilingual Evaluation substitution, i.e. Evaluation of translation results by a human surrogate, and the higher the value, the higher the translation quality of the model. By calculation, the BLEU of the first model is 19.11, and the BLEU of the second model is 19.99, it can be seen that the BLEU of the parameter of the second model is improved, thereby showing that the translation quality of the second model is higher than that of the first model.

According to the training sample enhancement method, at least one sentence in the associated sentences is spliced with the first source language sentence to obtain the second source language sentence, a new data pair is formed based on the second source language sentence, the second source language sentence is real data and is not pseudo data, and therefore negative effects of noise generated by the pseudo data on a machine translation model can be reduced or avoided.

In a second aspect, the present application provides a machine translation model training method, specifically including:

s210, obtaining a training sample, wherein the training sample comprises a plurality of sentence pairs, and each sentence pair comprises a source language sentence and a corresponding target language sentence;

s220, expanding the training sample by adopting the training sample enhancement method provided by the first aspect;

and S230, performing model training according to the training samples and the extended training samples.

Of course, in the training process in S230, it is generally necessary to perform text correctness verification, determine whether a convergence condition is met, and the like, and when the convergence condition is met, the model training is completed.

In a third aspect, the present application provides a training sample enhancing apparatus, as shown in fig. 2, the apparatus 100 includes the following modules:

a sentence obtaining module 110, configured to obtain a first source language sentence and an associated sentence of the first source language sentence;

a first determining module 120, configured to splice at least one of the associated sentences with the first source language sentence to obtain a second source language sentence;

a second determining module 130, configured to determine a second target language sentence according to the second source language sentence;

a sentence pair forming module 140, configured to form a sentence pair from the second source language sentence and the second target language sentence, and use the sentence pair as a training sample.

It can be understood that, for the explanation, examples, and beneficial effects of the training sample enhancement device provided in the embodiment of the present application, reference may be made to corresponding parts in the first aspect, and details are not described here.

In a fourth aspect, the present application provides a machine translation model training apparatus, specifically including:

the training sample enhancing apparatus 100 provided in the third aspect is configured to augment the training sample;

In a fifth aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor, when executing the computer program, implements the training sample enhancement method provided in the first aspect or the machine translation model training method provided in the second aspect. For example, the following steps in the method provided by the first aspect may be implemented: acquiring a first source language sentence and an associated sentence of the first source language sentence; splicing at least one sentence in the associated sentences with the first source language sentence to obtain a second source language sentence; determining a second target language sentence according to the second source language sentence; and forming a sentence pair by using the second source language sentence and the second target language sentence, and using the sentence pair as a training sample.

FIG. 3 is a diagram illustrating an internal structure of a computer device in one embodiment. As shown in fig. 3, the computer apparatus includes a processor, a memory, a network interface, an input device, a display screen, and the like, which are connected through a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement the training sample enhancement method. The internal memory may also have a computer program stored therein, which when executed by the processor, causes the processor to perform a training sample enhancement method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 2 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the training sample enhancement apparatus provided herein may be implemented in the form of a computer program that is executable on a computer device such as that shown in fig. 3. The memory of the computer device may store various program modules constituting the capacity expansion apparatus, such as the sentence acquisition module 110, the first determination module 120, the second determination module 130, and the sentence pair formation module 140 shown in fig. 2. As another example, the sample acquiring device, the training sample enhancing device, and the model training device. These several means are also effectively program modules. The program modules constitute computer programs that cause the processor to perform the steps of the training sample enhancement method of the embodiments of the present application described in the present specification.

For example, the computer apparatus shown in FIG. 3 may perform the obtaining of the first source language sentence and the associated sentence of the first source language sentence by the sentence obtaining module 110 in the training sample enhancing apparatus shown in FIG. 2. The computer device may perform, by the first determining module 120, stitching at least one of the associated sentences with the first source language sentence to obtain a second source language sentence. The computer device may perform determining a second target language sentence from the second source language sentence by a second determining module 130. The computer device may perform the forming of the second source language sentence and the second target language sentence into a sentence pair by sentence pair forming module 140 and use the sentence pair as a training sample.

It is understood that, for the computer device provided in the embodiments of the present application, for explanation, examples, and beneficial effects, reference may be made to corresponding parts in the first aspect or the second aspect, and details are not described here.

In a sixth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the training sample enhancement method provided in the first aspect or the machine translation model training method provided in the second aspect.

It is to be understood that, for the computer-readable storage medium provided in the embodiments of the present application, for explanation, examples, and beneficial effects of the contents, reference may be made to corresponding parts in the first aspect or the second aspect, and details are not described here.

It is to be appreciated that any reference to memory, storage, database, or other medium used in the embodiments provided herein can include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for enhancing a training sample, comprising:

2. The method of claim 1, wherein said determining a second target language sentence from said second source language sentence comprises:

and taking the first target language sentence corresponding to the first source language sentence as the second target language sentence.

3. The method of claim 1, wherein said determining a second target language sentence from said second source language sentence comprises:

and determining the second target language sentence according to the first source language sentence and sentences except the first source language sentence in the second source language sentence.

4. The method of claim 3, wherein determining the second target language sentence from the first source language sentence and sentences other than the first source language sentence in the second source language sentence comprises:

and selecting sentences or sentence combinations adjacent to the first source language sentence from sentences except the first source language sentence in the second source language sentence, and splicing the target language sentence corresponding to the selected sentences or sentence combinations with the first target language sentence corresponding to the first source language sentence to obtain the second target language sentence.

5. The method of claim 1, further comprising: setting a position index for a source language sentence with the same source identifier, wherein the position index is a position number of the source language sentence in a source text corresponding to the source identifier;

obtaining an associated sentence of the first source language sentence, including: and selecting the associated text of the first source language sentence from the corresponding source text according to the position index of the first source language sentence.

6. A machine translation model training method is characterized by comprising the following steps:

augmenting the training sample with the training sample enhancement method of any one of claims 1-5;

7. A training sample enhancement device, comprising:

8. A machine translation model training apparatus, comprising:

the training sample enhancement apparatus of claim 7, configured to augment the training sample;

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 6 are implemented when the computer program is executed by the processor.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.