CN113392657A - Training sample enhancement method and device, computer equipment and storage medium - Google Patents

Training sample enhancement method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN113392657A
CN113392657A CN202110678838.6A CN202110678838A CN113392657A CN 113392657 A CN113392657 A CN 113392657A CN 202110678838 A CN202110678838 A CN 202110678838A CN 113392657 A CN113392657 A CN 113392657A
Authority
CN
China
Prior art keywords
sentence
language sentence
source language
source
training sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110678838.6A
Other languages
Chinese (zh)
Inventor
张轩玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing IQIYI Science and Technology Co Ltd
Original Assignee
Beijing IQIYI Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing IQIYI Science and Technology Co Ltd filed Critical Beijing IQIYI Science and Technology Co Ltd
Priority to CN202110678838.6A priority Critical patent/CN113392657A/en
Publication of CN113392657A publication Critical patent/CN113392657A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to a training sample enhancement method, a training sample enhancement device, computer equipment and a storage medium. The method comprises the following steps: acquiring a first source language sentence and an associated sentence of the first source language sentence; splicing at least one sentence in the associated sentences with the first source language sentence to obtain a second source language sentence; determining a second target language sentence according to the second source language sentence; and forming a sentence pair by using the second source language sentence and the second target language sentence, and using the sentence pair as a training sample. In the application, the second source language sentence is real data rather than pseudo data, so that the negative influence of noise generated by the pseudo data on the machine translation model can be reduced or avoided.

Description

Training sample enhancement method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of machine translation technologies, and in particular, to a method and an apparatus for enhancing a training sample, a method and an apparatus for training a machine translation model, a computer device, and a storage medium.
Background
Currently, machine translation is performed by a translation model obtained by sample training, and a source language is translated into a target language. However, a large number of training samples are required for training the translation model, and the training samples, namely, the sentence pairs of the source language and the target language, are high in acquisition cost, and especially in the whisper language, special translators are required for labeling, so that technicians prefer to extend the training samples in a data enhancement manner to reduce the cost.
At present, methods for enhancing data include word replacement, translation, and the like. Where words are replaced, for example, by "very" in "this action is very cool", this approach requires maintaining a completely correct synonym table because the replaced statement will have a different meaning from the original statement once the synonym table has a problem. In the translation, for example, firstly "He looks good" is translated into english "He looks good" and then translated into chinese "He looks good", and it is obvious that the meaning of the translated sentence is different from that of the original sentence, and "looks good" is not necessarily the meaning of "looks good". It can be seen that this approach, while increasing the number of training samples, also increases the false noise. In fact, the data added by the methods of word replacement and translation are all pseudo data, so that a lot of noise is introduced while the number of training samples is increased, the difficulty of machine learning is increased, and the replacement rate and the noise rate need to be finely adjusted, so that the workload is large and the flexibility is not high.
Disclosure of Invention
To solve the above technical problem or at least partially solve the above technical problem, the present application provides a training sample enhancement method, apparatus, computer device and storage medium.
In a first aspect, the present application provides a training sample enhancement method, including:
acquiring a first source language sentence and an associated sentence of the first source language sentence;
splicing at least one sentence in the associated sentences with the first source language sentence to obtain a second source language sentence;
determining a second target language sentence according to the second source language sentence;
and forming a sentence pair by using the second source language sentence and the second target language sentence, and using the sentence pair as a training sample.
In a second aspect, the present application provides a machine translation model training method, including:
obtaining a training sample, wherein the training sample comprises a plurality of sentence pairs, and each sentence pair comprises a source language sentence and a corresponding target language sentence;
the training sample enhancement method provided by the first aspect is used for expanding the training sample;
and performing model training according to the training samples and the training samples added in the expansion.
In a third aspect, the present application provides a training sample enhancing apparatus, comprising:
the sentence acquisition module is used for acquiring a first source language sentence and an associated sentence of the first source language sentence;
the first determining module is used for splicing at least one sentence in the associated sentences with the first source language sentence to obtain a second source language sentence;
a second determining module, configured to determine a second target language sentence according to the second source language sentence;
and the sentence pair forming module is used for forming a sentence pair by the second source language sentence and the second target language sentence and taking the sentence pair as a training sample.
In a fourth aspect, the present application provides a machine translation model training apparatus, including:
the system comprises a sample acquisition device, a processing device and a processing device, wherein the sample acquisition device is used for acquiring a training sample, the training sample comprises a plurality of sentence pairs, and each sentence pair comprises a source language sentence and a corresponding target language sentence;
the training sample enhancement device provided by the third aspect is used for expanding the training sample;
and the model training device is used for carrying out model training according to the training samples and the extended training samples.
In a fifth aspect, the present application provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.
In a sixth aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above-described method.
In the application, at least one sentence in the associated sentences is spliced with the first source language sentence to obtain the second source language sentence, and then a new data pair is formed based on the second source language sentence, wherein the second source language sentence is real data and is not pseudo data, so that negative effects brought to a machine translation model by noise generated by the pseudo data can be reduced or avoided.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic flowchart of a training sample enhancement method according to an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of a training sample enhancing apparatus according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In a first aspect, an embodiment of the present application provides a training sample enhancement method, as shown in fig. 1, the method includes the following steps:
s110, acquiring a first source language sentence and an associated sentence of the first source language sentence;
the method and the device can be applied to a plurality of scenes needing to translate the text, such as a long text scene of the lines, a short text scene of the lines, a conversation scene and the like.
In particular implementations, the first source language sentence can be obtained from a set of training samples. In practice, a plurality of sentence pairs may be included in the training sample set, each sentence pair including a source language sentence and a corresponding target language sentence. After a new sentence pair is obtained by executing the scheme, the new sentence pair is also added to the training sample set, so that the training sample set is expanded.
It will be appreciated that the training sample set is used to perform model training to obtain a machine translation model that can translate one language to another, for example, translate Chinese to English, where Chinese is the source language and English is the target language. A sentence pair in the training sample set, for example, "she is a beautiful girl" and "Sheisabeautifugirl".
It is understood that the specific meaning of the above related sentences in different application scenarios may be different, but in any scenario, the related sentences generally refer to sentences having a certain paraphrasing effect on the sentences to be translated. For example, in a scenario of translating a speech-line text, the associated sentence of the first source language sentence may be an upper sentence of the first source language sentence, may be a lower sentence of the first source language sentence, or may be the upper sentence and the lower sentence of the first source language sentence. Of course, besides the context sentence, the associated sentence may also be related annotation text, and the like of the sentence to be translated.
In a specific implementation, the number of associated sentences may be selected as needed, for example, the associated sentences are context sentences, and if the number n is set to 1, the previous sentence a and the next sentence c of the first source language sentence b in the source text are selected as context sentences. If n is 2, the upper two sentences a, b and the lower two sentences d, e of the first source language sentence c in the source text are selected as context sentences.
The source text may be of various types, such as movies, television shows, articles, books, etc. Assuming that a machine training model is mainly used for translating lines in a television series, a training sample set of the machine training model needs to contain lines corresponding to a plurality of television series. In order to distinguish different dramas, a source identifier is set for each of the dramas, and different dramas correspond to different source identifiers, so that all lines in the training sample set, which are from the same drama, correspond to one source identifier. Of course, the speech in the training sample set is both the active language sentence and the target language sentence. In order to distinguish the sentence pairs from the same source text in the training sample set and facilitate extracting the context sentence as the associated sentence, a position identifier may be set for the source language sentence from the same source text, and the position identifier is used to indicate the position of the source language sentence in the source text.
That is, the method provided by the present application may further include:
s100, setting a position index for a source language sentence with the same source identifier, wherein the position index is a position number of the source language sentence in a source text corresponding to the source identifier;
based on step S100, the process of obtaining the associated sentence of the first source language sentence in step S110 may include: and selecting the associated text of the first source language sentence from the corresponding source text according to the position index of the first source language sentence. For example, if the position index of the first source language sentence is b, the context sentence of the first source language sentence may include the above sentence a and the below sentence c. Therefore, by setting the position index, the context sentence of the first source language sentence can be found as the associated text more simply and conveniently.
S120, splicing at least one sentence in the associated sentences with the first source language sentence to obtain a second source language sentence;
for example, assume that a context sentence is taken as an associated sentence, the number n is set to 1, the first source language sentence is B, and the corresponding first target language sentence is B; the sentence a of the sentence b is a, and the target language sentence corresponding to the sentence a is A; and if the following statement of the statement b is C, the target language statement corresponding to the following statement C is C. At least one of the context sentences a and c is spliced with the first source language sentence b to obtain a second source language sentence b ', so that the second source language sentence b' comprises three types: a + b, b + c, a + b + c.
For another example, the number n is 2, the first source language sentence is C, and the corresponding first target language sentence is C; c, the sentences above the sentence a and B are respectively corresponding to the target language sentences A and B; and D and E are the following sentences of the sentence c, and the target language sentences corresponding to the following sentences D and E are D and E respectively. At least one of the contextual sentences a, b, d, e is spliced with the first source language sentence c to obtain a second source language sentence c ', so that the second source language sentence c' includes 15 types: (1) any one of the contextual sentences a, b, d, e is spliced with the first source language sentence c, and the obtained second source language sentence c' comprises four: a + c, b + c, c + d, c + e; (2) any two of the contextual sentences a, b, d, e are spliced with the first source language sentence c, and the obtained second source language sentence c' comprises six: a + b + c, b + c + d, c + d + e, a + c + d, a + c + e, b + c + e; (3) any three of the contextual sentences a, b, d, e are spliced with the first source language sentence c, and the obtained second source language sentence c' comprises four: a + b + c + d, a + b + c + e, b + c + d + e, a + c + d + e; (4) the context sentences a, b, d, e are spliced with the first source language sentence c, and the obtained second source language sentence c' comprises: a + b + c + d + e. It can be seen that when n is 2, there are 15 second source language sentences c'.
Wherein the first source language sentence and the first target language sentence are corresponding, from a sentence pair. By analysis, the larger n, the larger the number of second source language sentences,the specific number is (2)2n-1)。
It can be seen that the second source language sentence obtained by this step S120 is real data, not pseudo data.
S130, determining a second target language sentence according to the second source language sentence;
in specific implementation, there are various ways to determine the second target language sentence in step S130, and two of them are described below:
(1) and taking the first target language sentence corresponding to the first source language sentence as the second target language sentence.
For example, if the number n is 1, the first source language sentence is B, and the corresponding first target language sentence is B, then the first target language sentence B is the first target language sentence B'. At this time, the correspondence between the second source language sentence and the second target language sentence is: a + B → B, B + c → B, a + B + c → B, the second source language sentence and the second target language sentence having a correspondence can constitute a sentence pair.
For another example, the number n is 2, the first source language sentence is C, the corresponding first target language sentence is C, and then the first target language sentence C is the first target language sentence C ', that is, then the 15 second source language sentences C' listed above all correspond to C.
(2) And determining the second target language sentence according to the first source language sentence and sentences except the first source language sentence in the second source language sentence.
That is, when the second target language sentence is selected, not only the first target language sentence but also at least one of the related sentences according to which the second source language sentence is determined in step S120 is selected.
For example, if the number n is 1, the first source language sentence is b, and at least one related sentence selected when determining a second source language sentence is the above sentence a, then the second target language sentence is determined from sentences a and b. At least one of the associated sentences selected in determining a second source language sentence is the context sentences a and c, and then the second target language sentence is determined from sentences a, b and c.
In specific implementation, the method (2) may include: and selecting sentences or sentence combinations adjacent to the first source language sentence from sentences except the first source language sentence in the second source language sentence, and splicing the target language sentence corresponding to the selected sentences or sentence combinations with the first target language sentence corresponding to the first source language sentence to obtain the second target language sentence.
It is to be understood that if one of a plurality of adjacent sentences in a sentence other than the first source language sentence in the second source language sentence is adjacent to the first source language sentence, the plurality of adjacent sentences constitute a sentence combination.
For example, if n is 1, the first source language sentence is B, the second source language sentence B ' is a + B, and the sentence a is adjacent to the first source language sentence B, then B ' is obtained by splicing a and B, that is, B ' is a + B, and at this time, the corresponding relationship between the second source language sentence and the second target language sentence is: b ═ a + B → B ═ a + B. B 'is obtained after C and B are spliced, that is, B' is equal to B + C, and at this time, the corresponding relationship between the second source language sentence and the second target language sentence is: b ═ B + C → B ═ B + C.
It is understood that one or more sentences adjacent to the first source language sentence selected from sentences other than the first source language sentence in the second source language sentence may be used. When the selected sentence is one, for example, B '═ a + B + C → B' ═ B + C, B '═ a + B + C → B' ═ a + B, when the selected sentence is plural, for example, B '═ a + B + C → B' ═ a + B + C.
For another example, where n is 2, the first source language sentence is c, the associated sentences are context sentences a, b, d, e, and the process of determining the second target language sentence is similar to the case where n is 1 and the first source language sentence is b. For example, assume that a, b, d, e are selected when a second source language sentence is formed by concatenation, that is, the second source language sentence c' includes a, b, c, d, and e, and in sentences a, b, d, and e, sentence b and sentence d are adjacent to sentence c, so the corresponding second target language sentence may include: b + C, C + D, where the correspondence between the second source language sentence and the second target language sentence is: c ═ a + B + C + D + e → C ═ C + D.
Further, a and b are adjacent sentences, and b is adjacent to c, and the visible sentences a and b form a sentence combination, and the corresponding relationship between the second source language sentence and the second target language sentence is: c ═ a + B + C + d + e → C ═ a + B + C. d and e are adjacent sentences, the sentence d is adjacent to the sentence c, the visible sentences d and e form a sentence combination, and the corresponding relation between the second source language sentence and the second target language sentence is as follows: c ═ a + b + C + D + E → C ═ C + D + E.
For sentence combinations, there is also a correspondence: c ═ a + B + C + D + E → C ═ a + B + C + D + E, this special case, in fact, is: and splicing the target language sentences corresponding to the sentences except the first source language sentence in the second source language sentence with the first target language sentence to obtain the second target language sentence. For this particular approach, the 15 second source language sentences c' above correspond to the second target language sentences by: a + C → A + C, B + C → B + C, C + d → C + D, C + E → C + E; a + B + C → A + B + C, B + C + D → B + C + D, C + D + E → C + D + E, a + C + D → A + C + D, a + C + E → A + C + E, B + C + E → B + C + E; a + B + C + D → A + B + C + D, a + B + C + E → A + B + C + E, B + C + D + E → B + C + D + E, a + C + D + E → A + C + D + E; a + B + C + D + E → A + B + C + D + E.
It can be seen that the second target language statements obtained by either the (1) th way or the (2) th way are real data, not pseudo data.
S140, forming a sentence pair by the second source language sentence and the second target language sentence, and taking the sentence pair as a training sample.
For example, where n is 1 and the first source language sentence is b, there are 8 new sentence pairs obtained by sample enhancement: a + B → B, B + C → B, a + B + C → B, a + B → A + B, B + C → B + C, a + B + C → B + C, a + B + C → A + B, a + B + C → A + B + C, and the 8 new sentence pairs are added to the training sample set to realize the extension of the training sample set. Of course, when n is 2, the number of new sentence pairs is increased.
It can be understood that the method provided by the application is suitable for various types of sentences, and is particularly suitable for the speech-line sentences, and the speech-line sentences have the characteristics that: the speech-line sentences are different in length, and if a single speech-line sentence is not complete, the contexts of the speech-line sentences have a certain relation, and the context sentences are used as related sentences to further expand the training samples.
For example, the source language is Chinese and the target language is Malay. Firstly, performing model training by adopting a training sample set which originally comprises 1000 training samples to obtain a first model; after the original training sample set is expanded through the training sample enhancement method provided by the application, the expanded training sample set is adopted for model training to obtain a second model. Then, both models were evaluated using the parameter BLEU. Wherein BLEU is called Bilingual Evaluation understatus, which means that Bilingual Evaluation substitution, i.e. Evaluation of translation results by a human surrogate, and the higher the value, the higher the translation quality of the model. By calculation, the BLEU of the first model is 19.11, and the BLEU of the second model is 19.99, it can be seen that the BLEU of the parameter of the second model is improved, thereby showing that the translation quality of the second model is higher than that of the first model.
According to the training sample enhancement method, at least one sentence in the associated sentences is spliced with the first source language sentence to obtain the second source language sentence, a new data pair is formed based on the second source language sentence, the second source language sentence is real data and is not pseudo data, and therefore negative effects of noise generated by the pseudo data on a machine translation model can be reduced or avoided.
In a second aspect, the present application provides a machine translation model training method, specifically including:
s210, obtaining a training sample, wherein the training sample comprises a plurality of sentence pairs, and each sentence pair comprises a source language sentence and a corresponding target language sentence;
s220, expanding the training sample by adopting the training sample enhancement method provided by the first aspect;
and S230, performing model training according to the training samples and the extended training samples.
Of course, in the training process in S230, it is generally necessary to perform text correctness verification, determine whether a convergence condition is met, and the like, and when the convergence condition is met, the model training is completed.
In a third aspect, the present application provides a training sample enhancing apparatus, as shown in fig. 2, the apparatus 100 includes the following modules:
a sentence obtaining module 110, configured to obtain a first source language sentence and an associated sentence of the first source language sentence;
a first determining module 120, configured to splice at least one of the associated sentences with the first source language sentence to obtain a second source language sentence;
a second determining module 130, configured to determine a second target language sentence according to the second source language sentence;
a sentence pair forming module 140, configured to form a sentence pair from the second source language sentence and the second target language sentence, and use the sentence pair as a training sample.
It can be understood that, for the explanation, examples, and beneficial effects of the training sample enhancement device provided in the embodiment of the present application, reference may be made to corresponding parts in the first aspect, and details are not described here.
In a fourth aspect, the present application provides a machine translation model training apparatus, specifically including:
the system comprises a sample acquisition device, a processing device and a processing device, wherein the sample acquisition device is used for acquiring a training sample, the training sample comprises a plurality of sentence pairs, and each sentence pair comprises a source language sentence and a corresponding target language sentence;
the training sample enhancing apparatus 100 provided in the third aspect is configured to augment the training sample;
and the model training device is used for carrying out model training according to the training samples and the extended training samples.
In a fifth aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor, when executing the computer program, implements the training sample enhancement method provided in the first aspect or the machine translation model training method provided in the second aspect. For example, the following steps in the method provided by the first aspect may be implemented: acquiring a first source language sentence and an associated sentence of the first source language sentence; splicing at least one sentence in the associated sentences with the first source language sentence to obtain a second source language sentence; determining a second target language sentence according to the second source language sentence; and forming a sentence pair by using the second source language sentence and the second target language sentence, and using the sentence pair as a training sample.
FIG. 3 is a diagram illustrating an internal structure of a computer device in one embodiment. As shown in fig. 3, the computer apparatus includes a processor, a memory, a network interface, an input device, a display screen, and the like, which are connected through a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement the training sample enhancement method. The internal memory may also have a computer program stored therein, which when executed by the processor, causes the processor to perform a training sample enhancement method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 2 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, the training sample enhancement apparatus provided herein may be implemented in the form of a computer program that is executable on a computer device such as that shown in fig. 3. The memory of the computer device may store various program modules constituting the capacity expansion apparatus, such as the sentence acquisition module 110, the first determination module 120, the second determination module 130, and the sentence pair formation module 140 shown in fig. 2. As another example, the sample acquiring device, the training sample enhancing device, and the model training device. These several means are also effectively program modules. The program modules constitute computer programs that cause the processor to perform the steps of the training sample enhancement method of the embodiments of the present application described in the present specification.
For example, the computer apparatus shown in FIG. 3 may perform the obtaining of the first source language sentence and the associated sentence of the first source language sentence by the sentence obtaining module 110 in the training sample enhancing apparatus shown in FIG. 2. The computer device may perform, by the first determining module 120, stitching at least one of the associated sentences with the first source language sentence to obtain a second source language sentence. The computer device may perform determining a second target language sentence from the second source language sentence by a second determining module 130. The computer device may perform the forming of the second source language sentence and the second target language sentence into a sentence pair by sentence pair forming module 140 and use the sentence pair as a training sample.
It is understood that, for the computer device provided in the embodiments of the present application, for explanation, examples, and beneficial effects, reference may be made to corresponding parts in the first aspect or the second aspect, and details are not described here.
In a sixth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the training sample enhancement method provided in the first aspect or the machine translation model training method provided in the second aspect.
It is to be understood that, for the computer-readable storage medium provided in the embodiments of the present application, for explanation, examples, and beneficial effects of the contents, reference may be made to corresponding parts in the first aspect or the second aspect, and details are not described here.
It is to be appreciated that any reference to memory, storage, database, or other medium used in the embodiments provided herein can include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for enhancing a training sample, comprising:
acquiring a first source language sentence and an associated sentence of the first source language sentence;
splicing at least one sentence in the associated sentences with the first source language sentence to obtain a second source language sentence;
determining a second target language sentence according to the second source language sentence;
and forming a sentence pair by using the second source language sentence and the second target language sentence, and using the sentence pair as a training sample.
2. The method of claim 1, wherein said determining a second target language sentence from said second source language sentence comprises:
and taking the first target language sentence corresponding to the first source language sentence as the second target language sentence.
3. The method of claim 1, wherein said determining a second target language sentence from said second source language sentence comprises:
and determining the second target language sentence according to the first source language sentence and sentences except the first source language sentence in the second source language sentence.
4. The method of claim 3, wherein determining the second target language sentence from the first source language sentence and sentences other than the first source language sentence in the second source language sentence comprises:
and selecting sentences or sentence combinations adjacent to the first source language sentence from sentences except the first source language sentence in the second source language sentence, and splicing the target language sentence corresponding to the selected sentences or sentence combinations with the first target language sentence corresponding to the first source language sentence to obtain the second target language sentence.
5. The method of claim 1, further comprising: setting a position index for a source language sentence with the same source identifier, wherein the position index is a position number of the source language sentence in a source text corresponding to the source identifier;
obtaining an associated sentence of the first source language sentence, including: and selecting the associated text of the first source language sentence from the corresponding source text according to the position index of the first source language sentence.
6. A machine translation model training method is characterized by comprising the following steps:
obtaining a training sample, wherein the training sample comprises a plurality of sentence pairs, and each sentence pair comprises a source language sentence and a corresponding target language sentence;
augmenting the training sample with the training sample enhancement method of any one of claims 1-5;
and performing model training according to the training samples and the training samples added in the expansion.
7. A training sample enhancement device, comprising:
the sentence acquisition module is used for acquiring a first source language sentence and an associated sentence of the first source language sentence;
the first determining module is used for splicing at least one sentence in the associated sentences with the first source language sentence to obtain a second source language sentence;
a second determining module, configured to determine a second target language sentence according to the second source language sentence;
and the sentence pair forming module is used for forming a sentence pair by the second source language sentence and the second target language sentence and taking the sentence pair as a training sample.
8. A machine translation model training apparatus, comprising:
the system comprises a sample acquisition device, a processing device and a processing device, wherein the sample acquisition device is used for acquiring a training sample, the training sample comprises a plurality of sentence pairs, and each sentence pair comprises a source language sentence and a corresponding target language sentence;
the training sample enhancement apparatus of claim 7, configured to augment the training sample;
and the model training device is used for carrying out model training according to the training samples and the extended training samples.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 6 are implemented when the computer program is executed by the processor.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
CN202110678838.6A 2021-06-18 2021-06-18 Training sample enhancement method and device, computer equipment and storage medium Pending CN113392657A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110678838.6A CN113392657A (en) 2021-06-18 2021-06-18 Training sample enhancement method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110678838.6A CN113392657A (en) 2021-06-18 2021-06-18 Training sample enhancement method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113392657A true CN113392657A (en) 2021-09-14

Family

ID=77622985

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110678838.6A Pending CN113392657A (en) 2021-06-18 2021-06-18 Training sample enhancement method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113392657A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408833A (en) * 2018-10-30 2019-03-01 科大讯飞股份有限公司 A kind of interpretation method, device, equipment and readable storage medium storing program for executing
CN109446534A (en) * 2018-09-21 2019-03-08 清华大学 Machine translation method and device
CN111027333A (en) * 2019-12-20 2020-04-17 北京百度网讯科技有限公司 Chapter translation method and device
CN111460838A (en) * 2020-04-23 2020-07-28 腾讯科技(深圳)有限公司 Pre-training method and device of intelligent translation model and storage medium
CN112765998A (en) * 2019-11-01 2021-05-07 华为技术有限公司 Machine translation method, machine translation model training method, device and storage medium
CN112836523A (en) * 2019-11-22 2021-05-25 上海流利说信息技术有限公司 Word translation method, device and equipment and readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109446534A (en) * 2018-09-21 2019-03-08 清华大学 Machine translation method and device
CN109408833A (en) * 2018-10-30 2019-03-01 科大讯飞股份有限公司 A kind of interpretation method, device, equipment and readable storage medium storing program for executing
CN112765998A (en) * 2019-11-01 2021-05-07 华为技术有限公司 Machine translation method, machine translation model training method, device and storage medium
CN112836523A (en) * 2019-11-22 2021-05-25 上海流利说信息技术有限公司 Word translation method, device and equipment and readable storage medium
CN111027333A (en) * 2019-12-20 2020-04-17 北京百度网讯科技有限公司 Chapter translation method and device
CN111460838A (en) * 2020-04-23 2020-07-28 腾讯科技(深圳)有限公司 Pre-training method and device of intelligent translation model and storage medium

Similar Documents

Publication Publication Date Title
US9152542B2 (en) Automatic generation of test scripts
US20140379322A1 (en) Converting an input script
CN110929094A (en) Video title processing method and device
O'Brien et al. Towards intelligent post-editing interfaces
KR102100951B1 (en) System for generating question-answer data for maching learning based on maching reading comprehension
CN110532575A (en) Text interpretation method and device
CN112560510A (en) Translation model training method, device, equipment and storage medium
CN115994536B (en) Text information processing method, system, equipment and computer storage medium
CN110753269B (en) Video abstract generation method, intelligent terminal and storage medium
CN110019305B (en) Knowledge base expansion method, storage medium and terminal
CN112417899A (en) Character translation method, device, computer equipment and storage medium
US10936827B1 (en) Machine evaluation of translation accuracy
CN111178098A (en) Text translation method, device and equipment and computer readable storage medium
CN110991193A (en) Translation matrix model selection system based on OpenKiwi
CN117725895A (en) Document generation method, device, equipment and medium
CN112836525A (en) Human-computer interaction based machine translation system and automatic optimization method thereof
CN113392657A (en) Training sample enhancement method and device, computer equipment and storage medium
CN116861898A (en) Sample data processing method, device, equipment and medium
CN110928790A (en) Test case construction method and device and test equipment
CN111597824B (en) Training method and device for language translation model
CN115936021A (en) Data enhancement method, device, equipment and readable storage medium
CN113705198B (en) Scene graph generation method and device, electronic equipment and storage medium
CN114298060A (en) Subtitle translation quality detection method, device, equipment and medium
Stodden Reproduction of German Text Simplification Systems
CN113392658A (en) Statement translation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination