CN111382577B

CN111382577B - Document translation method, device, electronic equipment and storage medium

Info

Publication number: CN111382577B
Application number: CN202010166968.7A
Authority: CN
Inventors: 王明轩; 孙泽维; 李磊
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2023-05-02
Anticipated expiration: 2040-03-11
Also published as: CN111382577A

Abstract

The embodiment of the disclosure discloses a document translation method, a device, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring a source document of a source language to be translated, wherein the source document comprises a plurality of sentences; the source document is input into a pre-trained document translation model to obtain a target document of a target language output by the document translation model, and the document translation model can directly translate the source document into the target document instead of translating the sentences of the source document sentence by sentence. According to the technical scheme, full-text translation can be performed by taking the document as a unit, so that the machine learning model can consider the semantics of the vocabulary in the full text, and the translation is more accurate.

Description

Document translation method, device, electronic equipment and storage medium

Technical Field

The embodiment of the disclosure relates to the technical field of natural language processing, in particular to a document translation method, a device, electronic equipment and a storage medium.

Background

How to automatically realize the mutual conversion between different languages by using a computer is an important research field of natural language processing and artificial intelligence. Currently, a widely adopted approach is sentence-to-sentence level translation.

Disclosure of Invention

In view of this, the embodiments of the present disclosure provide a document translation method, apparatus, electronic device, and storage medium, so as to implement full text translation in units of documents.

Other features and advantages of embodiments of the present disclosure will be apparent from the following detailed description, or may be learned by practice of embodiments of the disclosure in part.

In a first aspect of the present disclosure, an embodiment of the present disclosure provides a document translation method, including: acquiring a source document of a source language to be translated, wherein the source document comprises a plurality of sentences; the source document is input into a pre-trained document translation model to obtain a target document of a target language output by the document translation model, and the document translation model can directly translate the source document into the target document instead of translating the sentences of the source document sentence by sentence.

In a second aspect of the present disclosure, an embodiment of the present disclosure further provides a document translation apparatus, including: a source document acquisition unit configured to acquire a source document of a source language to be translated, the source document including a plurality of sentences; a target document obtaining unit, configured to input the source document into a pre-trained document translation model to obtain a target document in a target language output by the document translation model, where the document translation model is capable of directly translating the source document into the target document, instead of translating the plurality of sentences of the source document sentence by sentence.

In a third aspect of the present disclosure, an electronic device is provided. The electronic device includes: a processor; and a memory storing executable instructions that, when executed by the processor, cause the electronic device to perform the method of the first aspect.

In a fourth aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of the first aspect.

According to the method and the device for translating the source language, the source document of the source language to be translated is obtained, the source document is input into the pre-trained document translation model, so that the target document of the target language output by the document translation model is obtained, the document translation model can translate the source document into the target document directly instead of translating the sentences of the source document sentence by sentence, and therefore the machine learning model can consider the meaning of the vocabulary in the whole text, and translation is more accurate.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the following description will briefly explain the drawings required to be used in the description of the embodiments of the present disclosure, and it is apparent that the drawings in the following description are only some of the embodiments of the present disclosure, and other drawings may be obtained according to the contents of the embodiments of the present disclosure and these drawings without inventive effort for those skilled in the art.

FIG. 1 is a flow diagram of a document translation method provided by an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of an exemplary sentence-level translation effect;

FIG. 3 is a flow diagram of an exemplary document translation model training method provided by embodiments of the present disclosure;

FIG. 4 is a flow chart of an exemplary method for training a derived document translation model provided by embodiments of the present disclosure;

FIG. 5 is a schematic diagram of a document translation apparatus according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a training module architecture of an exemplary document translation model provided by embodiments of the present disclosure;

fig. 7 shows a schematic diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

In order to make the technical problems solved, the technical solutions adopted and the technical effects achieved by the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments, but not all embodiments of the present disclosure. All other embodiments, which are derived by a person skilled in the art from the embodiments of the present disclosure without creative efforts, fall within the protection scope of the embodiments of the present disclosure.

It should be noted that the terms "system" and "network" in the embodiments of the present disclosure are often used interchangeably herein. References to "and/or" in the embodiments of the present disclosure are intended to encompass any and all combinations of one or more of the associated listed items. The terms first, second and the like in the description and in the claims and drawings are used for distinguishing between different objects and not for limiting a particular order.

It should be further noted that, in the embodiments of the present disclosure, the following embodiments may be implemented separately, or may be implemented in combination with each other, which is not specifically limited by the embodiments of the present disclosure.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

The technical solutions of the embodiments of the present disclosure are further described below with reference to the accompanying drawings and through specific implementations.

Fig. 1 shows a flowchart of a document translation method according to an embodiment of the present disclosure, where the method may be applicable to a case of performing full text translation in units of documents, and the method may be performed by a document translation device configured in an electronic device, as shown in fig. 1, where the document translation method according to the embodiment includes:

in step S110, a source document of a source language to be translated is acquired, the source document containing a plurality of sentences.

In step S120, the source document is input to a pre-trained document translation model to obtain a target document in a target language output by the document translation model, the document translation model being capable of translating the source document directly into the target document instead of translating the plurality of sentences of the source document sentence by sentence. It should be noted that the source language and the target language belong to different languages.

In order to better explain the technical solution of the present embodiment, before describing the technical solution of the present embodiment, the current document translation method may be reviewed. At present, a method commonly adopted in the industry is to divide a document into a plurality of sentences and then translate the sentences at the sentence level; rather than translating the entire document directly. The reason is that the industry generally neglects document-level translation due to the lack of large-scale training parallel documents, but consistently focuses research on sentence-level translation. However, for a document, the meaning of the sentences is always continuous and inertial rather than isolated and closed, so that the meaning of the words in the sentence should be determined by the related, developed and comprehensive view angles, but not by the isolated, static and unilateral view angles, otherwise, inconsistent translation or incorrect translation can occur.

For example, fig. 2 is a schematic diagram of an exemplary sentence-level translation effect, in which the first part is a source document of chinese to be translated, and the next part is an english target document obtained by using a sentence-level translation method, and as can be seen from fig. 2, a plurality of sentences in the source document all have a word of "suke", which is a name of a person, and should be translated into a unified target vocabulary in the same document, since the sentence-level translation tool adopts a sentence-level translation method, each sentence translation process is self-administration, the translation result of the word of "suke" is not unified, and see the words of "Shuke" and "Shuk" in the target document in detail, which is obviously unreasonable.

According to the method, the source document is input into a pre-trained document translation model to be translated to obtain the target document by taking the document as a unit, and the source document is directly translated into the target document instead of translating the sentences of the source document sentence by sentence, so that unexpected technical effects are achieved. The reason is that more context information, such as name context, tense of the lower graph, etc., can be considered at the level of the document.

By adopting the document translation method of the embodiment, the meanings of the words in the sentences are determined by adopting a related, developed and comprehensive view angle instead of an isolated, static and one-sided view angle, at least the problems of inconsistent word translation, word translation errors caused by context reasons, dynamic translation errors during actions and the like are overcome, and the translation quality can be obviously improved.

The document translation model in this embodiment belongs to a document-level translation model, and specific training methods thereof include a plurality of types, which are not limited in this embodiment. For example, FIG. 3 is an exemplary training method, as described in FIG. 3, the document translation model may be trained by:

in step S410, a set of training document pairs is obtained, wherein a training document pair comprises a first document in the source language and a second document in the target language.

In step S420, an initialized document translation model is determined, wherein the initialized document translation model includes a target layer for outputting a translation result document.

In step S430, a first document of the training document pair in the training document pair set is used as an input of the initialized document translation model, and a second document of the training document pair is used as an expected output of the initialized document translation model, so as to train to obtain the document translation model.

Wherein the initialized document translation model may be a seq2seq (sequence to sequence) model, taking the initialized document translation model including at least two encoder layers and at least one decoder layer as an example, taking a first document in a training document pair in the training document pair set as an input of the initialized document translation model, taking a second document in the training document pair as an expected output of the initialized document translation model, training to obtain the document translation model may employ a method as shown in fig. 4, and step S430 in fig. 3 may further include:

in step S510, the input document vectors are respectively input to at least one encoder layer for processing to form hidden layer sentence vectors.

In step S520, the document vector and/or the hidden layer vocabulary vector is input to at least one encoder layer for processing to form a hidden layer vocabulary vector.

In step S530, the hidden layer vocabulary vector and the hidden layer sentence vector are input to at least one decoder layer for processing to generate an output document vector.

In step S540, the initialized document translation model is subjected to parameter adjustment according to difference information between the output document vector and the expected output of the input document vector, so as to train to obtain the document translation model.

According to the method, the source document of the source language to be translated is acquired, the source document is input into the pre-trained document translation model, so that the target document of the target language output by the document translation model is acquired, the document translation model can directly translate the source document into the target document instead of translating the sentences of the source document sentence by sentence, and therefore the machine learning model can consider the semantics of the vocabulary in the whole text, and the translation is more accurate.

As an implementation of the method shown in the above figures, the present application provides an embodiment of a document translation apparatus, and fig. 5 shows a schematic structural diagram of the document translation apparatus provided in this embodiment, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 1 to 4, and the apparatus may be specifically applied to various electronic devices. As shown in fig. 5, the document translation apparatus according to the present embodiment includes a source document acquisition unit 610 and a target document acquisition unit 620.

The source document obtaining unit 610 is configured to obtain a source document of a source language to be translated, the source document containing a plurality of sentences.

The target document obtaining unit 620 is configured to input the source document into a pre-trained document translation model to obtain a target document of a target language output by the document translation model, the document translation model being capable of directly translating the source document into the target document instead of translating the plurality of sentences of the source document sentence by sentence.

Further, fig. 6 provides a schematic diagram of an exemplary training module structure of the document translation model, and as shown in fig. 6, the modules for training the document translation model include a sample acquisition module 710, a model determination module 720, and a model training module 730.

The sample acquisition module 710 is configured to acquire a set of training document pairs, wherein a training document pair comprises a first document in the source language and a second document in the target language.

The model determination module 720 is configured to determine an initialized document translation model, wherein the initialized document translation model includes a target layer for outputting a translation result document.

The model training module 730 is configured to use a machine learning method to train a first document in the training document pair set as an input of the initialized document translation model, and a second document in the training document pair as an expected output of the initialized document translation model to obtain the document translation model.

In one embodiment, the initialized document translation model is a seq2seq model.

In an embodiment, the initialized document translation model includes at least two encoder layers and at least one decoder layer, and in this configuration, the model training module 730 may further include a first encoding sub-module 731, a second encoding sub-module 732, a decoding sub-module 733, and a model adjustment sub-module 734.

The first encoding submodule 731 is configured to input the input document vectors into at least one encoder layer respectively for processing to form hidden layer statement vectors.

The second encoding submodule 732 is configured to input the document vector and/or the hidden layer sentence vector into at least one encoder layer for processing to form a hidden layer vocabulary vector.

The decoding submodule 733 is configured to input the hidden layer vocabulary vector and the hidden layer sentence vector into at least one decoder layer for processing to generate an output document vector.

The model adjustment sub-module 734 is configured to perform parameter adjustment on the initialized document translation model based on difference information between the output document vector and the expected output of the input document vector to train to obtain the document translation model.

The document translation device provided by the embodiment of the invention can execute the document translation method provided by the embodiment of the method of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Referring now to fig. 7, a schematic diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 7 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 7, the electronic device 800 may include a processing means (e.g., a central processor, a graphics processor, etc.) 801, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage means 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic device 800 are also stored. The processing device 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

In general, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 807 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, etc.; storage 808 including, for example, magnetic tape, hard disk, etc.; communication means 809. The communication means 809 may allow the electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 shows an electronic device 800 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via communication device 809, or installed from storage device 808, or installed from ROM 802. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 801.

It should be noted that, the computer readable medium described above in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the disclosed embodiments, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the disclosed embodiments, the computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to:

acquiring a source document of a source language to be translated, wherein the source document comprises a plurality of sentences;

the source document is input into a pre-trained document translation model to obtain a target document of a target language output by the document translation model, and the document translation model can directly translate the source document into the target document instead of translating the sentences of the source document sentence by sentence.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The name of the unit does not in any way constitute a limitation of the unit itself, for example the first acquisition unit may also be described as "unit acquiring at least two internet protocol addresses".

According to one or more embodiments of the present disclosure, in the document translation method, the document translation model is trained by: acquiring a training document pair set, wherein the training document pair comprises a first document in the source language and a second document in the target language; determining an initialized document translation model, wherein the initialized document translation model comprises a target layer for outputting a translation result document; and taking a first document of the training document pair in the training document pair set as input of the initialized document translation model, taking a second document of the training document pair as expected output of the initialized document translation model, and training to obtain the document translation model.

In accordance with one or more embodiments of the present disclosure, in the document translation method, the initialized document translation model is a seq2seq model.

According to one or more embodiments of the present disclosure, in the document translation method: the initialized document translation model comprises at least two encoder layers and at least one decoder layer; taking a first document in a training document pair in the training document pair set as an input of the initialized document translation model, taking a second document in the training document pair as an expected output of the initialized document translation model, and training to obtain the document translation model comprises the following steps: respectively inputting the input document vectors into at least one encoder layer for processing to form hidden layer sentence vectors; inputting the document vector and/or the hidden layer sentence vector into at least one encoder layer for processing to form a hidden layer vocabulary vector; inputting the hidden layer vocabulary vector and the hidden layer sentence vector into at least one decoder layer for processing to generate an output document vector; and carrying out parameter adjustment on the initialized document translation model according to difference information between the output document vector and expected output of the input document vector so as to train and obtain the document translation model.

According to one or more embodiments of the present disclosure, in the document translation apparatus, the document translation model is trained by: the sample acquisition module is used for acquiring a training document pair set, wherein the training document pair comprises a first document in the source language and a second document in the target language; a model determination module for determining an initialized document translation model, wherein the initialized document translation model comprises a target layer for outputting a translation result document; and the model training module is used for training to obtain the document translation model by taking a first document in the training document pair set as the input of the initialized document translation model and a second document in the training document pair as the expected output of the initialized document translation model by using a machine learning method.

According to one or more embodiments of the present disclosure, in the document translation apparatus, the initialized document translation model is a seq2seq model.

According to one or more embodiments of the present disclosure, in the document translation apparatus, the initialized document translation model includes at least two encoder layers and at least one decoder layer; the model training module further comprises: the first coding submodule is used for respectively inputting the input document vectors into at least one encoder layer for processing so as to form hidden layer statement vectors; the second coding submodule is used for inputting the document vector and/or the hidden layer sentence vector into at least one encoder layer for processing so as to form a hidden layer word vector; the decoding submodule is used for inputting the hidden layer vocabulary vector and the hidden layer sentence vector into at least one decoder layer for processing so as to generate an output document vector; and the model adjustment sub-module is used for carrying out parameter adjustment on the initialized document translation model according to the difference information between the output document vector and the expected output of the input document vector so as to train and obtain the document translation model.

The foregoing description is only of the preferred embodiments of the disclosed embodiments and is presented for purposes of illustration of the principles of the technology being utilized. It will be appreciated by those skilled in the art that the scope of the disclosure in the embodiments of the disclosure is not limited to the specific combination of the above technical features, but also encompasses other technical features formed by any combination of the above technical features or their equivalents without departing from the spirit of the disclosure. Such as the technical solution formed by mutually replacing the above-mentioned features and the technical features with similar functions (but not limited to) disclosed in the embodiments of the present disclosure.

Claims

1. A document translation method, comprising:

inputting the source document into a pre-trained document translation model to obtain a target document of a target language output by the document translation model, wherein the document translation model can directly translate the source document into the target document instead of translating the sentences of the source document sentence by sentence;

the document translation model is obtained through training the following steps:

acquiring a training document pair set, wherein the training document pair comprises a first document in the source language and a second document in the target language;

determining an initialized document translation model, wherein the initialized document translation model comprises a target layer for outputting a translation result document;

taking a first document of a training document pair in the training document pair set as input of the initialized document translation model, taking a second document of the training document pair as expected output of the initialized document translation model, and training to obtain the document translation model;

the initialized document translation model comprises at least two encoder layers and at least one decoder layer;

taking a first document in a training document pair in the training document pair set as an input of the initialized document translation model, taking a second document in the training document pair as an expected output of the initialized document translation model, and training to obtain the document translation model comprises the following steps:

respectively inputting the input document vectors into at least one encoder layer for processing to form hidden layer sentence vectors;

inputting the document vector and/or the hidden layer sentence vector into at least one encoder layer for processing to form a hidden layer vocabulary vector;

inputting the hidden layer vocabulary vector and the hidden layer sentence vector into at least one decoder layer for processing to generate an output document vector;

and carrying out parameter adjustment on the initialized document translation model according to difference information between the output document vector and expected output of the input document vector so as to train and obtain the document translation model.

2. The method of claim 1, wherein the initialized document translation model is a seq2seq model.

3. A document translation apparatus, comprising:

a source document acquisition unit configured to acquire a source document of a source language to be translated, the source document including a plurality of sentences;

a target document obtaining unit configured to input the source document to a pre-trained document translation model to obtain a target document in a target language output by the document translation model, the document translation model being capable of directly translating the source document into the target document instead of performing sentence-by-sentence translation on the plurality of sentences of the source document;

the document translation model is obtained through training of the following modules:

the sample acquisition module is used for acquiring a training document pair set, wherein the training document pair comprises a first document in the source language and a second document in the target language;

a model determination module for determining an initialized document translation model, wherein the initialized document translation model comprises a target layer for outputting a translation result document;

the model training module is used for using a machine learning method to take a first document in a training document pair in the training document pair set as input of the initialized document translation model, and a second document in the training document pair as expected output of the initialized document translation model, so as to train and obtain the document translation model;

the model training module further comprises:

the first coding submodule is used for respectively inputting the input document vectors into at least one encoder layer for processing so as to form hidden layer statement vectors;

the second coding submodule is used for inputting the document vector and/or the hidden layer sentence vector into at least one encoder layer for processing so as to form a hidden layer word vector;

the decoding submodule is used for inputting the hidden layer vocabulary vector and the hidden layer sentence vector into at least one decoder layer for processing so as to generate an output document vector;

and the model adjustment sub-module is used for carrying out parameter adjustment on the initialized document translation model according to the difference information between the output document vector and the expected output of the input document vector so as to train and obtain the document translation model.

4. The apparatus of claim 3, wherein the initialized document translation model is a seq2seq model.

5. An electronic device, comprising:

a processor; and

a memory for storing executable instructions that, when executed by one or more processors, cause the electronic device to perform the method of any of claims 1-2.

6. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the method according to any of claims 1-2.