CN117910485A

CN117910485A - Document translation method, device, electronic equipment and storage medium

Info

Publication number: CN117910485A
Application number: CN202311734240.XA
Authority: CN
Inventors: 邓乔波; 张云君
Original assignee: Iol Wuhan Information Technology Co ltd
Current assignee: Iol Wuhan Information Technology Co ltd
Priority date: 2023-12-14
Filing date: 2023-12-14
Publication date: 2024-04-19

Abstract

The invention provides a document translation method, a device, an electronic device and a storage medium, wherein the method comprises the following steps: dividing a document to be translated into a plurality of document fragments; inputting a plurality of document fragments into a text corpus recognition model to obtain industry field labels of the document fragments output by the text corpus recognition model; matching industry field labels corresponding to the document fragments with a preset mapping relation table to obtain target translation engines matched with the document fragments, wherein the preset mapping relation table comprises mapping relations between preset industry fields and preset translation engines; based on the target translation engines matched with the document fragments, the translation document corresponding to the document to be translated is obtained.

Description

Document translation method, device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a document translation method, a device, an electronic device, and a storage medium.

Background

With the acceleration of globalization process and the wide application of internet technology, the demand for cross-language information communication is increasing. Machine translation is an important technique for accelerating globalization process. The goal is to convert text from a source language to a target language.

Existing machine translation tools generally provide only a fixed machine turning engine when processing the same item or the same document, however, in some scenes, the field span of the document content to be translated provided by the user is large and the style is variable. For example, a project document may contain financial, legal, etc. related content. The content of different areas often has specific terms and expressions. These different fields of text have certain limitations if they are all processed by a common engine.

Disclosure of Invention

The invention provides a document translation method, a device, electronic equipment and a storage medium, which are used for solving the defects that in the prior art, aiming at the field of document content to be translated, the span is large, the style is changeable, and a general engine is used for translating, so that the translation result is inaccurate.

The invention provides a document translation method, which comprises the following steps:

Dividing a document to be translated into a plurality of document fragments;

inputting a plurality of document fragments into a text corpus recognition model to obtain industry field labels of the document fragments output by the text corpus recognition model;

Matching industry field labels corresponding to the document fragments with a preset mapping relation table to obtain target translation engines matched with the document fragments, wherein the preset mapping relation table comprises mapping relations between preset industry fields and preset translation engines;

And obtaining a translation document corresponding to the document to be translated based on the target translation engine matched with each document fragment.

According to the document translation method provided by the invention, the training step of the text corpus recognition model comprises the following steps:

acquiring unsupervised corpus data, wherein the unsupervised corpus data comprises text data corresponding to each preset industry field;

Labeling the text data with corresponding industry field labels to obtain standard corpus data;

training the first text corpus recognition model by adopting a mask language modeling mode based on the unsupervised corpus data to obtain a second text corpus recognition model;

Training the second text corpus recognition model based on the standard corpus data to obtain the trained text corpus recognition model.

According to the document translation method provided by the invention, the target translation engine matched with each document fragment is used for obtaining the translation document corresponding to the document to be translated, and the method comprises the following steps:

acquiring a translation result corresponding to each document fragment based on a target translation engine matched with each document fragment;

And according to the segment splicing relation of the plurality of document segments, splicing the translation results corresponding to the document segments in sequence to obtain the translation document corresponding to the document to be translated.

According to the document translation method provided by the invention, the target translation engine matched with each document fragment is used for obtaining the translation result corresponding to each document fragment, and the method comprises the following steps:

Creating a thread pool, wherein the thread pool is used for transmitting each document fragment and a target translation engine matched with each document fragment as parameters to a translation function;

And sending a request carrying the document fragments corresponding to the target translation engines, and acquiring translation results corresponding to the document fragments returned by the target translation engines.

According to the document translation method provided by the invention, before the document to be translated is divided into a plurality of document fragments, the method further comprises the following steps:

Preprocessing the document to be translated;

Wherein the preprocessing includes at least one of document parsing and document format conversion.

According to the document translation method provided by the invention, after the translated document corresponding to the document to be translated is obtained, the method further comprises the following steps:

And correspondingly inserting the translation document into the document to be translated, and maintaining the same format and directory structure as the document to be translated.

The invention also provides a document translation device, which comprises:

A document segmentation unit for segmenting a document to be translated into a plurality of document fragments;

the corpus recognition unit is used for inputting a plurality of document fragments into the text corpus recognition model to obtain industry field labels of the document fragments output by the text corpus recognition model;

The engine matching unit is used for matching the industry field labels corresponding to the document fragments with a preset mapping relation table to obtain target translation engines matched with the document fragments, and the preset mapping relation table comprises mapping relations between preset industry fields and preset translation engines;

And the document translation unit is used for obtaining a translation document corresponding to the document to be translated based on the target translation engine matched with each document fragment.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the document translation method as described in any of the above when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a document translation method as described in any of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements a document translation method as described in any of the above.

The document translation method, the device, the electronic equipment and the storage medium provided by the invention divide the document to be translated into a plurality of document fragments; inputting a plurality of document fragments into a text corpus recognition model to obtain industry field labels of the document fragments output by the text corpus recognition model; matching industry field labels corresponding to the document fragments with a preset mapping relation table to obtain target translation engines matched with the document fragments, wherein the preset mapping relation table comprises mapping relations between preset industry fields and preset translation engines; based on the target translation engines matched with the document fragments, the translation document corresponding to the document to be translated is obtained.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow diagram of a document translation method provided by the present invention;

FIG. 2 is a schematic flow chart of a training method of a text corpus recognition model provided by the invention;

FIG. 3 is a schematic diagram of a document translation apparatus according to the present invention;

Fig. 4 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

FIG. 1 is a schematic flow chart of a document translation method provided by the invention, and as shown in FIG. 1, the invention provides a document translation method comprising the following steps:

Step 110, dividing a document to be translated into a plurality of document fragments;

Specifically, a document to be translated refers to a document that needs to be converted from one language type to another language type, such as translating an english document to a chinese document.

In one example, in segmenting a document to be translated into a plurality of document segments, the document to be translated may be partitioned into a plurality of sentence segments based on end punctuation (e.g., period, question mark, exclamation mark, etc.) of the sentence. Each sentence fragment represents a complete sentence of the original document. The document to be translated can be divided into a plurality of paragraph fragments according to the logical structure of the paragraphs and separators such as line breaks or empty lines among the paragraphs. Each paragraph fragment represents a paragraph of the original document.

Step 120, inputting a plurality of document fragments into a text corpus recognition model to obtain industry field labels of the document fragments output by the text corpus recognition model;

In this embodiment, all industry domain labels that can be output by the text corpus recognition model are predefined labels, where these industry domain labels generally represent specific industry domains related to the text, such as the following 10 kinds of industry domain labels: international engineering, legal contracts, petrochemical industry, machine manufacturing, financial finance and economics, biomedical science, cyber literature, aviation defense, IT electronics, and general fields.

In practice, a training data set may be prepared for a text corpus recognition model. The data set should include text samples in various industries, such as financial labels when the text samples relate to financial terms such as funding requirements, financing, etc., and mechanical manufacturing labels when the text samples relate to technical parameters, process flows, etc., and network literary labels when the text samples relate to network novels, literary works, etc. Training the model using the text samples of these different industry domains enables the model to learn the differences between the text of the different industry domains.

Step 130, matching industry field labels corresponding to the document fragments with a preset mapping relation table to obtain target translation engines matched with the document fragments, wherein the preset mapping relation table comprises mapping relations between preset industry fields and preset translation engines;

according to the translation requirement, a mapping relation between a predefined preset industry field and a preset translation engine is configured in the system, such as the following table 1:

TABLE 1

Preset industry field	Preset translation engine
		Petrochemical industry	Engine A
IT electronic	Engine B
		Biomedical treatment	Engine C

If A document fragment "The production process of THIS CHEMICAL PLANT IS environmentally FRIENDLY AND ENERGY-saving" is currently classified as the text in the petrochemical domain, the engine A can be selected for translation, that is, the engine A is the target translation engine for matching the document fragment labeled as the "petrochemical domain".

And 140, obtaining a translation document corresponding to the document to be translated based on the target translation engine matched with each document fragment.

After obtaining the target translation engine for matching each document segment, typically by comparing the industry domain labels of the document segments with a preset mapping table, for example, assuming that a preset industry domain label is "medical" and we have preset a translation engine "engine a" to be applicable to this domain, all the document segments marked as "medical" will be automatically matched to "engine a".

Once the target translation engines are determined, each document snippet is then input to the corresponding target translation engine for translation.

The method provided by the embodiment of the invention divides the document to be translated into a plurality of document fragments; inputting a plurality of document fragments into a text corpus recognition model to obtain industry field labels of the document fragments output by the text corpus recognition model; matching industry field labels corresponding to the document fragments with a preset mapping relation table to obtain target translation engines matched with the document fragments, wherein the preset mapping relation table comprises mapping relations between preset industry fields and preset translation engines; based on the target translation engines matched with the document fragments, the translation document corresponding to the document to be translated is obtained.

Based on the above embodiments, fig. 2 is a flow chart of a training method of a text corpus recognition model according to the present invention, and as shown in fig. 2, the training steps of the text corpus recognition model include:

step 210, acquiring unsupervised corpus data, wherein the unsupervised corpus data comprises text data corresponding to each preset industry field;

Step 220, labeling the text data with corresponding industry field labels to obtain standard corpus data;

step 230, training the first text corpus recognition model by using a mask language modeling mode based on the unsupervised corpus data to obtain a second text corpus recognition model;

And step 240, training the second text corpus recognition model based on the text standard corpus data to obtain the trained text corpus recognition model.

Wherein, the unsupervised corpus data refers to text data which is not manually marked or classified. Such data is typically collected from a variety of sources, possibly including large amounts of text, news articles, social media content, etc. on a network.

In this embodiment, the collected unsupervised corpus data includes data of industry domain labels corresponding to all preset translation engines, for example, as shown in table 1 above, and the system includes three preset translation engines: the system comprises an engine A, an engine B and an engine C, wherein the engine A corresponds to the petrochemical industry field, the engine B corresponds to the IT electronic field, and the engine C corresponds to the biomedical field, so that text data which are not manually marked or classified in the petrochemical industry field, the electronic field and the biomedical field are required to be collected respectively for model training.

In this embodiment, the model training is mainly three stages:

In the first stage, a part of samples are respectively selected from text data corresponding to each preset industry field, and standard corpus data is constructed after industry field labels to which the samples belong are manually marked.

The second stage obtains a first text corpus recognition model, wherein the first text corpus recognition model is an original natural language processing model for text classification, the first text corpus recognition model is pre-trained on original unsupervised corpus data by using a mask language modeling (Masked Language Model, MLM) mode, namely, some words are randomly shielded in input text data, and then the model predicts the shielded words. This may enable the text corpus recognition model to be more familiar with domain-related languages and contexts.

And in the third stage, fine tuning training is performed again by using the second text corpus recognition model pre-trained in the second stage by using the constructed standard corpus data, and then the text corpus recognition model finally used for text field classification can be obtained.

It will be appreciated that the text corpus recognition model herein may be a natural language processing model that may be used for text classification, such as a BERT model, which is not limited in this embodiment.

According to the method provided by the embodiment of the invention, the initial text corpus recognition model is used for pre-training on the corpus in the non-label field, and then fine tuning is performed on the corpus with the label, so that the later classification accuracy of the model is improved.

Based on the above embodiment, the obtaining, by the target translation engine based on the matching of the document segments, a translated document corresponding to the document to be translated includes:

In this embodiment, after the translation results corresponding to the document fragments are obtained, the translation results are spliced according to the correct order, so as to obtain the complete translation of the document to be translated.

In addition, when splicing, besides the original sequence among the document fragments, a certain error may exist in machine translation, so that when splicing the translation result, some later checking and correction work can be performed to ensure the accuracy and readability of the translated document. Such as by means of some Natural Language Processing (NLP) techniques such as grammar checking, spell checking, and semantic analysis.

Based on the above embodiment, the obtaining, by the target translation engine based on matching of the document segments, a translation result corresponding to each document segment includes:

In this embodiment, in order to improve the translation efficiency, a thread pool may be created for concurrently processing the translation tasks of multiple document fragments, and the document fragments and the corresponding target translation engines are passed to the translation function as parameters through the created thread pool. Thus, the translation task can be executed in parallel in a plurality of threads, and the translation efficiency is improved.

The requests library of Python may then be used to send an HTTP request containing the document fragment to be translated to the assigned target translation engine. In this way, the translation tasks can be distributed to each target translation engine in parallel for processing. And finally, collecting the translation results, and storing the translation results according to the sequence of the original document fragments.

According to the method provided by the embodiment of the invention, the high-efficiency translation using a plurality of translation engines is realized through the parallel translation mode.

Based on the above embodiment, before the dividing the document to be translated into the plurality of document fragments, the method further includes:

Preprocessing the document to be translated;

For example, the lxml library or BeautifulSoup library of Python may be used to parse the input document to remove special characters and tags. This is because when performing natural language processing tasks, it is often necessary to pre-process the document to be translated, thereby eliminating noise and garbage in the document to be translated, and making the document to be translated easier to process and analyze. While text formats such as HTML, XML, or Word documents may contain a large number of tags and special characters that are meaningless to natural language processing tasks and even interfere with the processing of the text. Therefore, we need to use lxml or BeautifulSoup libraries to remove these labels and special characters and extract only text content.

In addition, when we handle different types of documents to be translated, the encoding between different formats may be different. Therefore, in this embodiment, the input document to be translated needs to be converted into a unified coding format (such as UTF-8) to eliminate format differences, so as to facilitate subsequent processing.

By the method provided by the embodiment of the invention, the readability and portability of the document to be translated under different environments can be ensured through the preprocessing, and the problem of coding errors in the processing process can be avoided.

Based on the above embodiment, after obtaining the translated document corresponding to the document to be translated, the method further includes:

For example, if the translation result is to be incorporated into the original Word document, we can read the original Word document using the python-docx library, then insert the translation document into the corresponding location, and finally save the modified document as a new Word document. Similarly, if we want to incorporate the translation result into the original PDF document, we can read the original PDF document using fpdf library, then insert the translation document into the corresponding location, and finally save the modified document as a new PDF document. In this process, the format and directory structure of the original document need to be kept unchanged to ensure that the reader can easily view and understand the contents of the document.

Based on any of the above embodiments, fig. 3 is a schematic structural diagram of a document translation apparatus provided by the present invention, as shown in fig. 3, the apparatus includes:

a document segmentation unit 310 for segmenting a document to be translated into a plurality of document fragments;

a corpus identifying unit 320, configured to input a plurality of document segments into a text corpus identifying model, so as to obtain industry field labels to which each document segment output by the text corpus identifying model belongs;

The engine matching unit 330 is configured to match the industry domain label corresponding to each document fragment with a preset mapping relation table, so as to obtain a target translation engine matched with each document fragment, where the preset mapping relation table includes a mapping relation between a preset industry domain and a preset translation engine;

the document translation unit 340 is configured to obtain a translated document corresponding to the document to be translated based on the target translation engine matched by each document fragment.

The device provided by the embodiment of the invention divides the document to be translated into a plurality of document fragments; inputting a plurality of document fragments into a text corpus recognition model to obtain industry field labels of the document fragments output by the text corpus recognition model; matching industry field labels corresponding to the document fragments with a preset mapping relation table to obtain target translation engines matched with the document fragments, wherein the preset mapping relation table comprises mapping relations between preset industry fields and preset translation engines; based on the target translation engines matched with the document fragments, the translation document corresponding to the document to be translated is obtained.

Based on any one of the above embodiments, the apparatus further includes a model training unit, configured to obtain unsupervised corpus data, where the unsupervised corpus data includes text data corresponding to each preset industry field; labeling the text data with corresponding industry field labels to obtain standard corpus data; training the first text corpus recognition model by adopting a mask language modeling mode based on the unsupervised corpus data to obtain a second text corpus recognition model; training the second text corpus recognition model based on the standard corpus data to obtain the trained text corpus recognition model.

Based on any one of the above embodiments, the document translation unit is further configured to obtain a translation result corresponding to each document fragment based on a target translation engine matched with each document fragment; and according to the segment splicing relation of the plurality of document segments, splicing the translation results corresponding to the document segments in sequence to obtain the translation document corresponding to the document to be translated.

Based on any one of the above embodiments, the document translation unit is further configured to create a thread pool, where the thread pool is configured to pass, as parameters, each document fragment and a target translation engine that matches each document fragment to a translation function; and sending a request carrying the document fragments corresponding to the target translation engines, and acquiring translation results corresponding to the document fragments returned by the target translation engines.

Based on any one of the above embodiments, the apparatus further includes a preprocessing unit, configured to preprocess the document to be translated; wherein the preprocessing includes at least one of document parsing and document format conversion.

Based on any one of the above embodiments, the apparatus further includes a document display unit, configured to insert the translated document into the document to be translated correspondingly, and maintain the same format and directory structure as the document to be translated.

Fig. 4 illustrates a physical schematic diagram of an electronic device, as shown in fig. 4, which may include: processor 410, communication interface (Communications Interface) 420, memory 430, and communication bus 440, wherein processor 410, communication interface 420, and memory 430 communicate with each other via communication bus 440. Processor 410 may invoke logic instructions in memory 430 to perform a document translation method comprising: dividing a document to be translated into a plurality of document fragments; inputting a plurality of document fragments into a text corpus recognition model to obtain industry field labels of the document fragments output by the text corpus recognition model; matching industry field labels corresponding to the document fragments with a preset mapping relation table to obtain target translation engines matched with the document fragments, wherein the preset mapping relation table comprises mapping relations between preset industry fields and preset translation engines; and obtaining a translation document corresponding to the document to be translated based on the target translation engine matched with each document fragment.

Further, the logic instructions in the memory 430 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing a document translation method provided by the methods described above, the method comprising: dividing a document to be translated into a plurality of document fragments; inputting a plurality of document fragments into a text corpus recognition model to obtain industry field labels of the document fragments output by the text corpus recognition model; matching industry field labels corresponding to the document fragments with a preset mapping relation table to obtain target translation engines matched with the document fragments, wherein the preset mapping relation table comprises mapping relations between preset industry fields and preset translation engines; and obtaining a translation document corresponding to the document to be translated based on the target translation engine matched with each document fragment.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a document translation method provided by the above methods, the method comprising: dividing a document to be translated into a plurality of document fragments; inputting a plurality of document fragments into a text corpus recognition model to obtain industry field labels of the document fragments output by the text corpus recognition model; matching industry field labels corresponding to the document fragments with a preset mapping relation table to obtain target translation engines matched with the document fragments, wherein the preset mapping relation table comprises mapping relations between preset industry fields and preset translation engines; and obtaining a translation document corresponding to the document to be translated based on the target translation engine matched with each document fragment.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A document translation method, comprising:

Dividing a document to be translated into a plurality of document fragments;

2. The document translation method according to claim 1, wherein the training step of the text corpus recognition model includes:

3. The method for translating a document according to claim 1, wherein the obtaining a translated document corresponding to the document to be translated based on the target translation engine matched by each document fragment includes:

4. The document translation method according to claim 3, wherein the obtaining, based on the target translation engine matched with each document fragment, a translation result corresponding to each document fragment includes:

5. The document translation method according to any one of claims 1 to 4, wherein before dividing the document to be translated into a plurality of document fragments, further comprising:

Preprocessing the document to be translated;

6. The method for translating a document according to any one of claims 1 to 4, wherein after obtaining a translated document corresponding to the document to be translated, the method further comprises:

7. A document translation apparatus, comprising:

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the document translation method of any one of claims 1 to 6 when the program is executed by the processor.

9. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the document translation method according to any one of claims 1 to 6.

10. A computer program product comprising a computer program which, when executed by a processor, implements the document translation method of any one of claims 1 to 6.