CN112784613A - Document batch translation method and device, electronic equipment and storage medium - Google Patents

Document batch translation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112784613A
CN112784613A CN202110126066.5A CN202110126066A CN112784613A CN 112784613 A CN112784613 A CN 112784613A CN 202110126066 A CN202110126066 A CN 202110126066A CN 112784613 A CN112784613 A CN 112784613A
Authority
CN
China
Prior art keywords
translation
document
translation task
documents
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110126066.5A
Other languages
Chinese (zh)
Inventor
张芃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Iol Wuhan Information Technology Co ltd
Original Assignee
Iol Wuhan Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Iol Wuhan Information Technology Co ltd filed Critical Iol Wuhan Information Technology Co ltd
Priority to CN202110126066.5A priority Critical patent/CN112784613A/en
Publication of CN112784613A publication Critical patent/CN112784613A/en
Priority to PCT/CN2021/126664 priority patent/WO2022160819A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of computers, and provides a document batch translation method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining a plurality of documents to be translated; decomposing any document based on the document structure of the document, and determining a translation task block corresponding to the document; aggregating translation task blocks corresponding to all the documents, and determining translation task packages corresponding to the plurality of documents; and determining translation results of the plurality of documents based on the translation task packages corresponding to the plurality of documents. The method, the device, the electronic equipment and the storage medium provided by the invention realize batch translation of a plurality of documents and improve the document translation efficiency.

Description

Document batch translation method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for batch translation of documents, electronic equipment and a storage medium.
Background
In a large document translation project, a plurality of documents to be translated are generally distributed to a plurality of interpreters for parallel translation, so that translation results can be obtained quickly and accurately. In the prior art, documents to be translated are distributed mainly in a manual mode, so that the documents are unreasonably distributed, the translation time is long, the translation efficiency is low, and the accuracy of translation results is poor.
Disclosure of Invention
The invention provides a document batch translation method, a document batch translation device, electronic equipment and a storage medium, which are used for solving the technical problems of unreasonable document distribution, long translation time and low translation efficiency in the prior art.
The invention provides a document batch translation method, which comprises the following steps:
determining a plurality of documents to be translated;
decomposing any document based on the document structure of the document, and determining a translation task block corresponding to the document;
aggregating translation task blocks corresponding to all the documents, and determining translation task packages corresponding to the plurality of documents;
and determining translation results of the plurality of documents based on the translation task packages corresponding to the plurality of documents.
According to the document batch translation method provided by the invention, decomposing any document based on the document structure of any document and determining the translation task block corresponding to any document comprise the following steps:
carrying out fragment division on any document, and determining all fragments of any document;
determining a plurality of continuous segments corresponding to each level in any document based on the document structure of any document and all segments of any document;
and determining the translation task block corresponding to any document based on the word number range of the translation task block and a plurality of continuous segments corresponding to each level in any document.
According to the document batch translation method provided by the invention, the aggregation of the translation task blocks corresponding to the documents and the determination of the translation task packages corresponding to the plurality of documents comprises the following steps:
clustering translation task blocks corresponding to all documents based on semantic similarity among the translation task blocks to obtain a plurality of semantic similar classes;
aggregating all translation task blocks in any semantic similarity class based on the semantic similarity among all translation task blocks in any semantic similarity class and the word number of each translation task block to obtain a translation task package corresponding to any semantic similarity class;
and determining translation task packages corresponding to the plurality of documents based on the translation task package corresponding to each semantic similarity class.
According to the document batch translation method provided by the invention, the clustering is carried out on the translation task blocks corresponding to each document based on the semantic similarity among the translation task blocks to obtain a plurality of semantic similar classes, and the method comprises the following steps:
and merging all semantic similarity classes only containing one translation task block.
According to the document batch translation method provided by the invention, based on the semantic similarity between each translation task block in any semantic similarity class and the word number of each translation task block, each translation task block in any semantic similarity class is aggregated to obtain a translation task package corresponding to any semantic similarity class, and the method comprises the following steps:
establishing an undirected graph by taking each translation task block in any semantic similarity class as a vertex; edges in the undirected graph are semantic similarity among all translation task blocks, and vertex weight in the undirected graph is word number of each translation task block;
traversing the undirected graph by edge first, and aggregating vertex weights and task translation blocks corresponding to a plurality of vertexes meeting preset conditions into a translation task packet until a translation task packet corresponding to any semantic similarity class is obtained; the preset condition is between the vertex weight and the word number range of the translation task packet.
According to the document batch translation method provided by the invention, the step of carrying out fragment division on any document and determining all fragments of any document comprises the following steps:
and segmenting any document based on paragraph identifiers and/or punctuation marks in any document, and determining all segments of any document.
According to the document batch translation method provided by the invention, the determining of the translation results of the plurality of documents based on the translation task packages corresponding to the plurality of documents comprises the following steps:
respectively carrying out text similarity matching on any translation task packet and historical translation task packets of a plurality of translators, and determining a translator corresponding to any translation task packet;
and determining the translation results of the plurality of documents based on the translation result determined by the translator corresponding to each translation task package.
The invention provides a document batch translation device, which comprises:
a determination unit configured to determine a plurality of documents to be translated;
the system comprises a decomposition unit, a translation task block generation unit and a translation task block generation unit, wherein the decomposition unit is used for decomposing any document based on the document structure of the document and determining the translation task block corresponding to the document;
the aggregation unit is used for aggregating the translation task blocks corresponding to the documents and determining the translation task packages corresponding to the documents;
and the translation unit is used for determining the translation results of the plurality of documents based on the translation task packages corresponding to the plurality of documents.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the document batch translation method.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method for batch translation of documents as described in any one of the above.
According to the document batch translation method, the device, the electronic equipment and the storage medium, each document is decomposed according to the document structure, the translation task block corresponding to each document is determined, the translation task blocks corresponding to the documents are aggregated, the translation task packages corresponding to a plurality of documents are determined, and then the translation results of the documents are determined, so that batch translation of the documents is realized.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flowchart illustrating a document batch translation method according to the present invention;
FIG. 2 is a schematic structural diagram of a document batch translation apparatus provided in the present invention;
fig. 3 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a document batch translation method provided by the present invention, and as shown in fig. 1, the method includes:
at step 110, a plurality of documents to be translated are determined.
Specifically, the document is a text to be translated, and the language type of the document may be chinese, and may also be english, japanese, french, german, arabic, and the like. The embodiment of the present invention does not specifically limit the language type of the document. For example, the language categories of the plurality of documents to be translated are the same language category, and need to be translated into another language category.
Step 120, decomposing the document based on the document structure of any document, and determining the translation task block corresponding to the document.
Specifically, the translation task block is a collection of several continuous segments in the same document. The segment is a basic unit constituting a document, and may be a natural segment or a sentence. A document to be translated can be divided into a plurality of segments to be translated. A word count range may be set for the translation task block such that the number of words of the translation task block is within a certain range. For example, the word count range of a translation task block may be set to [500, 2000 ]. The size of the word number range can be set according to actual conditions.
A document to be translated may be divided into translation task blocks. The basic principle for dividing the translation task block is as follows: the segments to be translated which are as consistent as possible in context semantics are divided into the same translation task block. Therefore, sequential extraction of the segments in the document to be translated can ensure the continuity of document translation.
The document can be decomposed according to the document structure of the document to be translated, and the translation task block corresponding to the document is determined. The document structure is a document hierarchical structure, and the corresponding document structure information includes document titles, titles of each hierarchy and sub-hierarchies thereof, the number of fragments and words under each hierarchy and sub-hierarchies thereof, and the like. After decomposition, the document number where the translation task block is located and the block number under the document number can also be determined. According to the document number and the block number of the translation task block, the specific position of the translation task block in a plurality of documents can be quickly determined.
Step 130, aggregating the translation task blocks corresponding to the documents, and determining the translation task packages corresponding to the plurality of documents.
Specifically, after a plurality of documents to be translated are decomposed, translation task blocks corresponding to the documents can be obtained. In a large translation project, there is some inherent relationship between documents. For example, document a and document B belong to different technical documents of the same product, and the partial content appearing in document a may be the same as or similar to the partial content appearing in document B, or there is a mutual reference relationship on the contents, and the like.
Aggregation refers to the aggregation and grouping of different translation task blocks dispersed across multiple documents in an intrinsic relationship. The result of the aggregation is a translation task package. The translation task package comprises a plurality of translation task blocks which are internally connected with each other, that is, the internal connection degree between each translation task block in the translation task package is high. The intrinsic association here may include semantic similarity. A word count range may be set for the translation task package such that the number of words of the translation task package is within a certain range. For example, the word count range of the translation task package may be set to [5000, 10000 ]. The size of the word number range can be set according to actual conditions.
Step 140, determining the translation results of the plurality of documents based on the translation task packages corresponding to the plurality of documents.
Specifically, after the translation task packages corresponding to the plurality of documents are obtained, translation may be performed with the translation task packages as a basic unit. For example, assignment of tasks may be made to multiple translators based on a translation task package. And combining the translation results of each translation task package according to the sequence of the translation task blocks, thereby obtaining the translation results of a plurality of documents. An interpreter, refers to a document translator.
According to the document batch translation method provided by the embodiment of the invention, each document is decomposed according to the document structure, the translation task block corresponding to each document is determined, the translation task blocks corresponding to each document are aggregated, the translation task packages corresponding to a plurality of documents are determined, and the translation results of the plurality of documents are further determined, so that batch translation of the plurality of documents is realized.
Based on the above embodiment, step 120 includes:
carrying out fragment division on any document, and determining all fragments of the document;
determining a plurality of continuous segments corresponding to each level in the document based on the document structure of the document and all segments of the document;
and determining the translation task block corresponding to the document based on the word number range of the translation task block and a plurality of continuous segments corresponding to each level in the document.
Specifically, the number of translation task blocks corresponding to each document is determined by the word number range of the translation task blocks, and may be one or multiple. For example, if the overall word count of any document is smaller than the word count range of a translation task block, the document may be determined as one translation task block, and if the overall word count of any document is larger than the word count range of a translation task block, the document may be decomposed into a plurality of translation task blocks.
The translation task block is determined by taking the segment as a basic unit. For example, for one of the documents to be translated, the document may be segmented to obtain a plurality of segments to be translated, which may be represented by a set as:
S={S1,S2,…,Sn}
wherein S is a document to be translated, SiIs the ith segment to be translated, n is the number of the segments to be translated, and i is more than or equal to 1 and less than or equal to n.
According to the document structure of any document and all the segments of the document, a plurality of continuous segments corresponding to each level in the document are determined. For example, the document S to be translated includes 5 segments, and the document structure thereof is divided into two levels, i.e., chapter 1 and chapter 2, each level is further divided into two sub-levels, i.e., chapter 1 includes sections 1.1 and 1.2, and chapter 2 includes sections 2.1 and 2.2. Section 1.1 includes segment S1Section 1.2 includes segment S2Section 2.1 includes segment S3Section 2.2 includes segment S4And fragment S5
And determining the translation task block corresponding to the document according to the word number range of the translation task block and a plurality of continuous segments corresponding to each level in the document. For example, the word count range for a translation task block may be determined to be [500, 2000]]. For the segment S in the document S to be translated1、S2、S3、S4And S5The number of words of the segment is 200, 300, 1600, 300 and 800, respectively. The document S can be broken down into 3 translation task blocks, labeled S-1, S-2, and S-3, respectively, by document number and block number. Wherein the translation task block S-1 comprises a segment S1And fragment S2Translation task Block S-2 includes segment S3And fragment S4Translation task Block S-3 includes segment S5
When any document is subjected to translation task block decomposition, the following principles can be adopted:
within the word number range of the translation task block, dividing continuous document segments of the same level into the same translation task block, if the word number of the document segments of the same level cannot reach the lower limit of the word number range, continuing to extract subsequent segments until the word number of the translation task block reaches the lower limit of the word number range, and if the word number of the document segments of the same level reaches the upper limit of the word number range, dividing the next document segment exceeding the upper limit of the word number range into the next translation task block;
within the word number range of the translation task block, dividing continuous document segments of different levels into the same translation task block, if the word number of the document segments of different levels cannot reach the lower limit of the word number range, continuing to extract subsequent segments until the word number of the translation task block reaches the lower limit of the word number range, and if the word number of the document segments of different levels reaches the upper limit of the word number range, dividing the next document segment exceeding the upper limit of the word number range into the next translation task block.
The document batch translation method provided by the embodiment of the invention decomposes the document according to the document structure of the document, determines the translation task block corresponding to the document, provides a simple and easy document decomposition method, and reduces the complexity of a document batch translation algorithm.
Based on any of the above embodiments, step 130 includes:
clustering translation task blocks corresponding to all documents based on semantic similarity among the translation task blocks to obtain a plurality of semantic similar classes;
aggregating all translation task blocks in any semantic similarity class based on the semantic similarity among all translation task blocks and the word number of each translation task block to obtain a translation task package corresponding to the semantic similarity class;
and determining translation task packages corresponding to the plurality of documents based on the translation task package corresponding to each semantic similarity class.
Specifically, the cross-document combination can be performed on the translation task blocks corresponding to the obtained documents according to the semantic similarity between the translation task packages, so as to obtain the translation task packages corresponding to the multiple documents to be translated. The method for translating the task package can be divided into two parts, wherein the first part is used for carrying out cross-document clustering on translation task blocks corresponding to all documents to obtain a plurality of semantic similar classes, and the second part is used for aggregating the translation task blocks in each semantic similar class to obtain the translation task package corresponding to the semantic similar class.
Before clustering the translation task blocks corresponding to the documents, pre-classifying the translation task blocks corresponding to the documents by using the existing classification model to obtain the classified semantic similarity. For example, the existing classification model is a document content classification model, and can classify each translation task block into a financial class, a military class, an engineering class, or the like.
The translation task blocks of each document to be translated may be considered as set B. Wherein, the set composed of translation task blocks capable of being presorted is B1The set which cannot be formed by pre-classified translation task blocks is B2,B1+B2=B。
Can be applied to set B2The translation task blocks in the set B are clustered, the clustering method can adopt a K-means algorithm, and the set B is clustered2The translation task blocks in (1) are divided into several classes。
The embodiment of the invention provides a semantic similarity-based translation task block clustering method, which is used for classifying translation task blocks which cannot be classified through the existing classification model. The method comprises the following steps:
step one, determining a set B2={B21,B22,…,B2mGiven threshold of semantic similarity, where m is set B2The number of middle translation task blocks;
step two, translating the task block B21As a basis, calculate B21And set B2Screening all the translation task blocks with semantic similarity greater than a given threshold value from the semantic similarity of other translation task blocks, and comparing the semantic similarity with the semantic similarity of other translation task blocks21Forming a first semantic similarity class E1
Step three, in the set B2Removing E1In all the translation task blocks except the middle stage, according to the method in the step two, a second semantic similarity class E is obtained2
Step four, repeating the method in step two and step three until the set B2All the fragments in the sequence are divided into corresponding semantic similarity classes, and finally a plurality of semantic similarity classes are obtained.
For example, for set B2={B21,B22,B23,B24Get semantic similarity class E after clustering1={B21,B22And semantic similarity class E2={B23,B24}。
According to the document batch translation method provided by the embodiment of the invention, clustering and aggregation operations are carried out on the translation task blocks corresponding to each obtained document according to the semantic similarity between the translation task packages to obtain the translation task packages with higher semantic similarity, so that the rationality and accuracy of translation task division are improved, and the document translation efficiency is improved.
Based on any of the embodiments, clustering the translation task blocks corresponding to each document based on semantic similarity between the translation task blocks to obtain a plurality of semantic similarity classes, including:
and merging all semantic similarity classes only containing one translation task block.
Specifically, after the cross-document clustering is performed on the translation task blocks corresponding to the obtained documents, some semantic similarity classes only including one translation task block may be obtained. All semantic similarity classes containing only one translation task block can be merged, namely merged into one class, which can be called a tail class.
In large translation projects, semantic similarities may still exist between multiple translation task blocks in the tail class. And further aggregating a plurality of translation task blocks in the tail class to obtain a plurality of translation task packages.
Based on any embodiment, aggregating each translation task block in any semantic similarity class based on the semantic similarity between each translation task block in any semantic similarity class and the word number of each translation task block to obtain a translation task package corresponding to the semantic similarity class includes:
establishing an undirected graph by taking each translation task block in the semantic similarity class as a vertex; edges in the undirected graph are semantic similarity among all the translation task blocks, and vertex weight in the undirected graph is word number of each translation task block;
traversing the undirected graph by edge priority, and aggregating vertex weights and task translation blocks corresponding to a plurality of vertexes meeting preset conditions into a translation task packet until the translation task packet corresponding to the semantic similarity is obtained; the preset condition is between the vertex weight and the word number range of the translation task packet.
Specifically, the semantic similarity may include a pre-classified semantic similarity, a clustered semantic similarity, a tail class, and the like.
Any semantic similarity class comprises k translation task blocks, wherein k is a positive integer and is marked as a set A ═ a1,a2,…,ak}. Wherein the word number of each translation task block can be recorded as the set C ═ C1,c2,…,ckAnd recording the semantic similarity between the blocks as a set Z ═ a1a2,a1a3,…,a1ak,a2a3,a2a4,…,a2ak,…,ak-1ak}. And establishing an undirected graph G by taking each translation task block in the semantic similarity class as a vertex. The edge in the undirected graph G is semantic similarity between the translation task blocks, and the vertex weight is the number of words of each translation task block, so the undirected graph G can be represented as G ═ (a, Z, C).
And traversing the undirected graph G by edge priority, and accumulating the weights of the traversed top points to obtain the sum of the top point weights. And if the vertex weight sum meets the preset condition, aggregating the task translation blocks corresponding to the traversed vertices into a translation task packet. The preset condition may be set between the vertex weight and the range of word numbers in the translation task package. And circulating the steps until the translation task package corresponding to the semantic similarity is obtained.
The above method can be represented by the following algorithm steps:
step one, initializing a set Z _ new ═ Z;
step two, setting the set Z _ new _ del and the overflow set Z _ new _ overflow of the edge to be removed to be null;
thirdly, selecting an edge X with the maximum semantic similarity from the elements obtained by subtracting the set Z _ new _ overflow from the set Z _ new _ overflow;
step four, calculating the sum of the weight of the edge X and the top points corresponding to the elements in the set Z _ new _ del;
step five, if the weight sum is smaller than the lower limit of the word number range of the translation task packet, adding the edge X into the set Z _ new _ del, removing elements in the set Z _ new _ del from the set Z _ new to obtain an updated set Z _ new, and turning to the step three;
step six, if the weight sum is larger than the upper limit of the word number range of the translation task packet, adding the edge X into the set Z _ new _ overflow, and turning to the step three;
step seven, aggregating the translation task blocks corresponding to the vertexes corresponding to the edges in the set Z _ new _ del into a same translation task packet;
step eight, if the Z _ new is not empty and the weight sum of the top points corresponding to all edges in the Z _ new is larger than the lower limit of the word number range of the translation task packet, turning to step two;
step nine, if the Z _ new is not empty and the weight sum of the vertexes corresponding to all the edges in the Z _ new is smaller than the lower limit of the word number range of the translation task packet, aggregating the translation task blocks corresponding to the vertexes corresponding to all the edges in the Z _ new into the same translation task packet;
step ten, all the translation task packages formed by the polymerization of the translation task blocks in the set A are obtained, and the block polymerization process is finished.
According to the document batch translation method provided by the embodiment of the invention, the translation task blocks are aggregated in an undirected graph traversal mode, so that a translation task packet with higher semantic similarity is obtained, the rationality and the accuracy of translation task division are improved, and the document translation efficiency is improved.
Based on any of the above embodiments, segment division is performed on any document, and all segments of the document are determined, including:
and segmenting any document based on paragraph identifiers and/or punctuation marks in the document, and determining all segments of the document.
Specifically, when a document is divided into segments, the document may be divided according to natural segments, sentences, or both natural segments and sentences.
If the natural segment is divided, the division basis can be selected as the paragraph identifier. If the sentence is divided, punctuation can be selected according to the division basis. Punctuation marks are herein punctuation marks that can characterize the end of a complete sentence. Such as periods, question marks, exclamation marks, carriage returns, etc.
The document batch translation method provided by the embodiment of the invention is simple and easy to implement, reduces the workload of a translator and improves the document translation efficiency by segmenting the document according to the paragraph identifiers and/or punctuations in the document and determining all the segments of the document.
Based on any of the above embodiments, step 140 includes:
respectively carrying out text similarity matching on any translation task packet and historical translation task packets of a plurality of translators, and determining the translator corresponding to the translation task packet;
and determining the translation results of the plurality of documents based on the translation results determined by the translators corresponding to each translation task package.
Specifically, historical translation task packages for multiple translators may be collected in advance. And respectively carrying out text similarity matching on all translation task packages of a plurality of documents to be translated and the historical translation task package of each translator, thereby determining the translator corresponding to each translation task package and carrying out the distribution of the translation task packages.
And the corresponding translator translates the distributed translation task package, and arranges the obtained translation results according to the document numbers and the block numbers of the translation task blocks in the translation task package, so that the translation results of a plurality of documents to be translated are obtained.
According to the document batch translation method provided by the embodiment of the invention, any translation task packet is respectively subjected to text similarity matching with the historical translation task packets of a plurality of translators, so that the translator corresponding to the translation task packet is determined, the historical translation data of the translator is considered, the rationality of translation task allocation is improved, the working experience of the translator can be fully utilized, the translation time is saved, and the translation efficiency and accuracy are improved.
Based on any of the above embodiments, fig. 2 is a schematic structural diagram of a document batch translation apparatus provided by the present invention, and as shown in fig. 2, the apparatus includes:
a determining unit 210 configured to determine a plurality of documents to be translated;
the decomposition unit 220 is configured to decompose any document based on a document structure of any document, and determine a translation task block corresponding to any document;
the aggregation unit 230 is configured to aggregate translation task blocks corresponding to the documents, and determine translation task packages corresponding to the documents;
and the translation unit 240 is configured to determine translation results of the plurality of documents based on the translation task packages corresponding to the plurality of documents.
Specifically, the determination unit 210 is configured to determine a plurality of documents to be translated; the decomposition unit 220 is configured to determine a translation task block corresponding to any document; the aggregation unit 230 is configured to determine translation task packages corresponding to a plurality of documents; the translation unit 240 is configured to determine translation results for a plurality of documents.
The document batch translation device provided by the embodiment of the invention decomposes each document according to the document structure, determines the translation task block corresponding to each document, aggregates the translation task blocks corresponding to each document, determines the translation task packages corresponding to a plurality of documents, and further determines the translation results of the plurality of documents, so that batch translation of the plurality of documents is realized.
Based on any of the above embodiments, the decomposition unit 220 includes:
the dividing subunit is used for carrying out fragment division on any document and determining all fragments of any document;
the decomposition subunit is used for determining a plurality of continuous fragments corresponding to each hierarchy in any document based on the document structure of any document and all fragments of any document;
and the block determining subunit is used for determining the translation task block corresponding to any document based on the word number range of the translation task block and a plurality of continuous segments corresponding to each level in any document.
Based on any of the above embodiments, the aggregation unit 230 includes:
the clustering subunit is used for clustering the translation task blocks corresponding to the documents based on the semantic similarity among the translation task blocks to obtain a plurality of semantic similar classes;
the aggregation subunit is used for aggregating each translation task block in any semantic similarity class based on the semantic similarity between each translation task block in any semantic similarity class and the word number of each translation task block to obtain a translation task package corresponding to any semantic similarity class;
and the packet determining subunit is used for determining translation task packets corresponding to the plurality of documents based on the translation task packet corresponding to each semantic similarity class.
Based on any of the above embodiments, the clustering subunit is configured to:
and merging all semantic similarity classes only containing one translation task block.
In accordance with any of the above embodiments, the polymerization subunit comprises:
the mapping module is used for establishing an undirected graph by taking each translation task block in any semantic similarity class as a vertex; edges in the undirected graph are semantic similarity among all the translation task blocks, and vertex weight in the undirected graph is word number of each translation task block;
the aggregation module is used for traversing the undirected graph by edge priority, aggregating the vertex weight and the task translation blocks corresponding to a plurality of vertexes meeting the preset condition into one translation task packet until obtaining a translation task packet corresponding to any semantic similarity class; the preset condition is between the vertex weight and the word number range of the translation task packet.
Based on any of the embodiments described above, the partitioning sub-units are specifically configured to:
and segmenting any document based on paragraph identifiers and/or punctuation marks in any document, and determining all segments of any document.
Based on any of the above embodiments, the translation unit 240 is specifically configured to:
respectively carrying out text similarity matching on any translation task packet and historical translation task packets of a plurality of translators, and determining a translator corresponding to any translation task packet;
and determining the translation results of the plurality of documents based on the translation results determined by the translators corresponding to each translation task package.
Based on any of the above embodiments, fig. 3 is a schematic structural diagram of an electronic device provided by the present invention, and as shown in fig. 3, the electronic device may include: a Processor (Processor)310, a communication Interface (Communications Interface)320, a Memory (Memory)330, and a communication Bus (Communications Bus)340, wherein the Processor 310, the communication Interface 320, and the Memory 330 communicate with each other via the communication Bus 340. The processor 310 may call the logic command in the memory 330 to execute the method provided by the above embodiments, the method includes:
determining a plurality of documents to be translated; decomposing any document based on the document structure of any document, and determining a translation task block corresponding to any document; aggregating translation task blocks corresponding to all the documents, and determining translation task packages corresponding to a plurality of documents; and determining translation results of the plurality of documents based on the translation task packages corresponding to the plurality of documents.
In addition, the logic commands in the memory 330 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic commands are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes a plurality of commands for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The processor in the electronic device provided by the embodiment of the present invention may call a logic instruction in the memory to implement the document batch translation method, and the specific implementation manner is consistent with the method implementation manner and may achieve the same beneficial effects, which is not described herein again.
The present invention further provides a non-transitory computer-readable storage medium, which is described below, and the non-transitory computer-readable storage medium described below and the document batch translation method described above may be referred to in correspondence.
An embodiment of the present invention provides a non-transitory computer readable storage medium, on which a computer program is stored, the computer program being implemented to perform the method provided by the above embodiments when executed by a processor, the method including:
determining a plurality of documents to be translated; decomposing any document based on the document structure of any document, and determining a translation task block corresponding to any document; aggregating translation task blocks corresponding to all the documents, and determining translation task packages corresponding to a plurality of documents; and determining translation results of the plurality of documents based on the translation task packages corresponding to the plurality of documents.
When the computer program stored on the non-transitory computer-readable storage medium provided by the embodiment of the present invention is executed, the method for batch translation of documents is implemented, and the specific implementation manner is consistent with the method implementation manner and can achieve the same beneficial effects, which is not described herein again.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes commands for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A document batch translation method is characterized by comprising the following steps:
determining a plurality of documents to be translated;
decomposing any document based on the document structure of the document, and determining a translation task block corresponding to the document;
aggregating translation task blocks corresponding to all the documents, and determining translation task packages corresponding to the plurality of documents;
and determining translation results of the plurality of documents based on the translation task packages corresponding to the plurality of documents.
2. The method for batch translation of documents according to claim 1, wherein decomposing any document based on its document structure and determining a translation task block corresponding to the any document comprises:
carrying out fragment division on any document, and determining all fragments of any document;
determining a plurality of continuous segments corresponding to each level in any document based on the document structure of any document and all segments of any document;
and determining the translation task block corresponding to any document based on the word number range of the translation task block and a plurality of continuous segments corresponding to each level in any document.
3. The method for batch translation of documents according to claim 1, wherein the aggregating the translation task blocks corresponding to the documents and determining the translation task packages corresponding to the plurality of documents comprises:
clustering translation task blocks corresponding to all documents based on semantic similarity among the translation task blocks to obtain a plurality of semantic similar classes;
aggregating all translation task blocks in any semantic similarity class based on the semantic similarity among all translation task blocks in any semantic similarity class and the word number of each translation task block to obtain a translation task package corresponding to any semantic similarity class;
and determining translation task packages corresponding to the plurality of documents based on the translation task package corresponding to each semantic similarity class.
4. The document batch translation method according to claim 3, wherein the clustering the translation task blocks corresponding to each document based on semantic similarity among the translation task blocks to obtain a plurality of semantic similarity classes comprises:
and merging all semantic similarity classes only containing one translation task block.
5. The document batch translation method according to claim 3, wherein the aggregating the translation task blocks in any semantic similarity class based on the semantic similarity between the translation task blocks in any semantic similarity class and the word count of each translation task block to obtain the translation task package corresponding to any semantic similarity class comprises:
establishing an undirected graph by taking each translation task block in any semantic similarity class as a vertex; edges in the undirected graph are semantic similarity among all translation task blocks, and vertex weight in the undirected graph is word number of each translation task block;
traversing the undirected graph by edge first, and aggregating vertex weights and task translation blocks corresponding to a plurality of vertexes meeting preset conditions into a translation task packet until a translation task packet corresponding to any semantic similarity class is obtained; the preset condition is between the vertex weight and the word number range of the translation task packet.
6. The method for batch translation of documents according to claim 2, wherein the step of segment dividing any document and determining all segments of any document comprises:
and segmenting any document based on paragraph identifiers and/or punctuation marks in any document, and determining all segments of any document.
7. The method for batch translation of documents according to any one of claims 1 to 6, wherein the determining the translation results of the plurality of documents based on the translation task packages corresponding to the plurality of documents comprises:
respectively carrying out text similarity matching on any translation task packet and historical translation task packets of a plurality of translators, and determining a translator corresponding to any translation task packet;
and determining the translation results of the plurality of documents based on the translation result determined by the translator corresponding to each translation task package.
8. A document batch translation apparatus, comprising:
a determination unit configured to determine a plurality of documents to be translated;
the system comprises a decomposition unit, a translation task block generation unit and a translation task block generation unit, wherein the decomposition unit is used for decomposing any document based on the document structure of the document and determining the translation task block corresponding to the document;
the aggregation unit is used for aggregating the translation task blocks corresponding to the documents and determining the translation task packages corresponding to the documents;
and the translation unit is used for determining the translation results of the plurality of documents based on the translation task packages corresponding to the plurality of documents.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the document batch translation method according to any one of claims 1 to 7 when executing the computer program.
10. A non-transitory computer readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the document batch translation method according to any one of claims 1 to 7.
CN202110126066.5A 2021-01-29 2021-01-29 Document batch translation method and device, electronic equipment and storage medium Pending CN112784613A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110126066.5A CN112784613A (en) 2021-01-29 2021-01-29 Document batch translation method and device, electronic equipment and storage medium
PCT/CN2021/126664 WO2022160819A1 (en) 2021-01-29 2021-10-27 Document batch translation method and apparatus, electronic device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110126066.5A CN112784613A (en) 2021-01-29 2021-01-29 Document batch translation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112784613A true CN112784613A (en) 2021-05-11

Family

ID=75759737

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110126066.5A Pending CN112784613A (en) 2021-01-29 2021-01-29 Document batch translation method and device, electronic equipment and storage medium

Country Status (2)

Country Link
CN (1) CN112784613A (en)
WO (1) WO2022160819A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114154092A (en) * 2021-11-18 2022-03-08 网易有道信息技术(江苏)有限公司 Method for translating web pages and related product
WO2022160819A1 (en) * 2021-01-29 2022-08-04 语联网(武汉)信息技术有限公司 Document batch translation method and apparatus, electronic device, and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678287A (en) * 2013-11-30 2014-03-26 武汉传神信息技术有限公司 Method for unifying keyword translation
CN103744834A (en) * 2013-12-23 2014-04-23 武汉传神信息技术有限公司 Method for accurately distributing translation tasks
CN103885942A (en) * 2014-03-18 2014-06-25 成都优译信息技术有限公司 Rapid translation device and method
CN104484323A (en) * 2014-12-26 2015-04-01 武汉传神信息技术有限公司 Translation processing method based on document segment
CN105808528A (en) * 2016-03-04 2016-07-27 张广睿 Document character processing method
CN107391565A (en) * 2017-06-13 2017-11-24 东南大学 A kind of across language hierarchy taxonomic hierarchies matching process based on topic model
GB201717959D0 (en) * 2016-11-18 2017-12-13 Lionbridge Tech Inc Collection strategies that facilitate arranging portions of documents into content collections
KR20190011421A (en) * 2017-07-25 2019-02-07 주식회사 한글과컴퓨터 Documents package translation system
CN111191470A (en) * 2019-12-25 2020-05-22 语联网(武汉)信息技术有限公司 Document translation method and device
CN111611813A (en) * 2020-04-29 2020-09-01 南京南瑞继保电气有限公司 Document translation method and device, electronic equipment and storage medium
CN111611811A (en) * 2020-05-25 2020-09-01 腾讯科技(深圳)有限公司 Translation method, translation device, electronic equipment and computer readable storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103064970B (en) * 2012-12-31 2016-04-20 武汉传神信息技术有限公司 Optimize the search method of interpreter
CN112784613A (en) * 2021-01-29 2021-05-11 语联网(武汉)信息技术有限公司 Document batch translation method and device, electronic equipment and storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678287A (en) * 2013-11-30 2014-03-26 武汉传神信息技术有限公司 Method for unifying keyword translation
CN103744834A (en) * 2013-12-23 2014-04-23 武汉传神信息技术有限公司 Method for accurately distributing translation tasks
CN103885942A (en) * 2014-03-18 2014-06-25 成都优译信息技术有限公司 Rapid translation device and method
CN104484323A (en) * 2014-12-26 2015-04-01 武汉传神信息技术有限公司 Translation processing method based on document segment
CN105808528A (en) * 2016-03-04 2016-07-27 张广睿 Document character processing method
GB201717959D0 (en) * 2016-11-18 2017-12-13 Lionbridge Tech Inc Collection strategies that facilitate arranging portions of documents into content collections
US20180143975A1 (en) * 2016-11-18 2018-05-24 Lionbridge Technologies, Inc. Collection strategies that facilitate arranging portions of documents into content collections
CN107391565A (en) * 2017-06-13 2017-11-24 东南大学 A kind of across language hierarchy taxonomic hierarchies matching process based on topic model
KR20190011421A (en) * 2017-07-25 2019-02-07 주식회사 한글과컴퓨터 Documents package translation system
CN111191470A (en) * 2019-12-25 2020-05-22 语联网(武汉)信息技术有限公司 Document translation method and device
CN111611813A (en) * 2020-04-29 2020-09-01 南京南瑞继保电气有限公司 Document translation method and device, electronic equipment and storage medium
CN111611811A (en) * 2020-05-25 2020-09-01 腾讯科技(深圳)有限公司 Translation method, translation device, electronic equipment and computer readable storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022160819A1 (en) * 2021-01-29 2022-08-04 语联网(武汉)信息技术有限公司 Document batch translation method and apparatus, electronic device, and storage medium
CN114154092A (en) * 2021-11-18 2022-03-08 网易有道信息技术(江苏)有限公司 Method for translating web pages and related product
CN114154092B (en) * 2021-11-18 2023-04-18 网易有道信息技术(江苏)有限公司 Method for translating web pages and related product

Also Published As

Publication number Publication date
WO2022160819A1 (en) 2022-08-04

Similar Documents

Publication Publication Date Title
CN109871955B (en) Aviation safety accident causal relation extraction method
CN106383877B (en) Social media online short text clustering and topic detection method
JP7164701B2 (en) Computer-readable storage medium storing methods, apparatus, and instructions for matching semantic text data with tags
CN102591988B (en) Short text classification method based on semantic graphs
CN105608218A (en) Intelligent question answering knowledge base establishment method, establishment device and establishment system
CN106294350A (en) A kind of text polymerization and device
WO2022160819A1 (en) Document batch translation method and apparatus, electronic device, and storage medium
CN107145516B (en) Text clustering method and system
CN110442725B (en) Entity relationship extraction method and device
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN106445915A (en) New word discovery method and device
CN106557777A (en) It is a kind of to be based on the improved Kmeans clustering methods of SimHash
WO2022095637A1 (en) Fault log classification method and system, and device and medium
CN111489030B (en) Text word segmentation based job leaving prediction method and system
CN112699232A (en) Text label extraction method, device, equipment and storage medium
CN111695358A (en) Method and device for generating word vector, computer storage medium and electronic equipment
CN111177375A (en) Electronic document classification method and device
CN111639189B (en) Text graph construction method based on text content features
CN114328885A (en) Information processing method and device and computer readable storage medium
CN113836892A (en) Sample size data extraction method and device, electronic equipment and storage medium
Boudraa et al. An efficient cooperative smearing technique for degraded historical document image segmentation
CN111291182A (en) Hotspot event discovery method, device, equipment and storage medium
CN112256935A (en) Complex network clustering method based on optimization
CN112016330A (en) Semantic parsing method, semantic parsing device and storage medium
CN116150379B (en) Short message text classification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination