WO2022160819A1 - 文档批量翻译方法、装置、电子设备及存储介质 - Google Patents
文档批量翻译方法、装置、电子设备及存储介质 Download PDFInfo
- Publication number
- WO2022160819A1 WO2022160819A1 PCT/CN2021/126664 CN2021126664W WO2022160819A1 WO 2022160819 A1 WO2022160819 A1 WO 2022160819A1 CN 2021126664 W CN2021126664 W CN 2021126664W WO 2022160819 A1 WO2022160819 A1 WO 2022160819A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- translation
- document
- translation task
- documents
- task
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 58
- 230000004931 aggregating effect Effects 0.000 claims abstract description 6
- 239000012634 fragment Substances 0.000 claims description 24
- 230000002776 aggregation Effects 0.000 claims description 11
- 238000004220 aggregation Methods 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 10
- 238000000354 decomposition reaction Methods 0.000 claims description 7
- 238000004891 communication Methods 0.000 description 7
- 238000013145 classification model Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 241000590419 Polygonia interrogationis Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000013467 fragmentation Methods 0.000 description 1
- 238000006062 fragmentation reaction Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
Definitions
- the present application relates to the field of computer technology, and in particular, to a method, apparatus, electronic device and storage medium for batch translation of documents.
- the present application provides a method, device, electronic device and storage medium for batch translation of documents, which are used to solve the technical problems of unreasonable document distribution, long translation time and low translation efficiency in the prior art.
- This application provides a method for batch translation of documents, including:
- the translation results of the multiple documents are determined.
- the document structure based on any document is decomposed, and the translation task block corresponding to the any document is determined, including:
- the translation task block corresponding to any document is determined based on the word count range of the translation task block and several consecutive segments corresponding to each level in the any document.
- the aggregating translation task blocks corresponding to each document to determine translation task packages corresponding to the multiple documents includes:
- each translation task block in any semantic similarity class is aggregated to obtain the any semantic similarity class The corresponding translation task package;
- the translation task package corresponding to the plurality of documents is determined.
- the translation task blocks corresponding to each document are clustered based on the semantic similarity between the translation task blocks to obtain a plurality of semantic similarity classes, including:
- the translation task blocks are aggregated to obtain a translation task package corresponding to any one of the semantically similar classes, including:
- An undirected graph is established with each translation task block in any of the semantic similarity classes as vertices; the edge in the undirected graph is the semantic similarity between each translation task block, and the vertex weight in the undirected graph is The number of words in each translation task block;
- the undirected graph is traversed with edge priority, and the vertex weights and the task translation blocks corresponding to multiple vertices that satisfy the preset conditions are aggregated into a translation task package, until the translation task package corresponding to any semantically similar class is obtained. ;
- the preset condition is between the vertex weight and the word count range of the translation task package.
- the segment division of any document is performed, and all segments of the any document are determined, including:
- Segmentation is performed on any document based on paragraph identifiers and/or punctuation marks in the any document, and all segments of the any document are determined.
- determining the translation results of the multiple documents based on the translation task packages corresponding to the multiple documents includes:
- the translation results of the plurality of documents are determined.
- the present application provides a document batch translation device, including:
- a determination unit for determining a plurality of documents to be translated
- a decomposition unit configured to decompose any document based on the document structure of any document, and determine a translation task block corresponding to the any document;
- an aggregation unit configured to aggregate translation task blocks corresponding to each document, and determine translation task packages corresponding to the multiple documents
- a translation unit configured to determine translation results of the multiple documents based on translation task packages corresponding to the multiple documents.
- the present application also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the processor, when the processor executes the program, the batch translation of documents according to any one of the above-mentioned processes is implemented. steps of the method.
- the present application also provides a non-transitory computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of any one of the above-mentioned methods for batch translation of documents.
- the document batch translation method, device, electronic device and storage medium provided by the present application decompose each document according to the document structure, determine the translation task block corresponding to each document, aggregate the translation task blocks corresponding to each document, and determine the number of translation task blocks corresponding to each document.
- the translation task package corresponding to each document, and then the translation results of multiple documents are determined, and batch translation of multiple documents is realized. Because the content of the documents in the translation task package is continuous, semantically similar and of suitable length, multiple translators can be parallelized. The translation work is completed, which improves the efficiency of document translation. At the same time, the content of documents with similar semantics is divided into the same translation task package and translated by the same translator, which avoids inconsistencies in the results translated by different translators and ensures the consistency of translation results.
- FIG. 1 is a schematic flowchart of a method for batch translation of documents provided by the present application
- FIG. 2 is a schematic structural diagram of a document batch translation device provided by the present application.
- FIG. 3 is a schematic structural diagram of an electronic device provided by the present application.
- FIG. 1 is a schematic flowchart of a method for batch translation of documents provided by this application. As shown in FIG. 1 , the method includes:
- Step 110 Determine multiple documents to be translated.
- the document is the text to be translated
- the language type of the document may be Chinese, or may be English, Japanese, French, German, Arabic, and the like.
- This embodiment of the present application does not specifically limit the language type of the document.
- the language types of the multiple documents to be translated are the same language type and need to be translated into another language type.
- Step 120 Decompose any document based on the document structure of the document, and determine the translation task block corresponding to the document.
- the translation task block is a collection of several consecutive segments in the same document.
- a fragment is a basic unit of a document, which can be a natural paragraph or a sentence.
- a document to be translated can be divided into multiple segments to be translated.
- the word count range can be set for the translation task block, so that the number of words in the translation task block is within a certain range.
- the word count range for the translation task block can be set to [500, 2000].
- the size of the word count range can be set according to the actual situation.
- a document to be translated can be divided into multiple translation task blocks.
- the basic principle of dividing translation task blocks is to make the segments to be translated with coherent context and semantics as much as possible to be divided into the same translation task block. Therefore, sequential extraction of segments in the document to be translated can ensure the continuity of document translation.
- the document to be translated can be decomposed according to the document structure of the document to determine the translation task block corresponding to the document.
- the document structure is the hierarchical structure of the document, and the corresponding document structure information includes the document title, the title of each level and its sub-levels, the number of segments and the number of words under each level and its sub-levels, and the like. After decomposition, it is also possible to determine the document number where the translation task block is located, and the block number under this document number. Based on the document number and block number of the translation task block, the specific location of the translation task block in multiple documents can be quickly determined.
- Step 130 Aggregate translation task blocks corresponding to each document to determine translation task packages corresponding to multiple documents.
- translation task blocks corresponding to each document can be obtained.
- document A and document B belong to different technical documents of the same product, and part of the content appearing in document A may be the same or similar to part of the content appearing in document B, or there is a mutual reference relationship in content, etc.
- the aggregated result is the translation task package.
- the translation task package contains a plurality of translation task blocks that are intrinsically related to each other, that is, the translation task blocks in the translation task package have a relatively high degree of intrinsic relationship. Intrinsic connections here can include semantic similarity.
- the word count range can be set for the translation task package, so that the number of words in the translation task package is within a certain range. For example, the word count range for a translation task package can be set to [5000, 10000]. The size of the word count range can be set according to the actual situation.
- Step 140 Determine translation results of the multiple documents based on translation task packages corresponding to the multiple documents.
- translation may be performed using the translation task package as a basic unit.
- tasks can be assigned to multiple translators based on translation task packages.
- the translation results of each translation task package are combined in the order of translation task blocks, thereby obtaining translation results of multiple documents.
- Translator refers to the translator of the document.
- each document is decomposed according to the document structure, a translation task block corresponding to each document is determined, the translation task blocks corresponding to each document are aggregated, and translation tasks corresponding to multiple documents are determined package, and then determine the translation results of multiple documents, realizing batch translation of multiple documents.
- the content of the documents in the translation task package is continuous, semantically similar and of suitable length, multiple translators can complete the translation work in parallel, which improves the efficiency of translation. Document translation efficiency.
- the content of documents with similar semantics is divided into the same translation task package and translated by the same translator, which avoids inconsistencies in the results translated by different translators and ensures the consistency of translation results.
- step 120 includes:
- the translation task block corresponding to the document is determined.
- the number of translation task blocks corresponding to each document is determined by the word count range of the translation task blocks, which may be one or multiple. For example, if the overall word count of any document is less than the word count range of the translation task block, the document can be determined as a translation task block; if the overall word count of any document is greater than the word count range of the translation task block, the document can be decomposed For multiple translation task blocks.
- the translation task block is determined with the fragment as the basic unit. For example, for one of the documents to be translated, it can be divided into segments to obtain multiple segments to be translated, which can be expressed as a set as:
- S is the document to be translated
- S i is the ith segment to be translated
- n is the number of segments to be translated, 1 ⁇ i ⁇ n.
- the document S to be translated includes 5 segments, and its document structure is divided into two levels, namely Chapter 1 and Chapter 2, and each level is further divided into two sub-levels, that is, Chapter 1 includes Sections 1.1 and 2.
- Section 1.2 Chapter 2 includes Sections 2.1 and 2.2.
- Section 1.1 includes fragment S 1
- section 1.2 includes fragment S 2
- section 2.1 includes fragment S 3
- section 2.2 includes fragment S 4 and fragment S 5 .
- the translation task block corresponding to the document is determined according to the word count range of the translation task block and several consecutive segments corresponding to each level in the document.
- the word count range of the translation task block can be determined as [500, 2000].
- the word counts of the segments are 200, 300, 1600, 300 and 800, respectively.
- the document S can be decomposed into three translation task blocks, which are marked as S-1, S-2 and S-3 according to the document number and block number respectively.
- the translation task block S-1 includes a segment S 1 and a segment S 2
- the translation task block S-2 includes a segment S 3 and a segment S 4
- the translation task block S-3 includes a segment S 5 .
- the consecutive document segments of the same level are divided into the same translation task block. If the word count of the document segments at the same level cannot reach the lower limit of the word count range, the subsequent segments will continue to be extracted until the number of words in the translation task block. The lower limit of the word count range is reached. If the word count of the document fragment at the same level has reached the upper limit of the word count range, the next document fragment exceeding the upper limit of the word count range is divided into the next translation task block;
- word count range of the translation task block continuous document segments at different levels are divided into the same translation task block. If the word count of the document segments at different levels cannot reach the lower limit of the word count range, the subsequent segments will continue to be extracted until the number of words in the translation task block. The lower limit of the word count range is reached. If the word count of document fragments at different levels has reached the upper limit of the word count range, the next document fragment exceeding the upper limit of the word count range is divided into the next translation task block.
- the document batch translation method decomposes the document according to the document structure of the document, determines the translation task block corresponding to the document, provides a simple and feasible document decomposition method, and reduces the complexity of the document batch translation algorithm .
- step 130 includes:
- each translation task block in the semantic similarity class is aggregated to obtain the translation task package corresponding to the semantic similarity class;
- the translation task package corresponding to the multiple documents is determined.
- the translation task blocks corresponding to the obtained documents may be combined across documents according to the semantic similarity between the translation task packages, so as to obtain translation task packages corresponding to multiple documents to be translated.
- the method of translating the task package can be divided into two parts. The first part is to perform cross-document clustering on the translation task blocks corresponding to each document to obtain multiple semantically similar classes; The translation task blocks are aggregated to obtain the translation task package corresponding to the semantically similar class.
- an existing classification model Before clustering the translation task blocks corresponding to each document, an existing classification model may be used to pre-classify the translation task blocks corresponding to each document to obtain classified semantically similar classes.
- the existing classification model is a document content classification model, which can classify each translation task block into financial, military, or engineering.
- the translation task blocks of each document to be translated can be regarded as a set B.
- the set of translation task blocks that can be pre-classified is B 1
- the translation task blocks in the set B 2 can be clustered, and the clustering method can adopt the K-means algorithm. After the clustering, the translation task blocks in the set B 2 are divided into several categories.
- An embodiment of the present application provides a method for clustering translation task blocks based on semantic similarity, which is used to classify translation task blocks that cannot be classified by existing classification models.
- the steps of this method are:
- Step 2 Taking the translation task block B 21 as a benchmark, calculate the semantic similarity between B 21 and the rest of the translation task blocks in the set B 2 , and screen out all the translation task blocks whose semantic similarity is greater than a given threshold, and form the first translation task block with B 21 .
- a semantically similar class E 1 a semantically similar class E 1 ;
- Step 3 in all the translation task blocks except E1 in the set B2, according to the method in step 2 , obtain the second semantically similar class E2 ;
- step 4 the methods in steps 2 and 3 are repeated until all the segments in the set B 2 are classified into corresponding semantically similar classes, and finally multiple semantically similar classes are obtained.
- the method for batch translation of documents provided by the embodiment of the present application performs clustering and aggregation operations on the translation task blocks corresponding to the obtained documents according to the semantic similarity between the translation task packages, so as to obtain a translation task package with a higher semantic similarity,
- the rationality and accuracy of the division of translation tasks are improved, and the efficiency of document translation is improved.
- the translation task blocks corresponding to each document are clustered to obtain a plurality of semantic similarity classes, including:
- semantically similar classes containing only one translation task block may be obtained. All semantically similar classes that contain only one translation task block can be merged, that is, merged into one class, which can be called a tail class.
- each translation task block in the semantic similarity class is aggregated to obtain the semantic similarity
- the translation task package corresponding to the class including:
- An undirected graph is established with each translation task block in the semantic similarity class as a vertex; the edge in the undirected graph is the semantic similarity between each translation task block, and the vertex weight in the undirected graph is the word count of each translation task block;
- the undirected graph is traversed with edge priority, and the vertex weights and the task translation blocks corresponding to multiple vertices that satisfy the preset conditions are aggregated into a translation task package, until the translation task package corresponding to the semantically similar class is obtained; the preset conditions are Vertex weights and word count ranges in the translation task package.
- the semantic similarity class here may include a pre-classified semantic similarity class, a semantic similarity class obtained after clustering, a tail class, and the like.
- An undirected graph G is established with each translation task block in the semantic similarity class as vertices.
- the undirected graph G is traversed with edge priority, and the weights of the traversed vertices are accumulated to obtain the sum of the vertex weights. If the vertex weight sum meets the preset condition, the task translation blocks corresponding to the traversed vertices are aggregated into a translation task package. Preset conditions can be set between vertex weights and word count ranges in the translation task package. This cycle is repeated until the translation task package corresponding to the semantically similar class is obtained.
- Step 2 Set the set Z_new_del of the edges to be removed and the overflow set Z_new_overflow to be empty;
- Step 3 Select the edge X with the largest semantic similarity from the elements of the set Z_new minus the set Z_new_overflow;
- Step 4 Calculate the weight sum of the vertices corresponding to the elements in the edge X plus the set Z_new_del;
- Step 5 If the weight sum is less than the lower limit of the word count range of the translation task package, add the edge X to the set Z_new_del, remove the elements in the set Z_new_del from the set Z_new to obtain the updated set Z_new, and go to step 3;
- Step 6 If the weight sum is greater than the upper limit of the word count range of the translation task package, add edge X to the set Z_new_overflow, and go to step 3;
- Step 7 Aggregate the translation task blocks corresponding to the vertices corresponding to the edges in the set Z_new_del into the same translation task package;
- Step 8 If Z_new is not empty and the weight sum of the vertices corresponding to all edges in Z_new is greater than the lower limit of the word count range of the translation task package, go to Step 2;
- Step 9 If Z_new is not empty and the weight sum of the vertices corresponding to all edges in Z_new is less than the lower limit of the word count range of the translation task package, aggregate the translation task blocks corresponding to the vertices corresponding to all edges in Z_new into the same translation task package;
- Step 10 Obtain all translation task packages aggregated by translation task blocks in set A, and the block aggregation process ends.
- translation task blocks are aggregated by means of undirected graph traversal to obtain translation task packages with higher semantic similarity, which improves the rationality and accuracy of translation task division, and improves the Document translation efficiency.
- any document is segmented, and all segments of the document are determined, including:
- the document is segmented to determine all segments of the document.
- the document when the document is segmented, it can be divided according to natural segments, it can also be divided according to sentences, and it can also be divided according to natural segments and sentences.
- the division basis can be selected as a segment identifier. If according to the way of division of sentences, the division basis can be punctuation marks.
- the punctuation marks here are punctuation marks that can indicate the end of a complete sentence. Examples include periods, question marks, exclamation marks, and carriage returns.
- the method for batch translation of documents divides the document into segments according to paragraph identifiers and/or punctuation marks in the document, and determines all segments of the document, which is simple and easy to implement, reduces the workload of translators, and improves the performance of the document. translation efficiency.
- step 140 includes:
- the translation results of the plurality of documents are determined.
- historical translation task packages of multiple translators may be collected in advance. All the translation task packages of the multiple documents to be translated are respectively matched with the historical translation task packages of each translator for text similarity, thereby determining the translator corresponding to each translation task package and assigning the translation task packages.
- the corresponding translator translates the assigned translation task package, and arranges the obtained translation results according to the document numbers and block numbers of the translation task blocks in the translation task package, thereby obtaining translation results of multiple documents to be translated.
- any translation task package is respectively matched with the historical translation task packages of multiple translators for text similarity, so as to determine the translator corresponding to the translation task package, taking into account the historical translation data of the translators , which improves the rationality of translation task assignment, makes full use of the translator's work experience, saves translation time, and improves translation efficiency and accuracy.
- FIG. 2 is a schematic structural diagram of a document batch translation device provided by the present application. As shown in FIG. 2 , the device includes:
- a determining unit 210 configured to determine a plurality of documents to be translated
- a decomposition unit 220 configured to decompose any document based on the document structure of any document, and determine the translation task block corresponding to any document;
- an aggregation unit 230 configured to aggregate translation task blocks corresponding to each document, and determine translation task packages corresponding to multiple documents;
- the translation unit 240 is configured to determine translation results of the multiple documents based on translation task packages corresponding to the multiple documents.
- the determining unit 210 is used to determine multiple documents to be translated; the decomposing unit 220 is used to determine the translation task block corresponding to any document; the aggregation unit 230 is used to determine the translation task package corresponding to the multiple documents; Used to determine translation results for multiple documents.
- the document batch translation device decomposes each document according to the document structure, determines the translation task block corresponding to each document, aggregates the translation task blocks corresponding to each document, and determines the translation tasks corresponding to multiple documents package, and then determine the translation results of multiple documents, realizing batch translation of multiple documents. Because the content of the documents in the translation task package is continuous, semantically similar and of suitable length, multiple translators can complete the translation work in parallel, which improves the efficiency of translation. Document translation efficiency. At the same time, the content of documents with similar semantics is divided into the same translation task package and translated by the same translator, which avoids inconsistencies in the results translated by different translators and ensures the consistency of translation results.
- the decomposition unit 220 includes:
- Dividing subunits which are used to divide any document into fragments and determine all the fragments of any document
- Decomposition subunits are used to determine, based on the document structure of any document and all the fragments of any document, several consecutive fragments corresponding to each level in any document;
- the block determination subunit is used to determine the translation task block corresponding to any document based on the word count range of the translation task block and several consecutive segments corresponding to each level in any document.
- the aggregation unit 230 includes:
- the clustering subunit is used to cluster the translation task blocks corresponding to each document based on the semantic similarity between the translation task blocks to obtain multiple semantic similarity classes;
- the aggregation subunit is used to aggregate each translation task block in any semantic similarity class based on the semantic similarity between each translation task block in any semantic similarity class and the number of words in each translation task block to obtain any semantic Translation task packages corresponding to similar classes;
- the package determination subunit is used for determining translation task packages corresponding to multiple documents based on the translation task package corresponding to each semantically similar class.
- the clustering subunit is used to:
- the aggregation subunit includes:
- the graph building module is used to build an undirected graph with each translation task block in any semantic similarity class as a vertex; the edge in the undirected graph is the semantic similarity between each translation task block, and the vertex weight in the undirected graph is The number of words in each translation task block;
- the aggregation module is used to traverse the undirected graph with edge priority, and aggregate the vertex weights and the task translation blocks corresponding to multiple vertices that meet the preset conditions into a translation task package until the translation task corresponding to any semantically similar class is obtained. package; the preset condition is between vertex weights and the word count range in the translation task package.
- the dividing subunits are specifically used for:
- Fragmentation is performed on any document based on paragraph identifiers and/or punctuation marks in any document, and all fragments of any document are determined.
- the translation unit 240 is specifically configured to:
- the translation results of the plurality of documents are determined.
- FIG. 3 is a schematic structural diagram of an electronic device provided by the present application.
- the electronic device may include: a processor (Processor) 310, a communication interface (Communications Interface) 320, a memory (Memory) ) 330 and a communication bus (Communications Bus) 340, wherein the processor 310, the communication interface 320, and the memory 330 complete the communication with each other through the communication bus 340.
- the processor 310 may invoke the logic commands in the memory 330 to execute the methods provided by the above-mentioned embodiments, and the methods include:
- the above-mentioned logic commands in the memory 330 may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as an independent product.
- the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution, and the computer software product is stored in a storage medium, including Several commands are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
- the aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .
- the processor in the electronic device provided by the embodiment of the present application can call the logic instruction in the memory to realize the above-mentioned batch translation method of documents. Repeat.
- the present application also provides a non-transitory computer-readable storage medium.
- the non-transitory computer-readable storage medium provided by the present application is described below.
- the non-transitory computer-readable storage medium described below and the document batch translation described above are described below.
- the methods can refer to each other correspondingly.
- Embodiments of the present application provide a non-transitory computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, it is implemented to execute the methods provided by the foregoing embodiments, and the method includes:
- the device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.
- each embodiment can be implemented by means of software plus a necessary general hardware platform, and certainly can also be implemented by hardware.
- the above-mentioned technical solutions can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic A disc, an optical disc, etc., includes several commands to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in various embodiments or some parts of the embodiments.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
一种文档批量翻译方法、装置、电子设备及存储介质,涉及计算机技术领域,其中方法包括:确定待翻译的多个文档(110);基于任一文档的文档结构对所述任一文档进行分解,确定所述任一文档对应的翻译任务块(120);对各个文档对应的翻译任务块进行聚合,确定所述多个文档对应的翻译任务包(130);基于所述多个文档对应的翻译任务包,确定所述多个文档的翻译结果(140)。提供的方法、装置、电子设备及存储介质,实现了对多个文档的批量翻译,提高了文档翻译效率。
Description
相关申请的交叉引用
本申请要求于2021年1月29日提交的申请号为202110126066.5,发明名称为“文档批量翻译方法、装置、电子设备及存储介质”的中国专利申请的优先权,其通过引用方式全部并入本文。
本申请涉及计算机技术领域,尤其涉及一种文档批量翻译方法、装置、电子设备及存储介质。
在大型文档翻译项目中,通常将多篇待翻译文档分配给多个译员进行并行翻译,以快速准确地得到翻译结果。现有技术中,分配待翻译文档时主要依靠人工方式分配,使得文档分配不合理,翻译时间长,翻译效率低,并且翻译结果的准确性差。
发明内容
本申请提供一种文档批量翻译方法、装置、电子设备及存储介质,用以解决现有技术中文档分配不合理,翻译时间长,翻译效率低的技术问题。
本申请提供一种文档批量翻译方法,包括:
确定待翻译的多个文档;
基于任一文档的文档结构对所述任一文档进行分解,确定所述任一文档对应的翻译任务块;
对各个文档对应的翻译任务块进行聚合,确定所述多个文档对应的翻译任务包;
基于所述多个文档对应的翻译任务包,确定所述多个文档的翻译结果。
根据本申请提供的一种文档批量翻译方法,所述基于任一文档的文档结构对所述任一文档进行分解,确定所述任一文档对应的翻译任务块,包 括:
对所述任一文档进行片段划分,确定所述任一文档的所有片段;
基于所述任一文档的文档结构,以及所述任一文档的所有片段,确定所述任一文档中每一层级对应的若干个连续片段;
基于翻译任务块的字数范围,以及所述任一文档中每一层级对应的若干个连续片段,确定所述任一文档对应的翻译任务块。
根据本申请提供的一种文档批量翻译方法,所述对各个文档对应的翻译任务块进行聚合,确定所述多个文档对应的翻译任务包,包括:
基于翻译任务块之间的语义相似度,对各个文档对应的翻译任务块进行聚类,得到多个语义相似类;
基于任一语义相似类中各个翻译任务块之间的语义相似度,以及各个翻译任务块的字数,对所述任一语义相似类中各个翻译任务块进行聚合,得到所述任一语义相似类对应的翻译任务包;
基于每一语义相似类对应的翻译任务包,确定所述多个文档对应的翻译任务包。
根据本申请提供的一种文档批量翻译方法,所述基于翻译任务块之间的语义相似度,对各个文档对应的翻译任务块进行聚类,得到多个语义相似类,包括:
将所有只包含一个翻译任务块的语义相似类进行合并。
根据本申请提供的一种文档批量翻译方法,所述基于任一语义相似类中各个翻译任务块之间的语义相似度,以及各个翻译任务块的字数,对所述任一语义相似类中各个翻译任务块进行聚合,得到所述任一语义相似类对应的翻译任务包,包括:
以所述任一语义相似类中各个翻译任务块为顶点建立无向图;所述无向图中的边为各个翻译任务块之间的语义相似度,所述无向图中的顶点权重为各个翻译任务块的字数;
以边优先对所述无向图进行遍历,将顶点权重和满足预设条件的多个顶点对应的任务翻译块聚合为一个翻译任务包,直至得到所述任一语义相似类对应的翻译任务包;所述预设条件为顶点权重和在翻译任务包的字数范围之间。
根据本申请提供的一种文档批量翻译方法,所述对所述任一文档进行片段划分,确定所述任一文档的所有片段,包括:
基于所述任一文档中的段落标识符和/或标点符号,对所述任一文档进行片段划分,确定所述任一文档的所有片段。
根据本申请提供的一种文档批量翻译方法,所述基于所述多个文档对应的翻译任务包,确定所述多个文档的翻译结果,包括:
将任一翻译任务包分别与多个译员的历史翻译任务包进行文本相似度匹配,确定所述任一翻译任务包对应的译员;
基于每一翻译任务包对应的译员确定的翻译结果,确定所述多个文档的翻译结果。
本申请提供一种文档批量翻译装置,包括:
确定单元,用于确定待翻译的多个文档;
分解单元,用于基于任一文档的文档结构对所述任一文档进行分解,确定所述任一文档对应的翻译任务块;
聚合单元,用于对各个文档对应的翻译任务块进行聚合,确定所述多个文档对应的翻译任务包;
翻译单元,用于基于所述多个文档对应的翻译任务包,确定所述多个文档的翻译结果。
本申请还提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如上述任一种所述文档批量翻译方法的步骤。
本申请还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如上述任一种所述文档批量翻译方法的步骤。
本申请提供的文档批量翻译方法、装置、电子设备及存储介质,根据文档结构对每一文档进行分解,确定每一文档对应的翻译任务块,对各个文档对应的翻译任务块进行聚合,确定多个文档对应的翻译任务包,进而确定多个文档的翻译结果,实现了对多个文档的批量翻译,由于翻译任务包中的文档内容连续、语义相似度高并且长度合适,多个译员能够并行完成翻译工作,提高了文档翻译效率,同时,语义相似的文档内容被划分至 同一翻译任务包中由同一译员进行翻译,避免了不同的译员翻译出的结果不一致,保证了翻译结果的一致性。
为了更清楚地说明本申请或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请提供的文档批量翻译方法的流程示意图;
图2为本申请提供的文档批量翻译装置的结构示意图;
图3为本申请提供的电子设备的结构示意图。
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
图1为本申请提供的文档批量翻译方法的流程示意图,如图1所示,该方法包括:
步骤110,确定待翻译的多个文档。
具体地,文档为需要翻译的文本,文档的语言种类可以为中文,也可以为英文、日文、法文、德文和阿拉伯文等。本申请实施例对于文档的语言种类不作具体限定。例如,待翻译的多个文档的语言种类为同一语言种类,需要被翻译为另一种语言种类。
步骤120,基于任一文档的文档结构对该文档进行分解,确定该文档对应的翻译任务块。
具体地,翻译任务块为同一篇文档中若干个连续片段的集合。片段为组成文档的一个基本单位,可以为一个自然段或者一个句子。一篇待翻译的文档可以划分为多个待翻译片段。可以为翻译任务块设置字数范围,使 得翻译任务块的文字数量在一定范围内。例如,翻译任务块的字数范围可以设置为[500,2000]。字数范围的大小可以根据实际情况进行设置。
一篇待翻译的文档可以被划分为多个翻译任务块。划分翻译任务块的基本原则是:尽可能让上下文语义连贯的待翻译片段被划分在同一个翻译任务块中。因此,对待翻译的文档中的片段进行顺序提取就可以保证文档翻译的连续性。
可以根据待翻译文档的文档结构对该文档进行分解,确定该文档对应的翻译任务块。文档结构为文档的层级结构,对应的文档结构信息包括文档标题、每个层级及其子层级的标题、每个层级及其子层级下的片段数量和字数等。分解后还可以确定翻译任务块所在的文档编号,以及在该文档编号下的块编号。根据翻译任务块的文档编号和块编号,可以快速确定翻译任务块在多个文档中的具体位置。
步骤130,对各个文档对应的翻译任务块进行聚合,确定多个文档对应的翻译任务包。
具体地,对待翻译的多个文档进行分解后,可以得到各个文档对应的翻译任务块。在大型翻译项目中,各个文档之间存在一定的内在联系。例如,文档A和文档B属于同一产品的不同技术文档,在文档A中出现的部分内容可能和文档B中出现的部分内容相同或者相似,或者存在内容上的相互引用关系等。
聚合是指将分散在多个文档中的不同翻译任务块按照内在联系进行聚集和组合。聚合的结果为翻译任务包。翻译任务包中包含多个彼此之间存在内在联系的翻译任务块,也就是说,翻译任务包内的各个翻译任务块之间的内在联系程度较高。此处的内在联系可以包括语义相似度。可以为翻译任务包设置字数范围,使得翻译任务包的文字数量在一定范围内。例如,翻译任务包的字数范围可以设置为[5000,10000]。字数范围的大小可以根据实际情况进行设置。
步骤140,基于多个文档对应的翻译任务包,确定多个文档的翻译结果。
具体地,在得到多个文档对应的翻译任务包后,可以以翻译任务包为基本单位进行翻译。例如,可以根据翻译任务包给多个译员进行任务分配。 将每一翻译任务包的翻译结果按照翻译任务块的顺序进行组合,从而得到多个文档的翻译结果。译员,是指文档翻译人员。
本申请实施例提供的文档批量翻译方法,根据文档结构对每一文档进行分解,确定每一文档对应的翻译任务块,对各个文档对应的翻译任务块进行聚合,确定多个文档对应的翻译任务包,进而确定多个文档的翻译结果,实现了对多个文档的批量翻译,由于翻译任务包中的文档内容连续、语义相似度高并且长度合适,多个译员能够并行完成翻译工作,提高了文档翻译效率,同时,语义相似的文档内容被划分至同一翻译任务包中由同一译员进行翻译,避免了不同的译员翻译出的结果不一致,保证了翻译结果的一致性。
基于上述实施例,步骤120包括:
对任一文档进行片段划分,确定该文档的所有片段;
基于该文档的文档结构,以及该文档的所有片段,确定该文档中每一层级对应的若干个连续片段;
基于翻译任务块的字数范围,以及该文档中每一层级对应的若干个连续片段,确定该文档对应的翻译任务块。
具体地,每一文档对应的翻译任务块的数量是由翻译任务块的字数范围确定的,可以为一个,也可以为多个。例如,若任一文档的整体字数小于翻译任务块的字数范围,则可以将该文档确定为一个翻译任务块,若任一文档的整体字数大于翻译任务块的字数范围,则可以将该文档分解为多个翻译任务块。
翻译任务块是以片段为基本单位进行确定的。例如,对于其中一个待翻译的文档,可以对其进行片段划分,得到多个待翻译片段,可以用集合表示为:
S={S
1,S
2,…,S
n}
式中,S为待翻译的文档,S
i为第i个待翻译片段,n为待翻译片段的数量,1≤i≤n。
根据任一文档的文档结构,以及该文档的所有片段,确定该文档中每一层级对应的若干个连续片段。例如,待翻译的文档S包括5个片段,其文档结构分为两个层级,即第1章和第2章,每一层级又分为两个子层级, 即第1章包括第1.1节和第1.2节,第2章包括第2.1节和第2.2节。第1.1节包括片段S
1,第1.2节包括片段S
2,第2.1节包括片段S
3,第2.2节包括片段S
4和片段S
5。
根据翻译任务块的字数范围,以及该文档中每一层级对应的若干个连续片段,确定该文档对应的翻译任务块。例如,翻译任务块的字数范围可以确定为[500,2000]。对于待翻译的文档S中的片段S
1、S
2、S
3、S
4和S
5,其片段的字数分别为200、300、1600、300和800。则可以将文档S分解为3个翻译任务块,按照文档编号和块编号分别标记为S-1、S-2和S-3。其中,翻译任务块S-1包括片段S
1和片段S
2,翻译任务块S-2包括片段S
3和片段S
4,翻译任务块S-3包括片段S
5。
在对任一文档进行翻译任务块分解时,可以按照以下原则:
在翻译任务块的字数范围内,对于同一层级的连续文档片段划分在同一翻译任务块内,若同一层级的文档片段字数无法达到字数范围的下限,则继续提取后续片段,直到翻译任务块的字数达到字数范围的下限,若同一层级的文档片段字数已经达到字数范围的上限,则将超出字数范围的上限的下个文档片段划分在下一个翻译任务块中;
在翻译任务块的字数范围内,对于不同层级的连续文档片段划分在同一翻译任务块内,若不同层级的文档片段字数无法达到字数范围的下限,则继续提取后续片段,直到翻译任务块的字数达到字数范围的下限,若不同层级的文档片段字数已经达到字数范围的上限,则将超出字数范围的上限的下个文档片段划分在下一个翻译任务块中。
本申请实施例提供的文档批量翻译方法,根据文档的文档结构对文档进行分解,确定文档对应的翻译任务块,提供了一种简单易行的文档分解方法,减少了文档批量翻译算法的复杂度。
基于上述任一实施例,步骤130包括:
基于翻译任务块之间的语义相似度,对各个文档对应的翻译任务块进行聚类,得到多个语义相似类;
基于任一语义相似类中各个翻译任务块之间的语义相似度,以及各个翻译任务块的字数,对该语义相似类中各个翻译任务块进行聚合,得到该语义相似类对应的翻译任务包;
基于每一语义相似类对应的翻译任务包,确定多个文档对应的翻译任务包。
具体地,可以根据翻译任务包之间的语义相似度对所得到的各个文档对应的翻译任务块进行跨文档组合,得到待翻译的多个文档对应的翻译任务包。翻译任务包的方法可以分为两个部分,第一部分为对各个文档对应的翻译任务块进行跨文档聚类,得到多个语义相似类,第二部分为对每一语义相似类中的多个翻译任务块进行聚合,得到该语义相似类对应的翻译任务包。
在对各个文档对应的翻译任务块进行聚类之前,可以利用已有的分类模型对各个文档对应的翻译任务块进行预分类,得到已分类的语义相似类。例如,已有的分类模型为文档内容分类模型,可以对各个翻译任务块进行分类,分为财经类、军事类或者工程类等。
待翻译的各个文档的翻译任务块可以视为集合B。其中,能够进行预分类的翻译任务块构成的集合为B
1,无法通过预分类的翻译任务块构成的集合为B
2,B
1+B
2=B。
可以对集合B
2中的翻译任务块进行聚类,聚类方法可以采用K-means算法,聚类后将集合B
2中的翻译任务块分为若干个类。
本申请实施例提供一种基于语义相似度的翻译任务块聚类方法,用于对无法通过已有的分类模型进行分类的翻译任务块进行分类。该方法的步骤为:
步骤一、确定集合B
2={B
21,B
22,…,B
2m}和语义相似度的给定阈值,其中m为集合B
2中翻译任务块的数量;
步骤二、以翻译任务块B
21为基准,计算B
21与集合B
2中其余翻译任务块的语义相似度,将所有语义相似度大于给定阈值的翻译任务块筛选出来,与B
21构成第一个语义相似类E
1;
步骤三,在集合B
2中除E
1中之外的所有翻译任务块中,按照步骤二中的方法,得到第二个语义相似类E
2;
步骤四,重复步骤二和步骤三中的方法,直到集合B
2中所有的片段都被划分到对应的语义相似类,最后得到多个语义相似类。
例如,对于集合B
2={B
21,B
22,B
23,B
24},聚类后得到语义相似类 E
1={B
21,B
22}和语义相似类E
2={B
23,B
24}。
本申请实施例提供的文档批量翻译方法,根据翻译任务包之间的语义相似度对所得到的各个文档对应的翻译任务块进行聚类和聚合操作,得到语义相似度更高的翻译任务包,提高了翻译任务划分的合理性和准确性,提高了文档翻译效率。
基于上述任一实施例,基于翻译任务块之间的语义相似度,对各个文档对应的翻译任务块进行聚类,得到多个语义相似类,包括:
将所有只包含一个翻译任务块的语义相似类进行合并。
具体地,上述对所得到的各个文档对应的翻译任务块进行跨文档聚类后,可能会得到一些只包含一个翻译任务块的语义相似类。可以将所有只包含一个翻译任务块的语义相似类进行合并,即合并为一个类,可以称之为尾类。
在大型翻译项目中,尾类中的多个翻译任务块之间可能依然存在语义上的相似。可以进一步地对尾类中的多个翻译任务块进行聚合,得到多个翻译任务包。
基于上述任一实施例,基于任一语义相似类中各个翻译任务块之间的语义相似度,以及各个翻译任务块的字数,对该语义相似类中各个翻译任务块进行聚合,得到该语义相似类对应的翻译任务包,包括:
以该语义相似类中各个翻译任务块为顶点建立无向图;无向图中的边为各个翻译任务块之间的语义相似度,无向图中的顶点权重为各个翻译任务块的字数;
以边优先对无向图进行遍历,将顶点权重和满足预设条件的多个顶点对应的任务翻译块聚合为一个翻译任务包,直至得到该语义相似类对应的翻译任务包;预设条件为顶点权重和在翻译任务包的字数范围之间。
具体地,此处的语义相似类可以包括预分类的语义相似类、经过聚类后得到的语义相似类以及尾类等。
任一语义相似类中包括k个翻译任务块,k为正整数,记为集合A={a
1,a
2,…,a
k}。其中,每个翻译任务块的字数可以记为集合C={c
1,c
2,…,c
k},块与块之间的语义相似度的记为集合Z={a
1a
2,a
1a
3,…,a
1a
k,a
2a
3,a
2a
4,…,a
2a
k,…,a
k-1a
k}。以该语义相似类中各个 翻译任务块为顶点建立无向图G。无向图G中的边为各个翻译任务块之间的语义相似度,顶点权重为各个翻译任务块的字数,则无向图G可以表示为G=(A,Z,C)。
以边优先对无向图G进行遍历,将遍历的顶点的权重进行累加,得到顶点权重和。如果顶点权重和满足预设条件,则将遍历的顶点对应的任务翻译块聚合为一个翻译任务包。预设条件可以设置为顶点权重和在翻译任务包的字数范围之间。如此循环,直至得到该语义相似类对应的翻译任务包。
上述方法可以用算法步骤表示为:
步骤一、初始化集合Z_new=Z;
步骤二、将待去除边的集合Z_new_del和溢出集合Z_new_overflow置为空;
步骤三、在集合Z_new减去集合Z_new_overflow的元素中选取语义相似度最大的边X;
步骤四、计算边X加上集合Z_new_del中的元素所对应顶点的权重和;
步骤五、若权重和小于翻译任务包的字数范围下限,将边X加入到在集合Z_new_del中,在集合Z_new中去除集合Z_new_del中的元素后得到更新后的集合Z_new,转步骤三;
步骤六、若权重和大于翻译任务包的字数范围上限,则将边X加入集合Z_new_overflow,转步骤三;
步骤七、将集合Z_new_del中的边所对应的顶点对应的翻译任务块聚合为同一个翻译任务包;
步骤八、若Z_new不为空且Z_new中所有边对应的顶点的权重和大于翻译任务包的字数范围下限,转步骤二;
步骤九、若Z_new不为空且Z_new中所有边对应的顶点的权重和小于翻译任务包的字数范围下限,将Z_new中的所有边对应的顶点对应的翻译任务块聚合为同一个翻译任务包;
步骤十、获得集合A中翻译任务块所聚合而成的所有翻译任务包,块聚合流程结束。
本申请实施例提供的文档批量翻译方法,通过无向图遍历的方式进行 翻译任务块的聚合,得到语义相似度更高的翻译任务包,提高了翻译任务划分的合理性和准确性,提高了文档翻译效率。
基于上述任一实施例,对任一文档进行片段划分,确定该文档的所有片段,包括:
基于任一文档中的段落标识符和/或标点符号,对该文档进行片段划分,确定该文档的所有片段。
具体地,对文档进行片段划分时,可以按照自然段进行划分,也可以按照句子进行划分,还可以按照自然段和句子进行划分。
若按照自然段的划分方式,则划分依据可以选择为段落标识符。若按照句子的划分方式,则划分依据可以选择标点符号。此处的标点符号为能够表征一个完整语句结束的标点符号。例如句号、问号、感叹号和回车符等。
本申请实施例提供的文档批量翻译方法,根据文档中的段落标识符和/或标点符号,对文档进行片段划分,确定文档的所有片段,简单易行,减少了译员的工作量,提高了文档翻译效率。
基于上述任一实施例,步骤140包括:
将任一翻译任务包分别与多个译员的历史翻译任务包进行文本相似度匹配,确定该翻译任务包对应的译员;
基于每一翻译任务包对应的译员确定的翻译结果,确定多个文档的翻译结果。
具体地,可以事先收集多个译员的历史翻译任务包。将待翻译的多个文档的所有翻译任务包分别与每个译员的历史翻译任务包进行文本相似度匹配,从而确定每一翻译任务包对应的译员并进行翻译任务包的分配。
对应的译员对分配的翻译任务包进行翻译,将得到的翻译结果按照翻译任务包中翻译任务块的文档编号和块编号进行排列,从而得到待翻译的多个文档的翻译结果。
本申请实施例提供的文档批量翻译方法,将任一翻译任务包分别与多个译员的历史翻译任务包进行文本相似度匹配,从而确定该翻译任务包对应的译员,考虑了译员的历史翻译数据,提高了翻译任务分配的合理性,能够充分利用译员的工作经验,节省了翻译时间,提高了翻译效率和准确 性。
基于上述任一实施例,图2为本申请提供的文档批量翻译装置的结构示意图,如图2所述,该装置包括:
确定单元210,用于确定待翻译的多个文档;
分解单元220,用于基于任一文档的文档结构对任一文档进行分解,确定任一文档对应的翻译任务块;
聚合单元230,用于对各个文档对应的翻译任务块进行聚合,确定多个文档对应的翻译任务包;
翻译单元240,用于基于多个文档对应的翻译任务包,确定多个文档的翻译结果。
具体地,确定单元210用于确定待翻译的多个文档;分解单元220用于确定任一文档对应的翻译任务块;聚合单元230用于确定多个文档对应的翻译任务包;翻译单元240用于确定多个文档的翻译结果。
本申请实施例提供的文档批量翻译装置,根据文档结构对每一文档进行分解,确定每一文档对应的翻译任务块,对各个文档对应的翻译任务块进行聚合,确定多个文档对应的翻译任务包,进而确定多个文档的翻译结果,实现了对多个文档的批量翻译,由于翻译任务包中的文档内容连续、语义相似度高并且长度合适,多个译员能够并行完成翻译工作,提高了文档翻译效率,同时,语义相似的文档内容被划分至同一翻译任务包中由同一译员进行翻译,避免了不同的译员翻译出的结果不一致,保证了翻译结果的一致性。
基于上述任一实施例,分解单元220包括:
划分子单元,用于对任一文档进行片段划分,确定任一文档的所有片段;
分解子单元,用于基于任一文档的文档结构,以及任一文档的所有片段,确定任一文档中每一层级对应的若干个连续片段;
块确定子单元,用于基于翻译任务块的字数范围,以及任一文档中每一层级对应的若干个连续片段,确定任一文档对应的翻译任务块。
基于上述任一实施例,聚合单元230包括:
聚类子单元,用于基于翻译任务块之间的语义相似度,对各个文档对 应的翻译任务块进行聚类,得到多个语义相似类;
聚合子单元,用于基于任一语义相似类中各个翻译任务块之间的语义相似度,以及各个翻译任务块的字数,对任一语义相似类中各个翻译任务块进行聚合,得到任一语义相似类对应的翻译任务包;
包确定子单元,用于基于每一语义相似类对应的翻译任务包,确定多个文档对应的翻译任务包。
基于上述任一实施例,聚类子单元用于:
将所有只包含一个翻译任务块的语义相似类进行合并。
基于上述任一实施例,聚合子单元包括:
建图模块,用于以任一语义相似类中各个翻译任务块为顶点建立无向图;无向图中的边为各个翻译任务块之间的语义相似度,无向图中的顶点权重为各个翻译任务块的字数;
聚合模块,用于以边优先对无向图进行遍历,将顶点权重和满足预设条件的多个顶点对应的任务翻译块聚合为一个翻译任务包,直至得到任一语义相似类对应的翻译任务包;预设条件为顶点权重和在翻译任务包的字数范围之间。
基于上述任一实施例,划分子单元具体用于:
基于任一文档中的段落标识符和/或标点符号,对任一文档进行片段划分,确定任一文档的所有片段。
基于上述任一实施例,翻译单元240具体用于:
将任一翻译任务包分别与多个译员的历史翻译任务包进行文本相似度匹配,确定任一翻译任务包对应的译员;
基于每一翻译任务包对应的译员确定的翻译结果,确定多个文档的翻译结果。
基于上述任一实施例,图3为本申请提供的电子设备的结构示意图,如图3所示,该电子设备可以包括:处理器(Processor)310、通信接口(Communications Interface)320、存储器(Memory)330和通信总线(Communications Bus)340,其中,处理器310,通信接口320,存储器330通过通信总线340完成相互间的通信。处理器310可以调用存储器330中的逻辑命令,以执行上述各实施例提供的方法,该方法包括:
确定待翻译的多个文档;基于任一文档的文档结构对任一文档进行分解,确定任一文档对应的翻译任务块;对各个文档对应的翻译任务块进行聚合,确定多个文档对应的翻译任务包;基于多个文档对应的翻译任务包,确定多个文档的翻译结果。
此外,上述的存储器330中的逻辑命令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干命令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
本申请实施例提供的电子设备中的处理器可以调用存储器中的逻辑指令,实现上述文档批量翻译方法,其具体的实施方式与方法实施方式一致,且可以达到相同的有益效果,此处不再赘述。
本申请还提供一种非暂态计算机可读存储介质,下面对本申请提供的非暂态计算机可读存储介质进行描述,下文描述的非暂态计算机可读存储介质与上文描述的文档批量翻译方法可相互对应参照。
本申请实施例提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现以执行上述各实施例提供的方法,该方法包括:
确定待翻译的多个文档;基于任一文档的文档结构对任一文档进行分解,确定任一文档对应的翻译任务块;对各个文档对应的翻译任务块进行聚合,确定多个文档对应的翻译任务包;基于多个文档对应的翻译任务包,确定多个文档的翻译结果。
本申请实施例提供的非暂态计算机可读存储介质上存储的计算机程序被执行时,实现上述文档批量翻译方法,其具体的实施方式与方法实施方式一致,且可以达到相同的有益效果,此处不再赘述。
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说 明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干命令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。
最后应说明的是:以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。
Claims (10)
- 一种文档批量翻译方法,其特征在于,包括:确定待翻译的多个文档;基于任一文档的文档结构对所述任一文档进行分解,确定所述任一文档对应的翻译任务块;对各个文档对应的翻译任务块进行聚合,确定所述多个文档对应的翻译任务包;基于所述多个文档对应的翻译任务包,确定所述多个文档的翻译结果。
- 根据权利要求1所述的文档批量翻译方法,其特征在于,所述基于任一文档的文档结构对所述任一文档进行分解,确定所述任一文档对应的翻译任务块,包括:对所述任一文档进行片段划分,确定所述任一文档的所有片段;基于所述任一文档的文档结构,以及所述任一文档的所有片段,确定所述任一文档中每一层级对应的若干个连续片段;基于翻译任务块的字数范围,以及所述任一文档中每一层级对应的若干个连续片段,确定所述任一文档对应的翻译任务块。
- 根据权利要求1所述的文档批量翻译方法,其特征在于,所述对各个文档对应的翻译任务块进行聚合,确定所述多个文档对应的翻译任务包,包括:基于翻译任务块之间的语义相似度,对各个文档对应的翻译任务块进行聚类,得到多个语义相似类;基于任一语义相似类中各个翻译任务块之间的语义相似度,以及各个翻译任务块的字数,对所述任一语义相似类中各个翻译任务块进行聚合,得到所述任一语义相似类对应的翻译任务包;基于每一语义相似类对应的翻译任务包,确定所述多个文档对应的翻译任务包。
- 根据权利要求3所述的文档批量翻译方法,其特征在于,所述基于翻译任务块之间的语义相似度,对各个文档对应的翻译任务块进行聚类,得到多个语义相似类,包括:将所有只包含一个翻译任务块的语义相似类进行合并。
- 根据权利要求3所述的文档批量翻译方法,其特征在于,所述基于任一语义相似类中各个翻译任务块之间的语义相似度,以及各个翻译任务块的字数,对所述任一语义相似类中各个翻译任务块进行聚合,得到所述任一语义相似类对应的翻译任务包,包括:以所述任一语义相似类中各个翻译任务块为顶点建立无向图;所述无向图中的边为各个翻译任务块之间的语义相似度,所述无向图中的顶点权重为各个翻译任务块的字数;以边优先对所述无向图进行遍历,将顶点权重和满足预设条件的多个顶点对应的任务翻译块聚合为一个翻译任务包,直至得到所述任一语义相似类对应的翻译任务包;所述预设条件为顶点权重和在翻译任务包的字数范围之间。
- 根据权利要求2所述的文档批量翻译方法,其特征在于,所述对所述任一文档进行片段划分,确定所述任一文档的所有片段,包括:基于所述任一文档中的段落标识符和/或标点符号,对所述任一文档进行片段划分,确定所述任一文档的所有片段。
- 根据权利要求1至6任一项所述的文档批量翻译方法,其特征在于,所述基于所述多个文档对应的翻译任务包,确定所述多个文档的翻译结果,包括:将任一翻译任务包分别与多个译员的历史翻译任务包进行文本相似度匹配,确定所述任一翻译任务包对应的译员;基于每一翻译任务包对应的译员确定的翻译结果,确定所述多个文档的翻译结果。
- 一种文档批量翻译装置,其特征在于,包括:确定单元,用于确定待翻译的多个文档;分解单元,用于基于任一文档的文档结构对所述任一文档进行分解,确定所述任一文档对应的翻译任务块;聚合单元,用于对各个文档对应的翻译任务块进行聚合,确定所述多个文档对应的翻译任务包;翻译单元,用于基于所述多个文档对应的翻译任务包,确定所述多个 文档的翻译结果。
- 一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至7中任一项所述的文档批量翻译方法的步骤。
- 一种非暂态计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至7中任一项所述的文档批量翻译方法的步骤。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110126066.5 | 2021-01-29 | ||
CN202110126066.5A CN112784613A (zh) | 2021-01-29 | 2021-01-29 | 文档批量翻译方法、装置、电子设备及存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022160819A1 true WO2022160819A1 (zh) | 2022-08-04 |
Family
ID=75759737
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/126664 WO2022160819A1 (zh) | 2021-01-29 | 2021-10-27 | 文档批量翻译方法、装置、电子设备及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112784613A (zh) |
WO (1) | WO2022160819A1 (zh) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112784613A (zh) * | 2021-01-29 | 2021-05-11 | 语联网(武汉)信息技术有限公司 | 文档批量翻译方法、装置、电子设备及存储介质 |
CN114154092B (zh) * | 2021-11-18 | 2023-04-18 | 网易有道信息技术(江苏)有限公司 | 用于对网页进行翻译的方法及其相关产品 |
CN114358030A (zh) * | 2021-12-29 | 2022-04-15 | 苏州远卓科技信息有限公司 | 一种专利文献翻译后的机器校对方法及其系统 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103064970A (zh) * | 2012-12-31 | 2013-04-24 | 武汉传神信息技术有限公司 | 优化译员的检索方法 |
CN103678287A (zh) * | 2013-11-30 | 2014-03-26 | 武汉传神信息技术有限公司 | 一种关键词翻译统一的方法 |
CN103744834A (zh) * | 2013-12-23 | 2014-04-23 | 武汉传神信息技术有限公司 | 一种翻译任务准确分配的方法 |
CN104484323A (zh) * | 2014-12-26 | 2015-04-01 | 武汉传神信息技术有限公司 | 一种基于文档片段的翻译处理方法 |
US20180143975A1 (en) * | 2016-11-18 | 2018-05-24 | Lionbridge Technologies, Inc. | Collection strategies that facilitate arranging portions of documents into content collections |
CN112784613A (zh) * | 2021-01-29 | 2021-05-11 | 语联网(武汉)信息技术有限公司 | 文档批量翻译方法、装置、电子设备及存储介质 |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103885942B (zh) * | 2014-03-18 | 2017-09-05 | 成都优译信息技术股份有限公司 | 一种快速翻译装置及方法 |
CN105808528B (zh) * | 2016-03-04 | 2019-01-25 | 张广睿 | 一种文档文字的处理方法 |
CN107391565B (zh) * | 2017-06-13 | 2020-11-03 | 东南大学 | 一种基于主题模型的跨语言层次分类体系匹配方法 |
KR101977207B1 (ko) * | 2017-07-25 | 2019-06-18 | 주식회사 한글과컴퓨터 | 문서 일괄 번역 시스템 |
CN111191470A (zh) * | 2019-12-25 | 2020-05-22 | 语联网(武汉)信息技术有限公司 | 文档翻译方法及装置 |
CN111611813B (zh) * | 2020-04-29 | 2023-09-08 | 南京南瑞继保电气有限公司 | 文档翻译方法、装置、电子设备及存储介质 |
CN111611811B (zh) * | 2020-05-25 | 2023-01-13 | 腾讯科技(深圳)有限公司 | 翻译方法、装置、电子设备及计算机可读存储介质 |
-
2021
- 2021-01-29 CN CN202110126066.5A patent/CN112784613A/zh active Pending
- 2021-10-27 WO PCT/CN2021/126664 patent/WO2022160819A1/zh active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103064970A (zh) * | 2012-12-31 | 2013-04-24 | 武汉传神信息技术有限公司 | 优化译员的检索方法 |
CN103678287A (zh) * | 2013-11-30 | 2014-03-26 | 武汉传神信息技术有限公司 | 一种关键词翻译统一的方法 |
CN103744834A (zh) * | 2013-12-23 | 2014-04-23 | 武汉传神信息技术有限公司 | 一种翻译任务准确分配的方法 |
CN104484323A (zh) * | 2014-12-26 | 2015-04-01 | 武汉传神信息技术有限公司 | 一种基于文档片段的翻译处理方法 |
US20180143975A1 (en) * | 2016-11-18 | 2018-05-24 | Lionbridge Technologies, Inc. | Collection strategies that facilitate arranging portions of documents into content collections |
CN112784613A (zh) * | 2021-01-29 | 2021-05-11 | 语联网(武汉)信息技术有限公司 | 文档批量翻译方法、装置、电子设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN112784613A (zh) | 2021-05-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022160819A1 (zh) | 文档批量翻译方法、装置、电子设备及存储介质 | |
US9740688B2 (en) | System and method for training a machine translation system | |
CN102591988B (zh) | 基于语义图的短文本分类方法 | |
RU2586577C2 (ru) | Фильтрация дуг в синтаксическом графе | |
WO2020259280A1 (zh) | 日志管理方法、装置、网络设备和可读存储介质 | |
CN104778158A (zh) | 一种文本表示方法及装置 | |
RU2618374C1 (ru) | Выявление словосочетаний в текстах на естественном языке | |
WO2011004529A1 (ja) | 分類階層再作成システム、分類階層再作成方法及び分類階層再作成プログラム | |
WO2022095637A1 (zh) | 一种故障日志分类方法、系统、设备以及介质 | |
CN113407679A (zh) | 文本主题挖掘方法、装置、电子设备及存储介质 | |
CN106445915A (zh) | 一种新词发现方法及装置 | |
US20210342379A1 (en) | Method and device for processing sentence, and storage medium | |
CN105956158B (zh) | 基于海量微博文本和用户信息的网络新词自动提取的方法 | |
JP2021111342A (ja) | テキストワードセグメンテーションの方法、装置、デバイスおよび媒体 | |
CN118503350A (zh) | 一种提升大模型rag准确性的流程优化设计方法和系统 | |
Nodarakis et al. | Using hadoop for large scale analysis on twitter: A technical report | |
CN106599305B (zh) | 一种基于众包的异构媒体语义融合方法 | |
US10372816B2 (en) | Preprocessing of string inputs in natural language processing | |
CN112926297A (zh) | 处理信息的方法、装置、设备和存储介质 | |
CN115470929A (zh) | 样本数据的生成方法、模型训练方法、装置、设备及介质 | |
CN115577082A (zh) | 文档关键词的提取方法、装置、电子设备及存储介质 | |
CN114328885A (zh) | 一种信息处理方法、装置及计算机可读存储介质 | |
WO2018136371A1 (en) | Compressed encoding for bit sequence | |
JP7524128B2 (ja) | テキスト予測方法、装置、機器及び記憶媒体 | |
CN111209371B (zh) | 评论数据处理方法、装置、计算机设备和存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21922410 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21922410 Country of ref document: EP Kind code of ref document: A1 |