CN109299480B - Context-based term translation method and device - Google Patents

Context-based term translation method and device Download PDF

Info

Publication number
CN109299480B
CN109299480B CN201811025328.3A CN201811025328A CN109299480B CN 109299480 B CN109299480 B CN 109299480B CN 201811025328 A CN201811025328 A CN 201811025328A CN 109299480 B CN109299480 B CN 109299480B
Authority
CN
China
Prior art keywords
term
corpus
definitions
word
paraphrasing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811025328.3A
Other languages
Chinese (zh)
Other versions
CN109299480A (en
Inventor
宋安琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Iol Wuhan Information Technology Co ltd
Shanghai Transn Translation Services Co ltd
Original Assignee
Iol Wuhan Information Technology Co ltd
Shanghai Transn Translation Services Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Iol Wuhan Information Technology Co ltd, Shanghai Transn Translation Services Co ltd filed Critical Iol Wuhan Information Technology Co ltd
Priority to CN201811025328.3A priority Critical patent/CN109299480B/en
Publication of CN109299480A publication Critical patent/CN109299480A/en
Application granted granted Critical
Publication of CN109299480B publication Critical patent/CN109299480B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/47Machine-assisted translation, e.g. using translation memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars

Abstract

The embodiment of the invention provides a method and a device for translating terms based on a context, wherein the method comprises the following steps: sentence dividing processing is carried out on a document to be translated, terms in the document to be translated and sentences where the terms are located are extracted by combining a maximum forward matching algorithm with a term library selected in advance by a translator, and term definitions corresponding to the terms are obtained from the term library; extracting the corpus with the sentence matching degree of the corpus and the term being greater than a preset threshold value, sequencing the corpus according to the matching degree from high to low, and filtering the corpus which does not contain the term; acquiring corpus paraphrasing corresponding to the terms in the corpus by adopting a word alignment method; and screening the term definitions by utilizing the corpus definitions and a character string editing distance algorithm to obtain final term definitions. According to the embodiment of the invention, the term paraphrasing in the corpus is extracted by the word alignment method, so that the term prompt function of computer-aided translation is improved, and the translation efficiency of a translator can be effectively improved.

Description

Context-based term translation method and device
Technical Field
The embodiment of the invention relates to the technical field of natural language processing, in particular to a method and a device for translating terms based on a context.
Background
The Computer aided translation (CAT, computer-Aided Translation) means that when the translator performs translation work, the background continuously and automatically stores translations entered by the translator so as to establish a database, so that when the same or similar phrases appear again in the later translation process, the system can automatically search the same or similar content stored in the database, provide reference translations for the translator, avoid repeated translation work, and only concentrate on the translation of new content, thereby effectively improving the translation efficiency.
In computer aided translation, term hint is an important function, and a translator typically connects multiple term libraries during translation, and a term typically corresponds to multiple definitions. The existing term prompt function generally prompts the translator of all the definitions of the term, and the translator needs to choose the definition according to the context, so that the translator cannot quickly choose the correct term definition for the translation, which is inefficient. Accordingly, there is a need to provide a method for improving the term hint function that provides accurate term definitions to a translator.
Disclosure of Invention
Aiming at the problems existing in the prior art, the embodiment of the invention provides a method and a device for translating terms based on a context.
According to a first aspect of an embodiment of the present invention, there is provided a context-based term translation method, comprising:
sentence dividing processing is carried out on a document to be translated, a maximum forward matching algorithm is utilized to combine a term library preselected by a translator to extract terms in the document to be translated and sentences in which the terms are located, and one or more term definitions corresponding to the terms are obtained from the term library;
extracting the linguistic data with the sentence matching degree of the linguistic data with the term being greater than a preset threshold value from a linguistic data base, sequencing the linguistic data according to the similarity from high to low, and filtering the linguistic data without the term;
acquiring corpus paraphrasing corresponding to the terms in the corpus by adopting a word alignment method;
and screening the term definitions by utilizing the corpus definitions and a character string editing distance algorithm to obtain final term definitions.
According to a second aspect of embodiments of the present invention, there is provided a context-based term translation apparatus comprising:
the term definition acquisition module is used for carrying out sentence processing on a document to be translated, extracting terms in the document to be translated and sentences in which the terms are located by combining a term library preselected by a translator through a maximum forward matching algorithm, and acquiring one or more term definitions corresponding to the terms from the term library;
the corpus extraction module is used for extracting the corpus with the sentence matching degree of the corpus with the term being greater than a preset threshold value, sequencing the corpus according to the similarity from high to low, and filtering the corpus which does not contain the term;
the word alignment module is used for acquiring corpus paraphrasing corresponding to the term in the corpus by adopting a word alignment method;
and the paraphrasing screening module is used for screening the term paraphrasing by utilizing the corpus paraphrasing and combining a character string editing distance algorithm to obtain a final term paraphrasing.
According to a third aspect of an embodiment of the present invention, there is provided an electronic apparatus including:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor to invoke the method of context-based term translation provided by any of the various possible implementations of the first aspect described above.
According to a fourth aspect of embodiments of the present invention, there is provided a non-transitory computer-readable storage medium storing computer instructions that enable the computer to perform the context-based term translation method provided by any one of the various possible implementations of the first aspect described above.
According to the method and the device for translating the terms based on the context, which are provided by the embodiment of the invention, the definitions of the terms in the corpus are extracted through the word alignment method, so that the optimal definitions are screened out, the term prompt function in computer-aided translation can be improved, the optimal definitions conforming to the context are provided for the translator, and the translation efficiency of the translator can be effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow diagram of a context-based term translation method provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of a context-based term translation device according to another embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The term hint in existing computer aided translations is an important function and typically prompts the translator for all definitions of the term, which may require selection of the definition based on context, resulting in an inability to select the correct term definition for translation in a short period of time. In order to overcome the above problems in the prior art, the present invention provides a method for translating terms based on a context, which uses a term library and a corpus of computer-aided translation to obtain an optimal interpretation meeting the context of the terms by term interpretation extraction, corpus matching, word alignment and interpretation screening. The following description and description will be made with reference to various embodiments.
As shown in fig. 1, a flow chart of a context-based term translation method according to an embodiment of the present invention is shown, where an execution subject of the method is a server, and the method includes:
and step 10, sentence processing is carried out on the document to be translated, the terms in the document to be translated and sentences in which the terms are located are extracted by combining a maximum forward matching algorithm with a term library selected in advance by a translator, and one or more term definitions corresponding to the terms are obtained from the term library.
Specifically, the existing general sentence dividing algorithm can be utilized to process the sentence to be translated, and the sentence to be translated is divided into a plurality of sentences, so that the subsequent extraction of the terms and the sentences in which the terms are located is facilitated.
In order to determine the best interpretation corresponding to the term by utilizing the context of the term, the embodiment of the invention firstly needs to extract the term and the sentence where the term is located in the document to be translated. The terms in each sentence of the document to be translated can be extracted by using a maximum forward matching algorithm (for convenience of description, the extracted terms are referred to as target terms). Correspondingly, the sentence in which the term is located can also be determined. The maximum forward matching algorithm is a basic word segmentation algorithm, and is mature in application and not described herein.
The term definitions corresponding to the target term are then obtained from the term library, meaning that the term is in the term library, and one term may be one or more in the term library, e.g., the term definition of the Super Computer includes: "supercomputer", "high-performance computer", etc. The term definitions are the basis for the best definition of the term provided to the translator, and the subsequent steps of operation are all processes performed on the basis of the term definitions.
And 11, extracting the linguistic data with the sentence matching degree of the linguistic data with the term being greater than a preset threshold value, sequencing the linguistic data according to the matching degree from high to low, and filtering the linguistic data without the term.
In order to provide accurate term definitions to a translator, embodiments of the present invention require the use of the context of the term, which may be accomplished through corpus matching. Specifically, a corpus with a sentence matching degree with a target term being greater than a preset threshold is extracted, and a large number of sentence texts and corresponding translations thereof are stored in the corpus according to the embodiment of the invention, and the corpus similar to the sentence with the target term being is matched according to the preset threshold through an elastic search and other search systems, wherein the value of the preset threshold can be set according to needs, for example, 50%.
The corpus satisfying the extraction conditions can be a plurality of sentences, and the extracted corpus is ordered according to the matching degree. The corpus extracted according to the method may have the condition of not containing the terms in the document to be translated, so that the corpus not containing the terms in the document to be translated also needs to be filtered.
It should be noted that, the embodiment of the invention does not limit the method for calculating the matching degree between sentences.
And step 12, acquiring corpus paraphrasing corresponding to the terms in the corpus by adopting a word alignment method.
Specifically, the translation of the target term in the extracted corpus is called corpus paraphrasing. The extracted corpus is sentences with the matching degree with the sentences of the target term being greater than a preset threshold value and translations thereof, and can also be considered as sentences with higher similarity with the sentences of the target term. The translation of the target term in the extracted corpus is already translated and is an existing translation resource, and can be considered as a relatively accurate interpretation. In order to provide accurate term definitions for the translator, translations corresponding to the target terms in all extracted corpora need to be found and screened. The method can be used for obtaining the target term by a word alignment method, and if the context of the target term in the corpus is aligned by the word alignment method, the translation of the target term can be directly obtained.
And step 13, screening the term definitions by utilizing the corpus definitions and a character string editing distance algorithm to obtain final term definitions.
Specifically, after the corpus paraphrasing is obtained, the above term paraphrasing is screened by using the corpus paraphrasing. The string edit distance algorithm refers to the minimum number of edits required to convert one string into another string, and can be used to compare the similarity between the corpus paraphrasing and the term paraphrasing, for example, the edit distance between the supercomputer and the supercomputer is 1, the edit distance is smaller, and the description similarity is higher. If the similarity is zero, the editing distance is large.
The step of screening the term definitions by utilizing the corpus definitions and the character string editing distance algorithm is to compare the similarity of the term definitions and the corpus definitions through the character string editing distance algorithm, delete the term definitions with zero similarity, and sort the term definitions according to the editing distance from small to large to obtain final term definitions, wherein the final term definitions are the term definitions most similar to the corpus definitions.
According to the term translation method based on the context, the term definitions in the corpus are extracted through the word alignment method, so that the optimal definition is screened based on the context, a term prompt function in computer-aided translation can be improved, the optimal definition conforming to the context is provided for a translator, the time for selecting the term translation by the translator is saved, the repeated work phenomenon in the translation process is avoided, and the translation efficiency of the translator can be effectively improved.
On the basis of the foregoing embodiment, before the step of extracting the corpus, in which the sentence similarity with the term is greater than a preset threshold, the method further includes:
classifying the files to be translated by using a pre-established classifier, and classifying the files to be translated into one or more industry categories according to the probability;
querying dictionary definitions corresponding to the terms in the industry category dictionary, and pre-screening the term definitions by utilizing the dictionary definitions;
correspondingly, the step of screening the term definitions by utilizing the corpus definitions and a character string editing distance algorithm to obtain final term definitions comprises the following specific steps:
and screening the term definitions subjected to pre-screening by utilizing the corpus definitions and a character string editing distance algorithm to obtain final term definitions.
Specifically, the embodiment of the invention also provides a term translation method based on the context, which is characterized in that before the term paraphrasing is screened by utilizing the corpus paraphrasing, the term paraphrasing is pre-screened by utilizing the dictionary paraphrasing. The detailed flow of the term translation method based on the context provided by the embodiment of the invention can be as follows:
first, the to-be-translated document is subjected to industry classification: and establishing a naive Bayesian text classifier according to industry classification according to a large number of translation manuscripts with industry labels, and classifying according to industry setting. Classifying the files to be translated by using the established classifier, and classifying the files into one or more categories according to the probability. In particular, a document is considered to belong to a general category when the probability that the document belongs to each category is low.
Acquiring term paraphrasing: and carrying out sentence segmentation on the document to be translated, extracting the terms in the document to be translated and sentences in which the terms are located by combining a term library preselected by a translator by using a maximum forward matching method, and acquiring one or more term definitions corresponding to the terms from the term library.
Pre-screening with dictionary definitions: inquiring the corresponding dictionary definitions of the terms in the industry category dictionary, comparing the similarity of the term definitions and the dictionary definitions through a character string editing distance algorithm, sorting the term definitions according to the editing distance from small to large, and filtering the term definitions with the similarity of zero to obtain the pre-screened term definitions.
Matching corpus: extracting a corpus with sentence similarity greater than a preset threshold value from the corpus, sequencing the corpus according to the similarity from high to low, and filtering the corpus which does not contain the term;
word alignment extracts term paraphrasing in corpus: and acquiring the corpus paraphrasing of the terms in the corpus by adopting a word alignment method.
Term paraphrasing screening: and comparing the similarity of the pre-screened term definitions with the corpus definitions through a character string editing distance algorithm, sorting the pre-screened term definitions according to the editing distance from small to large, and deleting the term definitions with the similarity of zero to obtain the final term definitions.
According to the method for translating the terms based on the context, provided by the embodiment of the invention, the dictionary definitions are utilized to conduct pre-screening of the term definitions, and then the word alignment method is utilized to extract the term definitions in the corpus, so that the translation accuracy of the term definitions can be improved, the optimal definition can be screened based on the context, the term prompt function in computer-aided translation can be improved, the optimal definition conforming to the context is provided for the translator, the time for selecting the term translations by the translator is saved, the repeated work phenomenon in the translation process is avoided, and the translation efficiency of the translator can be effectively improved.
Based on the content of each embodiment, the step of obtaining the corpus paraphrasing corresponding to the term in the corpus by using a word alignment method specifically includes:
performing word alignment scoring on the terms in the corpus by using a preset scoring model, and taking the translated word with the highest word alignment scoring as the corpus paraphrasing of the terms;
wherein, the preset scoring model is:
in the above formula, src represents the original text word, dst represents the translated text word, similarity represents the original text wordDefinition similarity, w, of sink src and translated word dst i Weights representing the ith factor, score i Score, q, representing the ith factor j Weight, distance of jth word in four words representing context of original word src j Representing the distance between the aligned word and the translated word dst if the jth word is aligned, len representing the number of verbs and nouns contained in the corpus.
Specifically, word alignment scoring is performed on terms in all extracted corpus by using a preset scoring model, wherein the scoring model comprises three aspects of contents:
the similarity represents the paraphrase similarity of the original word src and the translated word dst, and if the paraphrases are identical, the paraphrases are 1, 80% similar, 0.8, half similar, 0.5 and completely different, and 0.
The second aspect of the scoring model is the scoring factor measure, w i Weights representing the ith factor, score i The score representing the ith factor (1 represents full satisfaction, 0.5 represents half satisfaction). Wherein w is i Score obtained by training a plurality of bilingual sentences including word alignment i Comprises the following types:
whether the parts of speech of src and dst are the same, if so, score i 1, not identical, score i Is 0;
two context words before and after src and dst are aligned with each other, if the context word of src is already aligned with the context word of dst, score i 1 is shown in the specification; if half aligned, score i 0.5; if not aligned at all, score i Is 0. For example, the context of src is ABsrcCD, the context of dst is EFdstGH, A is aligned to E, B is aligned to F, C is aligned to G, D is aligned to H, score i 1. The context of src is ABsrcCD, dst is EFdstGH, A is aligned to G, B is aligned to H, then score i 0.5.
The third aspect of the scoring model is the penalty value, where qj represents the weight of the jth term in the four terms of the src context, e.g., the context terms of src are the weights of ABsrcCD, B and CWeights 1, A and D are 0.5.distance represents the distance between the aligned words and dst if aligned, len represents the number of nouns and verb words contained in the corpus. For example: the total number of nouns and verb words contained in the corpus is 10, the context word of the term src is ABsrcCD, the context word of the word dst in the translation is EFdstGH, A, B and C are not aligned, D is aligned to the 5 th word behind H, and distance is the word of the word dst j =5, third partHas a value of 0.25.
When the three-part total score of the word alignment scoring model is greater than 1, the total score is set to 1. And taking the translated word with the highest word alignment score as the corpus paraphrasing of the term.
As shown in fig. 2, a schematic structural diagram of a context-based term translation device according to an embodiment of the present invention includes: a term paraphrasing acquisition module 20, a corpus extraction module 21, a word alignment module 22 and a paraphrasing screening module 23, wherein,
the term definition obtaining module 20 is configured to perform sentence processing on a document to be translated, extract terms in the document to be translated and sentences in which the terms are located by using a maximum forward matching algorithm in combination with a term library pre-selected by a translator, and obtain one or more term definitions corresponding to the terms from the term library.
Specifically, the term definition obtaining module 20 may perform sentence segmentation on the document to be translated by using an existing general sentence segmentation algorithm, so as to segment the document to be translated into a plurality of sentences, thereby facilitating subsequent extraction of terms and sentences in which the terms are located.
In order to determine the best interpretation corresponding to a term by using the context of the term, the term interpretation obtaining module 20 needs to extract the term and the sentence in which the term is located in the document to be translated. The terms in each sentence of the document to be translated can be extracted by using a maximum forward matching algorithm (for convenience of description, the extracted terms are referred to as target terms). Correspondingly, the sentence in which the term is located can also be determined. The maximum forward matching algorithm is a basic word segmentation algorithm, and is mature in application and not described herein.
The term definition acquisition module 20 then acquires the term definition corresponding to the target term from the term library, the term definition referring to the definition of the term in the term library, and the definition of a term in the term library may be one or more.
The corpus extraction module 21 is configured to extract a corpus in the corpus, where the degree of matching with the sentence where the term is located is greater than a preset threshold, rank the corpus according to the degree of matching from high to low, and filter the corpus that does not include the term.
In order to provide accurate term definitions to a translator, embodiments of the present invention require the use of the context of the term, which may be accomplished through corpus matching. Specifically, the corpus extraction module 21 extracts a corpus with a sentence matching degree with the target term being greater than a preset threshold, and the corpus in the embodiment of the invention stores a large number of sentence texts and corresponding translations thereof, and can match the corpus similar to the sentence with the target term being according to the preset threshold through a search system such as an elastic search system, and the value of the preset threshold can be set according to needs, for example, 50%.
The corpus satisfying the extraction conditions may be a plurality of sentences, and the corpus extraction module 21 sorts the extracted corpus according to the matching degree. The corpus extracted according to the above method may have a case of not including terms in the document to be translated, and therefore, the corpus extraction module 21 also needs to filter the corpus not including terms in the document to be translated.
The word alignment module 22 is configured to obtain a corpus paraphrasing corresponding to the term in the corpus by using a word alignment method.
The extracted corpus is sentences with the matching degree with the sentences of the target term being greater than a preset threshold value and translations thereof, and can also be considered as sentences with higher similarity with the sentences of the target term. The translation of the target term in the extracted corpus is already translated and is an existing translation resource, and can be considered as a relatively accurate interpretation. In order to provide accurate term definitions to the translator, the word alignment module 22 needs to find and filter translations corresponding to the target terms in all extracted corpora. The method can be used for obtaining the target term by a word alignment method, and if the context of the target term in the corpus is aligned by the word alignment method, the translation of the target term can be directly obtained.
And the paraphrasing screening module 23 is used for screening the term paraphrasing by utilizing the corpus paraphrasing and the character string editing distance algorithm to obtain a final term paraphrasing.
Specifically, after the corpus paraphrasing is obtained, the paraphrasing screening module 23 screens the term paraphrasing using the corpus paraphrasing. The character string editing distance algorithm refers to the minimum editing frequency required for converting one character string into another character string, and can be used for comparing the similarity between the corpus paraphrasing and the term paraphrasing, the editing distance is smaller, and the description similarity is higher. If the similarity is zero, the editing distance is large.
The step of screening the term definitions by utilizing the corpus definitions and the character string editing distance algorithm is to compare the similarity of the term definitions and the corpus definitions through the character string editing distance algorithm, delete the term definitions with zero similarity, and sort the term definitions according to the editing distance from small to large to obtain final term definitions, wherein the final term definitions are the term definitions most similar to the corpus definitions.
According to the term translation device based on the context, provided by the embodiment of the invention, the term definitions in the corpus are extracted through the word alignment method, so that the optimal definition is screened based on the context, the term prompt function in computer-aided translation can be improved, the optimal definition conforming to the context is provided for the translator, the time for selecting the term translation by the translator is saved, the repeated work phenomenon in the translation process is avoided, and the translation efficiency of the translator can be effectively improved.
Based on the content of the above embodiment, the apparatus further includes:
the classification module is used for classifying the files to be translated by using a pre-established classifier, and classifying the files to be translated into one or more industry categories according to the probability;
a pre-screening module, configured to query a dictionary definition corresponding to the term in the industry category dictionary, and pre-screen the term definition using the dictionary definition;
correspondingly, the paraphrasing screening module is specifically used for:
and screening the term definitions subjected to pre-screening by utilizing the corpus definitions and a character string editing distance algorithm to obtain final term definitions.
Specifically, the classification module establishes a naive Bayesian text classifier according to industry classification according to a large number of translation manuscripts with industry labels, and classifies according to industry setting. Classifying the files to be translated by using the established classifier, and classifying the files into one or more categories according to the probability. In particular, a document is considered to belong to a general category when the probability that the document belongs to each category is low.
And the pre-screening module inquires corresponding dictionary definitions of the terms in the industry category dictionary, compares the similarity of the term definitions and the dictionary definitions through a character string editing distance algorithm, sorts the term definitions according to the editing distance from small to large, and filters the term definitions with the similarity of zero to obtain the pre-screened term definitions.
According to the context-based term translation device provided by the embodiment of the invention, the dictionary definitions are utilized to pre-screen the term definitions, and the word alignment method is utilized to extract the term definitions in the corpus, so that the translation accuracy of the term definitions can be improved.
Based on the content of the above embodiment, the word alignment module 22 is specifically configured to:
performing word alignment scoring on the terms in the corpus by using a preset scoring model, and taking the translated word with the highest word alignment scoring as the corpus paraphrasing of the terms;
wherein, the preset scoring model is:
in the above, src represents the original word, dst represents the translated word, similarity represents the similarity of the definitions of the original word src and the translated word dst, and w i Weights representing the ith factor, score i Score, q, representing the ith factor j Weight, distance of jth word in four words representing context of original word src j Representing the distance between the aligned word and the translated word dst if the jth word is aligned, len representing the number of verbs and nouns contained in the corpus.
Specifically, the word alignment module 22 performs word alignment scoring on terms in all extracted corpus by using a preset scoring model, where the scoring model includes three aspects of contents:
the similarity represents the paraphrase similarity of the original word src and the translated word dst, and if the paraphrases are identical, the paraphrases are 1, 80% similar, 0.8, half similar, 0.5 and completely different, and 0.
The second aspect of the scoring model is the scoring factor measure, w i Weights representing the ith factor, score i The score representing the ith factor (1 represents full satisfaction, 0.5 represents half satisfaction). Wherein w is i Score obtained by training a plurality of bilingual sentences including word alignment i Comprises the following types:
whether the parts of speech of src and dst are the same, if so, score i 1, not identical, score i Is 0;
two context words before and after src and dst are aligned with each other, if the context word of src is already aligned with the context word of dst, score i 1 is shown in the specification; if half aligned, score i 0.5; if not aligned at all, score i Is 0.
The third aspect of the scoring model is penalty, where q j Weights representing the jth term of the jth four terms in the src-th context。distance j Indicating the distance of the aligned word and dst if aligned. len represents the number of nouns and verb words contained in the corpus.
FIG. 3 shows a schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in FIG. 3, including a processor (processor) 301, a memory (memory) 302, and a bus 303;
wherein, the processor 301 and the memory 302 respectively complete communication with each other through the bus 303; the processor 301 is configured to invoke program instructions in the storage 302 to perform the context-based term translation method provided by the above embodiments, including, for example: sentence dividing processing is carried out on a document to be translated, a maximum forward matching algorithm is utilized to combine a term library preselected by a translator to extract terms in the document to be translated and sentences in which the terms are located, and one or more term definitions corresponding to the terms are obtained from the term library; extracting the linguistic data with the sentence matching degree of the linguistic data with the term being greater than a preset threshold value from a linguistic data base, sequencing the linguistic data according to the similarity from high to low, and filtering the linguistic data without the term; acquiring corpus paraphrasing corresponding to the terms in the corpus by adopting a word alignment method; and screening the term definitions by utilizing the corpus definitions and a character string editing distance algorithm to obtain final term definitions.
Embodiments of the present invention provide a non-transitory computer readable storage medium storing computer instructions that cause a computer to perform the context-based term translation method provided by the above embodiments, for example, including: sentence dividing processing is carried out on a document to be translated, a maximum forward matching algorithm is utilized to combine a term library preselected by a translator to extract terms in the document to be translated and sentences in which the terms are located, and one or more term definitions corresponding to the terms are obtained from the term library; extracting the linguistic data with the sentence matching degree of the linguistic data with the term being greater than a preset threshold value from a linguistic data base, sequencing the linguistic data according to the similarity from high to low, and filtering the linguistic data without the term; acquiring corpus paraphrasing corresponding to the terms in the corpus by adopting a word alignment method; and screening the term definitions by utilizing the corpus definitions and a character string editing distance algorithm to obtain final term definitions.
The apparatus embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on such understanding, the foregoing technical solutions may be embodied essentially or in part in the form of a software product, which may be stored in a computer-readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the various embodiments or methods of some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (6)

1. A method of context-based term translation, comprising:
sentence dividing processing is carried out on a document to be translated, a maximum forward matching algorithm is utilized to combine a term library preselected by a translator to extract terms in the document to be translated and sentences in which the terms are located, and one or more term definitions corresponding to the terms are obtained from the term library;
extracting a corpus with the sentence matching degree of the corpus and the term being greater than a preset threshold, wherein the corpus comprises a plurality of sentences, sequencing the corpus according to the matching degree from high to low, and filtering the corpus which does not comprise the term;
acquiring corpus paraphrasing corresponding to the terms in the corpus by adopting a word alignment method;
screening the term definitions by utilizing the corpus definitions and a character string editing distance algorithm to obtain final term definitions;
before the step of extracting the corpus, the corpus with the sentence similarity with the term being greater than a preset threshold value, the method further comprises the following steps:
classifying the files to be translated by using a pre-established classifier, and classifying the files to be translated into one or more industry categories according to the probability;
querying dictionary definitions corresponding to the terms in the industry category dictionary, and pre-screening the term definitions by utilizing the dictionary definitions;
correspondingly, the step of screening the term definitions by utilizing the corpus definitions and a character string editing distance algorithm to obtain final term definitions comprises the following specific steps:
screening the term definitions subjected to pre-screening by utilizing the corpus definitions and a character string editing distance algorithm to obtain final term definitions;
the step of acquiring the corpus paraphrasing corresponding to the term in the corpus by adopting a word alignment method comprises the following specific steps:
performing word alignment scoring on the terms in the corpus by using a preset scoring model, and taking the translated word with the highest word alignment scoring as the corpus paraphrasing of the terms;
wherein, the preset scoring model is:
in the above, src represents the original word, dst represents the translated word, similarity represents the similarity of the definitions of the original word src and the translated word dst, and w i Weights representing the ith factor, score i Score, q, representing the ith factor j The j-th word weight in the four words in the context of the original text word src is represented, the distance j represents the distance between the aligned word and the translated word dst if the j-th word is aligned, and len represents the verb and noun quantity contained in the corpus.
2. The method according to claim 1, wherein the step of screening the term definitions by using the corpus definitions in combination with a string edit distance algorithm to obtain final term definitions is specifically:
comparing the similarity of the term paraphrasing and the corpus paraphrasing through a character string editing distance algorithm;
and if the similarity is not zero, deleting the term paraphrasing with zero similarity, and sorting the term paraphrasing according to the editing distance from small to large to obtain the final term paraphrasing.
3. The method according to claim 1, wherein the step of pre-screening the term definitions using the dictionary definitions is in particular:
comparing the similarity of the term paraphrasing and the dictionary paraphrasing through a character string editing distance algorithm;
and sorting the term definitions according to the editing distance from small to large, and deleting the term definitions with zero similarity to obtain the term definitions subjected to pre-screening.
4. A context-based term translation apparatus comprising:
the term definition acquisition module is used for carrying out sentence processing on a document to be translated, extracting terms in the document to be translated and sentences in which the terms are located by combining a term library preselected by a translator through a maximum forward matching algorithm, and acquiring one or more term definitions corresponding to the terms from the term library;
the corpus extraction module is used for extracting the corpus with the sentence matching degree of the corpus with the term being greater than a preset threshold value, sequencing the corpus according to the matching degree from high to low, and filtering the corpus which does not contain the term;
the word alignment module is used for acquiring corpus paraphrasing corresponding to the term in the corpus by adopting a word alignment method;
the paraphrasing screening module is used for screening the term paraphrasing by utilizing the corpus paraphrasing and combining a character string editing distance algorithm to obtain a final term paraphrasing;
further comprises:
the classification module is used for classifying the files to be translated by using a pre-established classifier, and classifying the files to be translated into one or more industry categories according to the probability;
a pre-screening module, configured to query a dictionary definition corresponding to the term in the industry category dictionary, and pre-screen the term definition using the dictionary definition;
correspondingly, the paraphrasing screening module is specifically used for:
screening the term definitions subjected to pre-screening by utilizing the corpus definitions and a character string editing distance algorithm to obtain final term definitions;
the word alignment module is specifically configured to:
performing word alignment scoring on the terms in the corpus by using a preset scoring model, and taking the translated word with the highest word alignment scoring as the corpus paraphrasing of the terms;
wherein, the preset scoring model is:
in the above, src represents the original word, dst represents the translated word, similarity represents the similarity of the definitions of the original word src and the translated word dst, and w i Weights representing the ith factor, score i Score, q, representing the ith factor j The j-th word weight in the four words in the context of the original text word src is represented, the distance j represents the distance between the aligned word and the translated word dst if the j-th word is aligned, and len represents the verb and noun quantity contained in the corpus.
5. An electronic device, comprising:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1-3.
6. A non-transitory computer readable storage medium storing computer instructions that cause the computer to perform the method of any one of claims 1 to 3.
CN201811025328.3A 2018-09-04 2018-09-04 Context-based term translation method and device Active CN109299480B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811025328.3A CN109299480B (en) 2018-09-04 2018-09-04 Context-based term translation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811025328.3A CN109299480B (en) 2018-09-04 2018-09-04 Context-based term translation method and device

Publications (2)

Publication Number Publication Date
CN109299480A CN109299480A (en) 2019-02-01
CN109299480B true CN109299480B (en) 2023-11-07

Family

ID=65166187

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811025328.3A Active CN109299480B (en) 2018-09-04 2018-09-04 Context-based term translation method and device

Country Status (1)

Country Link
CN (1) CN109299480B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413757B (en) * 2019-07-30 2022-02-25 中国工商银行股份有限公司 Word paraphrase determining method, device and system
CN110543644B (en) * 2019-09-04 2023-08-29 语联网(武汉)信息技术有限公司 Machine translation method and device containing term translation and electronic equipment
CN110717340B (en) * 2019-09-29 2023-11-21 百度在线网络技术(北京)有限公司 Recommendation method, recommendation device, electronic equipment and storage medium
CN112836523B (en) * 2019-11-22 2022-12-30 上海流利说信息技术有限公司 Word translation method, device and equipment and readable storage medium
CN111191469B (en) * 2019-12-17 2023-09-19 语联网(武汉)信息技术有限公司 Large-scale corpus cleaning and aligning method and device
CN111222346A (en) * 2019-12-20 2020-06-02 北京海兰信数据科技股份有限公司 Corpus file processing method and apparatus
CN111597826B (en) * 2020-05-15 2021-10-01 苏州七星天专利运营管理有限责任公司 Method for processing terms in auxiliary translation
CN111797621A (en) * 2020-06-04 2020-10-20 语联网(武汉)信息技术有限公司 Method and system for replacing terms
CN111652006B (en) * 2020-06-09 2021-02-09 北京中科凡语科技有限公司 Computer-aided translation method and device
CN111738022B (en) * 2020-06-23 2023-04-18 中国船舶工业综合技术经济研究院 Machine translation optimization method and system in national defense and military industry field
CN112052334B (en) * 2020-09-02 2024-04-05 广州极天信息技术股份有限公司 Text interpretation method, device and storage medium
CN112364669B (en) * 2020-10-14 2021-09-03 北京中科凡语科技有限公司 Method, device, equipment and storage medium for translating translated terms by machine translation
CN113627200B (en) * 2021-06-15 2023-12-08 天津师范大学 International organization science and technology term topic sentence extraction method driven by multi-machine translation engine
CN114091483B (en) * 2021-10-27 2023-02-28 北京百度网讯科技有限公司 Translation processing method and device, electronic equipment and storage medium
CN114238619B (en) * 2022-02-23 2022-04-29 成都数联云算科技有限公司 Method, system, device and medium for screening Chinese nouns based on edit distance
CN114781409B (en) * 2022-05-12 2023-12-01 北京百度网讯科技有限公司 Text translation method, device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5850561A (en) * 1994-09-23 1998-12-15 Lucent Technologies Inc. Glossary construction tool
CA2793268A1 (en) * 2011-10-21 2013-04-21 National Research Council Of Canada Method and apparatus for paraphrase acquisition
CN106156013A (en) * 2016-06-30 2016-11-23 电子科技大学 The two-part machine translation method that a kind of regular collocation type phrase is preferential
CN107908712A (en) * 2017-11-10 2018-04-13 哈尔滨工程大学 Cross-language information matching process based on term extraction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5850561A (en) * 1994-09-23 1998-12-15 Lucent Technologies Inc. Glossary construction tool
CA2793268A1 (en) * 2011-10-21 2013-04-21 National Research Council Of Canada Method and apparatus for paraphrase acquisition
CN106156013A (en) * 2016-06-30 2016-11-23 电子科技大学 The two-part machine translation method that a kind of regular collocation type phrase is preferential
CN107908712A (en) * 2017-11-10 2018-04-13 哈尔滨工程大学 Cross-language information matching process based on term extraction

Also Published As

Publication number Publication date
CN109299480A (en) 2019-02-01

Similar Documents

Publication Publication Date Title
CN109299480B (en) Context-based term translation method and device
CN110442760B (en) Synonym mining method and device for question-answer retrieval system
CN105095204B (en) The acquisition methods and device of synonym
CN107562717B (en) Text keyword extraction method based on combination of Word2Vec and Word co-occurrence
CN107577671B (en) Subject term extraction method based on multi-feature fusion
CN106598959B (en) Method and system for determining mutual translation relationship of bilingual sentence pairs
JP2014041481A (en) Document classification device, and document classification processing program
CN110347790B (en) Text duplicate checking method, device and equipment based on attention mechanism and storage medium
CN110895559A (en) Model training method, text processing method, device and equipment
CN111159389A (en) Keyword extraction method based on patent elements, terminal and readable storage medium
CN110889292B (en) Text data viewpoint abstract generating method and system based on sentence meaning structure model
CN111160014A (en) Intelligent word segmentation method
CN112380848B (en) Text generation method, device, equipment and storage medium
CN110929022A (en) Text abstract generation method and system
US11361565B2 (en) Natural language processing (NLP) pipeline for automated attribute extraction
CN112527977A (en) Concept extraction method and device, electronic equipment and storage medium
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN108475265B (en) Method and device for acquiring unknown words
CN111460147A (en) Title short text classification method based on semantic enhancement
CN112131341A (en) Text similarity calculation method and device, electronic equipment and storage medium
CN109241521B (en) Scientific literature high-attention sentence extraction method based on citation relation
Camps et al. Collating medieval vernacular texts. aligning witnesses, classifying variants
CN115905510A (en) Text abstract generation method and system
CN112528653B (en) Short text entity recognition method and system
Sidhu et al. Role of machine translation and word sense disambiguation in natural language processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant