CN109299480A - Terminology Translation method and device based on context of co-text - Google Patents

Terminology Translation method and device based on context of co-text Download PDF

Info

Publication number
CN109299480A
CN109299480A CN201811025328.3A CN201811025328A CN109299480A CN 109299480 A CN109299480 A CN 109299480A CN 201811025328 A CN201811025328 A CN 201811025328A CN 109299480 A CN109299480 A CN 109299480A
Authority
CN
China
Prior art keywords
corpus
term
definitions
paraphrase
translation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811025328.3A
Other languages
Chinese (zh)
Other versions
CN109299480B (en
Inventor
宋安琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Language Network (wuhan) Information Technology Co Ltd
Shanghai Vivid Translation Service Co Ltd
Original Assignee
Language Network (wuhan) Information Technology Co Ltd
Shanghai Vivid Translation Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Language Network (wuhan) Information Technology Co Ltd, Shanghai Vivid Translation Service Co Ltd filed Critical Language Network (wuhan) Information Technology Co Ltd
Priority to CN201811025328.3A priority Critical patent/CN109299480B/en
Publication of CN109299480A publication Critical patent/CN109299480A/en
Application granted granted Critical
Publication of CN109299480B publication Critical patent/CN109299480B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/47Machine-assisted translation, e.g. using translation memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars

Abstract

The embodiment of the present invention provides a kind of Terminology Translation method and device based on context of co-text, wherein method includes: to carry out subordinate sentence processing to waiting for translating shelves, the sentence where term and the term in the waiting for translating shelves is extracted using the pre-selected terminology bank of maximum forward matching algorithm combination interpreter, and the corresponding definitions of the term are obtained from the terminology bank;It extracts in corpus with the sentence corpus that match degree is greater than the preset threshold where the term, corpus is ranked up from high to low according to matching degree, filters out the corpus not comprising the term;The corresponding corpus paraphrase of term described in the corpus is obtained using word alignment method;The definitions are screened using the corpus paraphrase combination string editing distance algorithm, obtain final definitions.The embodiment of the present invention extracts definitions in corpus by word alignment method, improves the term prompt facility of computer-aided translation, can effectively promote the translation efficiency of interpreter.

Description

Terminology Translation method and device based on context of co-text
Technical field
The present embodiments relate to natural language processing technique fields, are based on context of co-text more particularly, to one kind Terminology Translation method and device.
Background technique
When computer-aided translation (CAT, Computer-Aided Translation) refers to that interpreter carries out translation, The translation of backstage constantly automatic storage interpreter's typing in this way in later translation process, occurs again to establish database When same or similar phrase, system can search for stored same or similar content in database automatically, mention for interpreter Translation for reference makes it that duplicate translation be avoided to work, and need to only be absorbed in the translation of new content, to effectively promote translation effect Rate.
In computer-aided translation, term prompt is a critically important function, and interpreter would generally connect more in translation A terminology bank, and a term would generally correspond to a variety of paraphrase.Existing term prompt facility would generally be by all paraphrase of term It is all prompted to interpreter, and based on context context goes selection paraphrase to interpreter's needs, causes interpreter that can not be rapidly selected correctly Definitions in translation, working efficiency to be low.Therefore, it is urgent to provide a kind of method for improving term prompt facility, energy Enough accurate definitions are provided for interpreter.
Summary of the invention
In view of the problems of the existing technology, the embodiment of the present invention provides a kind of Terminology Translation side based on context of co-text Method and device.
First aspect according to an embodiment of the present invention provides a kind of Terminology Translation method based on context of co-text, packet It includes:
Subordinate sentence processing is carried out to waiting for translating shelves, it will using the pre-selected terminology bank of maximum forward matching algorithm combination interpreter The sentence where term and the term in the waiting for translating shelves extracts, and the art is obtained from the terminology bank The corresponding one or more definitions of language;
Extract corpus in the sentence corpus that match degree is greater than the preset threshold where the term, according to similarity from It is high to Low that the corpus is ranked up, and filter out the corpus not comprising the term;
The corresponding corpus paraphrase of term described in the corpus is obtained using word alignment method;
The definitions are screened using the corpus paraphrase combination string editing distance algorithm, are obtained final Definitions.
The second aspect according to an embodiment of the present invention provides a kind of Terminology Translation device based on context of co-text, packet It includes:
Definitions obtain module, for carrying out subordinate sentence processing to waiting for translating shelves, are combined using maximum forward matching algorithm The pre-selected terminology bank of interpreter extracts the sentence where the term and the term in the waiting for translating shelves, and from The corresponding one or more definitions of the term are obtained in the terminology bank;
Corpus extraction module, for extracting in corpus with the sentence where the term what match degree is greater than the preset threshold Corpus is from high to low ranked up the corpus according to similarity, and filters out the corpus not comprising the term;
Word alignment module, for obtaining the corresponding corpus paraphrase of term described in the corpus using word alignment method;
Paraphrase screening module, for utilizing the corpus paraphrase combination string editing distance algorithm to the definitions It is screened, obtains final definitions.
In terms of third according to an embodiment of the present invention, a kind of electronic equipment is provided, comprising:
At least one processor;And
At least one processor being connect with the processor communication, in which:
The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program to refer to Order is able to carry out in the various possible implementations of above-mentioned first aspect provided by any implementation based on context The Terminology Translation method of context.
4th aspect according to an embodiment of the present invention, provides a kind of non-transient computer readable storage medium, described non- Transitory computer readable storage medium stores computer instruction, and the computer instruction makes the computer be able to carry out above-mentioned the Terminology Translation side based on context of co-text provided by any implementation in the various possible implementations of one side Method.
The Terminology Translation method and device based on context of co-text that the embodiment of the present invention proposes, is mentioned by word alignment method The paraphrase of term in corpus is taken, to filter out best paraphrase, the term prompt function in computer-aided translation can be improved Can, the best paraphrase for meeting context of co-text is provided for interpreter, can effectively promote the translation efficiency of interpreter.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without creative efforts, can be with root Other attached drawings are obtained according to these attached drawings.
Fig. 1 is the flow diagram of the Terminology Translation method provided in an embodiment of the present invention based on context of co-text;
Fig. 2 be another embodiment of the present invention provides the Terminology Translation device based on context of co-text structural schematic diagram;
Fig. 3 is the structural schematic diagram of electronic equipment provided in an embodiment of the present invention.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.
Term prompt in active computer supplementary translation is a critically important function, it will usually which all by term release Justice is prompted to interpreter, and interpreter may need the selection of based on context context progress paraphrase, lead to not in the shorter time The interior correct definitions of selection are in translation.In order to overcome the above problem of the prior art, the embodiment of the present invention provides one Terminology Translation method of the kind based on context of co-text, inventive concept are the terminology bank and corpus using computer-aided translation Library screens by definitions extraction, corpus matching, word alignment and paraphrase, matches and meet the best of term context of co-text Paraphrase is supplied to interpreter, the advantages of this arrangement are as follows, existing translation can be made full use of as reference, to filter out most Good paraphrase so that term prompt facility only provides the unique paraphrase for meeting term context of co-text to interpreter, and then is effectively promoted The translation efficiency of interpreter.Expansion explanation and introduction will be carried out by multiple embodiments below.
As shown in Figure 1, the process for the Terminology Translation method provided in an embodiment of the present invention based on context of co-text is illustrated Figure, the executing subject of this method is server, this method comprises:
Step 10 carries out subordinate sentence processing to waiting for translating shelves, and using maximum forward, matching algorithm combination interpreter is pre-selected Terminology bank extracts the sentence where the term and the term in the waiting for translating shelves, and obtains from the terminology bank Take the corresponding one or more definitions of the term.
Specifically, it can use existing general subordinate sentence algorithm and subordinate sentence processing carried out to waiting for translating shelves, waiting for translating shelves are cut It is divided into multiple sentences, convenient for the sentence where subsequent extracted term and term.
Interpreter would generally pre-select the multiple terminology banks of connection when carrying out document translation, and terminology bank also known as automates word Allusion quotation, according to the available multiple paraphrase to term of terminology bank, the embodiment of the present invention is in order to true using the context of co-text of term Make the corresponding best paraphrase of term, it is necessary first to extract the sentence where the term and term in waiting for translating shelves.Benefit The term extraction in each sentence of waiting for translating shelves can be come out (for the ease of describing hereinafter, by institute with maximum forward matching algorithm The term extracted is known as target terms).Correspondingly, the sentence where term can also determine.Wherein, maximum forward matches Algorithm is basic segmentation methods, and application is more mature, and details are not described herein.
Then the corresponding definitions of target terms are obtained from terminology bank, definitions refer to the term in terminology bank Paraphrase, paraphrase of the term in terminology bank may be it is one or more, for example, the definitions of Super Computer It include: " supercomputer ", " supercomputer ", " high-performance computer " etc..It is best that definitions are available to interpreter's term The basis of paraphrase, subsequent operating procedure are all the processing carried out on the basis of definitions.
Step 11, extract corpus in the sentence corpus that match degree is greater than the preset threshold where the term, according to Matching degree is from high to low ranked up the corpus, and filters out the corpus not comprising the term.
In order to provide accurate definitions to interpreter, the embodiment of the present invention needs the context of co-text using term, and Context of co-text using can be realized by corpus matching.Specifically, extract corpus in the sentence where target terms Son the corpus that match degree is greater than the preset threshold is stored with a large amount of sentence original text and its correspondence in the corpus of the embodiment of the present invention Translation, can by the search systems such as ElasticSearch according to preset threshold match and target terms where sentence class As corpus, the value of preset threshold can be set as needed, such as 50%.
The corpus for meeting said extracted condition can be multiple sentences, according to the height of matching degree to the corpus extracted It is ranked up.The corpus extracted according to the method described above is not there may be the term in waiting for translating shelves is included, therefore, It also needs to filter out the corpus not comprising waiting for translating shelves term.
It is worth noting that the embodiment of the present invention to the matching degree calculation method between sentence with no restriction.
Step 12 obtains the corresponding corpus paraphrase of term described in the corpus using word alignment method.
Specifically, the translation of target terms in the corpus extracted, referred to as corpus paraphrase.The corpus extracted is and target Sentence the sentence and its translation that match degree is greater than the preset threshold where term, it is also assumed that being and sentence where target terms The higher sentence of similarity.The translation of target terms is translated, existing translated resources in the corpus extracted, can To be considered more accurately paraphrase.In order to provide accurate definitions to interpreter, need to find all corpus extracted The corresponding translation of middle target terms is simultaneously screened.It can be obtained using word alignment method, if target terms is upper in corpus Alignment is all hereafter realized by word alignment method, then the translation of target terms can be directly obtained.
Step 13 screens the definitions using the corpus paraphrase combination string editing distance algorithm, Obtain final definitions.
Specifically, after obtaining corpus paraphrase, above-mentioned definitions are screened using corpus paraphrase.Character Series Code Volume distance algorithm refers to a character string is converted into another character string needed for minimum edit number, can be used for comparing language Expect the similarity between paraphrase and definitions, for example, the editing distance of supercomputer and supercomputer is 1, editor away from From smaller, illustrate that similarity is higher.If similarity is zero, illustrate that editing distance is very big.
Screening is carried out to the definitions using the corpus paraphrase combination string editing distance algorithm to refer to and pass through The similarity of the string editing distance algorithm definitions and the corpus paraphrase deletes the term that similarity is zero Paraphrase, and definitions are ranked up from small to large according to editing distance, final definitions are obtained, final definitions are The most like definitions with corpus paraphrase.
Terminology Translation method provided in an embodiment of the present invention based on context of co-text extracts corpus by word alignment method The paraphrase of term in library filters out best paraphrase based on context of co-text to realize, can improve in computer-aided translation Term prompt facility, the best paraphrase for meeting context of co-text is provided for interpreter, save interpreter select Terminology Translation when Between, evade the repeated work phenomenon occurred in translation process, can effectively promote the translation efficiency of interpreter.
On the basis of the above embodiments, it is greater than in the extraction corpus with the sentence similarity where the term pre- If before the step of corpus of threshold value, further includes:
Classified using the classifier pre-established to waiting for translating shelves, is divided the waiting for translating shelves according to probability height To one or more categorys of employment;
The term corresponding dictionary definition in the category of employment dictionary is inquired, and using the dictionary definition to institute It states definitions and carries out prescreening;
Correspondingly, described that the definitions are sieved using the corpus paraphrase combination string editing distance algorithm The step of selecting, obtaining final definitions, specifically:
The definitions Jing Guo prescreening are carried out using the corpus paraphrase combination string editing distance algorithm Screening, obtains final definitions.
Specifically, the embodiment of the present invention also provides a kind of Terminology Translation method based on context of co-text, is utilizing corpus Before definitions are screened in paraphrase, prescreening is carried out to definitions using dictionary definition.The embodiment of the present invention provides The detailed process of the Terminology Translation method based on context of co-text may is that
Firstly, carrying out trade classification to waiting for translating shelves: according to the translation original text largely already provided with industry label by industry point Class establishes Naive Bayes Classifier, sets and classifies according to industry.Using well-established classifier to waiting for translating shelves into Document is divided into one or more classifications according to probability height by row classification.Distinguishingly, when document belongs to the general of each classification When rate is all very low, it is believed that document is not belonging to any category of employment, belongs to general categories.
It obtains definitions: subordinate sentence processing being carried out to the waiting for translating shelves, and utilizes maximum forward matching method combination interpreter Pre-selected terminology bank extracts the sentence where the term and the term in the waiting for translating shelves, and from described The corresponding one or more definitions of the term are obtained in terminology bank.
Prescreening is carried out using dictionary definition: being inquired the term corresponding dictionary in the category of employment dictionary and is released Justice, and by the similarities of string editing the distance algorithm definitions and dictionary definition, according to editing distance from It is small and to filter the definitions that similarity is zero to being ranked up to the definitions greatly, obtain the term Jing Guo prescreening Paraphrase.
It matches corpus: extracting the corpus for being greater than preset threshold in corpus with the sentence similarity where the term, press The corpus is ranked up from high to low according to similarity, and filters out the corpus not comprising the term;
Word alignment extracts the definitions in corpus: the corpus of term described in the corpus is obtained using word alignment method Paraphrase.
Definitions screening: by string editing distance algorithm definitions by prescreening and described The similarity of corpus paraphrase is ranked up to described by the definitions of prescreening from small to large according to editing distance, is deleted The definitions that similarity is zero obtain final definitions.
Terminology Translation method provided in an embodiment of the present invention based on context of co-text carries out term using dictionary definition and releases Word alignment method is recycled to extract the paraphrase of term in corpus after the prescreening of justice, the translation that can be improved definitions is accurate Rate filters out best paraphrase based on context of co-text to realize, can improve the term prompt function in computer-aided translation Can, the best paraphrase for meeting context of co-text is provided for interpreter, saves the time that interpreter selects Terminology Translation, evades translated The repeated work phenomenon occurred in journey can effectively promote the translation efficiency of interpreter.
Content based on the various embodiments described above, it is described corresponding using term described in the word alignment method acquisition corpus The step of corpus paraphrase, specifically:
Word alignment marking is carried out to the term in the corpus using preset scoring model, and word alignment is given a mark Corpus paraphrase of the highest translation vocabulary as the term;
Wherein, the preset scoring model are as follows:
In above formula, src indicates that original text vocabulary, dst indicate that translation vocabulary, similarity indicate original text vocabulary src and translates The paraphrase similarity of cliction remittance dst, wiIndicate the weight of i-th of factor, scoreiIndicate the score of i-th of factor, qjIndicate former The weight of j-th of word, distance in four words of context of cliction remittance srcjIt indicates if j-th of word is right Together, the distance between the word of alignment and translation vocabulary dst, len indicate the verb for including in corpus original text and noun quantity.
Specifically, word alignment marking is carried out to the term in all corpus extracted using preset scoring model, beaten Sub-model includes the content of three aspects:
First is that measuring similarity, similarity indicates the paraphrase similarity of original text vocabulary src and translation vocabulary dst, such as Fruit is identical, is 1, and 80% is similar, is 0.8, and one is half similar, is 0.5, entirely different, is 0.
The second aspect content of scoring model is the measurement of marking factor, wiIndicate the weight of i-th of factor, scoreiTable Show the score (1 representative is fully met, and 0.5 representative meets half) of i-th of factor.Wherein, wiBy largely including word alignment Bilingual sentence training obtains, scoreiInclude such as Types Below:
Whether src and dst part of speech is identical, if they are the same, scoreiIt is 1, not identical, scoreiIt is 0;
Each two context words are aligned correlation before and after src and dst, if the context words of src are with dst's Context words alignment, scoreiIt is 1;If half is aligned, scoreiIt is 0.5;If complete unjustified, scoreiIt is 0.For example, The context words of src are ABsrcCD, and the context words of dst are EFdstGH, and A snaps to E, and B snaps to F, and C snaps to G, D snaps to H, scoreiIt is 1.The context words of src are ABsrcCD, and the context words of dst are EFdstGH, and A is snapped to G, B snap to H, then scoreiIt is 0.5.
Content is penalty value in terms of the third of scoring model, wherein qj indicates jth in four words of src context The weight of a word, for example, src context words be ABsrcCD, B and C weight be 1, A and D weight be 0.5. Distancej indicates that if be aligned, the word of alignment and the distance of dst, len indicate the noun that corpus original text includes and moves Word word quantity.Such as: the noun and verb word total quantity that corpus original text includes are 10, and the context words of term src are ABsrcCD, the context words of word dst are EFdstGH in translation, and A, B and C are unjustified, and D snaps to the 5th word behind H Language, then distancej=5, Part IIIValue be 0.25.
When word alignment scoring model three parts total score is greater than 1, it is always divided into 1.Word alignment is given a mark highest translation word The corpus paraphrase converged as the term.
As shown in Fig. 2, being the structural representation of the Terminology Translation device provided in an embodiment of the present invention based on context of co-text Figure, comprising: definitions obtain module 20, corpus extraction module 21, word alignment module 22 and paraphrase screening module 23, wherein
Definitions obtain module 20, for carrying out subordinate sentence processing to waiting for translating shelves, utilize maximum forward matching algorithm knot The pre-selected terminology bank of interpreter is closed to extract the sentence where the term and the term in the waiting for translating shelves, and The corresponding one or more definitions of the term are obtained from the terminology bank.
Specifically, definitions, which obtain module 20, can use existing general subordinate sentence algorithm to waiting for translating shelves progress subordinate sentence Waiting for translating shelves cutting is multiple sentences, convenient for the sentence where subsequent extracted term and term by processing.
Interpreter would generally pre-select the multiple terminology banks of connection when carrying out document translation, and terminology bank also known as automates word Allusion quotation, according to the available multiple paraphrase to term of terminology bank, the embodiment of the present invention is in order to true using the context of co-text of term The corresponding best paraphrase of term is made, definitions obtain module 20 and need the sentence where the term and term in waiting for translating shelves Son extracts.Using maximum forward matching algorithm can by the term extraction in each sentence of waiting for translating shelves come out (for the ease of It describes hereinafter, the term extracted is known as target terms).Correspondingly, the sentence where term can also determine.Its In, maximum forward matching algorithm is basic segmentation methods, and application is more mature, and details are not described herein.
Then definitions obtain module 20 and obtain the corresponding definitions of target terms from terminology bank, and definitions are Refer to that paraphrase of the term in terminology bank, paraphrase of the term in terminology bank may be one or more.
Corpus extraction module 21, for extracting, match degree is greater than the preset threshold with the sentence where the term in corpus Corpus, the corpus is ranked up from high to low according to matching degree, and filter out the corpus not comprising the term.
In order to provide accurate definitions to interpreter, the embodiment of the present invention needs the context of co-text using term, and Context of co-text using can be realized by corpus matching.Specifically, corpus extraction module 21 extract corpus in mesh The sentence corpus that match degree is greater than the preset threshold where term is marked, is stored with a large amount of sentence in the corpus of the embodiment of the present invention Sub- original text and its corresponding translation, can be matched by search systems such as ElasticSearch according to preset threshold and target The similar corpus of sentence where term, the value of preset threshold can be set as needed, such as 50%.
The corpus for meeting said extracted condition can be multiple sentences, and corpus extraction module 21 is right according to the height of matching degree The corpus extracted is ranked up.There may be the arts not included in waiting for translating shelves for the corpus extracted according to the method described above The case where language, therefore, corpus extraction module 21, also need to filter out the corpus not comprising waiting for translating shelves term.
Word alignment module 22, for obtaining the corresponding corpus paraphrase of term described in the corpus using word alignment method.
The corpus extracted is sentence the sentence and its translation that match degree is greater than the preset threshold where with target terms, It may be considered and the higher sentence of sentence similarity where target terms.The translation of target terms is in the corpus extracted Translated good, existing translated resources, it is believed that be more accurately paraphrase.It is released to provide accurate term to interpreter Justice, word alignment module 22 need to find the corresponding translation of target terms in all corpus extracted and are screened.It can adopt Word alignment schemes obtain, if the context of target terms all passes through word alignment method and realizes alignment in corpus, then just The translation of target terms can be directly obtained.
Paraphrase screening module 23, for being released using the corpus paraphrase combination string editing distance algorithm the term Justice is screened, and final definitions are obtained.
Specifically, after obtaining corpus paraphrase, paraphrase screening module 23 using corpus paraphrase to above-mentioned definitions into Row screening.String editing distance algorithm refers to a character string is converted into another character string needed for minimum editor time Number, can be used for comparing the similarity between corpus paraphrase and definitions, editing distance is smaller, illustrates that similarity is higher.If Similarity is zero, illustrates that editing distance is very big.
Screening is carried out to the definitions using the corpus paraphrase combination string editing distance algorithm to refer to and pass through The similarity of the string editing distance algorithm definitions and the corpus paraphrase deletes the term that similarity is zero Paraphrase, and definitions are ranked up from small to large according to editing distance, final definitions are obtained, final definitions are The most like definitions with corpus paraphrase.
Terminology Translation device provided in an embodiment of the present invention based on context of co-text extracts corpus by word alignment method The paraphrase of term in library filters out best paraphrase based on context of co-text to realize, can improve in computer-aided translation Term prompt facility, the best paraphrase for meeting context of co-text is provided for interpreter, save interpreter select Terminology Translation when Between, evade the repeated work phenomenon occurred in translation process, can effectively promote the translation efficiency of interpreter.
Content based on the above embodiment, described device further include:
Categorization module, for using the classifier pre-established to classify waiting for translating shelves, according to probability just by institute It states waiting for translating shelves and is divided into one or more categorys of employment;
Pre-screening module for inquiring the term corresponding dictionary definition in the category of employment dictionary, and utilizes The dictionary definition carries out prescreening to the definitions;
Correspondingly, the paraphrase screening module is specifically used for:
The definitions Jing Guo prescreening are carried out using the corpus paraphrase combination string editing distance algorithm Screening, obtains final definitions.
Specifically, categorization module establishes simple pattra leaves by trade classification according to the translation original text largely already provided with industry label This text classifier sets according to industry and classifies.Classified using well-established classifier to waiting for translating shelves, according to probability Document is divided into one or more classifications by height.Distinguishingly, when the probability that document belongs to each classification is very low, it is believed that Document is not belonging to any category of employment, belongs to general categories.
Pre-screening module inquires term corresponding dictionary definition in the category of employment dictionary, and passes through character string The similarity of definitions described in editing distance algorithm comparison and dictionary definition, according to editing distance from small to large to the term Paraphrase is ranked up, and filters the definitions that similarity is zero, obtains the definitions Jing Guo prescreening.
Terminology Translation device provided in an embodiment of the present invention based on context of co-text carries out term using dictionary definition and releases Word alignment method is recycled to extract the paraphrase of term in corpus after the prescreening of justice, the translation that can be improved definitions is accurate Rate.
Content based on the above embodiment, the word alignment module 22 are specifically used for:
Word alignment marking is carried out to the term in the corpus using preset scoring model, and word alignment is given a mark Corpus paraphrase of the highest translation vocabulary as the term;
Wherein, the preset scoring model are as follows:
In above formula, src indicates that original text vocabulary, dst indicate that translation vocabulary, similarity indicate original text vocabulary src and translates The paraphrase similarity of cliction remittance dst, wiIndicate the weight of i-th of factor, scoreiIndicate the score of i-th of factor, qjIndicate former The weight of j-th of word, distance in four words of context of cliction remittance srcjIt indicates if j-th of word is right Together, the distance between the word of alignment and translation vocabulary dst, len indicate the verb for including in corpus original text and noun quantity.
Specifically, word alignment module 22 carries out word to the term in all corpus extracted using preset scoring model Alignment marking, scoring model include the content of three aspects:
First is that measuring similarity, similarity indicates the paraphrase similarity of original text vocabulary src and translation vocabulary dst, such as Fruit is identical, is 1, and 80% is similar, is 0.8, and one is half similar, is 0.5, entirely different, is 0.
The second aspect content of scoring model is the measurement of marking factor, wiIndicate the weight of i-th of factor, scoreiTable Show the score (1 representative is fully met, and 0.5 representative meets half) of i-th of factor.Wherein, wiBy largely including word alignment Bilingual sentence training obtains, scoreiInclude such as Types Below:
Whether src and dst part of speech is identical, if they are the same, scoreiIt is 1, not identical, scoreiIt is 0;
Each two context words are aligned correlation before and after src and dst, if the context words of src are with dst's Context words alignment, scoreiIt is 1;If half is aligned, scoreiIt is 0.5;If complete unjustified, scoreiIt is 0.
Content is penalty value in terms of the third of scoring model, wherein qjIndicate jth in four words of src context The weight of a word.distancejIt indicates if be aligned, the word of alignment and the distance of dst.Len indicates corpus original text The noun and verb word quantity for including.
Fig. 3 shows the structural schematic diagram of electronic equipment provided in an embodiment of the present invention, as shown in figure 3, including processor (processor) 301, memory (memory) 302 and bus 303;
Wherein, processor 301 and memory 302 complete mutual communication by bus 303 respectively;Processor 301 is used In calling the program instruction in storage 302, to execute the Terminology Translation side based on context of co-text provided by above-described embodiment Method, for example, subordinate sentence processing is carried out to waiting for translating shelves, utilizes the pre-selected term of maximum forward matching algorithm combination interpreter Library extracts the sentence where the term and the term in the waiting for translating shelves, and institute is obtained from the terminology bank State the corresponding one or more definitions of term;It extracts to be greater than in corpus with the sentence matching degree where the term and preset The corpus of threshold value is from high to low ranked up the corpus according to similarity, and filters out the corpus not comprising the term; The corresponding corpus paraphrase of term described in the corpus is obtained using word alignment method;Utilize the corpus paraphrase combination character string Editing distance algorithm screens the definitions, obtains final definitions.
The embodiment of the present invention provides a kind of non-transient computer readable storage medium, the non-transient computer readable storage medium Matter stores computer instruction, which makes computer execute the art based on context of co-text provided by above-described embodiment Language interpretation method, for example, subordinate sentence processing is carried out to waiting for translating shelves, matching algorithm combination interpreter selects in advance using maximum forward The terminology bank selected extracts the sentence where the term and the term in the waiting for translating shelves, and from the terminology bank It is middle to obtain the corresponding one or more definitions of the term;Extract corpus in the sentence matching degree where the term Greater than the corpus of preset threshold, the corpus is ranked up from high to low according to similarity, and filters out not comprising the art The corpus of language;The corresponding corpus paraphrase of term described in the corpus is obtained using word alignment method;Utilize the corpus paraphrase The definitions are screened in conjunction with string editing distance algorithm, obtain final definitions.
The apparatus embodiments described above are merely exemplary, wherein unit can be as illustrated by the separation member Or may not be and be physically separated, component shown as a unit may or may not be physical unit, i.e., It can be located in one place, or may be distributed over multiple network units.It can select according to the actual needs therein Some or all of the modules achieves the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying creative labor In the case where dynamic, it can understand and implement.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation The method of certain parts of example or embodiment.
Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features; And these are modified or replaceed, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims (10)

1. a kind of Terminology Translation method based on context of co-text characterized by comprising
Subordinate sentence processing is carried out to waiting for translating shelves, it will be described using the pre-selected terminology bank of maximum forward matching algorithm combination interpreter The sentence where term and the term in waiting for translating shelves extracts, and the term pair is obtained from the terminology bank The one or more definitions answered;
Extract corpus in the sentence corpus that match degree is greater than the preset threshold where the term, according to matching degree from height to It is low that the corpus is ranked up, and filter out the corpus not comprising the term;
The corresponding corpus paraphrase of term described in the corpus is obtained using word alignment method;
The definitions are screened using the corpus paraphrase combination string editing distance algorithm, obtain final term Paraphrase.
2. the method according to claim 1, wherein the sentence extracted in corpus where with the term Similarity was greater than before the step of corpus of preset threshold, further includes:
Classified using the classifier pre-established to waiting for translating shelves, the waiting for translating shelves are divided into one according to probability height A or multiple categorys of employment;
The term corresponding dictionary definition in the category of employment dictionary is inquired, and using the dictionary definition to the art Language paraphrase carries out prescreening;
It is correspondingly, described that the definitions are screened using the corpus paraphrase combination string editing distance algorithm, The step of obtaining final definitions, specifically:
The definitions Jing Guo prescreening are screened using the corpus paraphrase combination string editing distance algorithm, Obtain final definitions.
3. the method according to claim 1, wherein described obtained described in the corpus using word alignment method The step of term corresponding corpus paraphrase, specifically:
Word alignment marking is carried out to the term in the corpus using preset scoring model, and word alignment is given a mark highest Corpus paraphrase of the translation vocabulary as the term;
Wherein, the preset scoring model are as follows:
In above formula, src indicates that original text vocabulary, dst indicate that translation vocabulary, similarity indicate original text vocabulary src and translation word The paraphrase similarity of remittance dst, wiIndicate the weight of i-th of factor, scoreiIndicate the score of i-th of factor, qjIndicate original text word The weight of j-th of word, distance in four words of context of remittance srcjIt indicates if j-th of word has been aligned, it is right Distance between neat word and translation vocabulary dst, len indicate the verb for including in corpus original text and noun quantity.
4. the method according to claim 1, wherein it is described using the corpus paraphrase combination string editing away from The step of definitions are screened, obtain final definitions from algorithm specifically:
Pass through the similarity of the string editing distance algorithm definitions and the corpus paraphrase;
If similarity is not all zero, the definitions that similarity is zero are deleted, and according to editing distance from small to large to described Definitions are ranked up, and obtain final definitions.
5. according to the method described in claim 2, it is characterized in that, it is described using the dictionary definition to the definitions into The step of row prescreening, specifically:
Pass through the similarity of the string editing distance algorithm definitions and the dictionary definition;
The definitions are ranked up from small to large according to editing distance, and delete the definitions that similarity is zero, are obtained Obtain the definitions Jing Guo prescreening.
6. a kind of Terminology Translation device based on context of co-text characterized by comprising
Definitions obtain module, for carrying out subordinate sentence processing to waiting for translating shelves, utilize maximum forward matching algorithm combination interpreter Pre-selected terminology bank extracts the sentence where the term and the term in the waiting for translating shelves, and from described The corresponding one or more definitions of the term are obtained in terminology bank;
Corpus extraction module, for extract in corpus with the sentence language that match degree is greater than the preset threshold where the term Material, is from high to low ranked up the corpus according to matching degree, and filter out the corpus not comprising the term;
Word alignment module, for obtaining the corresponding corpus paraphrase of term described in the corpus using word alignment method;
Paraphrase screening module, for being carried out using the corpus paraphrase combination string editing distance algorithm to the definitions Screening, obtains final definitions.
7. device according to claim 6, which is characterized in that further include:
Categorization module, for using the classifier pre-established to classify waiting for translating shelves, according to probability height will it is described to Translation shelves are divided into one or more categorys of employment;
Pre-screening module, for inquiring the term corresponding dictionary definition in the category of employment dictionary, and described in utilization Dictionary definition carries out prescreening to the definitions;
Correspondingly, the paraphrase screening module is specifically used for:
The definitions Jing Guo prescreening are screened using the corpus paraphrase combination string editing distance algorithm, Obtain final definitions.
8. device according to claim 6, which is characterized in that the word alignment module is specifically used for:
Word alignment marking is carried out to the term in the corpus using preset scoring model, and word alignment is given a mark highest Corpus paraphrase of the translation vocabulary as the term;
Wherein, the preset scoring model are as follows:
In above formula, src indicates that original text vocabulary, dst indicate that translation vocabulary, similarity indicate original text vocabulary src and translation word The paraphrase similarity of remittance dst, wiIndicate the weight of i-th of factor, scoreiIndicate the score of i-th of factor, qjIndicate original text word The weight of j-th of word, distance in four words of context of remittance srcjIt indicates if j-th of word has been aligned, it is right Distance between neat word and translation vocabulary dst, len indicate the verb for including in corpus original text and noun quantity.
9. a kind of electronic equipment characterized by comprising
At least one processor;And
At least one processor being connect with the processor communication, in which:
The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program to instruct energy Enough execute method as claimed in claim 1 to 5.
10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited Computer instruction is stored up, the computer instruction makes the computer execute method as claimed in claim 1 to 5.
CN201811025328.3A 2018-09-04 2018-09-04 Context-based term translation method and device Active CN109299480B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811025328.3A CN109299480B (en) 2018-09-04 2018-09-04 Context-based term translation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811025328.3A CN109299480B (en) 2018-09-04 2018-09-04 Context-based term translation method and device

Publications (2)

Publication Number Publication Date
CN109299480A true CN109299480A (en) 2019-02-01
CN109299480B CN109299480B (en) 2023-11-07

Family

ID=65166187

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811025328.3A Active CN109299480B (en) 2018-09-04 2018-09-04 Context-based term translation method and device

Country Status (1)

Country Link
CN (1) CN109299480B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413757A (en) * 2019-07-30 2019-11-05 中国工商银行股份有限公司 A kind of word paraphrase determines method, apparatus and system
CN110543644A (en) * 2019-09-04 2019-12-06 语联网(武汉)信息技术有限公司 Machine translation method and device containing term translation and electronic equipment
CN110717340A (en) * 2019-09-29 2020-01-21 百度在线网络技术(北京)有限公司 Recommendation method and device, electronic equipment and storage medium
CN111191469A (en) * 2019-12-17 2020-05-22 语联网(武汉)信息技术有限公司 Large-scale corpus cleaning and aligning method and device
CN111222346A (en) * 2019-12-20 2020-06-02 北京海兰信数据科技股份有限公司 Corpus file processing method and apparatus
CN111597826A (en) * 2020-05-15 2020-08-28 苏州七星天专利运营管理有限责任公司 Method for processing terms in auxiliary translation
CN111652006A (en) * 2020-06-09 2020-09-11 北京中科凡语科技有限公司 Computer-aided translation method and device
CN111738022A (en) * 2020-06-23 2020-10-02 中国船舶工业综合技术经济研究院 Machine translation optimization method and system in national defense and military industry field
CN111797621A (en) * 2020-06-04 2020-10-20 语联网(武汉)信息技术有限公司 Method and system for replacing terms
CN112052334A (en) * 2020-09-02 2020-12-08 广州极天信息技术股份有限公司 Text paraphrasing method, text paraphrasing device and storage medium
CN112364669A (en) * 2020-10-14 2021-02-12 北京中科凡语科技有限公司 Method, device, equipment and storage medium for translating translated terms by machine translation
CN112836523A (en) * 2019-11-22 2021-05-25 上海流利说信息技术有限公司 Word translation method, device and equipment and readable storage medium
CN113627200A (en) * 2021-06-15 2021-11-09 天津师范大学 International organization science and technology term subject sentence extraction method driven by multi-machine translation engine
CN114091483A (en) * 2021-10-27 2022-02-25 北京百度网讯科技有限公司 Translation processing method and device, electronic equipment and storage medium
CN114238619A (en) * 2022-02-23 2022-03-25 成都数联云算科技有限公司 Method, system, device and medium for screening Chinese nouns based on edit distance
CN114781409A (en) * 2022-05-12 2022-07-22 北京百度网讯科技有限公司 Text translation method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5850561A (en) * 1994-09-23 1998-12-15 Lucent Technologies Inc. Glossary construction tool
CA2793268A1 (en) * 2011-10-21 2013-04-21 National Research Council Of Canada Method and apparatus for paraphrase acquisition
CN106156013A (en) * 2016-06-30 2016-11-23 电子科技大学 The two-part machine translation method that a kind of regular collocation type phrase is preferential
CN107908712A (en) * 2017-11-10 2018-04-13 哈尔滨工程大学 Cross-language information matching process based on term extraction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5850561A (en) * 1994-09-23 1998-12-15 Lucent Technologies Inc. Glossary construction tool
CA2793268A1 (en) * 2011-10-21 2013-04-21 National Research Council Of Canada Method and apparatus for paraphrase acquisition
CN106156013A (en) * 2016-06-30 2016-11-23 电子科技大学 The two-part machine translation method that a kind of regular collocation type phrase is preferential
CN107908712A (en) * 2017-11-10 2018-04-13 哈尔滨工程大学 Cross-language information matching process based on term extraction

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413757B (en) * 2019-07-30 2022-02-25 中国工商银行股份有限公司 Word paraphrase determining method, device and system
CN110413757A (en) * 2019-07-30 2019-11-05 中国工商银行股份有限公司 A kind of word paraphrase determines method, apparatus and system
CN110543644B (en) * 2019-09-04 2023-08-29 语联网(武汉)信息技术有限公司 Machine translation method and device containing term translation and electronic equipment
CN110543644A (en) * 2019-09-04 2019-12-06 语联网(武汉)信息技术有限公司 Machine translation method and device containing term translation and electronic equipment
CN110717340A (en) * 2019-09-29 2020-01-21 百度在线网络技术(北京)有限公司 Recommendation method and device, electronic equipment and storage medium
CN110717340B (en) * 2019-09-29 2023-11-21 百度在线网络技术(北京)有限公司 Recommendation method, recommendation device, electronic equipment and storage medium
CN112836523A (en) * 2019-11-22 2021-05-25 上海流利说信息技术有限公司 Word translation method, device and equipment and readable storage medium
CN112836523B (en) * 2019-11-22 2022-12-30 上海流利说信息技术有限公司 Word translation method, device and equipment and readable storage medium
CN111191469A (en) * 2019-12-17 2020-05-22 语联网(武汉)信息技术有限公司 Large-scale corpus cleaning and aligning method and device
CN111191469B (en) * 2019-12-17 2023-09-19 语联网(武汉)信息技术有限公司 Large-scale corpus cleaning and aligning method and device
CN111222346A (en) * 2019-12-20 2020-06-02 北京海兰信数据科技股份有限公司 Corpus file processing method and apparatus
CN111597826A (en) * 2020-05-15 2020-08-28 苏州七星天专利运营管理有限责任公司 Method for processing terms in auxiliary translation
CN111797621A (en) * 2020-06-04 2020-10-20 语联网(武汉)信息技术有限公司 Method and system for replacing terms
CN111652006A (en) * 2020-06-09 2020-09-11 北京中科凡语科技有限公司 Computer-aided translation method and device
CN111652006B (en) * 2020-06-09 2021-02-09 北京中科凡语科技有限公司 Computer-aided translation method and device
CN111738022A (en) * 2020-06-23 2020-10-02 中国船舶工业综合技术经济研究院 Machine translation optimization method and system in national defense and military industry field
CN111738022B (en) * 2020-06-23 2023-04-18 中国船舶工业综合技术经济研究院 Machine translation optimization method and system in national defense and military industry field
CN112052334A (en) * 2020-09-02 2020-12-08 广州极天信息技术股份有限公司 Text paraphrasing method, text paraphrasing device and storage medium
CN112052334B (en) * 2020-09-02 2024-04-05 广州极天信息技术股份有限公司 Text interpretation method, device and storage medium
CN112364669A (en) * 2020-10-14 2021-02-12 北京中科凡语科技有限公司 Method, device, equipment and storage medium for translating translated terms by machine translation
CN113627200A (en) * 2021-06-15 2021-11-09 天津师范大学 International organization science and technology term subject sentence extraction method driven by multi-machine translation engine
CN113627200B (en) * 2021-06-15 2023-12-08 天津师范大学 International organization science and technology term topic sentence extraction method driven by multi-machine translation engine
CN114091483A (en) * 2021-10-27 2022-02-25 北京百度网讯科技有限公司 Translation processing method and device, electronic equipment and storage medium
CN114238619A (en) * 2022-02-23 2022-03-25 成都数联云算科技有限公司 Method, system, device and medium for screening Chinese nouns based on edit distance
CN114238619B (en) * 2022-02-23 2022-04-29 成都数联云算科技有限公司 Method, system, device and medium for screening Chinese nouns based on edit distance
CN114781409A (en) * 2022-05-12 2022-07-22 北京百度网讯科技有限公司 Text translation method and device, electronic equipment and storage medium
CN114781409B (en) * 2022-05-12 2023-12-01 北京百度网讯科技有限公司 Text translation method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN109299480B (en) 2023-11-07

Similar Documents

Publication Publication Date Title
CN109299480A (en) Terminology Translation method and device based on context of co-text
CN109189901B (en) Method for automatically discovering new classification and corresponding corpus in intelligent customer service system
CN110309305B (en) Machine reading understanding method based on multi-task joint training and computer storage medium
KR102055656B1 (en) Methods, apparatus and products for semantic processing of text
CN106951438A (en) A kind of event extraction system and method towards open field
CN104573046A (en) Comment analyzing method and system based on term vector
CN110362819B (en) Text emotion analysis method based on convolutional neural network
CN102169495A (en) Industry dictionary generating method and device
CN101231634A (en) Autoabstract method for multi-document
CN107194617B (en) App software engineer soft skill classification system and method
CN109002473A (en) A kind of sentiment analysis method based on term vector and part of speech
CN109783794A (en) File classification method and device
CN112131876A (en) Method and system for determining standard problem based on similarity
CN111191442A (en) Similar problem generation method, device, equipment and medium
CN104573030A (en) Textual emotion prediction method and device
CN112100212A (en) Case scenario extraction method based on machine learning and rule matching
CN107301167A (en) A kind of work(performance description information recognition methods and device
CN110674378A (en) Chinese semantic recognition method based on cosine similarity and minimum editing distance
CN112328792A (en) Optimization method for recognizing credit events based on DBSCAN clustering algorithm
CN109446313A (en) A kind of ordering system and method based on natural language analysis
KR20140044156A (en) Duplication news detection system and method for detecting duplication news
Reddy et al. Obtaining description for simple images using surface realization techniques and natural language processing
KR101429621B1 (en) Duplication news detection system and method for detecting duplication news
CN108475265B (en) Method and device for acquiring unknown words
CN106326495A (en) Topic model based automatic Chinese text classification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant