The term translation digging system and method for function are customized with field
Technical field
The present invention relates to a kind of term translation digging systems and method customizing function with field, belong to patent term certainly
Right Language Processing field, especially WEB information excavatings and extraction field.
Background technology
In recent years, the translated resources based on Web obtained the concern studied and start to cause researcher.The country is in this respect
It has been reported that, the multilingual translation dictionary acquisition methods based on Web is had studied such as Shanghai Communications University;Fujitsu's China research Institute research
The acquisition methods of Terminology Translation based on Web;It is also used in the bidirectional English-Chinese translation system of new generation that CCID Group releases
Web dictionary techniques, etc..
The achievement in research for summarizing forefathers finds that most terminology extraction research only uses single language language material of designated field
A variety of methods make great efforts to improve the accurate rate and recall rate of terminology extraction.This can not know the characteristics of true field text and towards answering
The problem of terminology extraction should pay attention to and solution.In terms of bilingual terminology acquisition, it can't realize that field is fixed at present
The excavation of the bilingual terminology of system.So-called field customization, is exactly the research field proposed according to user, automatically analyzes web page text
Field correlation can extract the bilingual terminology translation resource of different field.For the further investigation in terms of these undoubtedly to base
There is important value in the practical application that the bilingual resource of Web obtains research automatically.
Invention content
The present invention is intended to provide a kind of term translation digging system and method with field customization function, solve existing skill
Art is the defect of terminology extraction to be carried out to single language language material of designated field, and then realize the field seed according to input, is passed through
Search engine obtains the bilingual corpora of related field, and the purpose of online mining term translation automatically.The system that the present invention realizes
English-chinese bilingual field text can be automatically analyzed and obtained with method, online term translation is carried out and excavates, if input is English
Text just excavates English translation, and the translation of Chinese is excavated if what is inputted is Chinese.
Technical solution of the invention is:
A kind of term translation digging system with field customization function, including learning training unit and term translation excavate
Unit,
Learning training unit:Including input subelement and subelement is trained,
Input subelement:Input field seed arranges to input the small-scale Chinese-English keyword of user's designated field
Table, to the seed as learning training;
Training subelement:It obtains the bilingual web page resource of user's designated field from internet and obtains the bilingual of the field and turn over
It translates pair;
Term translation excavates unit:Automatic obtaining method is translated using the field term of self feed back, increment type, according to reception
To field term keyword obtain corresponding translation and export.
Further, training subelement include bilingual Web sites and webpage identification module, it is bilingual resource noise filtering module, double
Language webpage field similarity calculation module, translation sentence pair abstraction module,
Bilingual Web sites and webpage identification module:Using the field customizing model based on keyword, input one first is specific
The Chinese-English keyword list in field establishes associated station point set using universal search engine, recycles collector to each station
Point and its related link are extended lookup, form the bilingual web page collection of original specific area, use vector space model pair
Original bilingual web page collection carries out preliminary filtering and forms candidate bilingual web page;
Bilingual resource noise filtering module:The bilingual noise pair of acquisition is further filtered based on machine translation feature;
Bilingual web page field similarity calculation module:The field similarity degree for calculating candidate bilingual web page, will be candidate bilingual
Webpage is divided into parallel web pages, than webpage, while filtering out incoherent candidate web pages;
Translate sentence pair abstraction module:Bilingual translation pair is further extracted using the parallel web pages of acquisition and than webpage,
Form the bilingual text of alignment.
Further, it includes that automatic acquisition translates candidate unit module and selects final translation mould that term translation, which excavates unit,
Block,
It is automatic to obtain translation candidate unit module:Legal translation candidate unit is selected from the bilingual translation centering of taking-up;
Select final translation module:By screening and sorting, final translation is obtained.
Further, in automatic acquisition translation candidate unit module, specially:First with field term, keyword or
It names entity, and its translation to be used as seed, carries out the automatic acquisition of given field bilingualism corpora;Then, according to the double of acquisition
Language corpus realizes the automatic acquisition of field term translation using traditional Terminology Translation acquisition methods;It is turned over according to the term of acquisition
It translates and obtains more massive field bilingualism corpora, then carry out the Terminology Translation acquisition of a new round;So by repeatedly feeding back,
Incrementally obtain field bilingualism corpora and field term translation.
A kind of term translation method for digging with field customization function for realizing system described in any of the above-described, including with
Lower step:
The selection of S1, candidate bilingual mixing webpage;
Bilingual resource extracts in S2, bilingual mixing webpage;
The calculating of S3, bilingual web page field similarity calculate the field similarity degree of candidate bilingual web page, will be candidate bilingual
Webpage is divided into parallel web pages, than webpage, while filtering out incoherent candidate web pages;
S4, the translation sentence pair based on bilingual web page extract, and are further extracted using the parallel web pages of acquisition and than webpage
Go out bilingual translation pair, forms the bilingual text of alignment;
The automatic acquisition of S5, Terminology Translation based on field bilingualism corpora, using self feed back, the field term of increment type
Automatic obtaining method is translated, corresponding translation is obtained according to the field term keyword received and is exported.
Further, step S1 is specially:Using the field customizing model based on keyword, a specific neck is inputted first
The Chinese-English keyword list in domain establishes associated station point set using universal search engine, recycles collector to each website
And its related link is extended lookup, the bilingual web page collection of original specific area is formed, using vector space model to original
Beginning bilingual web page collection carries out preliminary filtering and forms candidate bilingual web page;
Further, it is further based on machine translation feature that bilingual resource, which extracts, in step S2, in bilingual mixing webpage
The bilingual noise pair that filtration step S1 is obtained.
Further, in step S5, automatic obtaining method is translated using the field term of self feed back, increment type, specially:
Seed is used as with field term, keyword or name entity, and its translation first, carries out the automatic of given field bilingualism corpora
It obtains;Then, oneself of traditional Terminology Translation acquisition methods realization field term translation is used according to the bilingualism corpora of acquisition
It is dynamic to obtain;It is gone to obtain more massive field bilingualism corpora according to the Terminology Translation of acquisition, then carries out the term of a new round and turn over
Translate acquisition;So by repeatedly feedback, field bilingualism corpora and field term translation are incrementally obtained.
The beneficial effects of the invention are as follows:This kind has the term translation digging system and method for field customization function, can be full
The needs of the technical term in certain field on sufficient people's quick obtaining internet read professional data for researcher and provide translation letter
Breath, also provides resource guarantee for the writing of terminological dictionary and update.The system and method, the parallel webpage based on Internet
Text, by research field customize WEB term translation extraction models and algorithm, make full use of internet on across language money
Source realizes that the term translation of field customization excavates, to provide important technology and resource across the natural language processing task of language
It supports.The system and method can be applied to the writing of Terminology Dictionary, machine translation, information retrieval, question answering system, master
Inscribe the related fields such as content analysis.
Description of the drawings
Fig. 1 is the embodiment of the present invention, and there is the term translation digging system of field customization function to illustrate block diagram.
Fig. 2 is the process description schematic diagram for the term translation method for digging that embodiment has field customization function.
Fig. 3 is that the specific area bilingualism corpora based on Web obtains flow diagram in embodiment.
Fig. 4 be self feed back in embodiment, increment type field term translation obtain schematic diagram.
Specific implementation mode
The preferred embodiment that the invention will now be described in detail with reference to the accompanying drawings.
Embodiment
There is embodiment the term translation digging system of field customization function and method can carry out online term translation
Excavation excavates English translation if what is inputted is English, the translation of Chinese is excavated if what is inputted is Chinese.
It includes that learning training unit and term translation excavate that this kind, which has the term translation digging system of field customization function,
Unit two large divisions, such as Fig. 1.
Learning training unit includes input subelement and training subelement.Input subelement inputs field seed to input
The small-scale Chinese-English keyword list of user's designated field, to the seed as learning training.Training subelement uses
Scheduled method obtains the bilingual web page resource of user's designated field from internet and obtains the bilingual translation pair in the field.
Training subelement includes bilingual Web sites and webpage identification module, bilingual resource noise filtering module, bilingual web page neck
Domain similarity calculation module, translation sentence pair abstraction module.Fig. 3 gives the field bilingualism corpora acquisition of the characteristics of based on Web stream
Journey.
Bilingual Web sites and webpage identification module:Using the field customizing model based on keyword, input one first is specific
The Chinese-English keyword list in field, scale can establish associated station point set, recycling is adopted with very little using universal search engine
Storage is extended lookup to each website and its related link, forms the bilingual web page collection of original specific area, using to
Quantity space model carries out preliminary filtering to original bilingual web page collection and forms candidate bilingual web page.
This bilingual speech mixing webpage refers to that single webpage includes the content of multilingual.If wrapped according to a webpage
Translation containing a pair of pair submits a pair of of translation to arriving search engine, search is drawn then it may include the inspiration for translating content each other
Hold up the webpage that will be returned comprising this to translating word.
Candidate bilingual Web sites and webpage refer to may the website containing bilingual text and webpage.The candidate bilingual Web sites of identification and
The purpose of webpage is further the acquisition of bilingual text to be limited on possible website and webpage, will greatly improve bilingual money
Source acquisition speed.
Bilingual resource noise filtering module:The bilingual noise pair of acquisition is further filtered based on machine translation feature;
Bilingual web page field similarity calculation module:The field similarity degree for calculating candidate bilingual web page, will be candidate bilingual
Webpage is divided into parallel web pages, than webpage, while filtering out incoherent candidate web pages.
It here will be from the field similarity measurement method of two kinds of angle research bilingual web pages of quantitative analysis and qualitative analysis.It is fixed
The method of amount describes the field similarity degree of bilingual web page in the form of numerical value, the feature such as based on crucial word frequency or position to
Various similarity calculating methods of quantity space model etc..Method for qualitative analysis uses the bilingual net of support vector cassification model realization
The domain classification of page.
Translate sentence pair abstraction module:Bilingual translation pair is further extracted using the parallel web pages of acquisition and than webpage,
Form the bilingual text of alignment.Inside webpage, subject content generally all can in some region, so first can to webpage into
Row divide, utilize " bilingual rate ", analyze web page interlinkage number, the information analyses such as Anchor Text divide region include bilingual resource can
It can property.Based on such structural information, a lot of bilingual translations pair candidate each other can be more accurately chosen.
Term translation excavates unit:Automatic obtaining method is translated using the field term of self feed back, increment type, according to reception
To field term keyword obtain corresponding translation and export.
It includes that automatic acquisition translates candidate unit module and selects final translation module that term translation, which excavates unit,.
It is automatic to obtain translation candidate unit module:Legal translation candidate unit is selected from the bilingual translation centering of taking-up.
Seed is used as with field term, keyword or name entity, and its translation first, carries out the automatic of given field bilingualism corpora
It obtains;Then, oneself of traditional Terminology Translation acquisition methods realization field term translation is used according to the bilingualism corpora of acquisition
It is dynamic to obtain;It is gone to obtain more massive field bilingualism corpora according to the Terminology Translation of acquisition, then carries out the term of a new round and turn over
Translate acquisition;So by repeatedly feedback, field bilingualism corpora and field term translation are incrementally obtained.Fig. 4 gives certainly
The field term translation of feedback, increment type obtains schematic diagram.
Select final translation module:By screening and sorting, final translation is obtained, the performance of entire query translation is improved.
A kind of term translation method for digging with field customization function, such as Fig. 1 include the following steps:
The selection of S1, candidate bilingual mixing webpage;Using the field customizing model based on keyword, a spy is inputted first
The Chinese-English keyword list for determining field establishes associated station point set using universal search engine, recycles collector to each
Website and its related link are extended lookup, form the bilingual web page collection of original specific area, use vector space model
Preliminary filtering is carried out to original bilingual web page collection and forms candidate bilingual web page.
This bilingual speech mixing webpage refers to that single webpage includes the content of multilingual.If wrapped according to a webpage
Translation containing a pair of pair submits a pair of of translation to arriving search engine, search is drawn then it may include the inspiration for translating content each other
Hold up the webpage that will be returned comprising this to translating word.
Candidate bilingual Web sites and webpage refer to may the website containing bilingual text and webpage.The candidate bilingual Web sites of identification and
The purpose of webpage is further the acquisition of bilingual text to be limited on possible website and webpage, will greatly improve bilingual money
Source acquisition speed.
Bilingual resource extracts in S2, bilingual mixing webpage;It is obtained to be based on the further filtration step S1 of machine translation feature
Bilingual noise pair.
Inside webpage, subject content generally, so can be divided first to webpage, can all utilize in some region
" bilingual rate ", analyzes web page interlinkage number, and the information analyses such as Anchor Text divide the possibility that region includes bilingual resource.Based on this
Class formation information can more accurately choose a lot of bilingual translations pair candidate each other.
The calculating of S3, bilingual web page field similarity calculate the field similarity degree of candidate bilingual web page, will be candidate bilingual
Webpage is divided into parallel web pages, than webpage, while filtering out incoherent candidate web pages.
It here will be from the field similarity measurement method of two kinds of angle research bilingual web pages of quantitative analysis and qualitative analysis.It is fixed
The method of amount describes the field similarity degree of bilingual web page in the form of numerical value, the feature such as based on crucial word frequency or position to
Various similarity calculating methods of quantity space model etc..Method for qualitative analysis uses the bilingual net of support vector cassification model realization
The domain classification of page.
S4, the translation sentence pair based on bilingual web page extract, and are further extracted using the parallel web pages of acquisition and than webpage
Go out bilingual translation pair, forms the bilingual text of alignment;Fig. 3 gives the field bilingualism corpora acquisition of the characteristics of based on Web flow.
The automatic acquisition of S5, Terminology Translation based on field bilingualism corpora, using self feed back, the field term of increment type
Automatic obtaining method is translated, corresponding translation is obtained according to the field term keyword received and is exported.It can be according to user
Demand, realize specific area bilingual terminology translation extract.
Automatic obtaining method is translated using the field term of self feed back, increment type, specially:First with field term, pass
Keyword or name entity, and its translation are used as seed, carry out the automatic acquisition of given field bilingualism corpora, bilingualism corpora packet
Include Parallel Corpus and comparable corpora;Then, traditional Terminology Translation acquisition methods are used according to the bilingualism corpora of acquisition
Realize the automatic acquisition of field term translation;It is gone to obtain more massive field bilingualism corpora according to the Terminology Translation of acquisition,
The Terminology Translation for carrying out a new round again obtains;So by repeatedly feedback, field bilingualism corpora and field are incrementally obtained
Terminology Translation.Fig. 3 gives self feed back, the field term translation of increment type obtains schematic diagram.