CN106980664A

CN106980664A - A kind of bilingual comparable corpora mining method and device

Info

Publication number: CN106980664A
Application number: CN201710169141.XA
Authority: CN
Inventors: 洪宇; 姚亮
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2017-03-21
Filing date: 2017-03-21
Publication date: 2017-07-25
Anticipated expiration: 2037-03-21
Also published as: CN106980664B

Abstract

The invention discloses the bilingual comparable corpora mining method and device of one kind, by capturing multiple pictures and corresponding text information from the database of different language in advance, the multi-modal knowledge base comprising picture and text information is set up；Using the picture in original language knowledge base as inquiry picture, picture retrieval is carried out in object language knowledge base, the Target Photo similar to inquiry picture is found out；According to the corresponding text information of Target Photo text information corresponding with inquiry picture, bilingual comparable language material is built.The application uses across media information retrieval technology, the medium for linking up original language and object language is used as by picture, and then obtain of equal value or comparable text of the original language in destination end, new method is provided for the bilingual comparable excavating resource in internet, the problem of specific bilingual resource is rare is solved.

Description

A kind of bilingual comparable corpora mining method and device

Technical field

The present invention relates to field of computer technology, more particularly to a kind of bilingual comparable corpora mining method and device.

Background technology

Bilingual comparable language material refers to the text collection that similar semantic is characterized in different language.It is large-scale bilingual comparable Rich and varied bilingual intertranslation unit, such as intertranslation pair of phrase rank, sentence level, and bilingual word are generally comprised in language material Allusion quotation.In rare foreign languages or some restriction fields, parallel resource is generally less, but bilingual comparable language material is relatively easily obtained. Therefore, bilingual comparable language material turns into the valuable source in machine translation and cross-language information retrieval field.How to obtain big automatically The bilingual comparable language material of scale turns into a basic task in machine translation.

At present, the research method that bilingual comparable language material is obtained is broadly divided into following three class：One class is based on across language The bilingual comparable language material construction method of information retrieval, this method extracting keywords from the document of original language, and based on bilingual Dictionary by keyword to object language, and then as retrieval and inquisition, the candidate documents set of searched targets language, most Comparable bilingual document is obtained eventually.Equations of The Second Kind is the bilingual comparable language material construction method based on content and structure similarity, Source document is translated object language by this method using translation engine (Google translates or should must translated), obtains source document Pseudo- translation result.And further from vocabulary, theme, structure similarity, evaluate pseudo- translation document and object language text Shelves similarity, and the similar document of sequencing selection.3rd class method is that (such as wikipedia) is taken out from the knowledge base of structuring Take bilingual comparable language material.Using the existing class entry of the knowledge bases such as wikipedia and link information, documentation level is obtained Comparable resource.Existing method only from text similarity angle, by bilingual dictionary or it is existing construction of knowledge base is bilingual can Compare language material resource.Performance of this kind of method dependent on a large amount of dictionary, knowledge base or existing translation systems manually marked, in face During to rare foreign languages or specific area, rare language resource will restrict the versatility of such method.

The content of the invention

It is an object of the invention to provide the bilingual comparable corpora mining method and device of one kind, with solve existing method only from Text similarity angle sets out, it is poor to the versatility of specific rare voice resource the problem of.

In order to solve the above technical problems, the present invention provides a kind of bilingual comparable corpora mining method, including：

Capture multiple pictures and corresponding text information from the database of different language in advance, set up comprising picture with And the multi-modal knowledge base of the text information；

Using the picture in original language knowledge base as inquiry picture, picture retrieval is carried out in object language knowledge base, is looked into Find out the Target Photo similar to the inquiry picture；

According to the corresponding text information of Target Photo text information corresponding with the inquiry picture, structure is bilingual can Compare language material.

Alternatively, it is described that multiple pictures and corresponding text information bag are captured from the database of different language in advance Include：

Using web crawlers from news website capturing pictures, the text information be the corresponding theme of the picture and/ Or heading message, using picture and corresponding text information as two tuples, it is stored in the multi-modal knowledge base.

Alternatively, it is described that picture retrieval is carried out in object language knowledge base, find out similar to the inquiry picture Target Photo includes：

The key point of the inquiry picture is extracted using scale invariant feature transfer algorithm, the inquiry picture is characterized as Characteristic vector based on the key point；

The characteristic vector of all candidate's pictures in the object language knowledge base is extracted, and matches the inquiry picture and institute State the key point of candidate's picture；

The average Euclidean distance between all matching key points is calculated, the picture similarity between picture is used as；

Candidate's picture is ranked up according to the picture similarity, the target figure similar to the inquiry picture is chosen Piece.

Determine subject classification and the issuing time information of the inquiry picture；

Filter out in the object language knowledge base with the subject classification and the unmatched picture of issuing time information；

Picture retrieval is carried out in the object language knowledge base after filtration, is found out similar to the inquiry picture Target Photo.

Alternatively, it is described to be believed according to the corresponding text information of Target Photo word corresponding with the inquiry picture Breath, building bilingual comparable language material includes：

Calculate the text phase of the inquiry corresponding text information of picture and the corresponding text information of the Target Photo Like degree；

The Target Photo is resequenced according to the text similarity；

According to the result after rearrangement, bilingual comparable language material is built.

Alternatively, it is described to calculate the inquiry corresponding text information of picture and the corresponding word letter of the Target Photo The text similarity of breath includes：

The content for calculating the corresponding text information of inquiry picture text information corresponding with the Target Photo is similar Degree, entity similarity and structural similarity；

The content similarity, the entity similarity and the structural similarity are weighted averagely, calculated To the text similarity of correspondence text information.

Present invention also offers the bilingual comparable corpora mining device of one kind, including：

Knowledge base sets up module, for capturing multiple pictures and corresponding word from the database of different language in advance Information, sets up the multi-modal knowledge base comprising picture and the text information；

Searching modul, for the picture in original language knowledge base, as inquiry picture, to be entered in object language knowledge base Row picture retrieval, finds out the Target Photo similar to the inquiry picture；

Module is built, for believing according to the corresponding text information of Target Photo word corresponding with the inquiry picture Breath, builds bilingual comparable language material.

Alternatively, the searching modul includes：

Extraction unit, the key point for extracting the inquiry picture using scale invariant feature transfer algorithm, will be described Inquiry picture is characterized as the characteristic vector based on the key point；

Matching unit, for extracting the characteristic vector of all candidate's pictures in the object language knowledge base, and matches institute State the key point of inquiry picture and candidate's picture；

Similarity calculated, for calculating the average Euclidean distance between all matching key points, as between picture Picture similarity；

Unit is chosen, for being ranked up according to the picture similarity to candidate's picture, is chosen and the inquiry picture Similar Target Photo.

Alternatively, the searching modul also includes：

Determining unit, subject classification and issuing time information for determining the inquiry picture；

Unit is filtered out, for filtering out in the object language knowledge base with the subject classification and issuing time information not The picture of matching；

Searching unit, for carrying out picture retrieval in the object language knowledge base after filtration, find out with it is described Inquire about the similar Target Photo of picture.

Alternatively, the structure module includes：

Text similarity computing unit, for calculating the inquiry corresponding text information of picture and the Target Photo The text similarity of corresponding text information；

Sequencing unit, for being resequenced according to the text similarity to the Target Photo；

Construction unit, for according to the result after rearrangement, building bilingual comparable language material.

Bilingual comparable corpora mining method and device provided by the present invention, by advance from the database of different language The multiple pictures of middle crawl and corresponding text information, set up the multi-modal knowledge base comprising picture and text information；By source Picture in language knowledge base carries out picture retrieval in object language knowledge base, found out and query graph as inquiry picture The similar Target Photo of piece；According to the corresponding text information of Target Photo text information corresponding with inquiry picture, build bilingual Comparable language material.The application uses across media information retrieval technology, and the matchmaker for linking up original language and object language is used as by picture It is situated between, and then obtains original language in the text of equal value or comparable of destination end, is the bilingual comparable excavating resource in internet There is provided new method, the problem of specific bilingual resource is rare is solved.

Brief description of the drawings

, below will be to embodiment or existing for the clearer explanation embodiment of the present invention or the technical scheme of prior art The accompanying drawing used required in technology description is briefly described, it should be apparent that, drawings in the following description are only this hair Some bright embodiments, for those of ordinary skill in the art, on the premise of not paying creative work, can be with root Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 is a kind of flow chart of embodiment of bilingual comparable corpora mining method provided by the present invention；

Fig. 2 is the similar pictures and comparable title schematic diagram that excavate in newsletter archive；

Fig. 3 is that the process schematic to the similar Target Photo of the inquiry picture is found out in the embodiment of the present invention；

Fig. 4 is the flow of another embodiment of bilingual comparable corpora mining method provided by the present invention Figure；

Fig. 5 is the matching result schematic diagram of key point in picture；

Fig. 6 is the structured flowchart of bilingual comparable corpora mining device provided in an embodiment of the present invention.

Embodiment

In order that those skilled in the art more fully understand the present invention program, with reference to the accompanying drawings and detailed description The present invention is described in further detail.Obviously, described embodiment is only a part of embodiment of the invention, rather than Whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not making creative work premise Lower obtained every other embodiment, belongs to the scope of protection of the invention.

A kind of flow chart such as Fig. 1 institutes of embodiment of bilingual comparable corpora mining method provided by the present invention Show, this method includes：

Step S101：Multiple pictures and corresponding text information are captured from the database of different language in advance, is set up Multi-modal knowledge base comprising picture and the text information；

Specifically, the embodiment of the present invention can be captured on a large scale using web crawlers from the news website of different language Picture and corresponding text information, the text information can be the corresponding theme of the picture and/or heading message, by picture And corresponding text information is as two tuples, it is stored in the multi-modal knowledge base.

It is pointed out that the present invention is specifically based on across media information retrieval (Cross-MediaIn FormationRetrieval, CMIR) technology, itself and cross-language information retrieval (Cross-Langu AgeInformationRetrieve, CLIR) technology is quite similar, only difference is that linking up original language and object language Medium is different.Across media information retrieval using picture and non-textual, as linking up the bridge of original language and object language, and then obtain Original language is taken in the text of equal value or comparable of destination end.

As a kind of embodiment, CMIR knowledge base can include picture and picture header two parts, picture mark Topic describes the content of the picture, and is corresponded with picture, the example of refer to the attached drawing 2.CMIR utilizes similar pictures search engine To set up the comparable text that the corresponding title of the contact of different language text, i.e. similar pictures is considered bilingual.CLIR systems Source language text is then translated into object language end using weak translation system or bilingual dictionary, and looked into translation result as retrieval Ask, similar text is retrieved to object language end.To a certain extent, CMIR is easier to use compared to CLIR, and CMIR is uniquely needed The problem of considering is the quality for how lifting retrieval, and CLIR then needs the quality of extra consideration bilingual dictionary and weak translation Device translates performance.Therefore, the embodiment of the present invention propose based on the bilingual comparable corpora mining across media information retrieval technology Method, attempts to excavate bilingual comparable language material using the similitude of picture first.

Step S102：Using the picture in original language knowledge base as inquiry picture, schemed in object language knowledge base Piece is retrieved, and finds out the Target Photo similar to the inquiry picture；

Between the multi-modal knowledge base of different language, similar pictures searching system is set up, in object language knowledge Similar picture is retrieved in storehouse.

Picture retrieval is carried out in reference picture 3, the embodiment of the present invention in object language knowledge base, is found out and the inquiry The process of the similar Target Photo of picture can be specifically included：

Step S1021：The key point of the inquiry picture is extracted using scale invariant feature transfer algorithm, by the inquiry Picture is characterized as the characteristic vector based on the key point；

Step S1022：The characteristic vector of all candidate's pictures in the object language knowledge base is extracted, and is looked into described in matching Ask the key point of picture and candidate's picture；

Step S1023：The average Euclidean distance between all matching key points is calculated, it is similar as the picture between picture Degree；

Step S1024：Candidate's picture is ranked up according to the picture similarity, chooses similar to the inquiry picture Target Photo.

Step S103：According to the corresponding text information of Target Photo text information corresponding with the inquiry picture, Build bilingual comparable language material.

Bilingual comparable corpora mining method provided by the present invention, by being captured in advance from the database of different language Multiple pictures and corresponding text information, set up the multi-modal knowledge base comprising picture and text information；Original language is known The picture in storehouse is known as inquiry picture, picture retrieval is carried out in object language knowledge base, is found out similar to inquiry picture Target Photo；According to the corresponding text information of Target Photo text information corresponding with inquiry picture, build bilingual comparable Language material.The application uses across media information retrieval technology, by picture as the medium for linking up original language and object language, and then Original language is obtained in the text of equal value or comparable of destination end, is provided newly for the bilingual comparable excavating resource in internet Method, solves the problem of specific bilingual resource is rare.

On the basis of any of the above-described embodiment, to lift the accurate rate and operational efficiency of similar pictures searching system, this Inventive embodiments to similar pictures with reference to theme of news information and time tag information it is further proposed that optimize retrieval Model.Specifically, picture is carried out in object language knowledge base in bilingual comparable corpora mining method provided by the present invention Retrieval, finding out the process of the Target Photo similar to the inquiry picture can be specially：

The flow of another embodiment of the bilingual comparable corpora mining method provided by the present invention of reference picture 4 Figure, the specific implementation process to this method is further elaborated on below.

The process includes：

Step S201：Multi-modal construction of knowledge base, the module is climbed using web crawlers from the news website of multilingual Picture and correspondence picture header are taken, using picture and correspondence title to as two tuples, building multi-modal knowledge base.

It particularly may be divided into following steps：Divided, crawled respectively under different columns or theme pair according to news website column The news pages answered；News pages are parsed, picture and picture header are extracted by Web page structural analysis, and extract news briefing Time；Picture and correspondence picture header are constituted into two tuples, stored into multi-modal knowledge base.

The news web page under correspondence theme or column is captured by news website theme column, and is respectively stored in correspondence theme Under catalogue.According to picture tag (such as:Img, pic etc.) picture and picture header in webpage are matched, while removing in title text The web page tag unrelated with content is (such as：<a>,</br>Deng).Picture and picture header are expressed as two tuples, as multi-modal One record of knowledge base.

In-exemplified by English may compare excavating resource, the present invention from the Chinese news website of main flow (such as：The www.xinhuanet.com, phoenix net, Sina website) and English news website (FOX, CNN, BBC) crawl news, the different columns provided according to news website are (such as：Army Thing, finance and economics, science and technology, physical culture, amusement etc.) news is subjected to classification and theme division, while recording the time of news briefing.For The news of crawl, the embodiment of the present invention carries out structured analysis to news web page, extracts picture and picture header in webpage, deposits Store up as two tuples (image, caption), be used as a record in multi-modal data storehouse.In this way, different language is set up Under multi-modal data storehouse.

Step S202：Determine subject classification and the issuing time information of the inquiry picture；Filter out object language knowledge With the subject classification and the unmatched picture of issuing time information in storehouse；

The artificial existing subject classification label system in Combination News website, for the news website of different language, using turning over Translation word allusion quotation realizes that the theme label across language maps by machine translation mothod.Different language and different news websites are owned by Respective news category system, it is therefore desirable to set up unified criteria for classification.For example " physical culture " is mapped to " Sports ".

The picture retrieval optimization of convergent journalism theme, the subject categories according to where news mark picture indicia to specify Label.When carrying out similar pictures retrieval, hunting zone is limited under identical theme.Comparable news often has identical or phase As theme, therefore when similar pictures are retrieved, limit the theme of image credit, candidate's picture retrieved simultaneously under identical theme Sequence, to filter invalid candidate's picture.

The picture retrieval optimization of time of fusion label, is divided, is being entered automatically according to the time of news briefing to news When row similar pictures are retrieved, range of search is limited in the time window specified, and changes time window threshold value, obtains different Retrieval result.Media event has ageing, causes similar news picture to tend to occur at the close period It is interior.Based on this, by the use of the issuing time of news picture as label, the time window d different by setting, and in the correspondence time { t-d, t+d } carries out the retrieval of similar pictures in interval.

The labels such as subject classification and issuing time of this step using news where picture, limitation similar pictures retrieval module Search space, with filter dissmilarity picture candidate so that during towards large-scale dataset, the effect of similar pictures searching system Rate and performance obtain stable lifted.

Step S203：Picture retrieval is carried out in the object language knowledge base after filtration, is found out and the inquiry The similar Target Photo of picture；

By extracting the crucial point feature of picture, and the similarity between picture is calculated based on extraction feature, in difference The retrieval of similar pictures is realized in the multi-modal knowledge base of language.

It particularly may be divided into following steps：Using picture in original language knowledge base as inquiry picture, using scale invariant feature Transfer algorithm extracts the key point of picture, and picture is characterized as into the characteristic vector based on key point；There is time to Target Photo place Picture is selected to extract its characteristic vector, and the key point of matching inquiry picture and target candidate picture；Calculate all matching key points Between average Euclidean distance, be used as the measuring similarity between picture；For inquiry picture, according to the similarity of picture to mesh All pictures in valut of marking on a map are ranked up, and selection Top-N pictures are used as similar pictures collection.

Specifically, for the key point x in retrieving image, two most like key points are found in target candidate picture Y, z.Similarity between two key points, is obtained by the Euclidean distance of key point characteristic vector.

Judge whether key point x and Target Photo key point y match, rule is according to nearest key point and time nearly key point The ratio of Euclidean distance whether be less than specified threshold and determine.Even d (x, y) ＜ d (x, z) * θ, then think key point x and pass Key point y matches, wherein threshold θ ∈ [0,1], is typically set to 0.8.Fig. 5 shows the matching result of key point in two pictures.

Calculate the similarity between picture.According to similar between the whole picture of the key point of all matchings in picture calculating Degree.Specifically, calculating the average Euclidean distance between all match points.Formula is as follows：

M represents the key point number matched in two pictures, and n represents the dimension of key point characteristic vector.

According to picture similarity, target candidate pictures are ranked up, topN similar pictures are selected.

Step S204：According to the corresponding text information of Target Photo text information corresponding with the inquiry picture, Build bilingual comparable language material.

By the way that by similar pictures searching system, similar pictures inspection is carried out under the theme of news related to target domain Rope, and and then the bilingual comparable text of extraction.The present invention proposes the title comparability measurement modeled based on image similarity, with And fused images similarity and the title comparability measurement of text similarity modeling.

The title comparability measure modeled based on image similarity, particularly may be divided into following steps：

Using similar pictures searching system, using the picture of original language as input, similar pictures are retrieved to destination end Close, and the similarity for inquiring about picture and retrieval result picture is calculated based on SIFT algorithms.

Using the comparability of picture similarity modeling picture header, using the ranking results of picture, picture header is used as Comparability between the ranking results of comparable degree, title is measured using below equation：

Wherein, S (s_cap,t_cap) represent title s_capWith title t_capBetween comparable degree.v(s_img,t_img) represent picture s_imgWith picture t_imgBetween similarity, using SITF algorithms calculate obtain, b is constant.

The title comparability measure of fused images similarity and text similarity, particularly may be divided into following steps：

Calculate the text similarity of inquiry picture header and target candidate picture header.The present invention is from content similarity (fc) text of, entity similarity (fe), and three angles of structural similarity (fl), overall merit inquiry title and candidate's title This similarity.Specific formula is as follows：

S=α fc+ β fe+ γ fl

Further text similarity score between fused images similarity and picture header, to similar pictures retrieval result (being based on image similarity) is reordered.According to the result after reordering, comparable text pair is built.The present invention is using as follows Formula, calculates inquiry and candidate's comprehensive similarity：

Wherein S_txt(s_cap,t_cap) represent inquiry title and retrieve the text similarity between title, b is constant, for controlling Contribution of the piece similarity of charting to final similarity score.

The computational methods for inquiring about the text similarity of picture header and candidate's picture header are as follows：Turned over online using Google Title translation will be inquired about to object language by translating system, with important letters such as the word order of stet sheet, sentence structure and name entities Breath.

The comparability of title translation and Target Photo title is inquired about based on following three characteristic evaluating：

Content similarity：The text description that will be inquired about after title translation carries out participle, removes stop words, and root etc. is operated, Bag-of-words is obtained to represent.Based on vector space model, the cosine similarity for calculating inquiry title and desired title (is denoted as fc)。

Entity similarity：The name entity in Entity recognition instrument identification picture header is named using Stamford, is ordered Name entity bag of words set, and the name entity similarity (being denoted as fe) between bilingual text is calculated based on vector space model.

Structural similarity：Will lexical word in inquiry title and desired title (including noun, verb, adverbial word, adjective, specially Having noun etc.) ratio of number is as a comparability evaluation criterion (being denoted as fl).

Above-mentioned three kinds of features are merged using weighted average mode, comparability between inquiry title and desired title is obtained and obtains Point.

S=α fc+ β fe+ γ fl

Wherein, according to percentage contribution of each feature to comparability, empirically set α=0.8, β=0.15, γ= 0.05。

Bilingual comparable corpora mining method proposed by the present invention based on across media information retrieval, is pair in internet Language may compare excavating resource and provide new method.The method for not only proposing to build extensive multi-modal data storehouse, it is also proposed that be based on The excavation of the comparable resource of picture similarity auxiliary, and based on this two kinds of comparability measure of proposition, to obtain big rule Mould may compare text.

Bilingual comparable corpora mining device provided in an embodiment of the present invention is introduced below, it is described below bilingual Comparable corpora mining device can be mutually to should refer to above-described bilingual comparable corpora mining method.

Fig. 6 is the structured flowchart of bilingual comparable corpora mining device provided in an embodiment of the present invention, and reference picture 6 is bilingual can Comparing corpora mining device can include：

Knowledge base sets up module 100, for capturing multiple pictures from the database of different language in advance and corresponding Text information, sets up the multi-modal knowledge base comprising picture and the text information；

Searching modul 200, for using the picture in original language knowledge base as inquiry picture, in object language knowledge base Picture retrieval is carried out, the Target Photo similar to the inquiry picture is found out；

Module 300 is built, for according to the corresponding text information of Target Photo text corresponding with the inquiry picture Word information, builds bilingual comparable language material.

As a kind of embodiment, in bilingual comparable corpora mining device provided by the present invention, searching modul It can specifically include：

As a kind of embodiment, above-mentioned searching modul can also include：

Specifically, above-mentioned structure module can be specifically included：

The bilingual comparable corpora mining device of the present embodiment is used to realize foregoing bilingual comparable corpora mining method, Therefore the visible bilingual comparable corpora mining method hereinbefore of embodiment in bilingual comparable corpora mining device Embodiment part, for example, knowledge base sets up module 100, searching modul 200 builds module 300, is respectively used to realize above-mentioned Step S101, S102, S103 in bilingual comparable corpora mining method, so, its embodiment is referred to accordingly The description of various pieces embodiment, will not be repeated here.

Bilingual comparable corpora mining device provided by the present invention, by being captured in advance from the database of different language Multiple pictures and corresponding text information, set up the multi-modal knowledge base comprising picture and text information；Original language is known The picture in storehouse is known as inquiry picture, picture retrieval is carried out in object language knowledge base, is found out similar to inquiry picture Target Photo；According to the corresponding text information of Target Photo text information corresponding with inquiry picture, build bilingual comparable Language material.The application uses across media information retrieval technology, by picture as the medium for linking up original language and object language, and then Original language is obtained in the text of equal value or comparable of destination end, is provided newly for the bilingual comparable excavating resource in internet Method, solves the problem of specific bilingual resource is rare.

The embodiment of each in this specification is described by the way of progressive, what each embodiment was stressed be with it is other Between the difference of embodiment, each embodiment same or similar part mutually referring to.For being filled disclosed in embodiment For putting, because it is corresponded to the method disclosed in Example, so description is fairly simple, related part is referring to method part Explanation.

Professional further appreciates that, with reference to the unit of each example of the embodiments described herein description And algorithm steps, can be realized with electronic hardware, computer software or the combination of the two, in order to clearly demonstrate hardware and The interchangeability of software, generally describes the composition and step of each example according to function in the above description.These Function is performed with hardware or software mode actually, depending on the application-specific and design constraint of technical scheme.Specialty Technical staff can realize described function to each specific application using distinct methods, but this realization should not Think beyond the scope of this invention.

Directly it can be held with reference to the step of the method or algorithm that the embodiments described herein is described with hardware, processor Capable software module, or the two combination are implemented.Software module can be placed in random access memory (RAM), internal memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.

Bilingual comparable corpora mining method provided by the present invention and device are described in detail above.Herein In apply specific case the principle and embodiment of the present invention be set forth, the explanation of above example is only intended to side The method and its core concept of the assistant solution present invention.It should be pointed out that for those skilled in the art, not On the premise of departing from the principle of the invention, some improvement and modification can also be carried out to the present invention, these are improved and modification is also fallen into In the protection domain of the claims in the present invention.

Claims

1. a kind of bilingual comparable corpora mining method, it is characterised in that including：

Multiple pictures and corresponding text information are captured from the database of different language in advance, sets up and includes picture and institute State the multi-modal knowledge base of text information；

Using the picture in original language knowledge base as inquiry picture, picture retrieval is carried out in object language knowledge base, is found out The Target Photo similar to the inquiry picture；

According to the corresponding text information of Target Photo text information corresponding with the inquiry picture, build bilingual comparable Language material.

2. bilingual comparable corpora mining method as claimed in claim 1, it is characterised in that described in advance from different language Multiple pictures and corresponding text information are captured in database to be included：

Using web crawlers from news website capturing pictures, the text information be the corresponding theme of the picture and/or mark Information is inscribed, using picture and corresponding text information as two tuples, is stored in the multi-modal knowledge base.

3. bilingual comparable corpora mining method as claimed in claim 2, it is characterised in that described in object language knowledge base Middle carry out picture retrieval, finding out the Target Photo similar to the inquiry picture includes：

The key point of the inquiry picture is extracted using scale invariant feature transfer algorithm, the inquiry picture is characterized as being based on The characteristic vector of the key point；

The characteristic vector of all candidate's pictures in the object language knowledge base is extracted, and matches the inquiry picture and the time Select the key point of picture；

Candidate's picture is ranked up according to the picture similarity, the Target Photo similar to the inquiry picture is chosen.

4. the bilingual comparable corpora mining method as described in any one of claims 1 to 3, it is characterised in that described in target Picture retrieval is carried out in language knowledge base, finding out the Target Photo similar to the inquiry picture includes：

Picture retrieval is carried out in the object language knowledge base after filtration, the target similar to the inquiry picture is found out Picture.

5. the bilingual comparable corpora mining method as described in any one of claims 1 to 3, it is characterised in that described according to institute The corresponding text information of Target Photo text information corresponding with the inquiry picture is stated, building bilingual comparable language material includes：

Calculate the text similarity of the inquiry corresponding text information of picture and the corresponding text information of the Target Photo；

The Target Photo is resequenced according to the text similarity；

6. bilingual comparable corpora mining method as claimed in claim 5, it is characterised in that the calculating inquiry picture The text similarity of corresponding text information and the corresponding text information of the Target Photo includes：

Calculate content similarity, the reality of the corresponding text information of inquiry picture text information corresponding with the Target Photo Body similarity and structural similarity；

The content similarity, the entity similarity and the structural similarity are weighted average, calculating is obtained pair Answer the text similarity of text information.

7. a kind of bilingual comparable corpora mining device, it is characterised in that including：

Knowledge base sets up module, for capturing multiple pictures and corresponding word letter from the database of different language in advance Breath, sets up the multi-modal knowledge base comprising picture and the text information；

Searching modul, for the picture in original language knowledge base, as inquiry picture, to be schemed in object language knowledge base Piece is retrieved, and finds out the Target Photo similar to the inquiry picture；

Module is built, for inquiring about the corresponding text information of picture with described according to the corresponding text information of the Target Photo, Build bilingual comparable language material.

8. bilingual comparable corpora mining device as claimed in claim 7, it is characterised in that the searching modul includes：

Extraction unit, the key point for extracting the inquiry picture using scale invariant feature transfer algorithm, by the inquiry Picture is characterized as the characteristic vector based on the key point；

Matching unit, the characteristic vector for extracting all candidate's pictures in the object language knowledge base, and looked into described in matching Ask the key point of picture and candidate's picture；

Similarity calculated, for calculating the average Euclidean distance between all matching key points, is used as the figure between picture Piece similarity；

Unit is chosen, for being ranked up according to the picture similarity to candidate's picture, chooses similar to the inquiry picture Target Photo.

9. bilingual comparable corpora mining device as claimed in claim 7 or 8, it is characterised in that the searching modul is also wrapped Include：

Unit is filtered out, is mismatched for filtering out in the object language knowledge base with the subject classification and issuing time information Picture；

Searching unit, for carrying out picture retrieval in the object language knowledge base after filtration, finds out and the inquiry The similar Target Photo of picture.

10. bilingual comparable corpora mining device as claimed in claim 7 or 8, it is characterised in that the structure module bag Include：

Text similarity computing unit, for calculating the inquiry corresponding text information of picture and Target Photo correspondence Text information text similarity；