CN106980664A - A kind of bilingual comparable corpora mining method and device - Google Patents

A kind of bilingual comparable corpora mining method and device Download PDF

Info

Publication number
CN106980664A
CN106980664A CN201710169141.XA CN201710169141A CN106980664A CN 106980664 A CN106980664 A CN 106980664A CN 201710169141 A CN201710169141 A CN 201710169141A CN 106980664 A CN106980664 A CN 106980664A
Authority
CN
China
Prior art keywords
picture
inquiry
text information
similarity
knowledge base
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710169141.XA
Other languages
Chinese (zh)
Other versions
CN106980664B (en
Inventor
洪宇
姚亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201710169141.XA priority Critical patent/CN106980664B/en
Publication of CN106980664A publication Critical patent/CN106980664A/en
Application granted granted Critical
Publication of CN106980664B publication Critical patent/CN106980664B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/5866Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses the bilingual comparable corpora mining method and device of one kind, by capturing multiple pictures and corresponding text information from the database of different language in advance, the multi-modal knowledge base comprising picture and text information is set up;Using the picture in original language knowledge base as inquiry picture, picture retrieval is carried out in object language knowledge base, the Target Photo similar to inquiry picture is found out;According to the corresponding text information of Target Photo text information corresponding with inquiry picture, bilingual comparable language material is built.The application uses across media information retrieval technology, the medium for linking up original language and object language is used as by picture, and then obtain of equal value or comparable text of the original language in destination end, new method is provided for the bilingual comparable excavating resource in internet, the problem of specific bilingual resource is rare is solved.

Description

A kind of bilingual comparable corpora mining method and device
Technical field
The present invention relates to field of computer technology, more particularly to a kind of bilingual comparable corpora mining method and device.
Background technology
Bilingual comparable language material refers to the text collection that similar semantic is characterized in different language.It is large-scale bilingual comparable Rich and varied bilingual intertranslation unit, such as intertranslation pair of phrase rank, sentence level, and bilingual word are generally comprised in language material Allusion quotation.In rare foreign languages or some restriction fields, parallel resource is generally less, but bilingual comparable language material is relatively easily obtained. Therefore, bilingual comparable language material turns into the valuable source in machine translation and cross-language information retrieval field.How to obtain big automatically The bilingual comparable language material of scale turns into a basic task in machine translation.
At present, the research method that bilingual comparable language material is obtained is broadly divided into following three class:One class is based on across language The bilingual comparable language material construction method of information retrieval, this method extracting keywords from the document of original language, and based on bilingual Dictionary by keyword to object language, and then as retrieval and inquisition, the candidate documents set of searched targets language, most Comparable bilingual document is obtained eventually.Equations of The Second Kind is the bilingual comparable language material construction method based on content and structure similarity, Source document is translated object language by this method using translation engine (Google translates or should must translated), obtains source document Pseudo- translation result.And further from vocabulary, theme, structure similarity, evaluate pseudo- translation document and object language text Shelves similarity, and the similar document of sequencing selection.3rd class method is that (such as wikipedia) is taken out from the knowledge base of structuring Take bilingual comparable language material.Using the existing class entry of the knowledge bases such as wikipedia and link information, documentation level is obtained Comparable resource.Existing method only from text similarity angle, by bilingual dictionary or it is existing construction of knowledge base is bilingual can Compare language material resource.Performance of this kind of method dependent on a large amount of dictionary, knowledge base or existing translation systems manually marked, in face During to rare foreign languages or specific area, rare language resource will restrict the versatility of such method.
The content of the invention
It is an object of the invention to provide the bilingual comparable corpora mining method and device of one kind, with solve existing method only from Text similarity angle sets out, it is poor to the versatility of specific rare voice resource the problem of.
In order to solve the above technical problems, the present invention provides a kind of bilingual comparable corpora mining method, including:
Capture multiple pictures and corresponding text information from the database of different language in advance, set up comprising picture with And the multi-modal knowledge base of the text information;
Using the picture in original language knowledge base as inquiry picture, picture retrieval is carried out in object language knowledge base, is looked into Find out the Target Photo similar to the inquiry picture;
According to the corresponding text information of Target Photo text information corresponding with the inquiry picture, structure is bilingual can Compare language material.
Alternatively, it is described that multiple pictures and corresponding text information bag are captured from the database of different language in advance Include:
Using web crawlers from news website capturing pictures, the text information be the corresponding theme of the picture and/ Or heading message, using picture and corresponding text information as two tuples, it is stored in the multi-modal knowledge base.
Alternatively, it is described that picture retrieval is carried out in object language knowledge base, find out similar to the inquiry picture Target Photo includes:
The key point of the inquiry picture is extracted using scale invariant feature transfer algorithm, the inquiry picture is characterized as Characteristic vector based on the key point;
The characteristic vector of all candidate's pictures in the object language knowledge base is extracted, and matches the inquiry picture and institute State the key point of candidate's picture;
The average Euclidean distance between all matching key points is calculated, the picture similarity between picture is used as;
Candidate's picture is ranked up according to the picture similarity, the target figure similar to the inquiry picture is chosen Piece.
Alternatively, it is described that picture retrieval is carried out in object language knowledge base, find out similar to the inquiry picture Target Photo includes:
Determine subject classification and the issuing time information of the inquiry picture;
Filter out in the object language knowledge base with the subject classification and the unmatched picture of issuing time information;
Picture retrieval is carried out in the object language knowledge base after filtration, is found out similar to the inquiry picture Target Photo.
Alternatively, it is described to be believed according to the corresponding text information of Target Photo word corresponding with the inquiry picture Breath, building bilingual comparable language material includes:
Calculate the text phase of the inquiry corresponding text information of picture and the corresponding text information of the Target Photo Like degree;
The Target Photo is resequenced according to the text similarity;
According to the result after rearrangement, bilingual comparable language material is built.
Alternatively, it is described to calculate the inquiry corresponding text information of picture and the corresponding word letter of the Target Photo The text similarity of breath includes:
The content for calculating the corresponding text information of inquiry picture text information corresponding with the Target Photo is similar Degree, entity similarity and structural similarity;
The content similarity, the entity similarity and the structural similarity are weighted averagely, calculated To the text similarity of correspondence text information.
Present invention also offers the bilingual comparable corpora mining device of one kind, including:
Knowledge base sets up module, for capturing multiple pictures and corresponding word from the database of different language in advance Information, sets up the multi-modal knowledge base comprising picture and the text information;
Searching modul, for the picture in original language knowledge base, as inquiry picture, to be entered in object language knowledge base Row picture retrieval, finds out the Target Photo similar to the inquiry picture;
Module is built, for believing according to the corresponding text information of Target Photo word corresponding with the inquiry picture Breath, builds bilingual comparable language material.
Alternatively, the searching modul includes:
Extraction unit, the key point for extracting the inquiry picture using scale invariant feature transfer algorithm, will be described Inquiry picture is characterized as the characteristic vector based on the key point;
Matching unit, for extracting the characteristic vector of all candidate's pictures in the object language knowledge base, and matches institute State the key point of inquiry picture and candidate's picture;
Similarity calculated, for calculating the average Euclidean distance between all matching key points, as between picture Picture similarity;
Unit is chosen, for being ranked up according to the picture similarity to candidate's picture, is chosen and the inquiry picture Similar Target Photo.
Alternatively, the searching modul also includes:
Determining unit, subject classification and issuing time information for determining the inquiry picture;
Unit is filtered out, for filtering out in the object language knowledge base with the subject classification and issuing time information not The picture of matching;
Searching unit, for carrying out picture retrieval in the object language knowledge base after filtration, find out with it is described Inquire about the similar Target Photo of picture.
Alternatively, the structure module includes:
Text similarity computing unit, for calculating the inquiry corresponding text information of picture and the Target Photo The text similarity of corresponding text information;
Sequencing unit, for being resequenced according to the text similarity to the Target Photo;
Construction unit, for according to the result after rearrangement, building bilingual comparable language material.
Bilingual comparable corpora mining method and device provided by the present invention, by advance from the database of different language The multiple pictures of middle crawl and corresponding text information, set up the multi-modal knowledge base comprising picture and text information;By source Picture in language knowledge base carries out picture retrieval in object language knowledge base, found out and query graph as inquiry picture The similar Target Photo of piece;According to the corresponding text information of Target Photo text information corresponding with inquiry picture, build bilingual Comparable language material.The application uses across media information retrieval technology, and the matchmaker for linking up original language and object language is used as by picture It is situated between, and then obtains original language in the text of equal value or comparable of destination end, is the bilingual comparable excavating resource in internet There is provided new method, the problem of specific bilingual resource is rare is solved.
Brief description of the drawings
, below will be to embodiment or existing for the clearer explanation embodiment of the present invention or the technical scheme of prior art The accompanying drawing used required in technology description is briefly described, it should be apparent that, drawings in the following description are only this hair Some bright embodiments, for those of ordinary skill in the art, on the premise of not paying creative work, can be with root Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of flow chart of embodiment of bilingual comparable corpora mining method provided by the present invention;
Fig. 2 is the similar pictures and comparable title schematic diagram that excavate in newsletter archive;
Fig. 3 is that the process schematic to the similar Target Photo of the inquiry picture is found out in the embodiment of the present invention;
Fig. 4 is the flow of another embodiment of bilingual comparable corpora mining method provided by the present invention Figure;
Fig. 5 is the matching result schematic diagram of key point in picture;
Fig. 6 is the structured flowchart of bilingual comparable corpora mining device provided in an embodiment of the present invention.
Embodiment
In order that those skilled in the art more fully understand the present invention program, with reference to the accompanying drawings and detailed description The present invention is described in further detail.Obviously, described embodiment is only a part of embodiment of the invention, rather than Whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not making creative work premise Lower obtained every other embodiment, belongs to the scope of protection of the invention.
A kind of flow chart such as Fig. 1 institutes of embodiment of bilingual comparable corpora mining method provided by the present invention Show, this method includes:
Step S101:Multiple pictures and corresponding text information are captured from the database of different language in advance, is set up Multi-modal knowledge base comprising picture and the text information;
Specifically, the embodiment of the present invention can be captured on a large scale using web crawlers from the news website of different language Picture and corresponding text information, the text information can be the corresponding theme of the picture and/or heading message, by picture And corresponding text information is as two tuples, it is stored in the multi-modal knowledge base.
It is pointed out that the present invention is specifically based on across media information retrieval (Cross-MediaIn FormationRetrieval, CMIR) technology, itself and cross-language information retrieval (Cross-Langu AgeInformationRetrieve, CLIR) technology is quite similar, only difference is that linking up original language and object language Medium is different.Across media information retrieval using picture and non-textual, as linking up the bridge of original language and object language, and then obtain Original language is taken in the text of equal value or comparable of destination end.
As a kind of embodiment, CMIR knowledge base can include picture and picture header two parts, picture mark Topic describes the content of the picture, and is corresponded with picture, the example of refer to the attached drawing 2.CMIR utilizes similar pictures search engine To set up the comparable text that the corresponding title of the contact of different language text, i.e. similar pictures is considered bilingual.CLIR systems Source language text is then translated into object language end using weak translation system or bilingual dictionary, and looked into translation result as retrieval Ask, similar text is retrieved to object language end.To a certain extent, CMIR is easier to use compared to CLIR, and CMIR is uniquely needed The problem of considering is the quality for how lifting retrieval, and CLIR then needs the quality of extra consideration bilingual dictionary and weak translation Device translates performance.Therefore, the embodiment of the present invention propose based on the bilingual comparable corpora mining across media information retrieval technology Method, attempts to excavate bilingual comparable language material using the similitude of picture first.
Step S102:Using the picture in original language knowledge base as inquiry picture, schemed in object language knowledge base Piece is retrieved, and finds out the Target Photo similar to the inquiry picture;
Between the multi-modal knowledge base of different language, similar pictures searching system is set up, in object language knowledge Similar picture is retrieved in storehouse.
Picture retrieval is carried out in reference picture 3, the embodiment of the present invention in object language knowledge base, is found out and the inquiry The process of the similar Target Photo of picture can be specifically included:
Step S1021:The key point of the inquiry picture is extracted using scale invariant feature transfer algorithm, by the inquiry Picture is characterized as the characteristic vector based on the key point;
Step S1022:The characteristic vector of all candidate's pictures in the object language knowledge base is extracted, and is looked into described in matching Ask the key point of picture and candidate's picture;
Step S1023:The average Euclidean distance between all matching key points is calculated, it is similar as the picture between picture Degree;
Step S1024:Candidate's picture is ranked up according to the picture similarity, chooses similar to the inquiry picture Target Photo.
Step S103:According to the corresponding text information of Target Photo text information corresponding with the inquiry picture, Build bilingual comparable language material.
Bilingual comparable corpora mining method provided by the present invention, by being captured in advance from the database of different language Multiple pictures and corresponding text information, set up the multi-modal knowledge base comprising picture and text information;Original language is known The picture in storehouse is known as inquiry picture, picture retrieval is carried out in object language knowledge base, is found out similar to inquiry picture Target Photo;According to the corresponding text information of Target Photo text information corresponding with inquiry picture, build bilingual comparable Language material.The application uses across media information retrieval technology, by picture as the medium for linking up original language and object language, and then Original language is obtained in the text of equal value or comparable of destination end, is provided newly for the bilingual comparable excavating resource in internet Method, solves the problem of specific bilingual resource is rare.
On the basis of any of the above-described embodiment, to lift the accurate rate and operational efficiency of similar pictures searching system, this Inventive embodiments to similar pictures with reference to theme of news information and time tag information it is further proposed that optimize retrieval Model.Specifically, picture is carried out in object language knowledge base in bilingual comparable corpora mining method provided by the present invention Retrieval, finding out the process of the Target Photo similar to the inquiry picture can be specially:
Determine subject classification and the issuing time information of the inquiry picture;
Filter out in the object language knowledge base with the subject classification and the unmatched picture of issuing time information;
Picture retrieval is carried out in the object language knowledge base after filtration, is found out similar to the inquiry picture Target Photo.
The flow of another embodiment of the bilingual comparable corpora mining method provided by the present invention of reference picture 4 Figure, the specific implementation process to this method is further elaborated on below.
The process includes:
Step S201:Multi-modal construction of knowledge base, the module is climbed using web crawlers from the news website of multilingual Picture and correspondence picture header are taken, using picture and correspondence title to as two tuples, building multi-modal knowledge base.
It particularly may be divided into following steps:Divided, crawled respectively under different columns or theme pair according to news website column The news pages answered;News pages are parsed, picture and picture header are extracted by Web page structural analysis, and extract news briefing Time;Picture and correspondence picture header are constituted into two tuples, stored into multi-modal knowledge base.
The news web page under correspondence theme or column is captured by news website theme column, and is respectively stored in correspondence theme Under catalogue.According to picture tag (such as:Img, pic etc.) picture and picture header in webpage are matched, while removing in title text The web page tag unrelated with content is (such as:<a>,</br>Deng).Picture and picture header are expressed as two tuples, as multi-modal One record of knowledge base.
In-exemplified by English may compare excavating resource, the present invention from the Chinese news website of main flow (such as:The www.xinhuanet.com, phoenix net, Sina website) and English news website (FOX, CNN, BBC) crawl news, the different columns provided according to news website are (such as:Army Thing, finance and economics, science and technology, physical culture, amusement etc.) news is subjected to classification and theme division, while recording the time of news briefing.For The news of crawl, the embodiment of the present invention carries out structured analysis to news web page, extracts picture and picture header in webpage, deposits Store up as two tuples (image, caption), be used as a record in multi-modal data storehouse.In this way, different language is set up Under multi-modal data storehouse.
Step S202:Determine subject classification and the issuing time information of the inquiry picture;Filter out object language knowledge With the subject classification and the unmatched picture of issuing time information in storehouse;
The artificial existing subject classification label system in Combination News website, for the news website of different language, using turning over Translation word allusion quotation realizes that the theme label across language maps by machine translation mothod.Different language and different news websites are owned by Respective news category system, it is therefore desirable to set up unified criteria for classification.For example " physical culture " is mapped to " Sports ".
The picture retrieval optimization of convergent journalism theme, the subject categories according to where news mark picture indicia to specify Label.When carrying out similar pictures retrieval, hunting zone is limited under identical theme.Comparable news often has identical or phase As theme, therefore when similar pictures are retrieved, limit the theme of image credit, candidate's picture retrieved simultaneously under identical theme Sequence, to filter invalid candidate's picture.
The picture retrieval optimization of time of fusion label, is divided, is being entered automatically according to the time of news briefing to news When row similar pictures are retrieved, range of search is limited in the time window specified, and changes time window threshold value, obtains different Retrieval result.Media event has ageing, causes similar news picture to tend to occur at the close period It is interior.Based on this, by the use of the issuing time of news picture as label, the time window d different by setting, and in the correspondence time { t-d, t+d } carries out the retrieval of similar pictures in interval.
The labels such as subject classification and issuing time of this step using news where picture, limitation similar pictures retrieval module Search space, with filter dissmilarity picture candidate so that during towards large-scale dataset, the effect of similar pictures searching system Rate and performance obtain stable lifted.
Step S203:Picture retrieval is carried out in the object language knowledge base after filtration, is found out and the inquiry The similar Target Photo of picture;
By extracting the crucial point feature of picture, and the similarity between picture is calculated based on extraction feature, in difference The retrieval of similar pictures is realized in the multi-modal knowledge base of language.
It particularly may be divided into following steps:Using picture in original language knowledge base as inquiry picture, using scale invariant feature Transfer algorithm extracts the key point of picture, and picture is characterized as into the characteristic vector based on key point;There is time to Target Photo place Picture is selected to extract its characteristic vector, and the key point of matching inquiry picture and target candidate picture;Calculate all matching key points Between average Euclidean distance, be used as the measuring similarity between picture;For inquiry picture, according to the similarity of picture to mesh All pictures in valut of marking on a map are ranked up, and selection Top-N pictures are used as similar pictures collection.
Specifically, for the key point x in retrieving image, two most like key points are found in target candidate picture Y, z.Similarity between two key points, is obtained by the Euclidean distance of key point characteristic vector.
Judge whether key point x and Target Photo key point y match, rule is according to nearest key point and time nearly key point The ratio of Euclidean distance whether be less than specified threshold and determine.Even d (x, y) < d (x, z) * θ, then think key point x and pass Key point y matches, wherein threshold θ ∈ [0,1], is typically set to 0.8.Fig. 5 shows the matching result of key point in two pictures.
Calculate the similarity between picture.According to similar between the whole picture of the key point of all matchings in picture calculating Degree.Specifically, calculating the average Euclidean distance between all match points.Formula is as follows:
M represents the key point number matched in two pictures, and n represents the dimension of key point characteristic vector.
According to picture similarity, target candidate pictures are ranked up, topN similar pictures are selected.
Step S204:According to the corresponding text information of Target Photo text information corresponding with the inquiry picture, Build bilingual comparable language material.
By the way that by similar pictures searching system, similar pictures inspection is carried out under the theme of news related to target domain Rope, and and then the bilingual comparable text of extraction.The present invention proposes the title comparability measurement modeled based on image similarity, with And fused images similarity and the title comparability measurement of text similarity modeling.
The title comparability measure modeled based on image similarity, particularly may be divided into following steps:
Using similar pictures searching system, using the picture of original language as input, similar pictures are retrieved to destination end Close, and the similarity for inquiring about picture and retrieval result picture is calculated based on SIFT algorithms.
Using the comparability of picture similarity modeling picture header, using the ranking results of picture, picture header is used as Comparability between the ranking results of comparable degree, title is measured using below equation:
Wherein, S (scap,tcap) represent title scapWith title tcapBetween comparable degree.v(simg,timg) represent picture simgWith picture timgBetween similarity, using SITF algorithms calculate obtain, b is constant.
The title comparability measure of fused images similarity and text similarity, particularly may be divided into following steps:
Using similar pictures searching system, using the picture of original language as input, similar pictures are retrieved to destination end Close, and the similarity for inquiring about picture and retrieval result picture is calculated based on SIFT algorithms.
Calculate the text similarity of inquiry picture header and target candidate picture header.The present invention is from content similarity (fc) text of, entity similarity (fe), and three angles of structural similarity (fl), overall merit inquiry title and candidate's title This similarity.Specific formula is as follows:
S=α fc+ β fe+ γ fl
Further text similarity score between fused images similarity and picture header, to similar pictures retrieval result (being based on image similarity) is reordered.According to the result after reordering, comparable text pair is built.The present invention is using as follows Formula, calculates inquiry and candidate's comprehensive similarity:
Wherein Stxt(scap,tcap) represent inquiry title and retrieve the text similarity between title, b is constant, for controlling Contribution of the piece similarity of charting to final similarity score.
The computational methods for inquiring about the text similarity of picture header and candidate's picture header are as follows:Turned over online using Google Title translation will be inquired about to object language by translating system, with important letters such as the word order of stet sheet, sentence structure and name entities Breath.
The comparability of title translation and Target Photo title is inquired about based on following three characteristic evaluating:
Content similarity:The text description that will be inquired about after title translation carries out participle, removes stop words, and root etc. is operated, Bag-of-words is obtained to represent.Based on vector space model, the cosine similarity for calculating inquiry title and desired title (is denoted as fc)。
Entity similarity:The name entity in Entity recognition instrument identification picture header is named using Stamford, is ordered Name entity bag of words set, and the name entity similarity (being denoted as fe) between bilingual text is calculated based on vector space model.
Structural similarity:Will lexical word in inquiry title and desired title (including noun, verb, adverbial word, adjective, specially Having noun etc.) ratio of number is as a comparability evaluation criterion (being denoted as fl).
Above-mentioned three kinds of features are merged using weighted average mode, comparability between inquiry title and desired title is obtained and obtains Point.
S=α fc+ β fe+ γ fl
Wherein, according to percentage contribution of each feature to comparability, empirically set α=0.8, β=0.15, γ= 0.05。
Bilingual comparable corpora mining method proposed by the present invention based on across media information retrieval, is pair in internet Language may compare excavating resource and provide new method.The method for not only proposing to build extensive multi-modal data storehouse, it is also proposed that be based on The excavation of the comparable resource of picture similarity auxiliary, and based on this two kinds of comparability measure of proposition, to obtain big rule Mould may compare text.
Bilingual comparable corpora mining device provided in an embodiment of the present invention is introduced below, it is described below bilingual Comparable corpora mining device can be mutually to should refer to above-described bilingual comparable corpora mining method.
Fig. 6 is the structured flowchart of bilingual comparable corpora mining device provided in an embodiment of the present invention, and reference picture 6 is bilingual can Comparing corpora mining device can include:
Knowledge base sets up module 100, for capturing multiple pictures from the database of different language in advance and corresponding Text information, sets up the multi-modal knowledge base comprising picture and the text information;
Searching modul 200, for using the picture in original language knowledge base as inquiry picture, in object language knowledge base Picture retrieval is carried out, the Target Photo similar to the inquiry picture is found out;
Module 300 is built, for according to the corresponding text information of Target Photo text corresponding with the inquiry picture Word information, builds bilingual comparable language material.
As a kind of embodiment, in bilingual comparable corpora mining device provided by the present invention, searching modul It can specifically include:
Extraction unit, the key point for extracting the inquiry picture using scale invariant feature transfer algorithm, will be described Inquiry picture is characterized as the characteristic vector based on the key point;
Matching unit, for extracting the characteristic vector of all candidate's pictures in the object language knowledge base, and matches institute State the key point of inquiry picture and candidate's picture;
Similarity calculated, for calculating the average Euclidean distance between all matching key points, as between picture Picture similarity;
Unit is chosen, for being ranked up according to the picture similarity to candidate's picture, is chosen and the inquiry picture Similar Target Photo.
As a kind of embodiment, above-mentioned searching modul can also include:
Determining unit, subject classification and issuing time information for determining the inquiry picture;
Unit is filtered out, for filtering out in the object language knowledge base with the subject classification and issuing time information not The picture of matching;
Searching unit, for carrying out picture retrieval in the object language knowledge base after filtration, find out with it is described Inquire about the similar Target Photo of picture.
Specifically, above-mentioned structure module can be specifically included:
Text similarity computing unit, for calculating the inquiry corresponding text information of picture and the Target Photo The text similarity of corresponding text information;
Sequencing unit, for being resequenced according to the text similarity to the Target Photo;
Construction unit, for according to the result after rearrangement, building bilingual comparable language material.
The bilingual comparable corpora mining device of the present embodiment is used to realize foregoing bilingual comparable corpora mining method, Therefore the visible bilingual comparable corpora mining method hereinbefore of embodiment in bilingual comparable corpora mining device Embodiment part, for example, knowledge base sets up module 100, searching modul 200 builds module 300, is respectively used to realize above-mentioned Step S101, S102, S103 in bilingual comparable corpora mining method, so, its embodiment is referred to accordingly The description of various pieces embodiment, will not be repeated here.
Bilingual comparable corpora mining device provided by the present invention, by being captured in advance from the database of different language Multiple pictures and corresponding text information, set up the multi-modal knowledge base comprising picture and text information;Original language is known The picture in storehouse is known as inquiry picture, picture retrieval is carried out in object language knowledge base, is found out similar to inquiry picture Target Photo;According to the corresponding text information of Target Photo text information corresponding with inquiry picture, build bilingual comparable Language material.The application uses across media information retrieval technology, by picture as the medium for linking up original language and object language, and then Original language is obtained in the text of equal value or comparable of destination end, is provided newly for the bilingual comparable excavating resource in internet Method, solves the problem of specific bilingual resource is rare.
The embodiment of each in this specification is described by the way of progressive, what each embodiment was stressed be with it is other Between the difference of embodiment, each embodiment same or similar part mutually referring to.For being filled disclosed in embodiment For putting, because it is corresponded to the method disclosed in Example, so description is fairly simple, related part is referring to method part Explanation.
Professional further appreciates that, with reference to the unit of each example of the embodiments described herein description And algorithm steps, can be realized with electronic hardware, computer software or the combination of the two, in order to clearly demonstrate hardware and The interchangeability of software, generally describes the composition and step of each example according to function in the above description.These Function is performed with hardware or software mode actually, depending on the application-specific and design constraint of technical scheme.Specialty Technical staff can realize described function to each specific application using distinct methods, but this realization should not Think beyond the scope of this invention.
Directly it can be held with reference to the step of the method or algorithm that the embodiments described herein is described with hardware, processor Capable software module, or the two combination are implemented.Software module can be placed in random access memory (RAM), internal memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.
Bilingual comparable corpora mining method provided by the present invention and device are described in detail above.Herein In apply specific case the principle and embodiment of the present invention be set forth, the explanation of above example is only intended to side The method and its core concept of the assistant solution present invention.It should be pointed out that for those skilled in the art, not On the premise of departing from the principle of the invention, some improvement and modification can also be carried out to the present invention, these are improved and modification is also fallen into In the protection domain of the claims in the present invention.

Claims (10)

1. a kind of bilingual comparable corpora mining method, it is characterised in that including:
Multiple pictures and corresponding text information are captured from the database of different language in advance, sets up and includes picture and institute State the multi-modal knowledge base of text information;
Using the picture in original language knowledge base as inquiry picture, picture retrieval is carried out in object language knowledge base, is found out The Target Photo similar to the inquiry picture;
According to the corresponding text information of Target Photo text information corresponding with the inquiry picture, build bilingual comparable Language material.
2. bilingual comparable corpora mining method as claimed in claim 1, it is characterised in that described in advance from different language Multiple pictures and corresponding text information are captured in database to be included:
Using web crawlers from news website capturing pictures, the text information be the corresponding theme of the picture and/or mark Information is inscribed, using picture and corresponding text information as two tuples, is stored in the multi-modal knowledge base.
3. bilingual comparable corpora mining method as claimed in claim 2, it is characterised in that described in object language knowledge base Middle carry out picture retrieval, finding out the Target Photo similar to the inquiry picture includes:
The key point of the inquiry picture is extracted using scale invariant feature transfer algorithm, the inquiry picture is characterized as being based on The characteristic vector of the key point;
The characteristic vector of all candidate's pictures in the object language knowledge base is extracted, and matches the inquiry picture and the time Select the key point of picture;
The average Euclidean distance between all matching key points is calculated, the picture similarity between picture is used as;
Candidate's picture is ranked up according to the picture similarity, the Target Photo similar to the inquiry picture is chosen.
4. the bilingual comparable corpora mining method as described in any one of claims 1 to 3, it is characterised in that described in target Picture retrieval is carried out in language knowledge base, finding out the Target Photo similar to the inquiry picture includes:
Determine subject classification and the issuing time information of the inquiry picture;
Filter out in the object language knowledge base with the subject classification and the unmatched picture of issuing time information;
Picture retrieval is carried out in the object language knowledge base after filtration, the target similar to the inquiry picture is found out Picture.
5. the bilingual comparable corpora mining method as described in any one of claims 1 to 3, it is characterised in that described according to institute The corresponding text information of Target Photo text information corresponding with the inquiry picture is stated, building bilingual comparable language material includes:
Calculate the text similarity of the inquiry corresponding text information of picture and the corresponding text information of the Target Photo;
The Target Photo is resequenced according to the text similarity;
According to the result after rearrangement, bilingual comparable language material is built.
6. bilingual comparable corpora mining method as claimed in claim 5, it is characterised in that the calculating inquiry picture The text similarity of corresponding text information and the corresponding text information of the Target Photo includes:
Calculate content similarity, the reality of the corresponding text information of inquiry picture text information corresponding with the Target Photo Body similarity and structural similarity;
The content similarity, the entity similarity and the structural similarity are weighted average, calculating is obtained pair Answer the text similarity of text information.
7. a kind of bilingual comparable corpora mining device, it is characterised in that including:
Knowledge base sets up module, for capturing multiple pictures and corresponding word letter from the database of different language in advance Breath, sets up the multi-modal knowledge base comprising picture and the text information;
Searching modul, for the picture in original language knowledge base, as inquiry picture, to be schemed in object language knowledge base Piece is retrieved, and finds out the Target Photo similar to the inquiry picture;
Module is built, for inquiring about the corresponding text information of picture with described according to the corresponding text information of the Target Photo, Build bilingual comparable language material.
8. bilingual comparable corpora mining device as claimed in claim 7, it is characterised in that the searching modul includes:
Extraction unit, the key point for extracting the inquiry picture using scale invariant feature transfer algorithm, by the inquiry Picture is characterized as the characteristic vector based on the key point;
Matching unit, the characteristic vector for extracting all candidate's pictures in the object language knowledge base, and looked into described in matching Ask the key point of picture and candidate's picture;
Similarity calculated, for calculating the average Euclidean distance between all matching key points, is used as the figure between picture Piece similarity;
Unit is chosen, for being ranked up according to the picture similarity to candidate's picture, chooses similar to the inquiry picture Target Photo.
9. bilingual comparable corpora mining device as claimed in claim 7 or 8, it is characterised in that the searching modul is also wrapped Include:
Determining unit, subject classification and issuing time information for determining the inquiry picture;
Unit is filtered out, is mismatched for filtering out in the object language knowledge base with the subject classification and issuing time information Picture;
Searching unit, for carrying out picture retrieval in the object language knowledge base after filtration, finds out and the inquiry The similar Target Photo of picture.
10. bilingual comparable corpora mining device as claimed in claim 7 or 8, it is characterised in that the structure module bag Include:
Text similarity computing unit, for calculating the inquiry corresponding text information of picture and Target Photo correspondence Text information text similarity;
Sequencing unit, for being resequenced according to the text similarity to the Target Photo;
Construction unit, for according to the result after rearrangement, building bilingual comparable language material.
CN201710169141.XA 2017-03-21 2017-03-21 Bilingual comparable corpus mining method and device Active CN106980664B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710169141.XA CN106980664B (en) 2017-03-21 2017-03-21 Bilingual comparable corpus mining method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710169141.XA CN106980664B (en) 2017-03-21 2017-03-21 Bilingual comparable corpus mining method and device

Publications (2)

Publication Number Publication Date
CN106980664A true CN106980664A (en) 2017-07-25
CN106980664B CN106980664B (en) 2020-11-10

Family

ID=59338807

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710169141.XA Active CN106980664B (en) 2017-03-21 2017-03-21 Bilingual comparable corpus mining method and device

Country Status (1)

Country Link
CN (1) CN106980664B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522554A (en) * 2018-11-06 2019-03-26 中国人民解放军战略支援部队信息工程大学 A kind of low-resource Document Classification Method and categorizing system
CN109710923A (en) * 2018-12-06 2019-05-03 浙江大学 Based on across the entity language matching process across media information
CN110110078A (en) * 2018-01-11 2019-08-09 北京搜狗科技发展有限公司 Data processing method and device, the device for data processing
CN111881900A (en) * 2020-07-01 2020-11-03 腾讯科技(深圳)有限公司 Corpus generation, translation model training and translation method, apparatus, device and medium
CN112818212A (en) * 2020-04-23 2021-05-18 腾讯科技(深圳)有限公司 Corpus data acquisition method and device, computer equipment and storage medium
WO2021233112A1 (en) * 2020-05-20 2021-11-25 腾讯科技(深圳)有限公司 Multimodal machine learning-based translation method, device, equipment, and storage medium
CN114004236A (en) * 2021-09-18 2022-02-01 昆明理工大学 Chinese cross-language news event retrieval method integrated with event entity knowledge

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102053991A (en) * 2009-10-30 2011-05-11 国际商业机器公司 Method and system for multi-language document retrieval
CN103473280A (en) * 2013-08-28 2013-12-25 中国科学院合肥物质科学研究院 Method and device for mining comparable network language materials
CN103473327A (en) * 2013-09-13 2013-12-25 广东图图搜网络科技有限公司 Image retrieval method and image retrieval system
US20150278197A1 (en) * 2014-03-31 2015-10-01 Abbyy Infopoisk Llc Constructing Comparable Corpora with Universal Similarity Measure

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102053991A (en) * 2009-10-30 2011-05-11 国际商业机器公司 Method and system for multi-language document retrieval
CN103473280A (en) * 2013-08-28 2013-12-25 中国科学院合肥物质科学研究院 Method and device for mining comparable network language materials
CN103473327A (en) * 2013-09-13 2013-12-25 广东图图搜网络科技有限公司 Image retrieval method and image retrieval system
US20150278197A1 (en) * 2014-03-31 2015-10-01 Abbyy Infopoisk Llc Constructing Comparable Corpora with Universal Similarity Measure

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
DARJA FIŠER,ŠPELA VINTAR,NIKOLA LJUBEŠIĆ,ET AL.: "Building and using comparable corpora for domain-specific bilingual lexicon extraction", 《BUCC》 *
ZHU Z,LI M,CHEN L,ET AL.: "Building Comparable Corpora Based on Bilingual LDA Model", 《ACL》 *
吴全娥;熊海灵: "一种综合多特征的句子相似度计算方法", 《计算机系统应用》 *
庞伟: "双语语料库构建研究综述", 《信息技术与信息化》 *
房璐;葛运东;洪宇;姚建民: "可比较语料库构建及在跨语言信息检索中的应用", 《广西师范大学学报(自然科学版)》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110078A (en) * 2018-01-11 2019-08-09 北京搜狗科技发展有限公司 Data processing method and device, the device for data processing
CN110110078B (en) * 2018-01-11 2024-04-30 北京搜狗科技发展有限公司 Data processing method and device for data processing
CN109522554A (en) * 2018-11-06 2019-03-26 中国人民解放军战略支援部队信息工程大学 A kind of low-resource Document Classification Method and categorizing system
CN109710923A (en) * 2018-12-06 2019-05-03 浙江大学 Based on across the entity language matching process across media information
CN109710923B (en) * 2018-12-06 2020-09-01 浙江大学 Cross-language entity matching method based on cross-media information
CN112818212A (en) * 2020-04-23 2021-05-18 腾讯科技(深圳)有限公司 Corpus data acquisition method and device, computer equipment and storage medium
CN112818212B (en) * 2020-04-23 2023-10-13 腾讯科技(深圳)有限公司 Corpus data acquisition method, corpus data acquisition device, computer equipment and storage medium
WO2021233112A1 (en) * 2020-05-20 2021-11-25 腾讯科技(深圳)有限公司 Multimodal machine learning-based translation method, device, equipment, and storage medium
CN111881900A (en) * 2020-07-01 2020-11-03 腾讯科技(深圳)有限公司 Corpus generation, translation model training and translation method, apparatus, device and medium
CN111881900B (en) * 2020-07-01 2022-08-23 腾讯科技(深圳)有限公司 Corpus generation method, corpus translation model training method, corpus translation model translation method, corpus translation device, corpus translation equipment and corpus translation medium
CN114004236A (en) * 2021-09-18 2022-02-01 昆明理工大学 Chinese cross-language news event retrieval method integrated with event entity knowledge
CN114004236B (en) * 2021-09-18 2024-04-30 昆明理工大学 Cross-language news event retrieval method integrating knowledge of event entity

Also Published As

Publication number Publication date
CN106980664B (en) 2020-11-10

Similar Documents

Publication Publication Date Title
CN110502621B (en) Question answering method, question answering device, computer equipment and storage medium
CN106980664A (en) A kind of bilingual comparable corpora mining method and device
US9489401B1 (en) Methods and systems for object recognition
CN102053991B (en) Method and system for multi-language document retrieval
CN108280114B (en) Deep learning-based user literature reading interest analysis method
CN107180045B (en) Method for extracting geographic entity relation contained in internet text
CN102890713B (en) A kind of music recommend method based on user&#39;s current geographic position and physical environment
Sarawagi et al. Open-domain quantity queries on web tables: annotation, response, and consensus models
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
US8606780B2 (en) Image re-rank based on image annotations
CN105045852A (en) Full-text search engine system for teaching resources
CN108509521B (en) Image retrieval method for automatically generating text index
CN107193892B (en) A kind of document subject matter determines method and device
CN113569050B (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
TW201826145A (en) Method and system for knowledge extraction from Chinese corpus useful for extracting knowledge from source corpuses mainly written in Chinese
Huang et al. AKMiner: Domain-specific knowledge graph mining from academic literatures
CN112015907A (en) Method and device for quickly constructing discipline knowledge graph and storage medium
Zhou et al. Automatic image–text alignment for large-scale web image indexing and retrieval
Moncla et al. Mapping urban fingerprints of odonyms automatically extracted from French novels
JP2021501387A (en) Methods, computer programs and computer systems for extracting expressions for natural language processing
Song et al. Cross-language record linkage based on semantic matching of metadata
JP6303669B2 (en) Document retrieval device, document retrieval system, document retrieval method, and program
George et al. A novel sequence graph representation for searching and retrieving sequences of long text in the domain of information retrieval
Rabiu et al. TEXTUAL AND STRUCTURAL APPROACHES TO DETECTING FIGURE PLAGIARISM IN SCIENTIFIC PUBLICATIONS.
Adefowoke Ojokoh et al. Automated document metadata extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant