CN106980664A - A kind of bilingual comparable corpora mining method and device - Google Patents
A kind of bilingual comparable corpora mining method and device Download PDFInfo
- Publication number
- CN106980664A CN106980664A CN201710169141.XA CN201710169141A CN106980664A CN 106980664 A CN106980664 A CN 106980664A CN 201710169141 A CN201710169141 A CN 201710169141A CN 106980664 A CN106980664 A CN 106980664A
- Authority
- CN
- China
- Prior art keywords
- picture
- inquiry
- text information
- similarity
- knowledge base
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/5866—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Library & Information Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses the bilingual comparable corpora mining method and device of one kind, by capturing multiple pictures and corresponding text information from the database of different language in advance, the multi-modal knowledge base comprising picture and text information is set up;Using the picture in original language knowledge base as inquiry picture, picture retrieval is carried out in object language knowledge base, the Target Photo similar to inquiry picture is found out;According to the corresponding text information of Target Photo text information corresponding with inquiry picture, bilingual comparable language material is built.The application uses across media information retrieval technology, the medium for linking up original language and object language is used as by picture, and then obtain of equal value or comparable text of the original language in destination end, new method is provided for the bilingual comparable excavating resource in internet, the problem of specific bilingual resource is rare is solved.
Description
Technical field
The present invention relates to field of computer technology, more particularly to a kind of bilingual comparable corpora mining method and device.
Background technology
Bilingual comparable language material refers to the text collection that similar semantic is characterized in different language.It is large-scale bilingual comparable
Rich and varied bilingual intertranslation unit, such as intertranslation pair of phrase rank, sentence level, and bilingual word are generally comprised in language material
Allusion quotation.In rare foreign languages or some restriction fields, parallel resource is generally less, but bilingual comparable language material is relatively easily obtained.
Therefore, bilingual comparable language material turns into the valuable source in machine translation and cross-language information retrieval field.How to obtain big automatically
The bilingual comparable language material of scale turns into a basic task in machine translation.
At present, the research method that bilingual comparable language material is obtained is broadly divided into following three class:One class is based on across language
The bilingual comparable language material construction method of information retrieval, this method extracting keywords from the document of original language, and based on bilingual
Dictionary by keyword to object language, and then as retrieval and inquisition, the candidate documents set of searched targets language, most
Comparable bilingual document is obtained eventually.Equations of The Second Kind is the bilingual comparable language material construction method based on content and structure similarity,
Source document is translated object language by this method using translation engine (Google translates or should must translated), obtains source document
Pseudo- translation result.And further from vocabulary, theme, structure similarity, evaluate pseudo- translation document and object language text
Shelves similarity, and the similar document of sequencing selection.3rd class method is that (such as wikipedia) is taken out from the knowledge base of structuring
Take bilingual comparable language material.Using the existing class entry of the knowledge bases such as wikipedia and link information, documentation level is obtained
Comparable resource.Existing method only from text similarity angle, by bilingual dictionary or it is existing construction of knowledge base is bilingual can
Compare language material resource.Performance of this kind of method dependent on a large amount of dictionary, knowledge base or existing translation systems manually marked, in face
During to rare foreign languages or specific area, rare language resource will restrict the versatility of such method.
The content of the invention
It is an object of the invention to provide the bilingual comparable corpora mining method and device of one kind, with solve existing method only from
Text similarity angle sets out, it is poor to the versatility of specific rare voice resource the problem of.
In order to solve the above technical problems, the present invention provides a kind of bilingual comparable corpora mining method, including:
Capture multiple pictures and corresponding text information from the database of different language in advance, set up comprising picture with
And the multi-modal knowledge base of the text information;
Using the picture in original language knowledge base as inquiry picture, picture retrieval is carried out in object language knowledge base, is looked into
Find out the Target Photo similar to the inquiry picture;
According to the corresponding text information of Target Photo text information corresponding with the inquiry picture, structure is bilingual can
Compare language material.
Alternatively, it is described that multiple pictures and corresponding text information bag are captured from the database of different language in advance
Include:
Using web crawlers from news website capturing pictures, the text information be the corresponding theme of the picture and/
Or heading message, using picture and corresponding text information as two tuples, it is stored in the multi-modal knowledge base.
Alternatively, it is described that picture retrieval is carried out in object language knowledge base, find out similar to the inquiry picture
Target Photo includes:
The key point of the inquiry picture is extracted using scale invariant feature transfer algorithm, the inquiry picture is characterized as
Characteristic vector based on the key point;
The characteristic vector of all candidate's pictures in the object language knowledge base is extracted, and matches the inquiry picture and institute
State the key point of candidate's picture;
The average Euclidean distance between all matching key points is calculated, the picture similarity between picture is used as;
Candidate's picture is ranked up according to the picture similarity, the target figure similar to the inquiry picture is chosen
Piece.
Alternatively, it is described that picture retrieval is carried out in object language knowledge base, find out similar to the inquiry picture
Target Photo includes:
Determine subject classification and the issuing time information of the inquiry picture;
Filter out in the object language knowledge base with the subject classification and the unmatched picture of issuing time information;
Picture retrieval is carried out in the object language knowledge base after filtration, is found out similar to the inquiry picture
Target Photo.
Alternatively, it is described to be believed according to the corresponding text information of Target Photo word corresponding with the inquiry picture
Breath, building bilingual comparable language material includes:
Calculate the text phase of the inquiry corresponding text information of picture and the corresponding text information of the Target Photo
Like degree;
The Target Photo is resequenced according to the text similarity;
According to the result after rearrangement, bilingual comparable language material is built.
Alternatively, it is described to calculate the inquiry corresponding text information of picture and the corresponding word letter of the Target Photo
The text similarity of breath includes:
The content for calculating the corresponding text information of inquiry picture text information corresponding with the Target Photo is similar
Degree, entity similarity and structural similarity;
The content similarity, the entity similarity and the structural similarity are weighted averagely, calculated
To the text similarity of correspondence text information.
Present invention also offers the bilingual comparable corpora mining device of one kind, including:
Knowledge base sets up module, for capturing multiple pictures and corresponding word from the database of different language in advance
Information, sets up the multi-modal knowledge base comprising picture and the text information;
Searching modul, for the picture in original language knowledge base, as inquiry picture, to be entered in object language knowledge base
Row picture retrieval, finds out the Target Photo similar to the inquiry picture;
Module is built, for believing according to the corresponding text information of Target Photo word corresponding with the inquiry picture
Breath, builds bilingual comparable language material.
Alternatively, the searching modul includes:
Extraction unit, the key point for extracting the inquiry picture using scale invariant feature transfer algorithm, will be described
Inquiry picture is characterized as the characteristic vector based on the key point;
Matching unit, for extracting the characteristic vector of all candidate's pictures in the object language knowledge base, and matches institute
State the key point of inquiry picture and candidate's picture;
Similarity calculated, for calculating the average Euclidean distance between all matching key points, as between picture
Picture similarity;
Unit is chosen, for being ranked up according to the picture similarity to candidate's picture, is chosen and the inquiry picture
Similar Target Photo.
Alternatively, the searching modul also includes:
Determining unit, subject classification and issuing time information for determining the inquiry picture;
Unit is filtered out, for filtering out in the object language knowledge base with the subject classification and issuing time information not
The picture of matching;
Searching unit, for carrying out picture retrieval in the object language knowledge base after filtration, find out with it is described
Inquire about the similar Target Photo of picture.
Alternatively, the structure module includes:
Text similarity computing unit, for calculating the inquiry corresponding text information of picture and the Target Photo
The text similarity of corresponding text information;
Sequencing unit, for being resequenced according to the text similarity to the Target Photo;
Construction unit, for according to the result after rearrangement, building bilingual comparable language material.
Bilingual comparable corpora mining method and device provided by the present invention, by advance from the database of different language
The multiple pictures of middle crawl and corresponding text information, set up the multi-modal knowledge base comprising picture and text information;By source
Picture in language knowledge base carries out picture retrieval in object language knowledge base, found out and query graph as inquiry picture
The similar Target Photo of piece;According to the corresponding text information of Target Photo text information corresponding with inquiry picture, build bilingual
Comparable language material.The application uses across media information retrieval technology, and the matchmaker for linking up original language and object language is used as by picture
It is situated between, and then obtains original language in the text of equal value or comparable of destination end, is the bilingual comparable excavating resource in internet
There is provided new method, the problem of specific bilingual resource is rare is solved.
Brief description of the drawings
, below will be to embodiment or existing for the clearer explanation embodiment of the present invention or the technical scheme of prior art
The accompanying drawing used required in technology description is briefly described, it should be apparent that, drawings in the following description are only this hair
Some bright embodiments, for those of ordinary skill in the art, on the premise of not paying creative work, can be with root
Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of flow chart of embodiment of bilingual comparable corpora mining method provided by the present invention;
Fig. 2 is the similar pictures and comparable title schematic diagram that excavate in newsletter archive;
Fig. 3 is that the process schematic to the similar Target Photo of the inquiry picture is found out in the embodiment of the present invention;
Fig. 4 is the flow of another embodiment of bilingual comparable corpora mining method provided by the present invention
Figure;
Fig. 5 is the matching result schematic diagram of key point in picture;
Fig. 6 is the structured flowchart of bilingual comparable corpora mining device provided in an embodiment of the present invention.
Embodiment
In order that those skilled in the art more fully understand the present invention program, with reference to the accompanying drawings and detailed description
The present invention is described in further detail.Obviously, described embodiment is only a part of embodiment of the invention, rather than
Whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not making creative work premise
Lower obtained every other embodiment, belongs to the scope of protection of the invention.
A kind of flow chart such as Fig. 1 institutes of embodiment of bilingual comparable corpora mining method provided by the present invention
Show, this method includes:
Step S101:Multiple pictures and corresponding text information are captured from the database of different language in advance, is set up
Multi-modal knowledge base comprising picture and the text information;
Specifically, the embodiment of the present invention can be captured on a large scale using web crawlers from the news website of different language
Picture and corresponding text information, the text information can be the corresponding theme of the picture and/or heading message, by picture
And corresponding text information is as two tuples, it is stored in the multi-modal knowledge base.
It is pointed out that the present invention is specifically based on across media information retrieval (Cross-MediaIn
FormationRetrieval, CMIR) technology, itself and cross-language information retrieval (Cross-Langu
AgeInformationRetrieve, CLIR) technology is quite similar, only difference is that linking up original language and object language
Medium is different.Across media information retrieval using picture and non-textual, as linking up the bridge of original language and object language, and then obtain
Original language is taken in the text of equal value or comparable of destination end.
As a kind of embodiment, CMIR knowledge base can include picture and picture header two parts, picture mark
Topic describes the content of the picture, and is corresponded with picture, the example of refer to the attached drawing 2.CMIR utilizes similar pictures search engine
To set up the comparable text that the corresponding title of the contact of different language text, i.e. similar pictures is considered bilingual.CLIR systems
Source language text is then translated into object language end using weak translation system or bilingual dictionary, and looked into translation result as retrieval
Ask, similar text is retrieved to object language end.To a certain extent, CMIR is easier to use compared to CLIR, and CMIR is uniquely needed
The problem of considering is the quality for how lifting retrieval, and CLIR then needs the quality of extra consideration bilingual dictionary and weak translation
Device translates performance.Therefore, the embodiment of the present invention propose based on the bilingual comparable corpora mining across media information retrieval technology
Method, attempts to excavate bilingual comparable language material using the similitude of picture first.
Step S102:Using the picture in original language knowledge base as inquiry picture, schemed in object language knowledge base
Piece is retrieved, and finds out the Target Photo similar to the inquiry picture;
Between the multi-modal knowledge base of different language, similar pictures searching system is set up, in object language knowledge
Similar picture is retrieved in storehouse.
Picture retrieval is carried out in reference picture 3, the embodiment of the present invention in object language knowledge base, is found out and the inquiry
The process of the similar Target Photo of picture can be specifically included:
Step S1021:The key point of the inquiry picture is extracted using scale invariant feature transfer algorithm, by the inquiry
Picture is characterized as the characteristic vector based on the key point;
Step S1022:The characteristic vector of all candidate's pictures in the object language knowledge base is extracted, and is looked into described in matching
Ask the key point of picture and candidate's picture;
Step S1023:The average Euclidean distance between all matching key points is calculated, it is similar as the picture between picture
Degree;
Step S1024:Candidate's picture is ranked up according to the picture similarity, chooses similar to the inquiry picture
Target Photo.
Step S103:According to the corresponding text information of Target Photo text information corresponding with the inquiry picture,
Build bilingual comparable language material.
Bilingual comparable corpora mining method provided by the present invention, by being captured in advance from the database of different language
Multiple pictures and corresponding text information, set up the multi-modal knowledge base comprising picture and text information;Original language is known
The picture in storehouse is known as inquiry picture, picture retrieval is carried out in object language knowledge base, is found out similar to inquiry picture
Target Photo;According to the corresponding text information of Target Photo text information corresponding with inquiry picture, build bilingual comparable
Language material.The application uses across media information retrieval technology, by picture as the medium for linking up original language and object language, and then
Original language is obtained in the text of equal value or comparable of destination end, is provided newly for the bilingual comparable excavating resource in internet
Method, solves the problem of specific bilingual resource is rare.
On the basis of any of the above-described embodiment, to lift the accurate rate and operational efficiency of similar pictures searching system, this
Inventive embodiments to similar pictures with reference to theme of news information and time tag information it is further proposed that optimize retrieval
Model.Specifically, picture is carried out in object language knowledge base in bilingual comparable corpora mining method provided by the present invention
Retrieval, finding out the process of the Target Photo similar to the inquiry picture can be specially:
Determine subject classification and the issuing time information of the inquiry picture;
Filter out in the object language knowledge base with the subject classification and the unmatched picture of issuing time information;
Picture retrieval is carried out in the object language knowledge base after filtration, is found out similar to the inquiry picture
Target Photo.
The flow of another embodiment of the bilingual comparable corpora mining method provided by the present invention of reference picture 4
Figure, the specific implementation process to this method is further elaborated on below.
The process includes:
Step S201:Multi-modal construction of knowledge base, the module is climbed using web crawlers from the news website of multilingual
Picture and correspondence picture header are taken, using picture and correspondence title to as two tuples, building multi-modal knowledge base.
It particularly may be divided into following steps:Divided, crawled respectively under different columns or theme pair according to news website column
The news pages answered;News pages are parsed, picture and picture header are extracted by Web page structural analysis, and extract news briefing
Time;Picture and correspondence picture header are constituted into two tuples, stored into multi-modal knowledge base.
The news web page under correspondence theme or column is captured by news website theme column, and is respectively stored in correspondence theme
Under catalogue.According to picture tag (such as:Img, pic etc.) picture and picture header in webpage are matched, while removing in title text
The web page tag unrelated with content is (such as:<a>,</br>Deng).Picture and picture header are expressed as two tuples, as multi-modal
One record of knowledge base.
In-exemplified by English may compare excavating resource, the present invention from the Chinese news website of main flow (such as:The www.xinhuanet.com, phoenix net,
Sina website) and English news website (FOX, CNN, BBC) crawl news, the different columns provided according to news website are (such as:Army
Thing, finance and economics, science and technology, physical culture, amusement etc.) news is subjected to classification and theme division, while recording the time of news briefing.For
The news of crawl, the embodiment of the present invention carries out structured analysis to news web page, extracts picture and picture header in webpage, deposits
Store up as two tuples (image, caption), be used as a record in multi-modal data storehouse.In this way, different language is set up
Under multi-modal data storehouse.
Step S202:Determine subject classification and the issuing time information of the inquiry picture;Filter out object language knowledge
With the subject classification and the unmatched picture of issuing time information in storehouse;
The artificial existing subject classification label system in Combination News website, for the news website of different language, using turning over
Translation word allusion quotation realizes that the theme label across language maps by machine translation mothod.Different language and different news websites are owned by
Respective news category system, it is therefore desirable to set up unified criteria for classification.For example " physical culture " is mapped to " Sports ".
The picture retrieval optimization of convergent journalism theme, the subject categories according to where news mark picture indicia to specify
Label.When carrying out similar pictures retrieval, hunting zone is limited under identical theme.Comparable news often has identical or phase
As theme, therefore when similar pictures are retrieved, limit the theme of image credit, candidate's picture retrieved simultaneously under identical theme
Sequence, to filter invalid candidate's picture.
The picture retrieval optimization of time of fusion label, is divided, is being entered automatically according to the time of news briefing to news
When row similar pictures are retrieved, range of search is limited in the time window specified, and changes time window threshold value, obtains different
Retrieval result.Media event has ageing, causes similar news picture to tend to occur at the close period
It is interior.Based on this, by the use of the issuing time of news picture as label, the time window d different by setting, and in the correspondence time
{ t-d, t+d } carries out the retrieval of similar pictures in interval.
The labels such as subject classification and issuing time of this step using news where picture, limitation similar pictures retrieval module
Search space, with filter dissmilarity picture candidate so that during towards large-scale dataset, the effect of similar pictures searching system
Rate and performance obtain stable lifted.
Step S203:Picture retrieval is carried out in the object language knowledge base after filtration, is found out and the inquiry
The similar Target Photo of picture;
By extracting the crucial point feature of picture, and the similarity between picture is calculated based on extraction feature, in difference
The retrieval of similar pictures is realized in the multi-modal knowledge base of language.
It particularly may be divided into following steps:Using picture in original language knowledge base as inquiry picture, using scale invariant feature
Transfer algorithm extracts the key point of picture, and picture is characterized as into the characteristic vector based on key point;There is time to Target Photo place
Picture is selected to extract its characteristic vector, and the key point of matching inquiry picture and target candidate picture;Calculate all matching key points
Between average Euclidean distance, be used as the measuring similarity between picture;For inquiry picture, according to the similarity of picture to mesh
All pictures in valut of marking on a map are ranked up, and selection Top-N pictures are used as similar pictures collection.
Specifically, for the key point x in retrieving image, two most like key points are found in target candidate picture
Y, z.Similarity between two key points, is obtained by the Euclidean distance of key point characteristic vector.
Judge whether key point x and Target Photo key point y match, rule is according to nearest key point and time nearly key point
The ratio of Euclidean distance whether be less than specified threshold and determine.Even d (x, y) < d (x, z) * θ, then think key point x and pass
Key point y matches, wherein threshold θ ∈ [0,1], is typically set to 0.8.Fig. 5 shows the matching result of key point in two pictures.
Calculate the similarity between picture.According to similar between the whole picture of the key point of all matchings in picture calculating
Degree.Specifically, calculating the average Euclidean distance between all match points.Formula is as follows:
M represents the key point number matched in two pictures, and n represents the dimension of key point characteristic vector.
According to picture similarity, target candidate pictures are ranked up, topN similar pictures are selected.
Step S204:According to the corresponding text information of Target Photo text information corresponding with the inquiry picture,
Build bilingual comparable language material.
By the way that by similar pictures searching system, similar pictures inspection is carried out under the theme of news related to target domain
Rope, and and then the bilingual comparable text of extraction.The present invention proposes the title comparability measurement modeled based on image similarity, with
And fused images similarity and the title comparability measurement of text similarity modeling.
The title comparability measure modeled based on image similarity, particularly may be divided into following steps:
Using similar pictures searching system, using the picture of original language as input, similar pictures are retrieved to destination end
Close, and the similarity for inquiring about picture and retrieval result picture is calculated based on SIFT algorithms.
Using the comparability of picture similarity modeling picture header, using the ranking results of picture, picture header is used as
Comparability between the ranking results of comparable degree, title is measured using below equation:
Wherein, S (scap,tcap) represent title scapWith title tcapBetween comparable degree.v(simg,timg) represent picture
simgWith picture timgBetween similarity, using SITF algorithms calculate obtain, b is constant.
The title comparability measure of fused images similarity and text similarity, particularly may be divided into following steps:
Using similar pictures searching system, using the picture of original language as input, similar pictures are retrieved to destination end
Close, and the similarity for inquiring about picture and retrieval result picture is calculated based on SIFT algorithms.
Calculate the text similarity of inquiry picture header and target candidate picture header.The present invention is from content similarity
(fc) text of, entity similarity (fe), and three angles of structural similarity (fl), overall merit inquiry title and candidate's title
This similarity.Specific formula is as follows:
S=α fc+ β fe+ γ fl
Further text similarity score between fused images similarity and picture header, to similar pictures retrieval result
(being based on image similarity) is reordered.According to the result after reordering, comparable text pair is built.The present invention is using as follows
Formula, calculates inquiry and candidate's comprehensive similarity:
Wherein Stxt(scap,tcap) represent inquiry title and retrieve the text similarity between title, b is constant, for controlling
Contribution of the piece similarity of charting to final similarity score.
The computational methods for inquiring about the text similarity of picture header and candidate's picture header are as follows:Turned over online using Google
Title translation will be inquired about to object language by translating system, with important letters such as the word order of stet sheet, sentence structure and name entities
Breath.
The comparability of title translation and Target Photo title is inquired about based on following three characteristic evaluating:
Content similarity:The text description that will be inquired about after title translation carries out participle, removes stop words, and root etc. is operated,
Bag-of-words is obtained to represent.Based on vector space model, the cosine similarity for calculating inquiry title and desired title (is denoted as
fc)。
Entity similarity:The name entity in Entity recognition instrument identification picture header is named using Stamford, is ordered
Name entity bag of words set, and the name entity similarity (being denoted as fe) between bilingual text is calculated based on vector space model.
Structural similarity:Will lexical word in inquiry title and desired title (including noun, verb, adverbial word, adjective, specially
Having noun etc.) ratio of number is as a comparability evaluation criterion (being denoted as fl).
Above-mentioned three kinds of features are merged using weighted average mode, comparability between inquiry title and desired title is obtained and obtains
Point.
S=α fc+ β fe+ γ fl
Wherein, according to percentage contribution of each feature to comparability, empirically set α=0.8, β=0.15, γ=
0.05。
Bilingual comparable corpora mining method proposed by the present invention based on across media information retrieval, is pair in internet
Language may compare excavating resource and provide new method.The method for not only proposing to build extensive multi-modal data storehouse, it is also proposed that be based on
The excavation of the comparable resource of picture similarity auxiliary, and based on this two kinds of comparability measure of proposition, to obtain big rule
Mould may compare text.
Bilingual comparable corpora mining device provided in an embodiment of the present invention is introduced below, it is described below bilingual
Comparable corpora mining device can be mutually to should refer to above-described bilingual comparable corpora mining method.
Fig. 6 is the structured flowchart of bilingual comparable corpora mining device provided in an embodiment of the present invention, and reference picture 6 is bilingual can
Comparing corpora mining device can include:
Knowledge base sets up module 100, for capturing multiple pictures from the database of different language in advance and corresponding
Text information, sets up the multi-modal knowledge base comprising picture and the text information;
Searching modul 200, for using the picture in original language knowledge base as inquiry picture, in object language knowledge base
Picture retrieval is carried out, the Target Photo similar to the inquiry picture is found out;
Module 300 is built, for according to the corresponding text information of Target Photo text corresponding with the inquiry picture
Word information, builds bilingual comparable language material.
As a kind of embodiment, in bilingual comparable corpora mining device provided by the present invention, searching modul
It can specifically include:
Extraction unit, the key point for extracting the inquiry picture using scale invariant feature transfer algorithm, will be described
Inquiry picture is characterized as the characteristic vector based on the key point;
Matching unit, for extracting the characteristic vector of all candidate's pictures in the object language knowledge base, and matches institute
State the key point of inquiry picture and candidate's picture;
Similarity calculated, for calculating the average Euclidean distance between all matching key points, as between picture
Picture similarity;
Unit is chosen, for being ranked up according to the picture similarity to candidate's picture, is chosen and the inquiry picture
Similar Target Photo.
As a kind of embodiment, above-mentioned searching modul can also include:
Determining unit, subject classification and issuing time information for determining the inquiry picture;
Unit is filtered out, for filtering out in the object language knowledge base with the subject classification and issuing time information not
The picture of matching;
Searching unit, for carrying out picture retrieval in the object language knowledge base after filtration, find out with it is described
Inquire about the similar Target Photo of picture.
Specifically, above-mentioned structure module can be specifically included:
Text similarity computing unit, for calculating the inquiry corresponding text information of picture and the Target Photo
The text similarity of corresponding text information;
Sequencing unit, for being resequenced according to the text similarity to the Target Photo;
Construction unit, for according to the result after rearrangement, building bilingual comparable language material.
The bilingual comparable corpora mining device of the present embodiment is used to realize foregoing bilingual comparable corpora mining method,
Therefore the visible bilingual comparable corpora mining method hereinbefore of embodiment in bilingual comparable corpora mining device
Embodiment part, for example, knowledge base sets up module 100, searching modul 200 builds module 300, is respectively used to realize above-mentioned
Step S101, S102, S103 in bilingual comparable corpora mining method, so, its embodiment is referred to accordingly
The description of various pieces embodiment, will not be repeated here.
Bilingual comparable corpora mining device provided by the present invention, by being captured in advance from the database of different language
Multiple pictures and corresponding text information, set up the multi-modal knowledge base comprising picture and text information;Original language is known
The picture in storehouse is known as inquiry picture, picture retrieval is carried out in object language knowledge base, is found out similar to inquiry picture
Target Photo;According to the corresponding text information of Target Photo text information corresponding with inquiry picture, build bilingual comparable
Language material.The application uses across media information retrieval technology, by picture as the medium for linking up original language and object language, and then
Original language is obtained in the text of equal value or comparable of destination end, is provided newly for the bilingual comparable excavating resource in internet
Method, solves the problem of specific bilingual resource is rare.
The embodiment of each in this specification is described by the way of progressive, what each embodiment was stressed be with it is other
Between the difference of embodiment, each embodiment same or similar part mutually referring to.For being filled disclosed in embodiment
For putting, because it is corresponded to the method disclosed in Example, so description is fairly simple, related part is referring to method part
Explanation.
Professional further appreciates that, with reference to the unit of each example of the embodiments described herein description
And algorithm steps, can be realized with electronic hardware, computer software or the combination of the two, in order to clearly demonstrate hardware and
The interchangeability of software, generally describes the composition and step of each example according to function in the above description.These
Function is performed with hardware or software mode actually, depending on the application-specific and design constraint of technical scheme.Specialty
Technical staff can realize described function to each specific application using distinct methods, but this realization should not
Think beyond the scope of this invention.
Directly it can be held with reference to the step of the method or algorithm that the embodiments described herein is described with hardware, processor
Capable software module, or the two combination are implemented.Software module can be placed in random access memory (RAM), internal memory, read-only deposit
Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology
In any other form of storage medium well known in field.
Bilingual comparable corpora mining method provided by the present invention and device are described in detail above.Herein
In apply specific case the principle and embodiment of the present invention be set forth, the explanation of above example is only intended to side
The method and its core concept of the assistant solution present invention.It should be pointed out that for those skilled in the art, not
On the premise of departing from the principle of the invention, some improvement and modification can also be carried out to the present invention, these are improved and modification is also fallen into
In the protection domain of the claims in the present invention.
Claims (10)
1. a kind of bilingual comparable corpora mining method, it is characterised in that including:
Multiple pictures and corresponding text information are captured from the database of different language in advance, sets up and includes picture and institute
State the multi-modal knowledge base of text information;
Using the picture in original language knowledge base as inquiry picture, picture retrieval is carried out in object language knowledge base, is found out
The Target Photo similar to the inquiry picture;
According to the corresponding text information of Target Photo text information corresponding with the inquiry picture, build bilingual comparable
Language material.
2. bilingual comparable corpora mining method as claimed in claim 1, it is characterised in that described in advance from different language
Multiple pictures and corresponding text information are captured in database to be included:
Using web crawlers from news website capturing pictures, the text information be the corresponding theme of the picture and/or mark
Information is inscribed, using picture and corresponding text information as two tuples, is stored in the multi-modal knowledge base.
3. bilingual comparable corpora mining method as claimed in claim 2, it is characterised in that described in object language knowledge base
Middle carry out picture retrieval, finding out the Target Photo similar to the inquiry picture includes:
The key point of the inquiry picture is extracted using scale invariant feature transfer algorithm, the inquiry picture is characterized as being based on
The characteristic vector of the key point;
The characteristic vector of all candidate's pictures in the object language knowledge base is extracted, and matches the inquiry picture and the time
Select the key point of picture;
The average Euclidean distance between all matching key points is calculated, the picture similarity between picture is used as;
Candidate's picture is ranked up according to the picture similarity, the Target Photo similar to the inquiry picture is chosen.
4. the bilingual comparable corpora mining method as described in any one of claims 1 to 3, it is characterised in that described in target
Picture retrieval is carried out in language knowledge base, finding out the Target Photo similar to the inquiry picture includes:
Determine subject classification and the issuing time information of the inquiry picture;
Filter out in the object language knowledge base with the subject classification and the unmatched picture of issuing time information;
Picture retrieval is carried out in the object language knowledge base after filtration, the target similar to the inquiry picture is found out
Picture.
5. the bilingual comparable corpora mining method as described in any one of claims 1 to 3, it is characterised in that described according to institute
The corresponding text information of Target Photo text information corresponding with the inquiry picture is stated, building bilingual comparable language material includes:
Calculate the text similarity of the inquiry corresponding text information of picture and the corresponding text information of the Target Photo;
The Target Photo is resequenced according to the text similarity;
According to the result after rearrangement, bilingual comparable language material is built.
6. bilingual comparable corpora mining method as claimed in claim 5, it is characterised in that the calculating inquiry picture
The text similarity of corresponding text information and the corresponding text information of the Target Photo includes:
Calculate content similarity, the reality of the corresponding text information of inquiry picture text information corresponding with the Target Photo
Body similarity and structural similarity;
The content similarity, the entity similarity and the structural similarity are weighted average, calculating is obtained pair
Answer the text similarity of text information.
7. a kind of bilingual comparable corpora mining device, it is characterised in that including:
Knowledge base sets up module, for capturing multiple pictures and corresponding word letter from the database of different language in advance
Breath, sets up the multi-modal knowledge base comprising picture and the text information;
Searching modul, for the picture in original language knowledge base, as inquiry picture, to be schemed in object language knowledge base
Piece is retrieved, and finds out the Target Photo similar to the inquiry picture;
Module is built, for inquiring about the corresponding text information of picture with described according to the corresponding text information of the Target Photo,
Build bilingual comparable language material.
8. bilingual comparable corpora mining device as claimed in claim 7, it is characterised in that the searching modul includes:
Extraction unit, the key point for extracting the inquiry picture using scale invariant feature transfer algorithm, by the inquiry
Picture is characterized as the characteristic vector based on the key point;
Matching unit, the characteristic vector for extracting all candidate's pictures in the object language knowledge base, and looked into described in matching
Ask the key point of picture and candidate's picture;
Similarity calculated, for calculating the average Euclidean distance between all matching key points, is used as the figure between picture
Piece similarity;
Unit is chosen, for being ranked up according to the picture similarity to candidate's picture, chooses similar to the inquiry picture
Target Photo.
9. bilingual comparable corpora mining device as claimed in claim 7 or 8, it is characterised in that the searching modul is also wrapped
Include:
Determining unit, subject classification and issuing time information for determining the inquiry picture;
Unit is filtered out, is mismatched for filtering out in the object language knowledge base with the subject classification and issuing time information
Picture;
Searching unit, for carrying out picture retrieval in the object language knowledge base after filtration, finds out and the inquiry
The similar Target Photo of picture.
10. bilingual comparable corpora mining device as claimed in claim 7 or 8, it is characterised in that the structure module bag
Include:
Text similarity computing unit, for calculating the inquiry corresponding text information of picture and Target Photo correspondence
Text information text similarity;
Sequencing unit, for being resequenced according to the text similarity to the Target Photo;
Construction unit, for according to the result after rearrangement, building bilingual comparable language material.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710169141.XA CN106980664B (en) | 2017-03-21 | 2017-03-21 | Bilingual comparable corpus mining method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710169141.XA CN106980664B (en) | 2017-03-21 | 2017-03-21 | Bilingual comparable corpus mining method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106980664A true CN106980664A (en) | 2017-07-25 |
CN106980664B CN106980664B (en) | 2020-11-10 |
Family
ID=59338807
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710169141.XA Active CN106980664B (en) | 2017-03-21 | 2017-03-21 | Bilingual comparable corpus mining method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106980664B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109522554A (en) * | 2018-11-06 | 2019-03-26 | 中国人民解放军战略支援部队信息工程大学 | A kind of low-resource Document Classification Method and categorizing system |
CN109710923A (en) * | 2018-12-06 | 2019-05-03 | 浙江大学 | Based on across the entity language matching process across media information |
CN110110078A (en) * | 2018-01-11 | 2019-08-09 | 北京搜狗科技发展有限公司 | Data processing method and device, the device for data processing |
CN111881900A (en) * | 2020-07-01 | 2020-11-03 | 腾讯科技(深圳)有限公司 | Corpus generation, translation model training and translation method, apparatus, device and medium |
CN112818212A (en) * | 2020-04-23 | 2021-05-18 | 腾讯科技(深圳)有限公司 | Corpus data acquisition method and device, computer equipment and storage medium |
WO2021233112A1 (en) * | 2020-05-20 | 2021-11-25 | 腾讯科技(深圳)有限公司 | Multimodal machine learning-based translation method, device, equipment, and storage medium |
CN114004236A (en) * | 2021-09-18 | 2022-02-01 | 昆明理工大学 | Chinese cross-language news event retrieval method integrated with event entity knowledge |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102053991A (en) * | 2009-10-30 | 2011-05-11 | 国际商业机器公司 | Method and system for multi-language document retrieval |
CN103473327A (en) * | 2013-09-13 | 2013-12-25 | 广东图图搜网络科技有限公司 | Image retrieval method and image retrieval system |
CN103473280A (en) * | 2013-08-28 | 2013-12-25 | 中国科学院合肥物质科学研究院 | Method and device for mining comparable network language materials |
US20150278197A1 (en) * | 2014-03-31 | 2015-10-01 | Abbyy Infopoisk Llc | Constructing Comparable Corpora with Universal Similarity Measure |
-
2017
- 2017-03-21 CN CN201710169141.XA patent/CN106980664B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102053991A (en) * | 2009-10-30 | 2011-05-11 | 国际商业机器公司 | Method and system for multi-language document retrieval |
CN103473280A (en) * | 2013-08-28 | 2013-12-25 | 中国科学院合肥物质科学研究院 | Method and device for mining comparable network language materials |
CN103473327A (en) * | 2013-09-13 | 2013-12-25 | 广东图图搜网络科技有限公司 | Image retrieval method and image retrieval system |
US20150278197A1 (en) * | 2014-03-31 | 2015-10-01 | Abbyy Infopoisk Llc | Constructing Comparable Corpora with Universal Similarity Measure |
Non-Patent Citations (5)
Title |
---|
DARJA FIŠER,ŠPELA VINTAR,NIKOLA LJUBEŠIĆ,ET AL.: "Building and using comparable corpora for domain-specific bilingual lexicon extraction", 《BUCC》 * |
ZHU Z,LI M,CHEN L,ET AL.: "Building Comparable Corpora Based on Bilingual LDA Model", 《ACL》 * |
吴全娥;熊海灵: "一种综合多特征的句子相似度计算方法", 《计算机系统应用》 * |
庞伟: "双语语料库构建研究综述", 《信息技术与信息化》 * |
房璐;葛运东;洪宇;姚建民: "可比较语料库构建及在跨语言信息检索中的应用", 《广西师范大学学报(自然科学版)》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110110078A (en) * | 2018-01-11 | 2019-08-09 | 北京搜狗科技发展有限公司 | Data processing method and device, the device for data processing |
CN110110078B (en) * | 2018-01-11 | 2024-04-30 | 北京搜狗科技发展有限公司 | Data processing method and device for data processing |
CN109522554A (en) * | 2018-11-06 | 2019-03-26 | 中国人民解放军战略支援部队信息工程大学 | A kind of low-resource Document Classification Method and categorizing system |
CN109710923A (en) * | 2018-12-06 | 2019-05-03 | 浙江大学 | Based on across the entity language matching process across media information |
CN109710923B (en) * | 2018-12-06 | 2020-09-01 | 浙江大学 | Cross-language entity matching method based on cross-media information |
CN112818212A (en) * | 2020-04-23 | 2021-05-18 | 腾讯科技(深圳)有限公司 | Corpus data acquisition method and device, computer equipment and storage medium |
CN112818212B (en) * | 2020-04-23 | 2023-10-13 | 腾讯科技(深圳)有限公司 | Corpus data acquisition method, corpus data acquisition device, computer equipment and storage medium |
WO2021233112A1 (en) * | 2020-05-20 | 2021-11-25 | 腾讯科技(深圳)有限公司 | Multimodal machine learning-based translation method, device, equipment, and storage medium |
CN111881900A (en) * | 2020-07-01 | 2020-11-03 | 腾讯科技(深圳)有限公司 | Corpus generation, translation model training and translation method, apparatus, device and medium |
CN111881900B (en) * | 2020-07-01 | 2022-08-23 | 腾讯科技(深圳)有限公司 | Corpus generation method, corpus translation model training method, corpus translation model translation method, corpus translation device, corpus translation equipment and corpus translation medium |
CN114004236A (en) * | 2021-09-18 | 2022-02-01 | 昆明理工大学 | Chinese cross-language news event retrieval method integrated with event entity knowledge |
CN114004236B (en) * | 2021-09-18 | 2024-04-30 | 昆明理工大学 | Cross-language news event retrieval method integrating knowledge of event entity |
Also Published As
Publication number | Publication date |
---|---|
CN106980664B (en) | 2020-11-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110502621B (en) | Question answering method, question answering device, computer equipment and storage medium | |
CN106980664A (en) | A kind of bilingual comparable corpora mining method and device | |
US9489401B1 (en) | Methods and systems for object recognition | |
CN108280114B (en) | Deep learning-based user literature reading interest analysis method | |
CN102053991B (en) | Method and system for multi-language document retrieval | |
CN103984738B (en) | Role labelling method based on search matching | |
CN107180045B (en) | Method for extracting geographic entity relation contained in internet text | |
Sarawagi et al. | Open-domain quantity queries on web tables: annotation, response, and consensus models | |
WO2015149533A1 (en) | Method and device for word segmentation processing on basis of webpage content classification | |
CN103544266B (en) | A kind of method and device for searching for suggestion word generation | |
US8606780B2 (en) | Image re-rank based on image annotations | |
TWI656450B (en) | Method and system for extracting knowledge from Chinese corpus | |
CN105045852A (en) | Full-text search engine system for teaching resources | |
CN108509521B (en) | Image retrieval method for automatically generating text index | |
CN107193892B (en) | A kind of document subject matter determines method and device | |
CN112015907A (en) | Method and device for quickly constructing discipline knowledge graph and storage medium | |
Huang et al. | AKMiner: Domain-specific knowledge graph mining from academic literatures | |
CN113569050A (en) | Method and device for automatically constructing government affair field knowledge map based on deep learning | |
Zhou et al. | Automatic image–text alignment for large-scale web image indexing and retrieval | |
JP2021501387A (en) | Methods, computer programs and computer systems for extracting expressions for natural language processing | |
Moncla et al. | Mapping urban fingerprints of odonyms automatically extracted from French novels | |
Song et al. | Cross-language record linkage based on semantic matching of metadata | |
George et al. | A novel sequence graph representation for searching and retrieving sequences of long text in the domain of information retrieval | |
Aghaebrahimian et al. | Named entity disambiguation at scale | |
Hazman et al. | An ontology based approach for automatically annotating document segments |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |