CN106980664B - Bilingual comparable corpus mining method and device - Google Patents

Bilingual comparable corpus mining method and device Download PDF

Info

Publication number
CN106980664B
CN106980664B CN201710169141.XA CN201710169141A CN106980664B CN 106980664 B CN106980664 B CN 106980664B CN 201710169141 A CN201710169141 A CN 201710169141A CN 106980664 B CN106980664 B CN 106980664B
Authority
CN
China
Prior art keywords
picture
similarity
pictures
query
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710169141.XA
Other languages
Chinese (zh)
Other versions
CN106980664A (en
Inventor
洪宇
姚亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201710169141.XA priority Critical patent/CN106980664B/en
Publication of CN106980664A publication Critical patent/CN106980664A/en
Application granted granted Critical
Publication of CN106980664B publication Critical patent/CN106980664B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/5866Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a bilingual comparable corpus mining method and a bilingual comparable corpus mining device, wherein a multi-mode knowledge base containing pictures and text information is established by capturing a plurality of pictures and corresponding text information from databases of different languages in advance; taking the picture in the source language knowledge base as a query picture, carrying out picture retrieval in the target language knowledge base, and finding out a target picture similar to the query picture; and constructing the bilingual comparable corpus according to the text information corresponding to the target picture and the text information corresponding to the query picture. According to the method, a cross-media information retrieval technology is adopted, the picture is used as a medium for communicating the source language and the target language, so that an equivalent or comparable text of the source language at the target end is obtained, a new method is provided for mining bilingual comparable resources in the Internet, and the problem of scarcity of specific bilingual resources is solved.

Description

Bilingual comparable corpus mining method and device
Technical Field
The invention relates to the technical field of computers, in particular to a bilingual comparable corpus mining method and device.
Background
Bilingual comparables refer to a collection of text in different languages that characterize similar semantics. Large-scale bilingual comparables typically contain a large variety of bilingual inter-translation units, such as phrase-level, sentence-level inter-translation pairs, and bilingual dictionaries. In the case of the small languages or some limited fields, parallel resources are usually small, but bilingual comparables are relatively easy to obtain. Thus, bilingual comparables are an important resource in the field of machine translation and cross-language information retrieval. How to automatically acquire large-scale bilingual comparable corpora becomes a basic task in machine translation.
Currently, the research methods for bilingual comparable corpus acquisition can be roughly divided into the following three categories: one type is a bilingual comparable corpus construction method based on cross-language information retrieval, which extracts keywords from documents in a source language, translates the keywords to a target language based on a bilingual dictionary, further uses the keywords as a retrieval query, retrieves a candidate document set of the target language and finally obtains comparable bilingual documents. The second type is a bilingual comparable corpus construction method based on content and structural similarity, which translates a source language document into a target language by using a translation engine (google translation or imperative translation, etc.) to obtain a pseudo translation result of the source document. And further evaluating the similarity of the pseudo translation document and the target language document from the similarity of the vocabulary, the subject and the structure, and selecting similar documents in sequence. The third category of methods is to extract bilingual comparables from a structured knowledge base (e.g., wikipedia, etc.). And (4) acquiring comparable resources at the document level by utilizing the existing classification items and link information of the knowledge base such as Wikipedia and the like. The existing method only starts from the text similarity angle, and constructs bilingual comparable corpus resources by means of a bilingual dictionary or an existing knowledge base. The method depends on the performance of a large number of manually labeled dictionaries, knowledge bases or existing translation systems, and when the method is oriented to the Chinese or specific fields, the universality of the method is limited by scarce language resources.
Disclosure of Invention
The invention aims to provide a bilingual comparable corpus mining method and device to solve the problem that the existing method only starts from the text similarity angle and has poor universality on specific scarce voice resources.
To solve the above technical problem, the present invention provides a bilingual comparable corpus mining method, comprising:
capturing a plurality of pictures and corresponding text information from databases of different languages in advance, and establishing a multi-mode knowledge base containing the pictures and the text information;
taking the picture in the source language knowledge base as a query picture, carrying out picture retrieval in a target language knowledge base, and finding out a target picture similar to the query picture;
and constructing a bilingual comparable corpus according to the text information corresponding to the target picture and the text information corresponding to the query picture.
Optionally, the capturing a plurality of pictures and corresponding text information from databases of different languages in advance includes:
and capturing pictures from a news website by using a web crawler, wherein the text information is subject and/or title information corresponding to the pictures, and storing the pictures and the corresponding text information in the multi-mode knowledge base as a binary group.
Optionally, the retrieving the picture in the target language knowledge base, and finding the target picture similar to the query picture includes:
extracting key points of the query picture by adopting a scale invariant feature transformation algorithm, and characterizing the query picture as feature vectors based on the key points;
extracting the characteristic vectors of all candidate pictures in the target language knowledge base, and matching the query picture with the key points of the candidate pictures;
calculating the average Euclidean distance between all the matched key points to serve as the picture similarity between the pictures;
and sequencing the candidate pictures according to the picture similarity, and selecting a target picture similar to the query picture.
Optionally, the retrieving the picture in the target language knowledge base, and finding the target picture similar to the query picture includes:
determining topic classification and release time information of the query picture;
filtering pictures which are not matched with the topic classification and the release time information in the target language knowledge base;
and searching pictures in the filtered target language knowledge base to find out a target picture similar to the query picture.
Optionally, the constructing a bilingual comparable corpus according to the text information corresponding to the target picture and the text information corresponding to the query picture includes:
calculating text similarity of the text information corresponding to the query picture and the text information corresponding to the target picture;
reordering the target pictures according to the text similarity;
and constructing bilingual comparable corpora according to the reordered result.
Optionally, the calculating the text similarity between the text information corresponding to the query picture and the text information corresponding to the target picture includes:
calculating content similarity, entity similarity and structure similarity of the text information corresponding to the query picture and the text information corresponding to the target picture;
and carrying out weighted average on the content similarity, the entity similarity and the structure similarity, and calculating to obtain the text similarity of the corresponding text information.
The invention also provides a bilingual comparable corpus mining device, which comprises:
the knowledge base establishing module is used for capturing a plurality of pictures and corresponding text information from databases of different languages in advance and establishing a multi-mode knowledge base containing the pictures and the text information;
the searching module is used for taking the picture in the source language knowledge base as a query picture, searching the picture in the target language knowledge base and searching out a target picture similar to the query picture;
and the construction module is used for constructing bilingual comparable corpora according to the text information corresponding to the target picture and the text information corresponding to the query picture.
Optionally, the search module includes:
the extraction unit is used for extracting key points of the query picture by adopting a scale invariant feature transformation algorithm and characterizing the query picture as feature vectors based on the key points;
the matching unit is used for extracting the characteristic vectors of all candidate pictures in the target language knowledge base and matching the query picture with the key points of the candidate pictures;
the similarity calculation unit is used for calculating the average Euclidean distance between all the matched key points as the picture similarity between the pictures;
and the selecting unit is used for sequencing the candidate pictures according to the picture similarity and selecting the target picture similar to the query picture.
Optionally, the search module further includes:
the determining unit is used for determining the topic classification and the release time information of the query picture;
the filtering unit is used for filtering the pictures which are not matched with the topic classification and the release time information in the target language knowledge base;
and the searching unit is used for searching pictures in the filtered target language knowledge base and searching out the target pictures similar to the searched pictures.
Optionally, the building module comprises:
the text similarity calculation unit is used for calculating text similarity of the character information corresponding to the query picture and the character information corresponding to the target picture;
the sequencing unit is used for reordering the target pictures according to the text similarity;
and the construction unit is used for constructing the bilingual comparable corpus according to the result after reordering.
The bilingual comparable corpus mining method and the bilingual comparable corpus mining device provided by the invention build a multi-modal knowledge base comprising pictures and text information by capturing a plurality of pictures and corresponding text information from databases of different languages in advance; taking the picture in the source language knowledge base as a query picture, carrying out picture retrieval in the target language knowledge base, and finding out a target picture similar to the query picture; and constructing the bilingual comparable corpus according to the text information corresponding to the target picture and the text information corresponding to the query picture. According to the method, a cross-media information retrieval technology is adopted, the picture is used as a medium for communicating the source language and the target language, so that an equivalent or comparable text of the source language at the target end is obtained, a new method is provided for mining bilingual comparable resources in the Internet, and the problem of scarcity of specific bilingual resources is solved.
Drawings
In order to more clearly illustrate the embodiments or technical solutions of the present invention, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
FIG. 1 is a flowchart illustrating a bilingual comparable corpus mining method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of similar pictures and comparable titles mined in a news text;
FIG. 3 is a diagram illustrating a process of finding a target picture similar to the query picture according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating another exemplary embodiment of a bilingual comparable corpus mining method according to the present invention;
FIG. 5 is a diagram illustrating matching results of key points in a picture;
fig. 6 is a block diagram illustrating a structure of a bilingual comparable corpus mining device according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 shows a flowchart of a specific embodiment of a bilingual comparable corpus mining method according to the present invention, which includes:
step S101: capturing a plurality of pictures and corresponding text information from databases of different languages in advance, and establishing a multi-mode knowledge base containing the pictures and the text information;
specifically, the embodiment of the invention can utilize a web crawler to capture large-scale pictures and corresponding text information from news websites of different languages, the text information can be subject and/or title information corresponding to the pictures, and the pictures and the corresponding text information are stored in the multi-mode knowledge base as a binary group.
It should be noted that the present invention is based on Cross-media information retrieval (CMIR) technology, which is very similar to Cross-language information retrieval (CLIR) technology, and the only difference is the media communicating the source language and the target language. Cross-media information retrieval uses pictures, rather than text, as a bridge to communicate the source and target languages, and thus obtains equivalent or comparable text in the source language at the target.
As a specific implementation, the knowledge base of the CMIR may contain two parts, namely, a picture and a picture header, where the picture header describes the content of the picture and corresponds to the picture one by one, and refer to the example of fig. 2. CMIR uses a similar picture search engine to link different language texts, i.e. similar pictures correspond to titles that are considered comparable texts in bilingual languages. The CLIR system translates the source language text to the target language end using a weak translation system or a bilingual dictionary, and retrieves similar text to the target language end using the translation result as a retrieval query. CMIR is somewhat easier to use than CLIR, the only consideration for CMIR is how to improve the quality of the search, which requires additional consideration for the quality of the bilingual dictionary and poor translator translation performance. Therefore, the bilingual comparable corpus mining method based on the cross-media information retrieval technology provided by the embodiment of the invention firstly tries to mine the bilingual comparable corpus by utilizing the similarity of the pictures.
Step S102: taking the picture in the source language knowledge base as a query picture, carrying out picture retrieval in a target language knowledge base, and finding out a target picture similar to the query picture;
and establishing a similar picture retrieval system among the multi-modal knowledge bases of different languages so as to retrieve similar pictures in the target language knowledge base.
Referring to fig. 3, the process of retrieving a picture in a target language knowledge base and finding out a target picture similar to the query picture in the embodiment of the present invention may specifically include:
step S1021: extracting key points of the query picture by adopting a scale invariant feature transformation algorithm, and characterizing the query picture as feature vectors based on the key points;
step S1022: extracting the characteristic vectors of all candidate pictures in the target language knowledge base, and matching the query picture with the key points of the candidate pictures;
step S1023: calculating the average Euclidean distance between all the matched key points to serve as the picture similarity between the pictures;
step S1024: and sequencing the candidate pictures according to the picture similarity, and selecting a target picture similar to the query picture.
Step S103: and constructing a bilingual comparable corpus according to the text information corresponding to the target picture and the text information corresponding to the query picture.
The bilingual comparable corpus mining method provided by the invention is characterized in that a multi-mode knowledge base containing pictures and character information is established by capturing a plurality of pictures and corresponding character information from databases of different languages in advance; taking the picture in the source language knowledge base as a query picture, carrying out picture retrieval in the target language knowledge base, and finding out a target picture similar to the query picture; and constructing the bilingual comparable corpus according to the text information corresponding to the target picture and the text information corresponding to the query picture. According to the method, a cross-media information retrieval technology is adopted, the picture is used as a medium for communicating the source language and the target language, so that an equivalent or comparable text of the source language at the target end is obtained, a new method is provided for mining bilingual comparable resources in the Internet, and the problem of scarcity of specific bilingual resources is solved.
On the basis of any of the above embodiments, in order to improve the accuracy and the operating efficiency of the similar picture retrieval system, the embodiment of the present invention further provides a model for performing optimized retrieval on similar pictures by combining news topic information and time tag information. Specifically, in the bilingual comparable corpus mining method provided by the present invention, the picture retrieval is performed in the target language knowledge base, and the process of finding out the target picture similar to the query picture may specifically be:
determining topic classification and release time information of the query picture;
filtering pictures which are not matched with the topic classification and the release time information in the target language knowledge base;
and searching pictures in the filtered target language knowledge base to find out a target picture similar to the query picture.
Referring to fig. 4, a flowchart of another embodiment of the bilingual comparable corpus mining method according to the present invention is shown, and the detailed implementation process of the method is further described in detail below.
The process comprises the following steps:
step S201: and (4) constructing a multi-mode knowledge base, wherein the module crawls pictures and corresponding picture titles from news websites of multiple languages by using a web crawler, and the picture and the corresponding title are paired to form a binary group to construct the multi-mode knowledge base.
The method comprises the following steps: according to news website column division, crawling corresponding news pages under different columns or topics respectively; analyzing a news page, extracting pictures and picture titles through webpage structural analysis, and extracting news release time; and forming a binary group by the picture and the corresponding picture title, and storing the binary group into a multi-mode knowledge base.
And capturing news webpages corresponding to the topics or under the columns according to the topic columns of the news website, and respectively storing the news webpages in the corresponding main topic records. And matching the pictures and the picture titles in the webpage according to the picture tags (such as img, pic and the like), and simultaneously removing the webpage tags (such as < a >, and < br > and the like) which are irrelevant to the content in the title text. The picture and the picture title are expressed as a binary group as one record of the multi-mode knowledge base.
Taking the mining of Chinese-English comparable resources as an example, the invention crawls news from mainstream Chinese news websites (such as Xinhua network, Phoenix network and Xinlang network) and English news websites (FOX, CNN and BBC), classifies the news according to different columns (such as military affairs, finance, science and technology, sports, entertainment and the like) given by the news websites, and records the time of news release. For the captured news, the embodiment of the invention performs structural analysis on the news webpage, extracts the pictures and the picture titles in the webpage, and stores the pictures and the picture titles as a binary group (image) as a record in a multi-mode database. In this way, a multimodal database in different languages is built.
Step S202: determining topic classification and release time information of the query picture; filtering pictures which are not matched with the topic classification and the release time information in the target language knowledge base;
the existing topic classification label system of the news website is manually integrated, and for the news websites with different languages, the cross-language topic label mapping is realized by utilizing a translation dictionary or by means of a machine translation technology. Different languages and different news websites have respective news classification systems, so that a uniform classification standard needs to be established. Such as mapping "Sports" to "Sports".
And optimizing the retrieval of the images fused with the news topics, and marking the images as the designated labels according to the topic categories of the news. When similar picture retrieval is carried out, the search range is limited under the same theme. Comparable news tends to have the same or similar subject matter, so when similar pictures are retrieved, the subject matter of the picture source is limited, and candidate pictures are retrieved under the same subject matter and ranked to filter invalid candidate pictures.
And optimizing the picture retrieval by fusing the time tags, automatically dividing news according to the time of news release, limiting the retrieval range in a specified time window when similar picture retrieval is carried out, changing the threshold value of the time window and obtaining different retrieval results. The occurrence of news events is time-sensitive, resulting in similar news pictures often appearing in similar time periods. Based on the above, the release time of the news picture is used as a label, different time windows d are set, and similar pictures are searched within the corresponding time interval { t-d, t + d }.
According to the method, the search space of the similar picture retrieval module is limited by using the subject classification and release time labels of the news where the pictures are located, so that dissimilar picture candidates are filtered, and the efficiency and performance of the similar picture retrieval system are stably improved when the system is oriented to a large-scale data set.
Step S203: searching pictures in the filtered target language knowledge base, and finding out a target picture similar to the query picture;
by extracting the key point characteristics of the pictures and calculating the similarity between the pictures based on the extracted characteristics, the retrieval of similar pictures is realized in the multi-modal knowledge base of different languages.
The method comprises the following steps: taking a picture in a source language knowledge base as a query picture, extracting key points of the picture by adopting a scale invariant feature transformation algorithm, and characterizing the picture into feature vectors based on the key points; extracting characteristic vectors of all candidate pictures in the target picture library, and matching key points of the query picture and the target candidate picture; calculating the average Euclidean distance between all the matched key points to be used as the similarity measurement between the pictures; and for the query picture, sequencing all pictures in the target picture library according to the similarity of the pictures, and selecting a Top-N picture as a similar picture set.
Specifically, for a keypoint x in the search picture, two most similar keypoints y, z are found in the target candidate picture. The similarity between two key points is obtained by the Euclidean distance of the feature vectors of the key points.
And judging whether the key point x is matched with the key point y of the target picture, wherein the rule is determined according to whether the ratio of the Euclidean distance between the nearest key point and the next nearest key point is smaller than a specified threshold value. That is, if d (x, y) < d (x, z) × θ, then the keypoint x is considered to match the keypoint y, where the threshold θ ∈ [0,1], is typically set to 0.8. Fig. 5 shows the matching result of the key points in the two pictures.
And calculating the similarity between the pictures. And calculating the similarity between the whole pictures according to all matched key points in the pictures. Specifically, the average euclidean distance between all the matching points is calculated. The formula is as follows:
Figure BDA0001250644140000101
m represents the number of matched key points in the two pictures, and n represents the dimensionality of the feature vector of the key points.
And sequencing the target candidate picture set according to the picture similarity, and selecting a similar picture of topN.
Step S204: and constructing a bilingual comparable corpus according to the text information corresponding to the target picture and the text information corresponding to the query picture.
By means of the similar picture retrieval system, similar picture retrieval is performed under news topics related to the target domain, and further bilingual comparable text is extracted. The invention provides a title comparability measurement based on image similarity modeling and a title comparability measurement based on fusion of image similarity and text similarity modeling.
The title comparability measuring method based on image similarity modeling specifically comprises the following steps:
and using a similar picture retrieval system, taking the picture of the source language as input, retrieving a similar picture set from a target terminal, and calculating the similarity between the query picture and the retrieval result picture based on an SIFT algorithm.
The comparability of the picture titles is modeled by using the picture similarity, and the comparability between the titles is measured by using the following formula as a ranking result of the comparability of the picture titles by using the ranking result of the pictures:
Figure BDA0001250644140000102
wherein, S (S)cap,tcap) Representation title scapAnd a title tcapDegree of comparability between them. v(s)img,timg) Representation picture simgAnd picture timgThe similarity between the two is calculated by an SITF algorithm, and b is a constant.
The title comparability measuring method for fusing image similarity and text similarity specifically comprises the following steps:
and using a similar picture retrieval system, taking the picture of the source language as input, retrieving a similar picture set from a target terminal, and calculating the similarity between the query picture and the retrieval result picture based on an SIFT algorithm.
And calculating the text similarity of the query picture title and the target candidate picture title. The method comprehensively evaluates the text similarity of the query title and the candidate title from three angles of content similarity (fc), entity similarity (fe) and structure similarity (fl). The specific formula is as follows:
S=αfc+βfe+γfl
and further fusing the image similarity and the text similarity score between the picture titles, and reordering the similar picture retrieval results (based on the image similarity). From the reordered results, comparable text pairs are constructed. The invention adopts the following formula to calculate the comprehensive similarity between query and candidate:
Figure BDA0001250644140000111
wherein Stxt(scap,tcap) Representing the text similarity between the query and search headings, and b is a constant that controls the contribution of the picture similarity to the final similarity score.
The method for calculating the text similarity between the query picture title and the candidate picture title is as follows: the query title is translated to a target language by using a Google online translation system so as to keep important information such as word order, sentence structure and named entity of the original text.
The comparability of the query title translation and the target picture title is evaluated based on the following three characteristics:
content similarity: and performing operations such as word segmentation, word stop removal, word root and the like on the text description after the query title translation to obtain bag-of-words expression. Based on the vector space model, the cosine similarity (denoted fc) of the query title and the target title is calculated.
Entity similarity: and identifying the named entities in the picture titles by using a Stanford named entity identification tool to obtain a named entity bag-of-words set, and calculating the named entity similarity (denoted as fe) between the bilingual texts based on a vector space model.
Structural similarity: the ratio of the number of content words (including nouns, verbs, adverbs, adjectives, proper nouns, etc.) in the query headline and the target headline is taken as a comparative evaluation criterion (denoted as fl).
And fusing the three characteristics in a weighted average mode to obtain a comparability score between the query title and the target title.
S=αfc+βfe+γfl
Here, α is 0.8, β is 0.15, and γ is 0.05, which are empirically set according to the degree of contribution of each feature to comparability.
The bilingual comparable corpus mining method based on cross-media information retrieval, which is provided by the invention, provides a new method for mining bilingual comparable resources in the Internet. A method for constructing a large-scale multi-modal database is provided, mining of comparable resources is assisted based on picture similarity, and two comparable measurement methods are provided based on the method for constructing the large-scale multi-modal database to obtain large-scale comparable texts.
In the following, the bilingual comparable corpus mining device provided in the embodiment of the present invention is introduced, and the bilingual comparable corpus mining device described below and the bilingual comparable corpus mining method described above may be referred to correspondingly.
Fig. 6 is a block diagram illustrating a structure of a bilingual comparative corpus mining apparatus according to an embodiment of the present invention, where the bilingual comparative corpus mining apparatus according to fig. 6 may include:
the knowledge base establishing module 100 is configured to capture a plurality of pictures and corresponding text information from databases of different languages in advance, and establish a multi-modal knowledge base including the pictures and the text information;
the searching module 200 is configured to use a picture in a source language knowledge base as a query picture, perform picture retrieval in a target language knowledge base, and search for a target picture similar to the query picture;
the constructing module 300 is configured to construct a bilingual comparable corpus according to the text information corresponding to the target picture and the text information corresponding to the query picture.
As a specific implementation manner, in the bilingual comparable corpus mining apparatus provided in the present invention, the searching module may specifically include:
the extraction unit is used for extracting key points of the query picture by adopting a scale invariant feature transformation algorithm and characterizing the query picture as feature vectors based on the key points;
the matching unit is used for extracting the characteristic vectors of all candidate pictures in the target language knowledge base and matching the query picture with the key points of the candidate pictures;
the similarity calculation unit is used for calculating the average Euclidean distance between all the matched key points as the picture similarity between the pictures;
and the selecting unit is used for sequencing the candidate pictures according to the picture similarity and selecting the target picture similar to the query picture.
As a specific implementation manner, the search module may further include:
the determining unit is used for determining the topic classification and the release time information of the query picture;
the filtering unit is used for filtering the pictures which are not matched with the topic classification and the release time information in the target language knowledge base;
and the searching unit is used for searching pictures in the filtered target language knowledge base and searching out the target pictures similar to the searched pictures.
Specifically, the building module may specifically include:
the text similarity calculation unit is used for calculating text similarity of the character information corresponding to the query picture and the character information corresponding to the target picture;
the sequencing unit is used for reordering the target pictures according to the text similarity;
and the construction unit is used for constructing the bilingual comparable corpus according to the result after reordering.
The bilingual comparable corpus mining device of the present embodiment is used to implement the above-mentioned bilingual comparable corpus mining method, and therefore the specific implementation manner of the bilingual comparable corpus mining device can be found in the embodiment parts of the bilingual comparable corpus mining method in the foregoing, for example, the knowledge base establishing module 100, the searching module 200, and the constructing module 300 are respectively used to implement the steps S101, S102, and S103 of the above-mentioned bilingual comparable corpus mining method, so that the specific implementation manner thereof may refer to the description of the corresponding respective part embodiments, and will not be described herein again.
The bilingual comparable corpus mining device provided by the invention establishes a multi-modal knowledge base containing pictures and character information by capturing a plurality of pictures and corresponding character information from databases of different languages in advance; taking the picture in the source language knowledge base as a query picture, carrying out picture retrieval in the target language knowledge base, and finding out a target picture similar to the query picture; and constructing the bilingual comparable corpus according to the text information corresponding to the target picture and the text information corresponding to the query picture. According to the method, a cross-media information retrieval technology is adopted, the picture is used as a medium for communicating the source language and the target language, so that an equivalent or comparable text of the source language at the target end is obtained, a new method is provided for mining bilingual comparable resources in the Internet, and the problem of scarcity of specific bilingual resources is solved.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The bilingual comparable corpus mining method and device provided by the invention are described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims (8)

1. A bilingual comparable corpus mining method, comprising:
capturing a plurality of pictures and corresponding text information from databases of different languages in advance, and establishing a multi-mode knowledge base containing the pictures and the text information;
taking the picture in the source language knowledge base as a query picture, carrying out picture retrieval in a target language knowledge base, and finding out a target picture similar to the query picture;
calculating text similarity of the text information corresponding to the query picture and the text information corresponding to the target picture;
calculating comprehensive similarity according to the image similarity of the query image and the target image and the text similarity, and reordering the target images according to the comprehensive similarity; the method for calculating the comprehensive similarity comprises the following steps:
Figure FDA0002627542310000011
S(scap,tcap) Representing query title scapAnd retrieving the title tcapThe comprehensive similarity between the two; v(s)img,timg) Representing a query picture simgAnd a target picture timgPicture similarity between them; stxt(scap,tcap) Representing query title scapAnd retrieving the title tcapB is a constant, and is used for controlling the contribution of the image similarity to the final similarity score; wherein the retrieval title is specifically the title of the target picture;
and constructing bilingual comparable corpora according to the reordered result.
2. The bilingual comparable corpus mining method according to claim 1, wherein said pre-fetching of multiple pictures and corresponding textual information from databases in different languages comprises:
and capturing pictures from a news website by using a web crawler, wherein the text information is the title information corresponding to the pictures, and storing the pictures and the corresponding text information in the multi-mode knowledge base as a binary group.
3. The bilingual comparable corpus mining method of claim 2, wherein said retrieving pictures in the target language knowledge base, finding out a target picture similar to the query picture comprises:
extracting key points of the query picture by adopting a scale invariant feature transformation algorithm, and characterizing the query picture as feature vectors based on the key points;
extracting the characteristic vectors of all candidate pictures in the target language knowledge base, and matching the query picture with the key points of the candidate pictures;
calculating the average Euclidean distance between all the matched key points to serve as the picture similarity between the pictures;
and sequencing the candidate pictures according to the picture similarity, and selecting a target picture similar to the query picture.
4. The bilingual comparable corpus mining method of any one of claims 1 to 3, wherein said retrieving pictures in the target language knowledge base, finding out a target picture similar to the query picture comprises:
determining topic classification and release time information of the query picture;
filtering pictures which are not matched with the topic classification and the release time information in the target language knowledge base;
and searching pictures in the filtered target language knowledge base to find out a target picture similar to the query picture.
5. The bilingual comparable corpus mining method according to claim 1, wherein the calculating the text similarity between the text information corresponding to the query picture and the text information corresponding to the target picture comprises:
calculating content similarity, entity similarity and structure similarity of the text information corresponding to the query picture and the text information corresponding to the target picture;
and carrying out weighted average on the content similarity, the entity similarity and the structure similarity, and calculating to obtain the text similarity of the corresponding text information.
6. A bilingual comparable corpus mining device, comprising:
the knowledge base establishing module is used for capturing a plurality of pictures and corresponding text information from databases of different languages in advance and establishing a multi-mode knowledge base containing the pictures and the text information;
the searching module is used for taking the picture in the source language knowledge base as a query picture, searching the picture in the target language knowledge base and searching out a target picture similar to the query picture;
the construction module is used for calculating text similarity of the text information corresponding to the query picture and the text information corresponding to the target picture;
calculating comprehensive similarity according to the image similarity of the query image and the target image and the text similarity, and reordering the target images according to the comprehensive similarity; the method for calculating the comprehensive similarity comprises the following steps:
Figure FDA0002627542310000021
S(scap,tcap) Representing query title scapAnd retrieving the title tcapThe comprehensive similarity between the two; v(s)img,timg) Representing a query picture simgAnd a target picture timgPicture similarity between them; stxt(scap,tcap) Representing query title scapAnd retrieving the title tcapB is a constant, and is used for controlling the contribution of the image similarity to the final similarity score; wherein the retrieval title is specifically the title of the target picture;
and constructing bilingual comparable corpora according to the reordered result.
7. The bilingual comparably corpus mining device of claim 6, wherein said lookup module comprises:
the extraction unit is used for extracting key points of the query picture by adopting a scale invariant feature transformation algorithm and characterizing the query picture as feature vectors based on the key points;
the matching unit is used for extracting the characteristic vectors of all candidate pictures in the target language knowledge base and matching the query picture with the key points of the candidate pictures;
the similarity calculation unit is used for calculating the average Euclidean distance between all the matched key points as the picture similarity between the pictures;
and the selecting unit is used for sequencing the candidate pictures according to the picture similarity and selecting the target picture similar to the query picture.
8. The bilingual comparably corpus mining device of claim 6 or 7, wherein said lookup module further comprises:
the determining unit is used for determining the topic classification and the release time information of the query picture;
the filtering unit is used for filtering the pictures which are not matched with the topic classification and the release time information in the target language knowledge base;
and the searching unit is used for searching pictures in the filtered target language knowledge base and searching out the target pictures similar to the searched pictures.
CN201710169141.XA 2017-03-21 2017-03-21 Bilingual comparable corpus mining method and device Active CN106980664B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710169141.XA CN106980664B (en) 2017-03-21 2017-03-21 Bilingual comparable corpus mining method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710169141.XA CN106980664B (en) 2017-03-21 2017-03-21 Bilingual comparable corpus mining method and device

Publications (2)

Publication Number Publication Date
CN106980664A CN106980664A (en) 2017-07-25
CN106980664B true CN106980664B (en) 2020-11-10

Family

ID=59338807

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710169141.XA Active CN106980664B (en) 2017-03-21 2017-03-21 Bilingual comparable corpus mining method and device

Country Status (1)

Country Link
CN (1) CN106980664B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110078B (en) * 2018-01-11 2024-04-30 北京搜狗科技发展有限公司 Data processing method and device for data processing
CN109522554B (en) * 2018-11-06 2022-12-02 中国人民解放军战略支援部队信息工程大学 Low-resource document classification method and classification system
CN109710923B (en) * 2018-12-06 2020-09-01 浙江大学 Cross-language entity matching method based on cross-media information
CN112818212B (en) * 2020-04-23 2023-10-13 腾讯科技(深圳)有限公司 Corpus data acquisition method, corpus data acquisition device, computer equipment and storage medium
CN111597830A (en) * 2020-05-20 2020-08-28 腾讯科技(深圳)有限公司 Multi-modal machine learning-based translation method, device, equipment and storage medium
CN111881900B (en) * 2020-07-01 2022-08-23 腾讯科技(深圳)有限公司 Corpus generation method, corpus translation model training method, corpus translation model translation method, corpus translation device, corpus translation equipment and corpus translation medium
CN114004236B (en) * 2021-09-18 2024-04-30 昆明理工大学 Cross-language news event retrieval method integrating knowledge of event entity

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473327A (en) * 2013-09-13 2013-12-25 广东图图搜网络科技有限公司 Image retrieval method and image retrieval system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102053991B (en) * 2009-10-30 2014-07-02 国际商业机器公司 Method and system for multi-language document retrieval
CN103473280B (en) * 2013-08-28 2017-02-08 中国科学院合肥物质科学研究院 Method for mining comparable network language materials
RU2607975C2 (en) * 2014-03-31 2017-01-11 Общество с ограниченной ответственностью "Аби ИнфоПоиск" Constructing corpus of comparable documents based on universal measure of similarity

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473327A (en) * 2013-09-13 2013-12-25 广东图图搜网络科技有限公司 Image retrieval method and image retrieval system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Building and using comparable corpora for domain-specific bilingual lexicon extraction;Darja Fišer,Špela Vintar,Nikola Ljubešić,et al.;《BUCC》;20110624;第19-26页 *
Building Comparable Corpora Based on Bilingual LDA Model;Zhu Z,Li M,Chen L,et al.;《ACL》;20130809;第282-287页 *
一种综合多特征的句子相似度计算方法;吴全娥;熊海灵;《计算机系统应用》;20100522;第19卷(第11期);全文 *

Also Published As

Publication number Publication date
CN106980664A (en) 2017-07-25

Similar Documents

Publication Publication Date Title
CN106980664B (en) Bilingual comparable corpus mining method and device
US9489401B1 (en) Methods and systems for object recognition
CN102053991B (en) Method and system for multi-language document retrieval
US8161059B2 (en) Method and apparatus for collecting entity aliases
CN108280114B (en) Deep learning-based user literature reading interest analysis method
Khusro et al. On methods and tools of table detection, extraction and annotation in PDF documents
US20090182547A1 (en) Adaptive Web Mining of Bilingual Lexicon for Query Translation
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
CN109145110B (en) Label query method and device
TWI656450B (en) Method and system for extracting knowledge from Chinese corpus
CN110929038A (en) Entity linking method, device, equipment and storage medium based on knowledge graph
CN107193892B (en) A kind of document subject matter determines method and device
CN113569050B (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
CN113961685A (en) Information extraction method and device
CN112015907A (en) Method and device for quickly constructing discipline knowledge graph and storage medium
CN112784009A (en) Subject term mining method and device, electronic equipment and storage medium
CN112434533A (en) Entity disambiguation method, apparatus, electronic device, and computer-readable storage medium
Ferrés et al. PDFdigest: an adaptable layout-aware PDF-to-XML textual content extractor for scientific articles
CN115203445A (en) Multimedia resource searching method, device, equipment and medium
KR101037091B1 (en) Ontology Based Semantic Search System and Method for Authority Heading of Various Languages via Automatic Language Translation
Hazman et al. An ontology based approach for automatically annotating document segments
JP4148247B2 (en) Vocabulary acquisition method and apparatus, program, and computer-readable recording medium
CN111241313A (en) Retrieval method and device supporting image input
Poornima et al. Automatic Annotation of Educational Videos for Enhancing Information Retrieval.
Saudagar et al. Concatenation technique for extracted Arabic characters for efficient content-based indexing and searching

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant