CN101303694A

CN101303694A - Method for implementing decussation retrieval between mediums through amalgamating different modality information

Info

Publication number: CN101303694A
Application number: CNA2008100614455A
Authority: CN
Inventors: 吴飞; 庄越挺; 王文华; 杨易
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2008-04-30
Filing date: 2008-04-30
Publication date: 2008-11-12

Abstract

The invention discloses a method for realizing the interaction searching between medias by integrating different modal information which includes the following steps: 1) building relation graphs to hypermedias and obtaining corresponding relative coefficient matrixes; 2) using media objects or the hypermedias inside or outside a database submitted by a user as a searching example to mark the initial matching degree; 3) utilizing the relative coefficient matrix between the hypermedias to repeatedly iterate to a stable state and broadcasting the matching ability to the un-marked hypermedias and returning to the hypermedias or the media objects of special modal states in the hypermedias with the matching degree of the searching example larger than 0.6; 4) periodically carrying out adjustment on the relation graphs of the hypermedia according to a searching example set and a positive example set. The invention integrates the bottom layer characteristics of various media objects and broadcasts the semanteme by the sibship between the media objects, thereby having a better searching effect; as the searching example and the state of a returning result can be different and are broadcasted by utilizing the semanteme, the searching is more accurate and the adaptation is broader.

Description

Merge the method that different modalities information realizes cross-searching between medium

Technical field

The present invention relates to the cross-searching of medium between different modalities, relate in particular to a kind of method that different modalities information realizes cross-searching between medium that merges.

Background technology

The development of Web is accompanied by the sharp increase of information content, face googol like this according to amount, retrieval has become the important means that people obtain information, simple text retrieval can not have been satisfied the complicated day by day demand of user, and the user wishes that by retrieving obtainable be not the data that text data also comprises the different modalities such as lantern slide of image, video, audio frequency, Microsoft PowerPoint form.And the retrieval of existing multimedia object generally realizes by the coupling of artificial mark and low-level image feature, but this method needs a large amount of manual marks, since the radix of data volume is huge and data always with high speed increment, with mark as the multimedia search on basis only be adapted at the finite data amount among a small circle in use; Though the coupling retrieval based on the low-level image feature of multimedia object does not need a large amount of artificial inputs, but owing to exist wide gap between low-level image feature and semanteme, for example visually similar image may represented diverse semanteme, and semantically identical image may seem different fully, and the search method that therefore merges low-level image feature and semanteme has very important meaning.

Therefore have no idea at present directly the to obtain semanteme of media object can only make full use of semantic relation between the media object and realize retrieval based on semantic and feature.Media object in the reality generally is not self-existent, but exists with the form that is attached to hypermedia, and the hypermedia here refers to the object of the media object that includes multiple modalities, for example webpage and lantern slide etc.For the image in the webpage,, exist similar or complementary semantic relation generally speaking between other media object in it and the webpage and the text though can not directly obtain its semanteme.Utilize with the semantic relation between the media object in the hypermedia, can cross over the wide gap on the different modalities media object low-level image feature, thereby set up network of personal connections based on the hypermedia object of low-level image feature and semantic dependency.Set up the network of personal connections of hypermedia object, the user can inquire about media object and the hypermedia of wanting by media object or hypermedia, for example can retrieve semantic similar video by submitting webpage or image to, therefore realize between media object cross-searching highly significant.

Summary of the invention

The objective of the invention is to overcome the deficiencies in the prior art, a kind of method that different modalities information realizes cross-searching between medium that merges is provided.

The method that merges cross-searching between different modalities information realization medium comprises the steps:

1) to hypermedia opening relationships figure and the corresponding correlation matrix of acquisition;

2) media object that the user submits in the database or database is outer or hypermedia are as inquiry example mark initial matching degree;

3) utilize correlation matrix between hypermedia to iterate to steady state (SS) matching is propagated into the not hypermedia of mark, and return matching degree greater than 0.6 the hypermedia or the media object of certain modality-specific in these hypermedia;

4) periodically adjust according to user's inquiry example collection and positive example set pair hypermedia object relationship figure.

Described correlation matrix step to hypermedia opening relationships figure and acquisition correspondence is as follows:

1) sets up audible distance figure A between the hypermedia object, to any two hypermedia objects, if two hypermedia objects all contain audio object, then calculate the audible distance between these two hypermedia objects, it is right as audio frequency respectively to get an audio frequency in these two hypermedia objects, calculate the Mel frequency cepstral coefficient MFCC of two audio frequency, calculate the right low-level image feature distance of all audio frequency then, get the distance between two audio frequency of characteristic distance minimum and do normalization, audible distance as these two hypermedia objects, if one of them hypermedia object does not contain audio frequency, then the audible distance of these two hypermedia objects is made as infinity;

2) set up image distance figure I between the hypermedia object, to any two hypermedia objects, if two hypermedia objects all contain image, then calculate the image distance between these two hypermedia objects, it is right as image respectively to get an image in these two hypermedia objects, these two images are extracted color and textural characteristics, compute euclidian distances then, get the right distance of image of characteristic distance minimum and do normalization, image distance as these two hypermedia objects, if one of them hypermedia object does not contain image, then the image distance of these two hypermedia objects is made as infinity, all images in the data set is extracted color characteristic and textural characteristics, and wherein color characteristic comprises color histogram, color moment and color convergence vector, textural characteristics comprises roughness, directivity and contrast;

3) set up text distance map T between the hypermedia object, to any two hypermedia objects, if two hypermedia objects all contain text, then calculate the text distance between these two hypermedia objects, adopt vocabulary frequency/contrary document frequency method to carry out vector quantization to the text object in the hypermedia object, calculate all text objects Euclidean distance between any two, and with all range normalizations, the text feature of getting characteristic distance minimum between two hypermedia objects is apart from the characteristic distance as these two hypermedia objects, if one of them hypermedia object does not contain text, then the text of these two hypermedia objects distance is made as infinity;

4) audible distance figure A, image distance figure I and the text distance map T of adjustment hypermedia object calculate the shortest path of any point-to-point transmission respectively on these three figure, and substitute the weight on the limit of original point-to-point transmission with shortest path;

5) structure hypermedia object distance figure, the precision ratio that statistics is inquired about separately with audio frequency, image and text is designated as P respectively _a, P _iAnd P _t, a hypermedia object is represented on each summit among the hypermedia object distance figure, and two distances between the hypermedia object are represented on the limit, make normalizing coefficient gamma=1/ (P _a+ P _i+ P _t), hypermedia object distance figure i and j is put range formula is γ * (A _Ij* P _a+ I _Ij* P _i+ T _Ij* P _t);

6) make data centralization that n hypermedia object arranged, set up Matrix C _{N * n}To represent the semantic relation between any two hypermedia objects.C _IjThe element of the capable j row of i if i and j value are equal, makes C among the representing matrix C _IjValue is zero, otherwise C _IjValue is exp (HMG _Ij ²/ 2 σ ²), HMG wherein _IjConnect the limit weight that sequence number is respectively the media object of i and j among the presentation medium object distance figure, σ is an adjustable parameter.

Media object that described user submits in the database or database is outer or hypermedia are as follows as the step of inquiry example mark initial matching degree:

1), in database, finds this object and the matching degree that this object and inquiry are imported is marked into 1 if the user submits to is media object or hypermedia in the database;

2) be outer media object or hypermedia of database if the user submits to, the low-level image feature distance of the media object that comprises in all media object and the inquiry example in the computational data storehouse, according to the low-level image feature distance, find in the database and the immediate k of an inquiry example media object, the hypermedia that these media object were subordinate to is all identified into 1 with respect to the matching degree of inquiring about example.

Described to utilize correlation matrix between hypermedia to iterate as follows with the hypermedia step that matching propagates into mark not to steady state (SS): according to marking matrix Y _{N * 1}=[y ₁, y ₂... y _n] ^TY wherein _iI corresponding hypermedia object and the matching degree of inquiring about example are utilized formula Y ^*=(1-α) (I-α C) ^-1Y (0) obtain all media object in the stable back of iteration the matching degree with the input example, and return matching degree greater than 0.6 hypermedia object or its media object that comprises.

The different modalities media information has been merged in the present invention, has utilized the complete semanteme in the hypermedia, and dynamically adjusts semantic relation according to user feedback, therefore has precision ratio more accurately.Simultaneously, this method also discloses a kind of method of different modalities medium cross-searching, and the user can submit to hypermedia object, text, audio frequency or image to retrieve the media object and the hypermedia of identical or different mode, and is therefore more flexible, with better function.

Description of drawings

Fig. 1 merges the method flow diagram that different modalities information realizes cross-searching between medium;

Fig. 2 is a result for retrieval of the present invention; This figure displaying contents is the user by preceding 9 results that submit to a webpage query image of talking about cat to return.

Embodiment

The described step of periodically adjusting according to user's inquiry example collection and positive example set pair hypermedia object distance figure HMG is as follows:

1) structural map G (0), a hypermedia object is represented on each summit, does not all have the limit between any two hypermedia;

2) each user's relevant feedback all is used for G figure is improved, and for example the user feedback of t wheel can be transformed G (t-1);

3) weight on each limit among the G figure is adjusted with shortest path first;

4) with G figure hypermedia object distance figure is adjusted, make the hypermedia distance map more meet relation between the hypermedia of user perspective.

The present invention is by utilizing the high correlation semantically with different modalities media object in low-level image feature distance between the mode media object and the together individual hypermedia, concentrate all hypermedia objects to set up distance map and correlation matrix to data, and in graph of a relation according to the matching degree of the weight transmission inquiry example between point and the point, realized cross-searching between dissimilar medium and content-based and semantic hypermedia retrieval.

As shown in Figure 1, fusion different modalities information realizes that the method for cross-searching between medium specifies as follows:

1) processed offline: this module realizes the media object in the database is carried out semantic understanding and set up the hypermedia distance map.This module comprises that mainly feature extraction, hypermedia single mode distance map are set up, hypermedia object distance figure sets up, sets up four main algorithm of correlation matrix.Specify as follows:

Feature extraction of a media object and distance calculation: this algorithm adopts distinct methods to extract low-level image feature to elder generation to the dissimilar medium object, and calculates distance between identical mode media object.For all text objects in the data set, use vocabulary frequency/contrary document frequency to come the vector quantization text, calculate the Euclidean distance between any two texts then; For all audio objects of data centralization, adopt the feature of Mel frequency cepstral coefficient MFCC, and calculate the distance between audio frequency as audio frequency; For all images object, extract color characteristic and textural characteristics, and the Euclidean distance of computed image between in twos; At last the distance of text, image and audio frequency is done normalization.

B hypermedia single mode distance map is set up: this algorithm is set up the hypermedia distance map respectively to audio frequency, image and three kinds of mode of text, for hypermedia audible distance figure, a hypermedia object is represented on each summit, get two hypermedia separately the low-level image feature that comprises of object apart from the distance between minimum two audio frequency as the distance between 2 o'clock, if one of them hypermedia does not comprise audio frequency, then the distance between these two hypermedia is made as infinity; On this original audio distance map, calculate the shortest path of all point-to-point transmissions, and replace the distance of point-to-point transmission with shortest path.The method for building up of hypermedia image distance figure and hypermedia text distance map is with the audio frequency distance map.

C hypermedia object distance figure sets up: this algorithm construction hypermedia object distance figure.The precision ratio that statistics is inquired about separately with audio frequency, image and text is designated as P respectively _a, P _IAnd P _t, a hypermedia object is represented on each summit among the hypermedia object distance figure, and two distances between the hypermedia object are represented on the limit, make normalizing coefficient gamma=1/ (P _a+ P _i+ P _t), hypermedia object distance figure i and j is put range formula is γ * (A _Ij* P _a+ I _Ij* P _i+ T _Ij* P _t);

D sets up correlation matrix: make data centralization that n hypermedia object arranged, set up Matrix C _{N * n}To represent the semantic relation between any two hypermedia objects.C _IjThe element of the capable j row of i if i and j value are equal, makes C among the representing matrix C _IjValue is zero, otherwise C _IjValue is exp (HMG _Ij ²/ 2 σ ²), HMG wherein _IjConnect the limit weight that sequence number is respectively the media object of i and j among the presentation medium object distance figure, σ is an adjustable parameter.

2) retrieval: this module realizes the intersection search and the hypermedia semantic retrieval of media object, and the user can submit to image, sound, text or hypermedia to retrieve and inquire about the media object or the hypermedia of input semantic dependency maximum as the inquiry input.Specify as follows:

When the retrieval example that a submits to as the user was the hypermedia of data centralization existence, at first the matching degree with this hypermedia object and inquiry input identified into 1, structural matrix Y _{N * 1}=[y ₁, y ₂... Y _n] ^TY wherein _iI corresponding hypermedia object and the matching degree of inquiring about example are if the hypermedia object is exactly an input inquiry, just y _iAssignment 1, otherwise assignment 0.Utilize formula Y then ^*=(1-α) (I-α C) ^-1Y (0) obtain all media object in the stable back of iteration the matching degree with the input example, and return matching degree greater than 0.6 hypermedia object or its media object that comprises.

When the retrieval example that b submits to as the user is the media object of data centralization existence, the hypermedia that finds this media object to be subordinate to, this hypermedia object is identified into 1 with the matching degree of inquiry input, the matching degree of other hypermedia object composes 0, next the aspire for stability stable state of all hypermedia objects and inquiry input matching degree under the state, method is consistent with step a.

When the retrieval example that c submits to as the user is a media object outside data set, then calculate inquiry example and the distance of all objects of data centralization on low-level image feature according to the method for distance between pretreatment module computing medium object, try to achieve k arest neighbors, and the matching degree of the hypermedia that this k media object is subordinate to identifies into 1 and replaces importing the example inquiry, and ensuing method is consistent with step a.

When the retrieval example that d submits to as the user is a hypermedia outside the data set, at first a plurality of media object in this hypermedia are all sought k arest neighbors, and the matching degree assignment of element becomes 1 in the Y matrix of the hypermedia correspondence that these arest neighbors are subordinate to, the value of other objects is 0, next the aspire for stability steady state (SS) of all objects and inquiry example matching degree under the state, method is consistent with step a.

3) inquiry example collection and the positive example collection according to user feedback comes structuring user's feedback diagram G to represent the visual angle of user to the hypermedia object relationship, and periodically uses G figure that hypermedia object distance figure is improved.Specify as follows:

A structural map G (0) for hypermedia object i and hypermedia object j arbitrarily, makes G _Ij(0)=0.

B makes the query set and the positive example collection of the user feedback of t wheel be respectively Q _tAnd P _t, then t takes turns amended user feedback figure G _Ij(t)=λ+log ₂(G _Ij(t-1)+2), wherein object i and object j belong to Q _tOr P _t, λ is the adjustable parameters more than or equal to 1.

C optimizes the limit weight among the user feedback figure G, wherein G according to formula 2 _pBe illustrated in the weight of path p among the user feedback figure G, min represents in the parameter minimum value, and minv represents to connect the minor face of process in 2 the path, and l represents the limit number of this path process.

G_{p} = \min (1 + \frac{\min v}{l}, \min v) - - - (2)

D is according to formula 3, and comprehensive user feedback figure G and hypermedia distance map are adjusted the distance between the hypermedia object in the hypermedia distance map, and regenerate hypermedia object dependencies Matrix C according to formula 1.To any hypermedia i and hypermedia j, if i and j belong to the query set and the positive example collection of r wheel relevant feedback, then HMG _Ij=ω * HMG _Ij, wherein ω is the positive integer less than 1, HMG _IjLimit weight between presentation medium object i and object j; If i and j belong to the query set of r wheel relevant feedback and positive example collection and between hypermedia object k and the hypermedia object j weight is arranged in figure G is the limit of non-zero, then HMG _Ij=HMG _Ij/ Gk _j

Embodiment:

Suppose to have 1000 hypermedia, by 950 images, 100 sound clips and 800 sections texts constitute.At first extract the color characteristic and the textural characteristics of all images, wherein color characteristic comprises color histogram, color moment and color convergence vector, and textural characteristics comprises roughness, directivity and contrast, calculates the distance in twos between all images then; To sound clip, extract Mel frequency cepstral coefficient MFCC, calculate all target voices distance between any two; To text, calculate text object distance between any two behind employing vocabulary frequency/contrary document frequency vector quantization.After finishing the media object distance calculation, be to image distance, the normalization respectively of text distance and acoustic distance.Set up audible distance figure A, image distance figure I and text distance map T between the hypermedia object, set up audible distance figure A, at first for any hypermedia to weevil and second, at first find all distances between the audio frequency that belongs to these two hypermedia respectively, get wherein minimum distance as the audible distance between first and second objects, if have in first and second one do not comprise or two do not comprise audio object, then the audible distance between first and second objects is made as infinity.Calculate any point-to-point transmission bee-line with dijkstra's algorithm again, with the new weight of bee-line as limit between two summits; The method for building up of image distance figure I and text distance map T and audible distance figure to set up mode consistent.The precision ratio that statistics is inquired about separately with audio frequency, image and text is designated as P respectively _a, P _IAnd P _t, merging audio frequency, image and text distance map and set up the hypermedia distance map, a hypermedia object is represented on each summit in the hypermedia distance map, and two distances between the hypermedia object are represented on the limit, make normalizing coefficient gamma=1/ (P _a+ P _i+ P _t), hypermedia distance map i and j some distance H MG _Ij=γ * (A _Ij* P _a+ I _Ij* P _i+ T _Ij* P _t).Setting up 1000 * 1000 Matrix C on the hypermedia distance map basis to represent the semantic relation between any two hypermedia objects.C _IjThe element of the capable j row of i if i and j value are equal, makes C among the representing matrix C _IjValue is zero, otherwise C _IjValue is exp (HMG _Ij ²/ 0.5); Set up 1000 * 1 matrix Y _{1000 * 1}, Y _iWhat represent is the degree of correlation of i hypermedia object and inquiry, Y _iAll be initialized to zero.

Fig. 2 is the user by preceding 9 results that submit to a webpage of talking about cat to go query image to return, its retrieving is as follows: when the user submits a webpage of talking about cat to, suppose to comprise in the webpage audio object and passage, Mel frequency cepstral coefficient MFCC at first calculates to this audio computer in system, and find data centralization and its 3 nearest audio frequency, the element of the hypermedia that comprises these 3 audio frequency in matrix Y is set as 1, similarly, word frequency/contrary document frequency calculates to the text in the input example in system, and obtain at data centralization 3 sections texts the most close with this section text low-level image feature, and the element that comprises in the Y matrix of these 3 sections texts is set as 1, the element of remaining hypermedia correspondence all is arranged to 0, obtain the good coupling matrix Y (0) of initialization, use formula Y ^*=(1-0.5) (I-0.5 * C) ^-1Y (0) calculates final matching degree matrix Y ^*, return Y ^*Middle matching degree is greater than the image that comprises in preceding 9 hypermedia objects of 0.6, as the result of user search.From figure two as can be seen, precision ratio is quite high, illustrates that this method has effectively striden across semantic wide gap, has solved the problem of the cross-searching between the different modalities medium.

Can see from top example, different with traditional search method is, the present invention has made full use of semantic dependency and the complementarity between the multimedia object that the hypermedia object comprises, and according to the statistics reasonable distribution influence of different modalities media object to precision ratio, therefore than traditional search method precision ratio height; The present invention simultaneously both can be by this complete fusion of hypermedia the set of different modalities media object retrieve, also can retrieve the media object of any mode by submitting the generic media object to, therefore from functional perspective, the present invention is more flexible, with better function, more can meet user's demand.

Claims

1. one kind merges the method that different modalities information realizes cross-searching between medium, it is characterized in that comprising the steps:

2. a kind of method that different modalities information realizes cross-searching between medium that merges according to claim 1 is characterized in that, described correlation matrix step to hypermedia opening relationships figure and acquisition correspondence is as follows:

3. a kind of method that different modalities information realizes cross-searching between medium that merges according to claim 1, it is characterized in that media object that described user submits in the database or database is outer or hypermedia are as follows as the step of inquiry example mark initial matching degree:

4. a kind of method that different modalities information realizes cross-searching between medium that merges according to claim 1, it is characterized in that described to utilize correlation matrix between hypermedia to iterate as follows with the hypermedia step that matching propagates into mark not to steady state (SS): according to marking matrix Y _{N * 1}=[y ₁, y ₂Y _n] ^TY wherein _iI corresponding hypermedia object and the matching degree of inquiring about example are utilized formula Y ^*=(1-α) (I-α C) ^-1Y (0) obtains all media object in stable back of iteration and the matching degree of importing example, and returns matching degree greater than 0.6 hypermedia object or its media object that comprises.