CN1920818A

CN1920818A - Transmedia search method based on multi-mode information convergence analysis

Info

Publication number: CN1920818A
Application number: CN 200610053392
Authority: CN
Inventors: 潘云鹤; 庄越挺; 吴飞; 杨易
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2006-09-14
Filing date: 2006-09-14
Publication date: 2007-02-28
Anticipated expiration: 2026-09-14
Also published as: CN100388282C

Abstract

本发明公开了一种基于多模态信息融合分析的跨媒体检索方法。利用该方法可以对多模态信息融合分析进行多媒体语义理解，从而实现基于内容的多媒体文档检索、图像检索、声音检索和文本检索。用户可以通过提交任意的模态的检索例子去检索任意模态的媒体对象或者多媒体文档。比如为了检索图像，用户既可以提交图像作为检索例子进行检索，也可以提交声音或者文本或者它们的组合作为检索例子进行检索。由于该方法在进行多媒体语义理解的时候，不仅仅依靠关键字，而是将多媒体文档内全体媒体对象进行融合分析后综合各种模态媒体对象所携带的信息进行语义理解，因此检索效果更好；由于检索例子和返回结果可以是不同模态的，因此功能更强大，适用更广泛。The invention discloses a cross-media retrieval method based on multimodal information fusion analysis. Using this method, multi-modal information fusion analysis can be used for multimedia semantic understanding, so as to realize content-based multimedia document retrieval, image retrieval, sound retrieval and text retrieval. Users can retrieve any modal media object or multimedia document by submitting any modal retrieval example. For example, in order to retrieve an image, the user can either submit an image as a retrieval example for retrieval, or submit a sound or text or a combination thereof as a retrieval example for retrieval. Because this method does not only rely on keywords when performing multimedia semantic understanding, but integrates and analyzes all the media objects in the multimedia document and then synthesizes the information carried by various modal media objects for semantic understanding, so the retrieval effect is better. ; Since the retrieval example and the returned result can be in different modes, the function is more powerful and applicable to a wider range.

Description

Stride the medium search method based on multi-modal information convergence analysis

Technical field

The present invention relates to multimedia retrieval, relate in particular to and a kind ofly stride the medium search method based on multi-modal information convergence analysis.

Background technology

Multimedia document is current very common file type, it is made up of the media object (comprising audio frequency, image and text etc.) of a plurality of different modalities, and have certain semanteme, all belong to multimedia document as lantern slide of multimedia encyclopedia, webpage and Microsoft PowerPoint form etc.In general, multimedia document has two characteristics.The first, form complex structure, the media object of multiple modalities is present in multimedia document inside simultaneously; The second, the media object of the inner different modalities of same multimedia document is complementary semantically, and the semanteme of multimedia document is by its inner all media object co expression.Therefore have ambiguous the time when a certain media object, do as a wholely, the semanteme of multimedia document is clear and definite often.Because traditional search method designs at single mode media object often, do not take all factors into consideration the complementary information that inner each mode media object of multimedia document is contained, therefore well in the analysis-by-synthesis multimedia document each media object of different modalities understanding semantic information of multimedia, thereby can't fine adaptation user's request.

At present,, comprise text along with memory technology and development of internet technology, picture, sound clip and multimedia document etc. are more and more at the interior multimedia file that can accessed by the userly arrive.Retrieval technique can help the content that the user finds oneself fast in the data of magnanimity need, and becomes field more and more important in the Computer Applied Technology.Traditional retrieval technique can be divided into based on the retrieval of key word and content-based retrieval.In searching system, need in advance multimedia object to be marked based on key word.But because the present media object enormous amount that exists, it is vast and numerous therefore to mark the process workload; And because the influence of the marked content person's subjective factor that is subjected to the mark inevitably, at same multimedia object, different mark persons may mark different key words, so key word whole semantemes that often can not reflect multimedia object fully objectively and contained.The content-based retrieval system does not then need multimedia object is marked, and the user can submit to a retrieval example that multimedia object is retrieved.But there are two weakness in traditional content-based retrieval technology: the one, and the user can only retrieve and the media object of inquiring about the identical mode of example, that is to say and to retrieve audio frequency by the image examples retrieving images or by audio example, and can't remove retrieving images or retrieve audio frequency by audio example by image examples; The 2nd, have semantic wide gap between the low-level image feature of media object and the high-level semantic, so precision ratio not very desirable.Consider that media object occurs with the form of multimedia document often, and media object often has identical semanteme in the same multimedia document, in order to cross over semantic wide gap, can utilize the semantic complementarity of different modalities media object to come disambiguation, understand semantic information of multimedia better.Simultaneously, in order to satisfy the needs that the user strides Media Inquiries, as by sound example query image, find a kind of content-based medium search method of striding quite meaningful.

Summary of the invention

The object of the present invention is to provide a kind of content-based multimedia document retrieval and stride the method that medium are retrieved, it is characterized in that comprising the steps:

1) based on multi-modal information convergence analysis semantic information of multimedia is understood;

2) user submits in the database media object beyond the existing or database to retrieve as the inquiry example;

3) according to user's relevant feedback, carry out quadratic search;

4) according to user's relevant feedback, the semantic information of multimedia space is safeguarded.

Describedly based on multi-modal information convergence analysis semantic information of multimedia is understood, its step is as follows:

1) all audio fragments in the database is extracted root mean square RMS, cutoff frequency Rolloff, zero-crossing rate ZCR and four features of barycenter Centroid, utilize dynamic time all audio fragments of DTW algorithm computation distance between any two of stretching, and with all range normalizations;

2) image objects all in the database is extracted color and textural characteristics, calculate all images object Euclidean distance between any two, and with all range normalizations;

3) adopt single text vocabulary frequency/contrary text frequency (TF/IDF) method to carry out vector quantization to text media objects all in the database, calculate all text media objects distance between any two, and with all range normalizations;

4) by non-linear method to the target voice in each multimedia document, the entrained information of text object and image object is carried out convergence analysis, thereby obtains multimedia document distance between any two;

5) set up a multimedia document associated diagram.Each multimedia document is a summit on this figure, and a weighting limit is arranged between any 2, and weight is the distance between resulting these two the corresponding multimedia documents in summit of step 4;

6) reconstruct multimedia document associated diagram, method are at first to set a threshold value, then weight all are made as infinity greater than the power on the limit of this threshold value.Then to all limits, with the new weight of the shortest path between 2 o'clock as this limit;

7) adopt multi-dimensionality gage method (Multidimensional Scaling) that the multimedia document associated diagram is projected to the semantic information of multimedia space, this space can keep the topological relation of multimedia document associated diagram, and all multimedia documents all have unique coordinate and pointed by this coordinate in this space; All media object are all pointed by the coordinate of multimedia document under them.

The user submits the method that existing media object is retrieved as the inquiry example in the database to, its step is as follows: at first find the coordinate of this media object in the semantic information of multimedia space, then according to all coordinate of media object in the semantic information of multimedia space, calculate inquiry example and the Euclidean distance of other all media object in the semantic information of multimedia space, and according to this distance, all media object are sorted from small to large the media object of the target mode that layback is nearest;

Media object beyond the user submits in the database is as follows as the step that the inquiry example carries out search method:

1) find in the database and all media object of the identical mode of inquiry example, the low-level image feature distance of calculating these media object and inquiring about example;

2) according to the low-level image feature distance, find in the database and the immediate k of an inquiry example media object, at the barycenter in semantic information of multimedia space coordinate, submit to the method that existing media object is retrieved as the inquiry example in the database to stride the medium retrieval these media object according to foregoing user as the retrieval example.

Relevant feedback according to the user, the step of carrying out quadratic search is as follows: return after the Query Result, the user estimates Query Result, and mark the result that some they praise, system is labeled as the user on the coordinate of the barycenter of those media object in the semantic information of multimedia space of positive example as the retrieval example, calculate inquiry example and the Euclidean distance of other all media object in the semantic information of multimedia space, and according to this distance, all media object are sorted the media object of the target mode that layback is nearest.

According to user's relevant feedback, the step that the semantic information of multimedia space is safeguarded is as follows:

1) according to user's relevant feedback historical record, periodically on-the-fly modifies the multimedia document associated diagram and re-construct the semantic information of multimedia space, make it to reflect more exactly the semantic information of multimedia relation;

2) according to user's relevant feedback, the inquiry example outside the database is mapped to the semantic information of multimedia space, thereby finishes database update.

The present invention compares with background technology, and the useful effect that has is:

The present invention proposes the new content-based retrieval method of a cover.Because this method has adopted multi-modal information fusion mechanism, makes full use of the entrained information of different modalities media object, the ability of crossing over semantic wide gap is stronger, therefore has higher precision ratio.Simultaneously, this method also discloses a kind of method of striding the medium retrieval, the user can (comprising image, text, sound or multimedia document) remove to inquire about the media object or the multimedia document of any mode by submitting any type of example to, inquiry example and return results can be different modalities, and be therefore more powerful than traditional content-based retrieval systemic-function.

Description of drawings

Fig. 1 is system framework figure of the present invention;

Fig. 2 is primary retrieval result of the present invention.This figure displaying contents is preceding 9 results that the user goes query image to return by the sound of submitting one section car engine to.

Embodiment

The present invention carries out semantic understanding by multi-modal information convergence analysis to multimedia document, for all multimedia documents are set up unified index, the multimedia object of different modalities can be pointed by the coordinate of the multimedia document under it, thereby set up unified index for the multimedia object of different modalities, realized the retrieval of multimedia document and stride the medium retrieval.

The content-based retrieval method example that the present invention proposes specifies as follows as shown in Figure 1:

1) pretreatment module: this module realizes the media object in the database is carried out semantic understanding and set up unified index.This module comprises that mainly feature extraction, multi-modal information fusion and semantic information of multimedia space set up three main algorithm.Specify as follows:

Feature extraction of a multimedia object and similarity computational algorithm; This algorithm extracts feature respectively and calculates the low-level image feature distance the media object of different modalities.For all images object in the database, extract texture and color characteristic, calculate all images object Euclidean distance between any two then.For all target voices, extract root mean square, zero-crossing rate, cutoff frequency and four features of barycenter, utilize dynamic time all target voices of (DTW) algorithm computation distance between any two of stretching then.For all text objects, carry out the text vector quantization according to the TF/IDF method, calculate all text objects Euclidean distance between any two then.Then image distance, acoustic distance and text distance are done Gaussian normalization respectively.

The multi-modal information fusion algorithm of b: this algorithm calculates the distance of multimedia document by the relation between the inner different media object of convergence analysis multimedia document.For any two multimedia documents, can obtain distance between their contained images, sound and the text object by step a, try to achieve minimum value mindis and maximal value maxdis between these distances then.Can being defined as of multimedia document apart from MMDdis: MMDdis=λ * mindis+ (a+ln (β * (maxdis-mindis)+1)); If having only a kind of media object between two multimedia documents is identical mode, so, MMDdis=λ * mindis+A, α wherein, β, λ and A are according to database size and the adjustable constant of DATA DISTRIBUTION situation.If there is not the media object of identical mode between two multimedia documents, the distance of two multimedia documents is set to infinity earlier so, then can be in the step of back by shortest path as the distance between the multimedia document.

C structure multimedia document associated diagram; In order to construct the multimedia document associated diagram,, the summit of a correspondence is set in the drawings for each multimedia document in the database; Between any two summits a limit is set all, the power on limit is the distance between the corresponding multimedia document in two summits; Reconstruct should figure then, and method is: weights are changed to infinity (this threshold value can be set to the poor of the mean value of all length of sides and standard deviation) greater than the weight on all limits of a certain threshold value; The definition path is the weights sum along these all limits, path, the weights on limit between any two, all summits among the figure is reset to the length of shortest path between two summits.

D semantic information of multimedia space is set up; Construct a matrix D, it each d _IjBe that i multimedia document is to the distance in the multimedia document associated diagram between j the multimedia document, if the distance between two multimedia documents is infinitely great, so just with d _IjBe set to 1.Then with matrix D as input, by multi-dimensionality gage method the multimedia document associated diagram is carried out projection, obtain the semantic information of multimedia space.Each multimedia document all has the coordinate of a correspondence in this space, each multimedia document all has the pointer that points to its attached media object simultaneously.

2) retrieval module: this module realizes striding the medium retrieval, comprises multimedia document retrieval, image retrieval, sound retrieval and text retrieval.The user can submit to many matchmakers document, image, sound or text to remove to inquire about the media object or the multimedia document of any mode as the retrieval example.Specify as follows:

The retrieval example that a submits to as the user is during already present multimedia document, at first to find the coordinate of the document in the semantic information of multimedia space in database, finds the k neighbour of inquiry example in the semantic information of multimedia space then.If the user at the retrieving multimedia document, then directly returns the k neighbour; If the user at retrieving images, then returns the image that belongs to k neighbour multimedia document; If the user is at retrieval sound or text, method and retrieving images are similar.

When the retrieval example that b submits to as the user is already present multimedia object in database (image, sound or text), at first find the affiliated multimedia document of retrieval example, then this multimedia document is provided with the retrieval example and retrieves, method is consistent with step a.

When the retrieval example that c submits to as the user was a multimedia document outside database, then the method for calculating the multimedia document distance according to pretreatment module was calculated the distance of retrieval example all multimedia documents in the database, found the k neighbour of retrieval example.If the user at the retrieving multimedia document, then directly returns the k neighbour; If the user at retrieving images, then returns the image that is contained in the k neighbour; If the user is at retrieval sound or text, method and retrieving images are similar.

When the retrieval example that d submits to as the user is a multimedia object outside database, then at first calculate in retrieval example and the database between the identical mode multimedia object distance in feature space and find k the arest neighbors of retrieval example at feature space, obtain the affiliated multimedia document of this k neighbour then, and try to achieve their barycenter in the semantic information of multimedia space; This barycenter is retrieved as the retrieval example, and method is as described in the step a.

The e result for retrieval returns to after the user, the user can estimate result for retrieval, system is made as the retrieval example with the positive example of user mark and carries out quadratic search then, method be with the barycenter of positive example in the semantic information of multimedia space as the retrieval example, carry out quadratic search according to the method for step a then.

3) maintenance module: this module mainly realizes the reconstruct and multimedia object outside the database and multimedia document be mapped to the semantic information of multimedia space of refining to the semantic information of multimedia space.Specify as follows:

A disposes a journal file in system, recording user comprises the evaluation of user to each return results to the feedback content of each retrieval.The multimedia document associated diagram is periodically revised according to the content of journal file by system.Specific practice is: the power between the multimedia document of the positive example that each retrieval user in the multimedia document associated diagram is labeled as multiply by one less than 1 number, and power between the multimedia document of the multimedia document of positive example and negative example that user in each retrieval in the multimedia document associated diagram is labeled as multiply by one greater than 1 number.If retrieval of content is a multimedia object, that is to say that the positive and negative example that the user marks is a multimedia object, then revise the power on limit between their affiliated host's multimedia documents according to the method described above.Again the semantic information of multimedia space is calculated in projection then.

B is when retrieval example that the user submits to is media object or multimedia object database outside, and system can be mapped to the semantic information of multimedia space by automatically that database is the outer inquiry example of user's relevant feedback, thus automatic EDS extended data set.Specific practice is: if return results is a multimedia document, at first try to achieve the user and be labeled as the barycenter of the multimedia document of positive example in the semantic information of multimedia space, take out then near three positive examples of barycenter, try to achieve the barycenter of these three positive examples and with this barycenter as newly inquiring about the coordinate of example in the semantic information of multimedia space; If return results is a media object, then at first try to achieve the user and be labeled as the barycenter of the affiliated multimedia document of multimedia object of positive example in the semantic information of multimedia space, take out then near three positive examples of barycenter, try to achieve the barycenter of these three positive examples and with this barycenter as newly inquiring about the coordinate of example in the semantic information of multimedia space.

Embodiment:

Suppose to have 900 multimedia documents, by 900 images, 300 sound clips and 700 sections texts constitute.At first calculate the low-level image feature that extracts all images, comprise the RGB color histogram, color convergence vector sum Tamura textural characteristics calculates the distance in twos between all images then; To sound clip, extract root mean square, zero-crossing rate, cutoff frequency and four features of barycenter, utilize dynamic time all target voices of (DTW) algorithm computation distance between any two of stretching then; To text, calculate text object distance between any two behind the employing TF/IDF vector quantization.After finishing the media object distance calculation, will be to image distance, the normalization respectively of text distance and acoustic distance, then for any multimedia document first and second, at first find the text that belongs to these two multimedia documents respectively, distance between sound and the image object is calculated their maximal value maxdis and minimum value mindis then.If two multimedia documents have only the multimedia object of two kinds of identical mode, then maxdis and mindis are respectively the minimum and maximum value of acoustic distance and image distance, and other analogues can be analogized.Such as in the multimedia document first image being arranged, text and sound, and have only image and target voice in the multimedia document second,, the maxdis of these two multimedia documents and mindis are respectively the minimum and maximum value of acoustic distance and image distance so.After calculating maxdis and mindis, calculate multimedia document distance, MMDdis=mindis+ (0.1+ln (0.3 * (maxdis-mindis)+1)) according to following formula.If two multimedia documents have only a kind of media object of identical mode, then the distance with them is provided with this mode media object apart from adding 0.1.Such as having only image and sound in the multimedia document first, and have only sound and text in the multimedia document second, their distance is set to acoustic distance and adds 0.1.If there is not the media object of identical mode between two multimedia documents, the distance of two multimedia documents is set to infinity earlier so, then can be in the step of back by shortest path as the distance between the multimedia document.After finishing the multimedia document distance calculation, can be according to the weighted graph of distance structure between the multimedia document.There is a limit in a summit on each multimedia document corresponding diagram between any two summits, and the weight on limit is the distance between the multimedia document of two summit correspondences.After finishing the structure of figure, entitlement among this figure all is changed to infinity again greater than 0.35 power,, finds their bee-lines between any two then for all summits, and the employing dijkstra's algorithm, with the new weight of bee-line as limit between two summits.Construct matrix D, wherein a D _IjFor multimedia document i to the distance between the multimedia document j, if the distance between these two multimedia documents is infinitely great, D is set then _IjBe 1.Then to D _Ij(Multidimensional Scaling) carries out projection with multi-dimensionality gage method, obtains the semantic information of multimedia space of one 20 dimension, and each multimedia document has the coordinate of one 20 dimension in this space.It is pointed out that above structure about the semantic information of multimedia space is that off-line carries out.

Figure two is preceding 9 results that the user goes query image to return by the sound of submitting one section car engine to, its retrieving is as follows: the sound of submitting car engine as the user to is as the retrieval example time, and system at first finds multimedia document under this audio file at the coordinate in semantic information of multimedia space; According to all multimedia documents in the database from small to large, all multimedia documents are sorted then to the distance between the multimedia document of inquiry under the example; Then from the close-by examples to those far off, search whether there is image in each multimedia document,, then as a result of return to the user,, then continue to search next multimedia document, reach the number of user's appointment up to the amount of images of returning if do not have if having.From figure two as can be seen, Query Result is quite accurately, and the method that this explanation the present invention proposes can effectively be crossed over semantic wide gap, well understands semantic information of multimedia, has higher accuracy rate.On the other hand, it seems from the return results of figure two, though the retrieval example of submitting to is an audio fragment and the result that returns is an image, it is consistent inquiring about between example and the return results semantically, and this explanation the present invention possesses the good ability that medium are retrieved of striding.

From top example as can be seen, compare with traditional search method, the present invention is owing to adopted multi-modal information fusion mechanism to carry out semantic information of multimedia understanding, therefore compare with traditional multimedia retrieval based on interior, can understand semantic information of multimedia more accurately, have higher retrieval rate; Simultaneously, the present invention can also finish and stride the medium retrieval, just can remove to retrieve the result for retrieval of any mode with the retrieval example of any mode, (such as using the sound retrieval image), therefore compare with traditional content-based multimedia retrieval, function is more powerful.

Claims

One kind based on multi-modal information convergence analysis stride the medium search method, it is characterized in that comprising the steps:

1), carries out semantic information of multimedia and understand to multi-modal information convergence analysis;

2) user submits in the database media object beyond the existing or database to retrieve as the inquiry example;

3) according to user's relevant feedback, carry out quadratic search;

4) according to user's relevant feedback, the semantic information of multimedia space is safeguarded.
2. according to claim 1 a kind of based on multi-modal information convergence analysis stride the medium search method, it is characterized in that, described to multi-modal information convergence analysis, carry out semantic information of multimedia and understand, its step is as follows:

1) all audio fragments in the database are extracted root mean square, cutoff frequency, zero-crossing rate and four features of barycenter, utilize dynamic time all audio fragments of algorithm computation distance between any two of stretching, and with all range normalizations;

2) image objects all in the database is extracted color and textural characteristics, calculate all images object Euclidean distance between any two, and with all range normalizations;

3) adopt single text vocabulary frequency/contrary text frequency approach to carry out vector quantization to text media objects all in the database, calculate all text media objects distance between any two, and with all range normalizations;

4) by non-linear method to the target voice in each multimedia document, the entrained information of text object and image object is carried out convergence analysis, thereby obtains multimedia document distance between any two;

5) set up a multimedia document associated diagram.Each multimedia document is a summit on this figure, and it is distance between resulting these two the pairing multimedia documents in summit of step 4 that a weighting limit, weight are arranged between any 2;

6) reconstruct multimedia document associated diagram, method are at first to set a threshold value, then weight all are made as infinity greater than the power on the limit of this threshold value, then to all limits, with the new weight of the shortest path between 2 o'clock as this limit;

7) adopt multi-dimensionality gage method that the multimedia document associated diagram is projected to the semantic information of multimedia space, this space can keep the topological relation of multimedia document associated diagram, and all multimedia documents all have unique coordinate and pointed by this coordinate in this space; All media object are all pointed by the coordinate of multimedia document under them.
3. according to claim 1ly a kind ofly stride the medium search method based on multi-modal information convergence analysis, it is characterized in that, described user submits to the step of the method that existing media object in the database retrieves as the inquiry example to be: at first find the coordinate of this media object in the semantic information of multimedia space, then according to all coordinate of media object in the semantic information of multimedia space, calculate inquiry example and the Euclidean distance of other all media object in the semantic information of multimedia space, and according to this distance, all media object are sorted the media object of the target mode that layback is nearest;
4. according to claim 1 a kind of based on multi-modal information convergence analysis stride the medium search method, it is characterized in that the media object beyond described user submits in the database is as follows as the step that the inquiry example carries out search method:

1) find in the database and all media object of the identical mode of inquiry example, the low-level image feature distance of calculating these media object and inquiring about example;

2) according to the low-level image feature distance, find in the database and the immediate k of an inquiry example media object, these media object at the barycenter in the semantic information of multimedia space coordinate as the retrieval example, are striden medium according to the method in the right 3 and retrieved.
5. according to claim 1ly a kind ofly stride the medium search method based on multi-modal information convergence analysis, it is characterized in that, described relevant feedback according to the user, the step of carrying out quadratic search is as follows: return after the Query Result, the user estimates Query Result, and mark the result that some they praise, system is labeled as the user on the coordinate of the barycenter of those media object in the semantic information of multimedia space of positive example as the retrieval example, calculate inquiry example and the Euclidean distance of other all media object in the semantic information of multimedia space, and according to this distance, all media object are sorted the media object of the target mode that layback is nearest.
6. according to claim 1 based on multi-modal information convergence analysis stride the medium search method, it is characterized in that, described relevant feedback according to the user, the step that the semantic information of multimedia space is safeguarded is as follows:

1) according to user's relevant feedback historical record, periodically on-the-fly modifies the multimedia document associated diagram and re-construct the semantic information of multimedia space, make it to reflect more exactly the semantic information of multimedia relation;

2) according to user's relevant feedback, the inquiry example outside the database is mapped to the semantic information of multimedia space, thereby finishes database update.