CN1477566A

CN1477566A - Method for making video search of scenes based on contents

Info

Publication number: CN1477566A
Application number: CNA031501265A
Authority: CN
Inventors: 董庆杰; 彭宇新; 郭宗明
Original assignee: BEIDA FANGZHENG TECHN INST Co Ltd BEIJING; Inst Of Computer Science & Technology Peking University
Current assignee: BEIDA FANGZHENG TECHN INST Co Ltd BEIJING; Inst Of Computer Science & Technology Peking University
Priority date: 2003-07-18
Filing date: 2003-07-18
Publication date: 2004-02-25
Anticipated expiration: 2023-07-18
Also published as: CN1240014C

Abstract

The present invention relates to a method of making video search of scene based on contents. It uses the fuzzy aggregation analysis method in the scene search. As compared with existent method it can obtain higher accuracy and quick searching speed.

Description

Method for performing content-based video retrieval on lens

Technical Field

The invention belongs to the technical field of video retrieval, and particularly relates to a method for performing content-based video retrieval on a lens.

Background

With significant technological advances in the manufacture, storage, and dissemination of multimedia data, digital video has become an integral part of people's daily lives. The problem faced by people is no longer the lack of multimedia content, but how to find the information needed by people in the multimedia world that is too expensive as the sea. At present, the traditional video retrieval based on keyword description cannot meet the requirement of massive video retrieval due to the reasons of limited description capability, strong subjectivity, manual labeling and the like. In order to facilitate people to search multimedia data, content-based video analysis and retrieval technology has become a hot problem of research since the last 90 s, and the development of content-based video retrieval technology is further promoted due to the gradual formulation and perfection of a multimedia content description interface MPEG-7.

In the prior art, as described in the document "a New Approach to retrieve Video by instance amplified Video Clip" [ x.m.liu, y.t.zhuang, and y.h.pan, ACM Multimedia, pp.41-44, 1999], a general method of Video Retrieval is to first perform shot boundary detection with shots as basic structural units and Retrieval units of a Video sequence; key frames are then extracted inside each shot to represent the content of the shot, and low-level features such as color and texture are extracted from the key frames for indexing and retrieval of the shot. Thus, the content-based shot retrieval is converted into the content-based image retrieval to solve the problem. A problem with this type of approach is that shots are a continuous sequence of images in time and do not fully exploit the temporal and motion information present in the video. In addition, in 2002, the document "a priori similarity between images and Systems for Video Technology" extracts the Modified Hausdorff Distance and the direct destination "(the document authors are s.H.Kim and R. -H.park, vol.CSVT-12, No.7, page 592-. Since two thresholds are set when extracting key frames: the threshold value of the similarity value of the previous frame and the threshold value of the similarity value between the current frame and the previous key frame must meet the two conditions at the same time to generate a key frame, so that the accuracy of extracting the key frame is influenced, and the accuracy of query is influenced; in addition, the YUV color space commonly used in videos is used as a visual feature, and compared with the HSV color space, the YUV color space is not much consistent with the visual perception of people.

Disclosure of Invention

Aiming at the defects of the existing lens retrieval method, the invention aims to provide a method for performing content-based video retrieval on a lens, which can greatly improve the accuracy of the content-based lens retrieval on the basis of the prior art and simultaneously keep the fast retrieval speed, thereby more fully playing the great role of the lens retrieval technology in the current network information society.

The purpose of the invention is realized as follows: a method for content-based video retrieval of a lens, comprising the steps of:

(1) firstly, carrying out shot segmentation on a video database, and taking a shot as a basic structure unit and a retrieval unit of a video;

(2) calculating the similarity between two frame images, and establishing a fuzzy similarity matrix R according to the following method: when i is j, let r_ijIs 1; when i ≠ j, let r_ijIs x_iAnd y_jThe similarity between them;

(3) calculating equivalent matrix of fuzzy similar matrix R by using transfer closure method

(4) Setting threshold lambda to determine intercept set, passing closure matrix to R matrix

Performing fuzzy clustering and calculation

<math> <mrow> <mo>[</mo> <mi>x</mi> <mo>]</mo> <mo>=</mo> <mo>{</mo> <mi>y</mi> <mo>|</mo> <mover> <mi>R</mi> <mo>^</mo> </mover> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> <mo>&GreaterEqual;</mo> <mi>λ</mi> <mo>}</mo> <mo>,</mo> </mrow> </math>

Set [ x ]]Namely, the equivalence classes of the fuzzy clustering are obtained, and all frames in each equivalence class set are similar, so that any frame in each set can be taken as a key frame;

(5) using key frames r_i1，r_i2，...，r_ikDenotes the lens s_iThe similarity between two shots is measured with a set of keyframes.

Further, the step (1) is performed on the video databaseThe method of performing shot segmentation is preferably a spatio-temporal slicing algorithm. Calculating x in step (2)_iAnd y_jThe similarity between can be calculated by the intersection of two image histograms:

<math> <mrow> <mi>Inter</mi> <mi>sec</mi> <mi>t</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>y</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mi>A</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>y</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <munder> <mi>Σ</mi> <mi>h</mi> </munder> <munder> <mi>Σ</mi> <mi>s</mi> </munder> <munder> <mi>Σ</mi> <mi>v</mi> </munder> <mi>min</mi> <mo>{</mo> <msub> <mi>H</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>h</mi> <mo>,</mo> <mi>s</mi> <mo>,</mo> <mi>v</mi> <mo>)</mo> </mrow> <mo>,</mo> <msub> <mi>H</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <mi>h</mi> <mo>,</mo> <mi>s</mi> <mo>,</mo> <mi>v</mi> <mo>)</mo> </mrow> <mo>}</mo> </mrow> </math>

H_i(H, S, V) is the histogram of HSV color space, we use the H, S, V components in a 18X 3 three-dimensional spaceThe middle statistical histogram takes the 162 normalized values as the color characteristic value, Intersect (x)_i，y_j) Representing the intersection of two histograms, which is used to determine the similarity of two key frames, using A (x)_i，y_j) Normalized to between 0 and 1.

Still further, in step (3), an equivalent matrix of the fuzzy similarity matrix R is calculatedThe transitive closure method of (2) may employ a flat method:

<math> <mrow> <mi>R</mi> <mo>&RightArrow;</mo> <msup> <mi>R</mi> <mn>2</mn> </msup> <mrow> <mo>&RightArrow;</mo> <msup> <mrow> <mo>(</mo> <msup> <mi>R</mi> <mn>2</mn> </msup> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> <mo>&RightArrow;</mo> <mo>·</mo> <mo>·</mo> <mo>·</mo> <mo>&RightArrow;</mo> <msup> <mi>R</mi> <msup> <mn>2</mn> <mi>k</mi> </msup> </msup> <mo>=</mo> <mover> <mi>R</mi> <mo>^</mo> </mover> <mo>,</mo> </mrow> </math>

its time complexity is O (n)³log₂n), if the value of n is extremely large, the total calculation time is influenced, so the composite operation of the fuzzy clustering optimal algorithm calculation matrix based on graph connected branch calculation is adopted, and the recursion is as follows:

the temporal complexity T (n) of this algorithm satisfies O (n) ≦ T (n) ≦ O (n)²)。

To better achieve the object of the present invention, when the shot retrieval is performed, the method is used for the shot retrieval

The fuzzy clustering method comprises the following steps:

(1) determining n samples X ═ (X)₁，...，X_n) The fuzzy similarity relation R and an intercept threshold value alpha;

(2) r is transformed into an equivalent matrix according to the following calculation;

RoR＝R²

R²oR²＝R⁴

...

R^{2^{k}} o R^{2^{k}} = R^{2^{(K + 1)}}

until there is one k satisfying

R^{2^{k}} = R^{2^{(k + 1)}}

In the above formula, RoR is a synthesis operation of fuzzy relation, and under the assumption that R is a similar matrix, it has been proved that k must exist, and k is less than or equal to log n;

(3) computing collections

<math> <mrow> <mo>[</mo> <mi>x</mi> <mo>]</mo> <mo>=</mo> <mo>{</mo> <mi>y</mi> <mo>|</mo> <mover> <mi>R</mi> <mo>^</mo> </mover> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> <mo>&GreaterEqual;</mo> <mi>α</mi> <mo>}</mo> <mo>,</mo> </mrow> </math>

[x]Namely fuzzy clustering, and finishing the algorithm;

after fuzzy clustering analysis is carried out on the n sample spaces, a plurality of equivalence classes are obtained, and one sample is selected from each equivalence class to be used as a key frame. The similarity measure between two shots then becomes the similarity measure between the set of keyframes.

In step (5) of the method, the lens s may be taken_iAnd s_jThe similarity is defined as

Sim (s_{i}, s_{j}) = \frac{1}{2} {M (s_{i}, s_{j}) + \hat{M} (s_{i}, s_{j})},

M represents the maximum value of similarity of the key frames,a second largest value that indicates that the key frames are similar, wherein,

M (s_{i}, s_{j}) = \max_{p = {1,2, . . .}} \max_{q = {1,2, . . .}} {Inter \sec t (r_{ip}, r_{jq})}

\hat{M} (s_{i}, s_{j}) = \max_{p = {1,2, . . .}}^{^} \max_{q = {1,2, . . .}} {Inter \sec t (r_{ip}, r_{jq})}

<math> <mrow> <mi>Inter</mi> <mi>sec</mi> <mi>t</mi> <mrow> <mo>(</mo> <msub> <mi>r</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>r</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mi>A</mi> <mrow> <mo>(</mo> <msub> <mi>r</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>r</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <munder> <mi>Σ</mi> <mi>h</mi> </munder> <munder> <mi>Σ</mi> <mi>s</mi> </munder> <munder> <mi>Σ</mi> <mi>v</mi> </munder> <mi>min</mi> <mo>{</mo> <msub> <mi>H</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>h</mi> <mo>,</mo> <mi>s</mi> <mo>,</mo> <mi>v</mi> <mo>)</mo> </mrow> <mo>,</mo> <msub> <mi>H</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <mi>h</mi> <mo>,</mo> <mi>s</mi> <mo>,</mo> <mi>v</mi> <mo>)</mo> </mrow> <mo>}</mo> </mrow> </math>

the invention has the following effects: the method for searching the video of the lens based on the content can obtain higher accuracy rate and simultaneously keep fast searching speed.

The invention has such remarkable technical effects because: the shot content is divided into a plurality of equivalence classes by using a fuzzy clustering analysis method, the equivalence classes well describe the change of the shot content, and the similarity between shots is expressed as the similarity between key frame combinations. The inter-shot similarity measure takes into account the disadvantage of using HSV color histograms to represent key frames: two key frames are considered similar if they have similar color distributions even if their contents are different. The average of the maximum similarity value and the second largest similarity value is used to enhance the robustness of the algorithm. The comparative experiment result proves the effectiveness of the method provided by the invention.

Drawings

FIG. 1 is a schematic flow diagram of a method for content-based video retrieval of a lens;

FIG. 2 is a diagram of an example of 7 semantic classes for shot retrieval in experimental comparison;

FIG. 3 is a schematic diagram of the search result of the method of the present invention for swimming shots.

Detailed Description

FIG. 1 is a general framework of the present invention, which is a flow chart of the method of each step in the present invention. As shown in fig. 1, a shot retrieval method based on fuzzy clustering analysis includes the following steps:

1. shot segmentation

First, a shot segmentation is performed on a Video database by using a spatio-Temporal Slice algorithm (spatial-Temporal Slice), and the shot is used as a basic structural unit and a retrieval unit of a Video, and a detailed description of the spatio-Temporal Slice algorithm can be found in the document "Video Partitioning by Temporal Slice coherence" [ c.w.ngo, t.c.pong, and r.t.chi, IEEE Transactions on Circuits and systems for Video Technology, vol.11, No.8, pp.941-953, August, 2001 ].

2. Establishing fuzzy similarity matrix R

The method for establishing the fuzzy similarity matrix R between the images in the lens comprises the following steps: when i is j, let r_ijIs 1, when i ≠ j, let r_ijIs x_iAnd y_jThe similarity between the two groups is calculated by the following method:

H_i(H, S, V) are histograms of HSV color space, and we use H, S, V components to count the histograms in a three-dimensional space of 18 × 3 × 3, and use the 162 normalized values as color feature values. Intersect (x)_i，y_j) Representing the intersection of two histograms, which is used to determine the similarity of two key frames, using A (x)_i，y_j) Normalized to between 0 and 1.

3. Solving the transfer closure of the similar matrix R to obtain an equivalent matrix

In this embodiment, solving the transmission closure of the similarity matrix adopts a flat method:

<math> <mrow> <mi>R</mi> <mo>&RightArrow;</mo> <msup> <mi>R</mi> <mn>2</mn> </msup> <mo>&RightArrow;</mo> <msup> <mrow> <mo>(</mo> <msup> <mi>R</mi> <mn>2</mn> </msup> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>&RightArrow;</mo> <mo>·</mo> <mo>·</mo> <mo>·</mo> <mo>&RightArrow;</mo> <msup> <msup> <mi>R</mi> <mn>2</mn> </msup> <mi>k</mi> </msup> <mo>=</mo> <mover> <mi>R</mi> <mo>^</mo> </mover> </mrow> </math>

its time complexity is O (n)³log₂n), if the value of n is extremely large, the total calculation time is inevitably affected. Therefore, the composite operation of the matrix is calculated by adopting the fuzzy clustering optimal algorithm based on the graph connected branch calculation, and the recursion is as follows:

4. Setting threshold lambda to determine intercept set, passing closure matrix to R matrixAnd carrying out fuzzy clustering.

In this embodiment, the specific method is as follows:

RoR＝R²

R²oR²＝R⁴

...

R^{2^{k}} o R^{2^{k}} = R^{2^{(K + 1)}}

until there is one k satisfying

R^{2^{k}} = R^{2^{(k + 1)}}

(3) computing collections

[x]I.e. fuzzy clustering, the algorithm ends

5. After the lens key frames are obtained by a fuzzy clustering analysis method, lens retrieval is carried out based on the key frames. On this basis, use the key frame { r_i1，r_i2，...，r_ik) Representing a lens, s_iLens s_iAnd s_jThe similarity is defined as

Sim (s_{i}, s_{j}) = \frac{1}{2} {M (s_{i}, s_{j}) + \hat{M} (s_{i}, s_{j})},

Wherein,

M (s_{i}, s_{j}) = \max_{p = {1,2, . . .}} \max_{q = {1,2, . . .}} {Inter \sec t (r_{ip}, r_{jq})}

\hat{M} (s_{i}, s_{j}) = \max_{p = {1,2, . . .}}^{^} \max_{q = {1,2, . . .}} {Inter \sec t (r_{ip}, r_{jq})}

indicating the second largest value, using

Because the HSV color histogram is used herein to represent key frames, it has the disadvantage that two key frames are considered similar if they have similar color distributions even though their contents are different, and to overcome this drawback, M and

to enhance the robustness of the algorithm. H_i(H, S, V) is a histogram of the HSV color space, which is a histogram statistically calculated in a 18 × 3 × 3 three-dimensional space using the H, S, V components, and 162 normalized values as color feature values. Intersect (r)_i，r_j) Represents the intersection of two histograms, which is used herein to determine the similarity of two key frames.

The following experimental results show that the method has better effect than the prior method, the retrieval speed is high, and the effectiveness of the fuzzy clustering analysis algorithm in shot retrieval is verified.

The experimental data retrieved for shots was a 2002 year's subaccupation program recorded from television for a total of 41 minutes, 777 shots, 62132 frames of images. It includes a variety of sports such as various ball games, weight lifting, swimming, and spot advertising programs. We selected 7 semantic classes as query shots, which are weightlifting, volleyball, swimming, judo, rowing, gymnastics, soccer, as shown in fig. 2.

To verify the effectiveness of the present invention, we tested the following 3 methods for experimental comparison:

(1) a commonly used shot retrieval algorithm using the first frame of each shot as a key frame;

(2) an algorithm described in the document "A.efficient algorithm for Video sequence matching using the modified Hausdorffdistance and the direct direction" (s.H.Kim and R.H.park, vo1.CSVT-12, No.7, page 592-;

(3) using a fuzzy clustering analysis algorithm to obtain a key frame to carry out shot retrieval (only using color features);

the first 3 methods all use only color features, so the final experimental result can prove the superiority of the method disclosed by the invention from the measurement method of the lens similarity. Fig. 3 shows a user interface of the experimental program, the upper line on the right side is a browsing area of the query video, the 1 st key frame of each shot in the video is displayed to represent each shot, the user can select the shot to be queried to search, and the lower line on the right side is a query result area. Fig. 3 is a view of the 1 st shot selected in the upper row, which is a swimming shot, represented by the first frame image 022430.bmp of the shot, the query results are arranged from large to small (left to right, top to bottom) with the greatest weight of similarity calculated according to the method of the present invention. The lower left side is a simple playing period, and the double-click retrieval result image can play the video corresponding to the corresponding lens.

The experiment used two evaluation criteria in the MPEG-7 standardization activities: average normalized modified retrieval rank anmrr (average normalized modified retrieval rank) and average recall ratio ar (average recall). AR is similar to conventional recall (call), and ANMRR reflects not only the correct proportion of search results, but also the sequence number of correct results, compared to conventional precision. The smaller the ANMRR value is, the more the ranking of the correct shot obtained by retrieval is; the larger the AR value is, the larger the proportion of similar shots to all similar shots in the top K (K is a truncated value of the search result) query results is. Table 1 shows the AR and ANMRR comparisons for 7 semantic lens classes for the 3 methods described above.

TABLE 1 comparative experimental results of the present invention and the prior two methods

Classification	Method 1		Method 2		Method 3
	Method 1		Method 2		Method 3		AR	ANMRR	AR	ANMRR	AR	ANMRR
	Weight lifting	0.8824	0.3098	0.8824	0.1539	0.9412	AR	ANMRR	AR	ANMRR	AR	ANMRR	0.2186
Volleyball	Weight lifting	0.8824	0.3098	0.8824	0.1539	0.9412	0.6333	0.4974	0.7895	0.3264	0.8556	0.3279	0.2186
Volleyball	Swimming	0.8400	0.2676	0.8250	0.3164	0.9200	0.6333	0.4974	0.7895	0.3264	0.8556	0.3279	0.2175
Judo (judo)	Swimming	0.8400	0.2676	0.8250	0.3164	0.9200	0.7000	0.4310	0.8214	0.2393	0.8000	0.3093	0.2175
Judo (judo)	Rowing boat	0.8750	0.3407	0.6875	0.3570	0.8125	0.7000	0.4310	0.8214	0.2393	0.8000	0.3093	0.2223
Gymnastics	Rowing boat	0.8750	0.3407	0.6875	0.3570	0.8125	0.7857	0.3445	0.9600	0.1759	0.7857	0.2056	0.2223
Gymnastics	Football game	0.5789	0.4883	0.6889	0.2815	0.8421	0.7857	0.3445	0.9600	0.1759	0.7857	0.2056	0.2614
Mean value	Football game	0.5789	0.4883	0.6889	0.2815	0.8421	0.7565	0.3827	0.8078	0.2642	0.8510	0.2518	0.2614

As can be seen from Table 1, the method of the present invention, whether AR or ANMRR, achieves better effect than the existing two algorithms, and confirms the effectiveness of the present invention in applying the fuzzy clustering analysis method to the shot retrieval. The method of the invention uses a fuzzy clustering analysis method to divide the shot content into a plurality of equivalence classes, the equivalence classes well describe the change of the shot content, and the similarity between the shots is expressed as the similarity between key frame combinations. The inter-shot similarity measure takes into account the disadvantage of using HSV color histograms to represent key frames: two key frames are considered similar if they have similar color distributions even if their contents are different. The average of the maximum similarity value and the second largest similarity value is used to enhance the robustness of the algorithm. The comparative experiment result proves the effectiveness of the method provided by the invention. In addition, on a PC with a CPU 500M PIII and 256M memory, the average search time of the algorithm is 22.557 seconds, and for a video library with 777 shots, the search speed of the two algorithms is high.

Claims

1. A method for content-based video retrieval of a lens, the method comprising the steps of:

(2) calculating the similarity between two frame images, and establishing a fuzzy similarity matrix R according to the following method: when i is j, let r_ijIs 1; when i ≠ j, let r_ijIs x_iAnd y_jSimilarity between them;

(3) utilizing transitive closuresMethod for calculating equivalent matrix of fuzzy similarity matrix R

Performing fuzzy clustering and calculation

Set [ x ]]The images in each set can be taken as key frames;

(5) using key frames r_i1，r_i2，...，r_ikDenotes the lens s_iThe set of keyframes measures the similarity between two shots.

2. A method for content-based video retrieval of a lens, as claimed in claim 1, wherein: in the step (1), the method for carrying out shot segmentation on the video database is a space-time slicing algorithm.

3. A method for content-based video retrieval of a lens, as claimed in claim 1, wherein: in step (2), x is calculated_iAnd y_jThe similarity between can be calculated by the intersection of two image histograms:

H_i(H, S, V) are histograms of HSV color space, we use H, S, V components to count histograms in a 18X 3 three-dimensional space, and 162 normalized values are used as the characteristic values of the pigment, Intersect (x)_i，y_j) Representing the intersection of two histograms, which is used to determine the similarity of two key frames, using A (x)_i，y_j) Normalized to between 0 and 1.

4. A method for content-based video retrieval of a lens, as claimed in claim 1, wherein: in the step (3), an equivalent matrix of the fuzzy similar matrix R is calculatedThe transitive closure method of (2) adopts a flat method:

5. A method for content-based video retrieval of a lens, as claimed in claim 1, wherein: to pairThe fuzzy clustering method comprises the following steps:

(1) determining n samples x ═ x₁，...，x_n) The fuzzy similarity relation R and an intercept threshold value alpha;

(2) reconstruction of R according to the following calculation into an equivalent matrix

RoR＝R²

R²oR²＝R⁴

...

R^{2^{k}} o R^{2^{k}} = R^{2^{(k + 1)}}

Until there is one k satisfying

R^{2^{k}} = R^{2^{(k + 1)}}

(3) computing collections

[x]Namely fuzzy clustering, and finishing the algorithm;

after fuzzy clustering analysis is carried out on n sample spaces, a plurality of equivalence classes are obtained, one sample is selected from each equivalence class to serve as a key frame, and therefore the similarity measurement between two shots is changed into the similarity measurement between key frame sets.

6. A method for content-based video retrieval of a lens as claimed in claim 1 or 5, wherein: lens s_iAnd s_jThe similarity is defined as

Sim (s_{i}, s_{j}) = \frac{1}{2} {M (s_{i}, s_{j}) + \hat{M} (s_{i}, s_{j})},

M (s_{i}, s_{j}) = \max_{p = {1,2, . . .}} \max_{q = {1,2, . . .}} {Inter \sec t (r_{ip}, r_{jq})}

\hat{M} (s_{i}, s_{j}) = \max_{p = {1,2, . . .}}^{^} \max_{q = {1,2, . . .}} {Inter \sec t (r_{ip}, r_{jq})}