CN106649440A

CN106649440A - Approximate repeated video retrieval method incorporating global R features

Info

Publication number: CN106649440A
Application number: CN201610820574.2A
Authority: CN
Inventors: 廖开阳; 王玮; 郑元林; 曹从军; 赵凡; 蔺广逢
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2016-09-13
Filing date: 2016-09-13
Publication date: 2017-05-10
Anticipated expiration: 2036-09-13
Also published as: CN106649440B

Abstract

The invention discloses an approximate repeated video retrieval method incorporating global R features. The method comprises the steps of firstly implementing the extraction of local SIFT features of videos in a databases, based on the coordinate information of the local SIFT features, establishing a global R feature, using descriptor information of the local SIFT features to establish a BOF retrieval model, based on the BOF model, establishing a voting retrieval model, finally implementing an application information fusion strategy to fuse global geometric distribution information into the BOF model, and accurately retrieving approximate repeated video in large-scale data. The approximate repeated video retrieval method incorporating global R features fuses the global R feature can fuse the global geometric distribution information into the BOF model based on the information fusion strategy, and achieve accurate retrieval of the approximate repeated video in large-scale data.

Description

The approximate of amalgamation of global R features repeats video retrieval method

Technical field

The invention belongs to video analysis and search method technical field, and in particular to a kind of amalgamation of global R features it is approximate Repeat video retrieval method.

Background technology

With the communication technology, video capture device, Video editing software fast development, the quantity of Internet video is exponentially Level increases.Meanwhile, the related service of video, such as：Advertisement, video sharing, recommendation and monitoring, excite the interest of online user simultaneously And participate in the related activity of video, such as：Search, upload, download and comment etc..

Nowadays, can all there is substantial amounts of video to be uploaded and share in internet daily, exist almost repeat in a large number on the net Video.A large amount of approximate appearance for repeating video have expedited the emergence of many new applications, such as：Results for video is resequenced, copyright protection, Online Video uses monitoring, video labeling and video database cleaning etc..For example：One it is typical it may be the case that, one The user of website is look for some new videos, but the video that the final ranking result returned in search engine much repeats； It is another kind of it may be the case that, a video production person wishes the video of their copyright protection, to avoid sharing on the internet. Above both occasions are required for approximately repeating video retrieval technology to realize respective target.

In recent years, the approximate video frequency searching that repeats becomes the focus of research, and many researchers are studying this technology.Mesh Before, most of existing methods be typically with it is following it is approximate repeat video frame retrieval frame (R.Fernandez-Beltran, And F.Pla, " Latent topics-based relevance feedback for video retrieval, " Pattern Recognition, vol.51, pp.72-84, Mar, 2016.)：First, calculated by shot boundary detector and sampling Method is decomposed into a series of key frame video；Secondly, to these key-frame extraction visual signatures, such as：Scale invariant feature (SIFT), local binary pattern (LBP) etc., whole video is represented with the visual signature sequence of key frame；Finally, system is needed The similitude between the video in each data set and inquiry video, and returned data are calculated according to visual signature sequence Concentrate the title with the most like video of inquiry video.Under normal circumstances, either time or spatial information can be used to assessment Similitude (M.Douze, H.Jegou, and C.Schmid, " An Image-Based Approach between two videos Video Copy Detection With Spatio-Temporal Post-Filtering, " Ieee Transactions On Multimedia, vol.12, no.4, pp.257-266, Jun, 2010.C.-L.Chou, H.-T.Chen, and S.- Y.Lee, " Pattern-Based Near-Duplicate Video Retrieval and Localization on Web- Scale Videos, " Ieee Transactions on Multimedia, vol.17, no.3, pp.382-395, Mar, 2015.).Additionally, also there are some existing methods, real-time retrieval, but this are realized to one global characteristics of whole video extraction The method of kind typically can not be used for that long-time video is carried out effectively to retrieve (X.Zhou, and L.Chen, " Structure Tensor Series-Based Large Scale Near-Duplicate Video Retrieval, " Multimedia, IEEE Transactions on, vol.14, no.4, pp.1220-1233,2012.).

In nearest some documents for occurring, the measurement that is also used to of the correlation in two videos between paired frame is regarded Similitude (J.Liu, Z.Huang, H.T.Shen, the and B.Cui, " Correlation-Based Retrieval for of frequency Heavily Changed Near-Duplicate Videos, " Acm Transactions on Information Systems, vol.29, no.4, Dec, 2011.).The nearest nearly literature review for repeating video retrieval technology may be referred to document (J.Liu, Z.Huang, H.Cai, H.T.Shen, N.Chong Wah, and W.Wang, " Near-Duplicate Video Retrieval:Current Research and Future Trends, " Acm Computing Surveys, vol.45, No.4, Aug, 2013.).

At present, the most nearly video retrieval method that repeats is all based on local feature and BOF retrieval models, but this A little methods only make use of single local grain information, the global information of characteristic point be have ignored, so as to cause the essence of video frequency searching Exactness is not high.

The content of the invention

It is an object of the invention to provide a kind of the approximate of amalgamation of global R features repeats video retrieval method, being capable of basis Information Fusion Policy is blended in global geometry distributed intelligence in BOF models, and realization is accurately retrieved in large-scale data It is approximate to repeat video.

The technical solution adopted in the present invention is, the approximate of amalgamation of global R features repeats video retrieval method, specifically according to Following steps are implemented：

Step 1, to video extraction local SIFT feature in database；

Step 2, the coordinate information foundation overall situation R features Jing after step 1, in the local SIFT feature for obtaining；

Step 3, treat after the completion of step 2, using information is accorded with described in the SIFT feature of local BOF characteristic models are set up；

Step 4, the BOF characteristic models obtained according to step 3, set up based on the ballot retrieval model of BOF；

Step 5, application message convergence strategy global geometry distributed intelligence be fused to that Jing steps 4 set up based on BOF's In ballot retrieval model, accurate retrieval is approximate in large-scale data repeats video.

Of the invention the characteristics of, also resides in：

Step 1 is specifically implemented in accordance with the following methods：

First to reference video storehouse in all videos carry out key-frame extraction, then SIFT is carried out to each key frame special Levy extraction.

The method that uniform sampling is used to key-frame extraction, and extracted a two field picture every 6 seconds；

It is to adopt D.G.Lowe that SIFT feature is extracted, " Distinctive image features from scale- Invariant keypoints, " International Journal of Computer Vision, vol.60, no.2, The method of pp.91-110, Nov, 2004., (using the method in document " unique Scale invariant characteristics of image ") is to key frame SIFT feature is extracted, its information extracted is included：The position of characteristic point, yardstick, angle and partial descriptions information.

Step 2 is specifically implemented in accordance with the following methods：

It is according to the SIFT for extracting to set up overall situation R features according to the coordinate information in the local SIFT feature that step 1 is obtained Positional information in feature, is become using improved Radon and brings extraction overall situation R features；

Radon conversion refers to that the straight line in a plane along different directions does line integral to function f, and the projection for obtaining is exactly The Radon conversion of function f；So the pixel of each non-zero on a discrete bianry image can be projected to one In Radon matrixes；

For piece image f (x, y), x, y are the coordinate of pixel in image, then image f (x, y) Jing Radon map tables It is shown as following form：

In formula (1)：δ () is that Dirac delta function is also called unit impulse function, is equal in the point in addition to zero Zero, and its integration in whole domain of definition is equal to 1；θ is angle, and θ ∈ [0, π)；ρ is polar diameter, and ρ ∈ (- ∞, ∞)；

Improved Radon is converted and is repeated to be also called R changes in video retrieval method in the approximate of amalgamation of global R features of the present invention Change, be formulated as following form：

In formula (2),It is the Radon conversion of f (x, y)；

Improved Radon conversion solves the problems, such as that former conversion does not possess yardstick, rotation and translation invariance；

Using (2D)²PCA principal component analysis algorithm carries out principal component analysis conversion and obtains phase to the matrix obtained from R conversion The low-dimensional matrix answered is used as final feature, referred to as R features；(2D)²PCA uses document " bidirectional two-dimensional principal component analysis Bidirectional two-dimensional principal component analytical method in application in efficient face representation with identification ", while in two sides of row and column Principal component analysis and calculating are carried out upwards, can so obtain the feature of higher accuracy of identification.

Step 3 is specifically implemented according to following steps：

Step 3.1, the descriptor in the SIFT feature in image library is instructed with large-scale data hierarchical clustering algorithm Practice, generate class；

Large-scale data hierarchical clustering algorithm is a kind of clustering algorithm；

Step 3.2, Jing after step 3.1, is quantified, and generates the BOF features of each image, and concrete grammar is as follows：

The BOF features for quantifying to generate each image refer to each characteristic point for judging image with which class center recently, most Near is then put into such center, will finally generate a row frequency table, i.e., preliminary haves no right BOF；Followed by tf-idf to frequency Number table adds weight, generates final weighting BOF features；

Wherein, quantization method is carried out to the feature of inquiry video as follows：

The q in formula (3)：Represent and quantify, R^dThe d dimension datas in real number space are represented, k represents the quantity at class center, x_i,j,i =1 ..., m₂For ith feature in jth frame in reference video storehouse；

The tf-idf weights methods calculated per frame are specific as follows：

W_i=tf_i·idf_i(6)；

In formula (4)～formula (6)：K represents the quantity at class center；f_ijIt is the visual vocabulary belonging to ith feature at j-th The frequency occurred in frame of video；n_iIt is the sum of the reference video frame comprising the visual vocabulary belonging to ith feature；N is total Reference video number；tf_iRepresent word frequency factor；idf_iRepresent inverse word frequency factor；

Step 3.3, the BOF features to generating set up inverted index, and concrete grammar is：

Inverted index is typically made up of quantization list file and inverted list file two parts；

Quantify list file and have recorded all vocabulary occurred in document sets (image, frame of video)；

Inverted list file is that the information such as the position by each vocabulary in log file (image, frame of video) and frequency are remembered Record is got off, and these information of all vocabulary just constitute inverted list；For n vocabulary (feature) w quantified in list file₁…w_n In a w_i, in m log file (image, frame of video) d₁…d_mIn inverted list can be expressed as form：

The such record of n bars can constitute a complete inverted list；

In formula (7), f_iRepresent frequency, direction and dimensional information；

Formula (7) gives a complete inverted index structure for query text vocabulary.

Step 4 is specifically implemented according to following steps：

Given inquiry frame, is represented with local feature y, and all of key frame local feature in video database x_j, j=1 ..., n is represented, based on specific as follows the step of BOF ballot retrievals：

Step 4.1, for inquiry frame local feature y_l, l=1 ..., m₁With all of key frame in video database Local feature x_i,j, i=1 ..., m₂, the similarity scores s between j=1 ..., n two frame of video of calculating_j, its algorithm is concrete It is as follows：

In formula (8)：F is an adaptation function, and it reflects two features x_i,jAnd y_lBetween degree of similarity；

Step 4.2, Jing after step 4.1, feature is quantified according to visual vocabulary, and video in database after quantifying Characteristic storage in an inverted file, this quantizing process q uses formula (3)；

Q (x after quantization_i,j) result be and feature x_i,jThe sequence number at nearest class center (visual vocabulary)；Therefore, if two Feature x_i,jAnd y_lQ (x are met after quantization_i,j)=q (y_l), then the two features in the feature space of higher-dimension very close to probability It is very high；Around this principle, it is contemplated that aforesaid tf-idf methods of weighting, adaptation function f is then defined as algorithm：

Then two different features can efficiently be compared according to the result after quantization；

Step 4.3, Jing after step 4.2, be ultimately used to sort image similarity score s_fIt is to s_jAfter being post-processed Arrive, specifically implement by following algorithm：

Can be seen that by the formula (10) in the formula (9) and step 4.3 in step 4.2：Consider simultaneously inquiry frame of video and The tf-idf weights of the vision word of key frame in database, and both be added to based on BOF ballot search method in, this Plant method of weighting has carried out normalization to vision word histogram.

Step 5 is specifically implemented in accordance with the following methods：

If two features x and y quantify Euclidean distance d for reflecting the two feature descriptors in the heart in same class (x, y) is very little, then the distance between network in the Euclidean space described by R features is also very little；Based on this point, By q (x) and b (x), q is a quantizer to one descriptor, and b is R features；Then R features are embedded in BOF retrieval models, weight The function of the adaptation function f of new definition, specific algorithm is as follows：

In formula (11)：D represents Euclidean distance；h_tRepresent a thresholding；

When quantifying, calculation will take smaller value in class, allow close video to match as far as possible, and h_tAlso it is corresponding Get the small value, here h_t=0.005, in order to the video of error hiding can be removed according to the distance of R features.

The beneficial effects of the present invention is：

(1) the approximate of amalgamation of global R features of the present invention repeats to propose a kind of improved Radon changes in video retrieval method Change, solve the problems, such as that former conversion does not possess yardstick, rotation, translation invariance, improve the robustness of global characteristics.

(2) repeat in video retrieval method in the approximate of amalgamation of global R features of the present invention, can be according to information Fusion Policy handle Global geometry distributed intelligence is blended in BOF models, increased the global property of BOF models, so as to improve stablizing for system Property.

(3) when the approximate repetition video retrieval method of amalgamation of global R features of the present invention is used, nearly palinopsia can be greatly improved The precision of frequency retrieval, and field of video retrieval can be widely used in.

(4) search method of amalgamation of global R features of the present invention is also suitable for field of image search, can greatly improve image inspection The precision of rope.

Description of the drawings

Fig. 1 is the approximate frame diagram for repeating video retrieval method of amalgamation of global R features of the present invention.

Specific embodiment

With reference to the accompanying drawings and detailed description the present invention is described in detail.

The approximate frame diagram for repeating video retrieval method of amalgamation of global R features, it is as shown in Figure 1, two can be classified as Major part, respectively：Offline part and online part.The process object of offline part is target video storehouse, produces online part Required inverted index table during inquiry；Online part mainly completes the inquiry to inquiring about video in target video storehouse Journey.

The process object of offline part is reference video storehouse, the video in reference video storehouse is carried out key-frame extraction, SIFT feature extraction, R feature extractions, feature clustering analysis, the quantization of characteristic vector to visual vocabulary simultaneously generate visual vocabulary table With the inverted index table with regard to feature for the inquiry of online part.

The inquiry to inquiring about video in reference video storehouse is partially completed online；Online part carries out key to inquiring about video Frame extraction, SIFT feature extraction, R feature extractions, according to reference video storehouse generate visual vocabulary table Online Video is owned Characteristic quantification in key frame is fused to BOF into visual vocabulary, then application message convergence strategy global geometry distributed intelligence In model, finding and search for candidate video is carried out, draw final retrieval result.

The approximate of amalgamation of global R features of the present invention repeats video retrieval method, specifically implements according to following steps：

Step 1, to video extraction local SIFT feature in database, concrete grammar is：

First to reference video storehouse in all videos carry out key-frame extraction, then to extract each key frame carry out SIFT feature is extracted；

Wherein, key-frame extraction is, using the method for uniform sampling, every 6 seconds a two field picture to be extracted；SIFT feature is extracted It is using (D.G.Lowe, " Distinctive image features from scale-invariant keypoints, " International Journal of Computer Vision, vol.60, no.2, pp.91-110, Nov, side 2004.) (translator of Chinese is method：Using the method in document " unique Scale invariant characteristics of image ")；

Wherein, to key-frame extraction SIFT feature, and the information extracted is included：The position of characteristic point, yardstick, angle with And partial descriptions information.

Step 2, Jing after step 1, according to obtain local SIFT feature in coordinate information set up overall situation R features, foundation Method is specifically implemented in accordance with the following methods：

In formula (2),It is the Radon conversion of f (x, y)；

In order to improve the robustness of feature, and the dimension of feature is reduced, applied (2D)²PCA principal component analysis algorithms pair The matrix obtained from R conversion carries out principal component analysis conversion, obtains corresponding low-dimensional matrix as final feature, and referred to as R is special Levy；(2D)²PCA uses document (Z.D., and Z.Z., " Letters：(2D)2PCA：Two-directional two- Dimensional PCA for efficient face representation and recognition, " Neurocomputing, vol.69, no.1, pp.224-231,2005.) in method (translator of Chinese is：Use document Bidirectional two-dimensional principal component analysis side in " application of the bidirectional two-dimensional principal component analysis in efficient face representation with identification " Method), while carrying out principal component analysis and calculating in row and column both direction, can so obtain the spy of higher accuracy of identification Levy.

Step 3, treat after the completion of step 2, set up BOF characteristic models using information is accorded with described in the SIFT feature of local, specifically Implement according to following steps：

Wherein, large-scale data hierarchical clustering algorithm is a kind of clustering algorithm, refer to document (K.Liao, G.Liu, L.Xiao, and C.Liu, " A sample-based hierarchical adaptive K-means clustering Method for large-scale video retrieval, " Knowledge-Based Systems, 2013.).

The tf-idf weights methods calculated per frame are specific as follows：

W_i=tf_i·idf_i(6)；

The such record of n bars can constitute a complete inverted list；

Step 4, the BOF characteristic models obtained according to step 3, set up based on BOF ballot retrieval model, specifically according to Lower step is implemented：

Step 4.2, Jing after step 4.1, in order to improve operation efficiency, typically feature is quantified according to visual vocabulary, And the characteristic storage of video in database after quantization in an inverted file, this quantizing process q uses formula (3)；

Can be seen that by the formula (10) in the formula (9) and step 4.3 in step 4.2：Consider simultaneously inquiry frame of video and The tf-idf weights of the vision word of key frame in database, and they be added to based on BOF ballot search method in, this Plant method of weighting has carried out normalization to vision word histogram.

Step 5, application message convergence strategy global geometry distributed intelligence be fused to that Jing steps 4 set up based on BOF's In ballot retrieval model, accurate retrieval is approximate in large-scale data repeats video, and concrete grammar is as follows：

Application message convergence strategy is fused to the ballot inspection based on BOF that Jing steps 4 are set up global geometry distributed intelligence In rope model, the approximate video that repeats of accurate retrieval is specifically referred in retrieval complete in step 2 in large-scale data Realize that the approximate of large-scale data repeats video frequency searching in the BOF retrieval models that office's R features are embedded in step 3, its is concrete Method is as follows：

If two features x and y quantify Euclidean distance d for reflecting the two feature descriptors in the heart in same class (x, y) is very little, then the distance between network in the Euclidean space described by R features should also be very little；Based on this A bit, by q (x) and b (x), q is a quantizer to a descriptor, and b is R features；Then R features are embedded into BOF retrieval models In, the function of the adaptation function f for redefining, specific algorithm is as follows：

General when quantifying, calculation can take smaller value in class, allow close video to match as far as possible, and h_t To get the small value accordingly, here h_t=0.005, in order to the video of error hiding can be removed according to the distance of R features.

The approximate of amalgamation of global R features of the present invention repeats video retrieval method, performs from function：Logarithm is first carried out According to video extraction local SIFT feature in storehouse；Secondly the coordinate information performed in the SIFT feature of local sets up overall situation R features； Then perform and set up BOF retrieval models using symbol information described in the SIFT feature of local；Next according to BOF models, set up and throw Ticket retrieval model；Finally perform application message convergence strategy to be fused in BOF models, extensive global geometry distributed intelligence Data in accurate retrieval is approximate repeats video.The approximate of amalgamation of global R features of the present invention repeats video retrieval method, fully Local grain information and global geometry distributed intelligence are make use of, and proposes a kind of method of Fusion Features, can be according to information Convergence strategy is blended in global geometry distributed intelligence in BOF models, realizes that accurate retrieval is approximate in large-scale data Repeat video.

Claims

1. the approximate of amalgamation of global R features repeats video retrieval method, it is characterised in that specifically implement according to following steps：

Step 1, to video extraction local SIFT feature in database；

Step 5, application message convergence strategy are fused to the ballot based on BOF that Jing steps 4 are set up global geometry distributed intelligence In retrieval model, accurate retrieval is approximate in large-scale data repeats video.

2. the approximate of amalgamation of global R features according to claim 1 repeats video retrieval method, it is characterised in that described Step 1 is specifically implemented in accordance with the following methods：

First to reference video storehouse in all videos carry out key-frame extraction, then SIFT feature is carried out to each key frame and is carried Take.

3. the approximate of amalgamation of global R features according to claim 2 repeats video retrieval method, it is characterised in that to institute The method that key-frame extraction uses uniform sampling is stated, and a two field picture was extracted every 6 seconds；

It is that it is carried to key-frame extraction SIFT feature using the method in unique Scale invariant characteristics of image that SIFT feature is extracted The information for taking is included：The position of characteristic point, yardstick, angle and partial descriptions information.

4. the approximate of amalgamation of global R features according to claim 1 repeats video retrieval method, it is characterised in that described Step 2 is specifically implemented in accordance with the following methods：

It is according to the SIFT feature extracted to set up overall situation R features according to the coordinate information in the local SIFT feature that step 1 is obtained In positional information, using improved Radon become bring extraction overall situation R features；

Radon conversion refers to that the straight line in a plane along different directions does line integral to function f, and the projection for obtaining is exactly function The Radon conversion of f；So the pixel of each non-zero on a discrete bianry image can be projected to a Radon square In battle array；

For piece image f (x, y), x, y are the coordinate of pixel in image, then image f (x, y) Jing Radon conversion is expressed as Following form：

T_{R^{f}} (ρ, θ) = {&Integral;}_{- \infty}^{\infty} {&Integral;}_{- \infty}^{\infty} f (x, y) δ (x c o s θ + y \sin θ - ρ) d x d y - - - (1);

In formula (1)：δ () is that Dirac delta function is also called unit impulse function, and in the point in addition to zero zero is equal to, And its integration in whole domain of definition is equal to 1；θ is angle, and θ ∈ [0, π)；ρ is polar diameter, and ρ ∈ (- ∞, ∞)；

Improved Radon is converted and is repeated to be also called R conversion in video retrieval method in the approximate of amalgamation of global R features of the present invention, It is formulated as following form：

R_{f} (θ) = {&Integral;}_{- \infty}^{\infty} T_{R^{f}}^{2} (ρ, θ) d ρ - - - (2);

In formula (2),It is the Radon conversion of f (x, y)；

Using (2D)²PCA principal component analysis algorithm carries out principal component analysis conversion and obtains corresponding to the matrix obtained from R conversion Low-dimensional matrix is used as final feature, referred to as R features；(2D)²PCA uses bidirectional two-dimensional principal component analysis in efficient people Face represents and the bidirectional two-dimensional principal component analytical method in the application in identification, at the same carry out in row and column both direction it is main into Part analysis and calculating, in order to the feature of higher accuracy of identification can be obtained.

5. the approximate of amalgamation of global R features according to claim 1 repeats video retrieval method, it is characterised in that described Step 3 is specifically implemented according to following steps：

Step 3.1, the descriptor in the SIFT feature in image library is trained with large-scale data hierarchical clustering algorithm, it is raw Into class；

The large-scale data hierarchical clustering algorithm is a kind of clustering algorithm；

The BOF features for quantifying to generate each image refer to each characteristic point for judging image with which class center recently, nearest Such center is then put into, a row frequency table will be finally generated, i.e., preliminary haves no right BOF；Followed by tf-idf to frequency table Plus weight, final weighting BOF features are generated；

q:R^d→[1,k]

The q in formula (3)：Represent and quantify, R^dThe d dimension datas in real number space are represented, k represents the quantity at class center, x_i,j, i= 1,...,m₂For ith feature in jth frame in reference video storehouse；

The tf-idf weights methods calculated per frame are specific as follows：

{tf}_{i} = f_{i j} / Σ_{t = 1}^{k} f_{t j} - - - (4);

{idf}_{i} = l o g \frac{N}{n_{i}} - - - (5);

W_i=tf_i·idf_i(6)；

In formula (4)～formula (6)：K represents the quantity at class center；f_ijIt is the visual vocabulary belonging to ith feature in j-th video The frequency occurred on frame；n_iIt is the sum of the reference video frame comprising the visual vocabulary belonging to ith feature；N is total reference Video counts；tf_iRepresent word frequency factor；idf_iRepresent inverse word frequency factor；

Quantify list file and have recorded all vocabulary occurred in document sets；

Inverted list file is that the information such as the position by each vocabulary in log file and frequency are recorded, all vocabulary These information just constitute inverted list；For the n vocabulary w quantified in list file₁…w_nIn a w_i, in m log file d₁…d_mIn inverted list can be expressed as form：

\begin{matrix} w_{i} & d_{1} [f_{1}] < p_{i 1}, . . ., p_{{if}_{1}} > . . . d_{m} [f_{m}] < p_{i 1}, . . ., p_{{if}_{m}} > - - - (7); \end{matrix}

The above-mentioned record of n bars can constitute a complete inverted list；

6. the approximate of amalgamation of global R features according to claim 1 repeats video retrieval method, it is characterised in that described Step 4 is specifically implemented according to following steps：

Given inquiry frame, is represented with local feature y, and all of key frame local feature x in video database_j,j =1 ..., n represents, based on specific as follows the step of BOF ballot retrievals：

Step 4.1, for inquiry frame local feature y_l, l=1 ..., m₁With the local of all of key frame in video database Feature x_i,j, i=1 ..., m₂, the similarity scores s between j=1 ..., n two frame of video of calculating_j, its algorithm is specific as follows：

s_{j} = Σ_{l = 1}^{m_{1}} Σ_{i = 1}^{m_{2}} f (x_{i, j}, y_{l}) - - - (8);

Step 4.2, Jing after step 4.1, feature is quantified according to visual vocabulary, and the spy of video in database after quantization Levy and be stored in an inverted file, this quantizing process q uses formula (3)；

Q (x after quantization_i,j) result be and feature x_i,jThe sequence number at nearest class center (visual vocabulary)；Therefore, if two features x_i,jAnd y_lQ (x are met after quantization_i,j)=q (y_l), then the two features in the feature space of higher-dimension very close to probability very It is high；Around this principle, it is contemplated that aforesaid tf-idf methods of weighting, adaptation function f is then defined as algorithm：

f_{tf - idf} (x_{i, j}, y_{l}) = (w_{q (y_{l})} \cdot w_{q (x_{i, j})}) δ_{q (x_{i, j}), q (y_{l})} - - - (9);

Step 4.3, Jing after step 4.2, be ultimately used to sort image similarity score s_fIt is to s_jObtain after being post-processed , specifically implement by following algorithm：

s_{f} = s_{j} / \sqrt{Σ_{l = 1}^{m_{1}} {w_{q (y_{l})}}^{2} Σ_{i = 1}^{m_{2}} {w_{q (x_{i, j})}}^{2}} - - - (10);

Can be seen that by the formula (10) in the formula (9) and step 4.3 in step 4.2：Inquiry frame of video and data are considered simultaneously The tf-idf weights of the vision word of key frame in storehouse, and during both have been added to based on BOF ballot search methods, it is this to add Power method has carried out normalization to vision word histogram.

7. the approximate of amalgamation of global R features according to claim 1 repeats video retrieval method, it is characterised in that described Step 5 is specifically implemented in accordance with the following methods：

If two features x and y quantify Euclidean distance d (x, y) for reflecting the two feature descriptors in the heart in same class It is very little, then the distance between network in the Euclidean space described by R features is also very little；Based on this point, one By q (x) and b (x), q is a quantizer to descriptor, and b is R features；Then R features are embedded in BOF retrieval models, it is again fixed The function of the adaptation function f of justice, specific algorithm is as follows：

f_{R E} (x, y) = \{\begin{matrix} {(t f - i d f (q (x)))}^{2} & \begin{matrix} i f q (x) = q (y) \\ a n d d (b (x), b (y)) \leq h_{t} \end{matrix} \\ 0 & o t h e r w i s e \end{matrix} - - - (11);

When quantifying, calculation will take smaller value in class, allow close video to match as far as possible, and h_tAlso to take accordingly little It is worth, here h_t=0.005, in order to the video of error hiding can be removed according to the distance of R features.