Disclosure of Invention
The invention aims to provide a video retrieval method and a video retrieval system based on fusion of a plurality of images so as to improve the recall ratio of video retrieval.
In order to realize the purpose, the invention adopts the technical scheme that: in a first aspect, the present invention provides a video retrieval method based on fusion of multiple images, the method including:
decoding the database video and dividing the video shots to obtain a plurality of video shots;
extracting key frames of a single video shot, and extracting local features of the key frames;
clustering partial local features, and taking an obtained clustering center set as a codebook of the local features of the database video;
according to the codebook of the local features of the database video, all the local features of the database video are quantized and coded;
after quantization coding, performing pooling processing on local feature sets of all key frames of a single video shot to obtain a local feature pooling set after quantization of the single video shot;
establishing a reverse file index according to a codebook of local characteristics of a database video and a local characteristic pooling set after quantization of a single video lens;
and performing online retrieval of the target video according to the plurality of query images and the reverse file index of the target video to be retrieved.
In a second aspect, the present invention provides a video retrieval system based on fusion of multiple images, the system comprising: the system comprises a video processing module, a distributed storage module and a retrieval module;
the video processing module comprises a processing unit, a first extraction unit, a first clustering unit, a first quantization coding unit and a first pooling unit;
the processing unit is connected with the database and used for decoding the video in the database and dividing the video shots to obtain a plurality of video shots;
the first extraction unit is connected with the processing unit to extract key frames of a single video shot and extract local features of the key frames;
the first clustering unit is connected with the extracting unit to cluster partial local features, and an obtained clustering center set is used as a codebook of the database video local features;
the first quantization coding unit is connected with the clustering unit so as to perform quantization coding on all local characteristics of the database video according to the codebook of the local characteristics of the database video;
the first pooling unit is connected with the quantization coding unit so as to perform pooling processing on local feature sets of all key frames of a single video shot after quantization coding to obtain a local feature pooling set after quantization of the single video shot;
the distributed storage module is connected with the video processing module to build a reverse file index according to a codebook of local characteristics of a database video and a local characteristic pooling set after a single video lens is quantized;
the retrieval module is connected with the distributed storage module to perform online retrieval of the target video according to the plurality of query images and the reverse file index of the target video to be retrieved.
Compared with the prior art, the invention has the following technical effects: first, the target video is searched and retrieved by using a plurality of query images of the same target video, different viewing angles can be considered, the retrieved target video is more accurately described, and the recall ratio of the target video is improved. Secondly, by establishing a reverse file index part in an off-line mode and taking the video shots of the database video as a unit, the local features of all key frames of a single video shot are pooled to obtain a pooled set of the local features after the single video shot is quantized, the memory consumption and the record number in the database are greatly reduced, the retrieval speed is accelerated, and the memory consumption is saved to dozens or even one thousandth of the original technology.
Detailed Description
The present invention will be described in further detail with reference to fig. 2 to 6.
As shown in fig. 2, the present embodiment provides a video retrieval method based on fusion of a plurality of images, the method including the following steps S1 to S7:
s1, decoding the database video and dividing the video shots to obtain a plurality of video shots;
specifically, the plurality of video shots therein means to be divided into at least one video shot.
S2, extracting key frames of the single video shot, and extracting local features of the key frames;
specifically, at least one key frame is extracted from a single video shot, and feature extraction is performed on the key frame, where the feature extraction includes, but is not limited to, local feature extraction and global feature extraction, and in this embodiment, local feature extraction is performed on the key frame as a preferable scheme.
S3, clustering partial local features, and taking the obtained clustering center set as a codebook of the database video local features;
s4, carrying out quantization coding on all local characteristics of the database video according to the codebook of the local characteristics of the database video;
s5, after quantization coding, pooling local feature sets of all key frames of a single video shot to obtain a quantized local feature pooled set of the single video shot;
the pooling (pond) in this embodiment includes, but is not limited to: average pooling (average pooling), maximum pooling (max pooling), and the like.
It should be noted that the quantized local feature pooling set here is a result of pooling local features of all key frames of a single video shot, and is different from the concept of the local features of the key frames.
S6, establishing a reverse file index according to a codebook of local characteristics of a database video and a local characteristic pooling set after quantization of a single video shot;
it should be noted that, since the number of codebooks corresponds to the dimension of the statistical histogram in the search, the number of codebooks is large, for example, several tens of thousands to millions. In this way, in the quantized local feature pooling set, most of the code words are assigned with zero values, which makes the quantized local feature pooling set very sparse in distribution, and with this sparsity, the reverse file index can be established by using the inverse sorting in the text retrieval.
And S7, performing online retrieval of the target video according to the plurality of query images and the reverse file index of the target video to be retrieved.
In this embodiment, the plurality of query images refers to at least two query images.
Specifically, as shown in fig. 3, the step S7 includes the following steps S71 to S75:
s71, extracting local features of all query images of the target video to be retrieved;
s72, carrying out quantitative coding on all local features of all query images according to the codebook of the local features of the database video;
s73, performing pooling treatment on all local features of all query images after quantization coding to obtain a local feature pooling set of all query images after quantization;
s74, according to the reverse file index, comparing the similarity of the quantized local feature pooling set of the target video to be retrieved with the similarity of the quantized local feature pooling set of a single video shot in the database video;
and S75, sequencing the inquired video files according to the similarity obtained by comparison, and completing the online retrieval of the target video.
In this embodiment, when a plurality of images are used for query, the local features of all query images are pooled, and the local features of all query images can be converted into a precisely quantized local feature pooling set capable of describing a target video, which is used as a new feature of all query images, so that the search efficiency of the target video and the search efficiency of the existing search process are basically kept unchanged.
Specifically, S3: "clustering partial local features, and using the obtained cluster center set as a codebook of the database video local features", specifically comprising the following subdivision steps:
extracting partial local features at intervals or randomly from all local features extracted from all video shot key frames;
clustering the extracted partial local features based on a preset unsupervised distance method, and taking the obtained k representative features as a codebook;
it should be noted that the unsupervised distance method preset in the present embodiment includes, but is not limited to, a k-means unsupervised distance method.
Accordingly, S4: "according to the codebook of the local features of the database video, all the local features of the database video are quantized and encoded", which specifically includes:
and according to the k feature codebooks, carrying out local feature vector quantization on all local features of the video lens by taking a single key frame as a unit to obtain a local feature statistical histogram of each key frame.
Specifically, S6: "establishing a reverse file index according to a codebook of local characteristics of a database video and a quantized local characteristic pooling set of a single video shot" specifically includes the following subdivision steps:
sequentially taking each code word ID in a codebook of local characteristics of a database video as a header, and establishing a linked list;
and scanning the video in the database, and pressing all video shot IDs containing the code words and related information into a linked list to obtain a reverse file index.
It should be noted that the related information in this embodiment includes, but is not limited to, information such as word frequency, hamming code, and feature distance.
Specifically, the specific process of step S6 "comparing the similarity between the quantized local feature pooling set of the target video to be retrieved and the quantized local feature pooling set of a single video shot in the database video according to the inverted file index" is as follows: and scanning a linked list corresponding to the code word in the reverse index file according to a certain code word in the quantized local feature pooling set of all the query images to obtain the similarity between the query images on the code word and the videos of the database containing the code word.
Specifically, the method disclosed in this embodiment is executed in step S72: after all local features of all query images are quantized and coded according to the codebook of the local features of the database video, the method further comprises the following steps:
cross-comparing all local features of all query images subjected to quantization coding, and determining feature matching overlapping regions of all query images as target regions to be searched;
accordingly, step S73: "all local features after all query image quantization codes are subjected to pooling processing to obtain a local feature pooling set after all query images are quantized", and the method specifically comprises the following steps:
and pooling local features of all query images in the target area to be searched to obtain a quantized local feature pooling set of the target video to be retrieved.
It should be noted that, by automatically discovering a common feature subset according to the correlation of features between images, and determining the spatial position of the target video to be retrieved in the images by using the set, the whole process does not depend on any manual labeling, so that the region of the target video to be retrieved can be obtained, and the query result obtained by querying the target region is more accurate than the query result obtained by querying the whole image.
Specifically, a process schematic diagram of a video retrieval method based on multiple image fusion in the present embodiment is shown in fig. 4.
As shown in fig. 5 and fig. 6, the present embodiment discloses a video retrieval system based on fusion of multiple images, which includes:
a video processing module 10, a distributed storage module 20 and a retrieval module 30;
the video processing module 10 comprises a processing unit 11, a first extracting unit 12, a first clustering unit 13, a first quantization coding unit 14 and a first pooling unit 15;
the processing unit 11 is connected with the database, and decodes the video in the database and divides the video shots to obtain a plurality of video shots;
the first extraction unit 12 is connected with the processing unit 11 to extract key frames of a single video shot and extract local features of the key frames;
the first clustering unit 13 is connected with the extracting unit 12 to cluster partial local features, and an obtained clustering center set is used as a codebook of the local features of the database video;
the first quantization coding unit 14 is connected with the clustering unit 13 to perform quantization coding on all local features of the database video according to the codebook of the local features of the database video;
the first pooling unit 15 is connected with the quantization encoding unit 14 to pool the local feature sets of all key frames of a single video shot after quantization encoding, so as to obtain a local feature pooling set after quantization of the single video shot;
the distributed storage module 20 is connected with the video processing module 10 to establish a reverse file index according to a codebook of local characteristics of a database video and a local characteristic pooling set after quantization of a single video shot;
the retrieval module 30 is connected with the distributed storage module 20 to perform online retrieval of the target video according to the plurality of query images and the reverse file index of the target video to be retrieved.
It should be noted that the video processing module 10 in this embodiment is specifically a video processing server group, the distributed storage module 20 is specifically a disk array, and the retrieval module 30 is specifically a retrieval server group. See table 1 for specific hardware configuration parameters:
TABLE 1
It should be noted that the distributed storage module 20 herein supports dynamic insertion/deletion of video feature vectors, and supports fast random search.
Specifically, the retrieving module 30 specifically includes: a second extraction unit 31, a second quantization encoding unit 32, a second pooling unit 33, a comparison unit 34, and a retrieval unit 35;
the second extraction unit 31 performs local feature extraction on all query images of the target video to be retrieved;
the second quantization coding unit 32 is connected to the second extracting unit 31 to perform quantization coding on all local features of all query images according to the codebook of local features of the database video;
the second pooling unit 33 is connected with the second quantization encoding unit 32 to pool all the local features after quantization encoding of all the query images, so as to obtain a quantized local feature pooling set of the target video to be retrieved;
the comparison unit 34 is connected with the second pooling unit 33 and the distributed storage module 20 to compare the similarity between the quantized local feature pooling set of the target video to be retrieved and the quantized local feature pooling set of a single video shot in the database video according to the reverse file index;
the retrieval unit 35 is connected to the comparison unit 34 to sort the queried video files according to the similarity obtained by comparison, so as to complete online retrieval of the target video.
Specifically, the first clustering unit 13 is specifically configured to:
extracting partial local features at intervals or randomly from all local features extracted from all video shot key frames;
clustering the extracted partial local features based on a preset unsupervised distance method, and taking the obtained k representative features as a codebook;
accordingly, the first quantization encoding unit 14 is specifically configured to:
and according to the k feature codebooks, carrying out local feature vector quantization on all local features of the video lens by taking a single key frame as a unit to obtain a local feature statistical histogram of each key frame.
Specifically, the distributed storage module 20 specifically includes: a linked list establishing unit 21 and a reverse index establishing unit 22;
the linked list establishing unit 21 establishes a linked list by sequentially taking each code word ID in the codebook of the local characteristics of the database video as a header;
the reverse index establishing unit 22 is connected to the linked list establishing unit 21 to scan the video in the database, and presses all the video shot IDs and related information including the code word into the linked list to obtain a reverse file index, where the related information includes the word frequency and the hamming code.
In particular, the video processing module 30 further comprises a matching unit 36;
the matching unit 36 is connected with the second quantization coding unit 32 to cross-compare all local features of all query images subjected to quantization coding, and determine feature matching overlapping regions of all query images as target regions to be searched;
correspondingly, the second pooling unit 33 is connected to the matching unit 36, and is specifically configured to:
and pooling local features of all query images in the target area to be searched to obtain a quantized local feature pooling set of the target video to be retrieved.
It should be noted that the specific working process and the key point of the video retrieval system based on the fusion of multiple images are the same as those of the video retrieval method based on the fusion of multiple images, and are not described herein again.
It should be noted that the video retrieval method and system based on the fusion of multiple images disclosed by the invention have the following technical effects:
(1) when a plurality of query target images are used, different viewing angles can be considered when a target object is expressed, so that the description is more accurate, and the improvement of the recall ratio of a retrieval system is greatly facilitated. Meanwhile, through feature pooling during multi-image query, the target to be searched can be described by only one feature vector as single image query, so that the searching efficiency is basically kept unchanged.
(2) The offline processing of the video part of the database keeps the quantized feature vector after pooling by the feature pooling with the video shot instead of the key frame as a unit, thereby greatly reducing the memory consumption and the record number in the database, greatly improving the retrieval efficiency, saving the memory consumption to one dozen to one thousand times of the original technology, and simultaneously keeping equivalent or even higher searching precision.
(3) In a plurality of query image input parts, a common feature subset is automatically discovered through the correlation of features among all query images, the spatial position area of a target to be searched in the image is determined by the set, the area of the target to be searched can be obtained without any manual marking, and a query result which is more accurate than that of the whole image is obtained by taking the area as a query.