CN110674334B

CN110674334B - Near-repetitive image retrieval method based on consistency region deep learning features

Info

Publication number: CN110674334B
Application number: CN201910869635.8A
Authority: CN
Inventors: 周志立; 孙文迪; 周煜; 孙星明
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Shenzhen Maidian Media Technology Co.,Ltd.
Priority date: 2019-09-16
Filing date: 2019-09-16
Publication date: 2020-08-11
Anticipated expiration: 2039-09-16
Also published as: CN110674334A

Abstract

The invention discloses a near-repetitive image retrieval method based on a consistency region deep learning characteristic, which specifically comprises the following steps: extracting SIFT features of all images in an image library, quantizing the SIFT features into visual words, and establishing an inverted index file for all the SIFT features; k target regions of each image are reserved, and CNN characteristic C (R) of the target regions is calculated_c) (ii) a Extracting SIFT characteristics of the query image, and quantizing the SIFT characteristics into visual words; finding out candidate images by utilizing the inverted index file; finding a near-repeat region in the query image that approximately repeats with each target region of each candidate image; extracting CNN feature C (R) of near-repetitive region_Q) (ii) a Calculating any C (R)_c) C (R) corresponding to the CNN feature_Q) As a similarity score for the group; in each candidate image, a group of scores with the highest cosine similarity is selected as the similarity score between the candidate image and the query image. The invention greatly improves the accuracy of image retrieval while improving the retrieval efficiency.

Description

Near-repetitive image retrieval method based on consistency region deep learning features

Technical Field

The invention belongs to the field of information safety, and particularly relates to a near-repetitive image retrieval method based on a consistency region deep learning characteristic.

Background

Digital image data is increasingly being illegally copied, tampered and transmitted over networks due to the widespread use of powerful image processing tools and the rapid development of internet technology. In fact, these illegal images are near-duplicate images that share a small copy area and undergo various image modifications such as rescaling, occlusion, noise addition, and brightness and color changes. In order to prevent unauthorized use and privacy violation of image content, it has become an urgent problem to detect illegal partially copied versions of copyrighted images. Therefore, as a branch of content-based image retrieval, near-repetitive image retrieval plays a very important role in the field of copyright and privacy protection. In addition, the method is also applied to other emerging fields, such as information hiding, image labeling and near-repetitive image redundancy removal.

In recent years, deep learning features have been successfully used in content-based image retrieval tasks and they provide superior performance compared to traditional manual features. According to the way of feature extraction, the existing image retrieval methods based on CNN features are mainly divided into two categories: image-based CNN features and region-based CNN features. Generally, the CNN feature based on an image directly takes the activation value of a convolutional layer or a fully-connected layer as the CNN feature. Of these, the most representative is to directly input the image into a pre-trained or fine-tuned convolutional Neural network and extract the CNN features from the fully connected layers in the network (Krizhevsky A, Sutskeeper I, and Hinton G, imaging classification with a default Neural networks [ C ],2012 advanced in Neural information processing Systems,2012: 1097-1105.). However, CNN features extracted from fully connected layers tend to lack spatial location information, resulting in extracted CNN features having limited discriminative power. To improve the identification of CNN features, we began to switch from fully-connected layers to CNN feature extraction from convolutional layers, mainly because convolutional layer features consist of the activation values of the convolutional filters, containing rich local spatial information (Babenko A and Lempitsky V, Aggregating discrete conditional defects for image retrieval [ J ], Computer Science, 2015; Kalantis Y, Mellina C, and Osindeno S, Cross-dimensional weighting for acquiring discrete conditional defects [ C ],2016 European Convergence Computer Vision,2016: 685-. Since image-based CNN features mainly describe the visual pattern or semantic meaning of the entire image, intuitively, these methods are not suitable for retrieving near-duplicate images that share small partial regions. Unlike image-based CNN features, such methods of region-based CNN features generally extract CNN features from image regions using the regions as basic units. It is noted that such methods mostly obtain image regions by simply dividing the image into a series of image blocks or directly using existing Region detection methods, such as Selective search (Uijlinks, J.R., Sande V, et al, Selective search for object Recognition [ J ], International journal of Computer Vision,2013,104(2): 154. about. 171.), EdgeBox (Zitnick C and Doll. rP, EdgeBox: locating objects probes [ C ],2014European Convergence Computer Vision,2014: 391. 405.) and Region detection network (Regixiasal. RPN, Salvador. Nile. about. X, Mars F, sample F. about. 2016. about. center. about. 2016. about. Regijjjjjjjjjjjjjc). Although these algorithms can meet the requirements for generating image regions to some extent, if the same region detection method is used for the candidate image and the query image, when the images are subjected to a series of image attacks, inconsistent region pairs can be detected between the near-duplicate images, which can seriously affect the accuracy of image retrieval.

Although research on near-duplicate image retrieval has been greatly advanced, the existing near-duplicate image retrieval method mainly has the following technical problems:

1) most of the existing near-duplicate image retrieval methods are based on feature extraction and matching of the whole image, and are not suitable for retrieving the near-duplicate images sharing small partial copy areas.

2) The existing near-duplicate image retrieval method uses the same region detection method for a candidate image and a query image, and when the images are attacked by a series of images, the detected region pairs between the near-duplicate images are inconsistent.

3) The existing near-repetitive image retrieval method generally directly takes the activation value extracted by a convolutional layer or a full link layer as a CNN feature, and the efficiency of feature extraction and matching is reduced due to overhigh dimensionality.

4) The existing near-repetitive image retrieval method generally directly performs region detection and feature extraction on all images in an image library, and irrelevant images in the images consume more time cost, so that the image retrieval efficiency is reduced.

Disclosure of Invention

The purpose of the invention is as follows: the problems that the existing retrieval technology is not suitable for sharing the near-repeated images of the small partial copy area, the retrieval efficiency is low and the like are solved; the invention provides a near-repetitive image retrieval method based on a consistency region deep learning characteristic.

The technical scheme is as follows: the invention provides a near-repetitive image retrieval method based on a depth learning characteristic of a consistency region; the method specifically comprises the following steps:

step 1: extracting SIFT characteristics of all images in an image library;

step 2: quantizing each SIFT feature into a visual word by using a K-means clustering method, and considering any two SIFT features which come from different images and have the same visual word as mutually matched; establishing an inverted index file for all SIFT features based on the visual words;

step 3, calculating and obtaining a target area of each image by utilizing an EdgeBox algorithm, deleting the target area with the area smaller than M/5 × N/5, wherein M and N are the width and the height of the image respectively, leaving k target areas in the rest target areas, deleting other target areas, and calculating CNN characteristics C (R) of each target area by utilizing an improved CNN characteristic extraction method_c)；

And 4, step 4: extracting SIFT characteristics of the query image; the SIFT features of the query image are quantized into visual words by using a K-means clustering method; finding out candidate images by utilizing the inverted index file; the candidate image is an SIFT feature pair with more than 5 pairs between the candidate image and the query image in the image library; the pair of SIFT feature pairs consists of two mutually matched SIFT features;

and 5: finding out a near-repetition region approximately repeated with each target region in the query image according to SIFT feature pairs existing between the query image and each target region in each candidate image; forming a set of near-repetitive region pairs by the near-repetitive region and the target region;

step 6: using improved CNN feature extraction method to extract CNN feature C (R) of near-repetitive region in any group of near-repetitive region pairs_Q) (ii) a Centering the group of near-repetitive regions on C (R)_Q) And C (R)_C) The cosine similarity of (a) as the similarity score of the group; in each candidate imageAnd selecting a group of scores with the highest cosine similarity as the similarity scores between the candidate image and the query image.

Further, in step 2 or step 4, quantizing each SIFT feature into a visual word, specifically: and performing K-means clustering on all the extracted SIFT features, thereby dividing all the SIFT features into E categories, wherein each category is represented by one visual word.

Furthermore, in the step 3, each target region with an area greater than or equal to M/5 × N/5 is arranged according to the number of the SIFT features included therein from high to low, and the first k target regions are selected.

Further, the specific method of step 5 is as follows:

step 5.1: finding out n pairs of SIFT feature pairs between the query image and a certain target region in a certain candidate image by using the inverted index file;

step 5.2: randomly selecting n from n pairs of SIFT feature pairs_sFor the pair of SIFT features,

y is an SIFT feature pair with Y pairs of real matches in the SIFT feature pairs of n pairs, and Y is more than or equal to n and less than 1; the actually matched SIFT feature pairs consist of two SIFT features which are from different images and have consistent content description on the images; p (n)_s) Is at n_sThe probability of at least one pair of actually matched SIFT feature pairs in the feature pairs;

step 5.3: according to n_sAny one of the pairs of features f_Q＝[σ_Q,θ_Q,(x_Q,y_Q)^T]And f_C＝[σ_C,θ_C,(x_C,y_C)^T]Wherein f is_QTo query for SIFT features in an image, σ_Q、θ_Q、(x_Q,y_Q) Respectively representing the scale, the main direction and the coordinate of the SIFT feature; f. of_CAs SIFT feature in the target region, σ_C、θ_C、(x_C,y_C) Respectively representing the scale, the main direction and the coordinate of the SIFT feature; determining a near repeat region by using the formula that there is n between the query image and the target region_sA pair of closely repeating regions;

wherein (u)_Q,v_Q)^T、w_QAnd h_QRespectively near repeat region R in the query image_CCenter coordinates, width and height;

further, the method for extracting the CNN feature in step 3 or step 6 specifically includes: taking any one target region/near-repetitive region as an input image of the AlexNet model, and outputting 256 feature maps with the size of W multiplied by H by the model to obtain a feature vector with the dimension of W multiplied by H multiplied by 256; w and H are the width and height of the feature map respectively and are in direct proportion to the width and height of the input image; compressing the size W × H of each feature map to m × m using a summing pooling aggregation operation; merging and summing pooling aggregation operations for every 256/d feature maps with the size of m × m, thereby obtaining feature vectors with dimensions of m × m × d, wherein d is more than 0 and less than 256, and d is a multiple of 256; finally, the generated m × m × d-dimensional feature vector is normalized by L2, and the normalized m × m × d-dimensional feature vector is used as the CNN feature of the input image.

Further, in step 6, the method for calculating the cosine similarity includes:

has the advantages that:

(1) according to the method, SIFT feature matching based on the BOW model is adopted, some irrelevant images are filtered according to SIFT feature matching results, and the number of candidate images is greatly reduced, so that near-repeated image retrieval can be realized more quickly.

(2) Because the SIFT features are robust to common attacks, the invention uses the characteristics of the SIFT features to detect visually consistent region pairs, so that visually consistent region pairs are detected between near-duplicate images when subjected to common image attacks.

(3) The invention adopts a two-stage sum-firing strategy, so that compact CNN characteristics are generated while the space coding of a coding region is fully realized.

(4) The CNN features obtained by calculation have strong identification capability, can capture the semantic characteristics of the image, and improve the accuracy of image retrieval.

Drawings

FIG. 1 is a general framework schematic of the present invention;

FIG. 2 is a schematic diagram of the structure of the inverted index in the present invention.

Detailed Description

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention.

As shown in fig. 1, the present embodiment provides a near-duplicate image retrieval method based on a depth learning feature of a consistent region: in the off-line stage, SIFT features are extracted from all images in the image library, then each SIFT feature is quantized into a visual word by using a K-Means clustering method, and the visual word is stored in a constructed inverted index file. And in the online stage, the same feature extraction and quantization methods are used for the input query image, the similarity between the quantized SIFT features and the features in the index file is calculated, the obtained similarity results are sequenced, and the image related to the query image, namely the candidate image, is sequentially output. The above process is to use the Bag-of-visual-words (BOW) model to perform image retrieval. In addition, in order to reduce the calculation complexity of detecting the target area in the image and extracting the features, the method uses the existing area detection EdgeBox algorithm to extract the target area from all the images in the image library in an off-line stage, and extracts the CNN features from all the target areas in the candidate images. In order to ensure that the region pairs with visual consistency are detected between the near-repeated images, in an online stage, the characteristics of SIFT features are fully utilized to position the near-repeated regions consistent with the target region in the query image, the near-repeated region pairs are formed, and compact CNN features are extracted from the near-repeated region pairs, so that the accuracy and efficiency of near-repeated image retrieval are improved. The specific steps are as follows:

step 1: and extracting 128-dimensional SIFT features from all the images in the image library.

Step 2: BOW quantization is carried out on the extracted SIFT features: and performing K-means clustering on all the extracted SIFT features, dividing all the extracted SIFT features into E categories, representing each category by using a visual word, and classifying the SIFT features quantized to the same visual word into one category. The set of all visual word labels constitutes a visual dictionary. Thus, each image can be described by several visual words.

And step 3: in order to improve the efficiency of image retrieval, an inverted index is established for all SIFT features. The indexed features not only record the ID of the image to which they belong, but also its orientation, scale and coordinates and other relevant information. This information will further be used to generate potential near-duplicate region pairs. The inverted index is shown in fig. 2.

And 4, step 4: by using an inverted index structure, SIFT features quantized from any two different images to the same visual word are considered to match, and the similarity between the images is measured by counting the number of SIFT features shared between the two images. When images in the image library share 5 pairs or more of SIFT feature pairs with the input query image, the images are considered as candidate images. Therefore, a large number of irrelevant images can be filtered out, and the time complexity of image detection areas and feature extraction is reduced.

And 5: since the EdgeBox algorithm can achieve high recall by computing an information edge map, meaningful target regions are detected from the image that are most likely to be copied and propagated between near-duplicate images. Furthermore, the edge calculation of the algorithm is efficient and the calculated edge map is sparse with low computational complexity. Most importantly, the algorithm can directly detect the target region from the edge information of the image without a learning process based on a deep learning network. Therefore, the algorithm has strong flexibility. The method comprises the following specific steps:

step 5-1: a set of target regions is detected for each candidate image using the EdgeBox algorithm.

Step 5-2: in order to avoid that small regions negatively affect the image retrieval, the present embodiment will delete regions with an area smaller than M/5 × N/5, where M and N are the width and height of the image, respectively.

Step 5-3: theoretically, for a detected target region, the number of SIFT features may reflect the texture complexity thereof to some extent, because the number of SIFT features extracted from a region with good texture is much larger than the SIFT features extracted from a flat region. Therefore, in order to save computing resources, all target regions detected in the candidate image are sorted in a descending order according to the SIFT feature quantity contained in each region, and the first k target regions (detected regions) are reserved; other target areas are deleted.

Step 6: according to the SIFT feature pair between the query image and any one target region in any one candidate image, finding out a near-repetition region approximately repeated with the target region in the query image; and forming a group of near-repeated region pairs by the near-repeated region and the target region. The details are as follows:

step 6-1: the method comprises the steps of utilizing an inverted index file to find n pairs of SIFT feature pairs existing between a query image and a target region in a candidate image, wherein the number n of the SIFT feature pairs can be as high as hundreds, and if all SIFT feature pairs are directly matched to position corresponding potential near-repetition region pairs in the query image, although many correct near-repetition region pairs can be positioned, the calculation consumption is very large. In practice, the accuracy of near-duplicate image detection can be ensured only by ensuring that the positioned near-duplicate region pair at least comprises a pair of truly matched SIFT feature pairs; the true matched SIFT feature pairThe image processing method comprises the following steps of (1) forming two SIFT features which come from different graphs and are consistent in description of image content; therefore, to reduce the amount of computation, assume that the probability of a true match is p_T，

Y is a feature pair which exists in n pairs of feature pairs and is true match, and when n is randomly selected_SWhen the SIFT features are matched, the probability of the SIFT feature pair at least containing one real match is approximated as:

therefore, pick n_SThe near-repetitive region pairs are positioned for the SIFT feature matching pairs, so that at least one pair of SIFT feature matching pairs can be guaranteed to be real matching, and at least one pair of correct near-repetitive region pairs can be positioned.

Step 6-2: the detection of the SIFT features is based on the content of the image, so the scale, principal direction and coordinates of the feature points of the local features are changed together with scaling, rotation and translation transformations, respectively. Therefore, the parameters of the transformation can be estimated from the property variation between two matching local features.

Assume two SIFT features f_QAnd f_CIs respectively [ sigma ]_Q,θ_Q,(x_Q,y_Q)^T]And [ sigma ]_C,θ_C,(x_C,y_C)^T](ii) a Wherein f is_QTo query for SIFT features in an image, σ_Q、θ_Q、(x_Q,y_Q) Respectively representing the scale, the main direction and the coordinate of the SIFT feature; f. of_CAs SIFT feature in the target region, σ_C、θ_C、(x_C,y_C) Respectively representing the scale, principal direction and coordinates of the SIFT feature. A near repeat region (localized region) is determined using the formula that there is n between the query image and the target region_sA pair of closely repeating regions;

intuitively, if the two features are a true match, then R_CAnd R_QIt is likely that the correct near-duplicate region pair.

And 7: after detecting potential near-duplicate region pairs, extracting compact CNN features for these near-duplicate region pairs by the following steps:

step 7-1: when any target region/near-repetitive region is used as an input image of the AlexNet model, the model outputs 256 feature maps with the size of W × H, and a feature vector with the dimension of W × H × 256 can be obtained.

Step 7-2: entering the first sum-posing stage, for an input area of any size, applying spatial sum-posing of size m × m to activation of the area to obtain a feature map of dimension m × m × 256.

And 7-3: entering a second sum-firing stage, compressing the features by summarizing the activation values of the m × m × 256 dimensional feature map and concatenating the results to generate a feature vector of m × m × d dimensions. Where 256 is a multiple of d. Finally, the generated m × m × d-dimensional feature vector is normalized by L2, and the normalized m × m × d-dimensional feature is regarded as a CNN feature.

And 8: in the online retrieval stage, the CNN characteristics between the near-repeat region and the target region of the candidate image are compared to measure the similarity between the two images so as to achieve the purpose of retrieving the near-repeat image version. For a given repeating region pair R_Q(near repeat region) and R_C(target region) and their corresponding CNN features are C (R) respectively_Q) And C (R)_C) It calculates the cosine similarity:

and step 9: and selecting the scores of a group of near-repeated region pairs with the highest cosine similarity score between the query image and a candidate image as the similarity score between the query image and the candidate image.

It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. The invention is not described in detail in order to avoid unnecessary repetition.

Claims

1. The near-repetitive image retrieval method based on the consistency region deep learning features is characterized by comprising the following steps:

step 1: extracting SIFT characteristics of all images in an image library;

step 6: using improved CNN feature extraction method to extract CNN feature C (R) of near-repetitive region in any group of near-repetitive region pairs_Q) (ii) a Centering the group of near-repetitive regions on C (R)_Q) And C (R)_C) The cosine similarity of (a) as the similarity score of the group; selecting a group of scores with highest cosine similarity in each candidate image as similarity scores between the candidate image and the query image;

the method for extracting the CNN feature in step 3 or step 6 specifically includes: taking any one target region/near-repetitive region as an input image of the AlexNet model, and outputting 256 feature maps with the size of W multiplied by H by the model to obtain a feature vector with the dimension of W multiplied by H multiplied by 256; w and H are the width and height of the feature map respectively and are in direct proportion to the width and height of the input image; compressing the size W × H of each feature map to m × m using a summing pooling aggregation operation; merging and summing pooling aggregation operations for every 256/d feature maps with the size of m × m, thereby obtaining feature vectors with dimensions of m × m × d, wherein d is more than 0 and less than 256, and d is a multiple of 256; finally, the generated m × m × d-dimensional feature vector is normalized by L2, and the normalized m × m × d-dimensional feature vector is used as the CNN feature of the input image.

2. The method according to claim 1, wherein each SIFT feature is quantized into a visual word in step 2 or step 4, specifically: and performing K-means clustering on all the extracted SIFT features, thereby dividing all the SIFT features into E categories, wherein each category is represented by one visual word.

3. The method according to claim 1, wherein in step 3, each target region having an area greater than or equal to M/5 × N/5 is ranked according to the number of SIFT features included therein, and the first k target regions are selected.

4. The method according to claim 1, wherein the specific method of the step 5 is as follows:

5. the method of claim 1, wherein in step 6, the cosine similarity is calculated by: