CN106599917A

CN106599917A - Similar image duplicate detection method based on sparse representation

Info

Publication number: CN106599917A
Application number: CN201611130891.8A
Authority: CN
Inventors: 赵万青; 罗迒哉; 范建平; 彭进业
Original assignee: Northwest University
Current assignee: Northwest University
Priority date: 2016-12-09
Filing date: 2016-12-09
Publication date: 2017-04-26
Also published as: WO2018103179A1

Abstract

The invention discloses a similar image duplicate detection method based on sparse representation. The method is proposed based on a hadoop distributed calculation frame, and the detection method includes the following steps: first obtaining an image set I, the sparse codes of all images in the image set I are g'; extracting non-zero elements from g', hashing the sparse code g<i>' of an image I<i> to a set corresponding to subscripts of the non-zero elements, and calculating each pair of images <Iw, Iz> in each Reduce function; the similarity of the sparse codes being Y, if Y is larger than 0.7, outputting similar image pairs <Iw, Iz>; and merging the similar image pairs having the Image Iw, so as to generate a similar image subset. Through a parallel calculation mode, the similar image duplicate detection method based on sparse representation greatly improves the calculation efficiency of a KMeans clustering algorithm aiming at a large-scale data set, introduces a sparse representation theory, and has a faster implementation method without too many solution optimization processes.

Description

Approximate image repetition detection method based on sparse representation

Technical Field

The invention belongs to the field of image approximate repeat detection, and relates to a parallelization image near repeat detection method based on sparse representation, which can efficiently and accurately extract a near repeat image set from a massive image set.

Background

With the development of mobile internet and digital cameras, people increasingly share shot multimedia data on the internet, and because the positions, shot objects and angles of the shooters are the same, a large number of similar pictures appear on the internet. By extracting the similar image sets, the duplicate removal and filtering of the image retrieval results can be performed, and the method is also an important step in many image processing fields such as image clustering, image identification, image classification and the like.

Generally, the approximately repeated image is obtained by some kind of original image through some kind of approximately repeated image transformation, and the transformation which can generate the approximately repeated image generally comprises translation, scaling, selection, change of image tone, addition of characters, format change, resolution change and the like. The near-duplicate image detection refers to that given query images are found in a data set, or all near-duplicate image subsets in the data set are extracted. At present, most of near-repetitive image detection adopts a Bag-of-words and LSH method to construct a system. The Bag-of-words model maps the local features of each image into a visual word frequency histogram vector by using a trained dictionary. The Bag-of-words based image representation model method generally comprises 3 parts: 1) extracting local features of the image; 2) constructing a visual dictionary through the local features of the clustering image set; 3) the local vector of each graph is mapped into a word frequency histogram. LSH (location-Sensitive Hashing) is a random method for indexing high-dimensional data, and performs approximately linear search in a high-dimensional data space at the cost of a certain search accuracy, and returns approximately nearest neighbor data of query data. The basic idea is to map input data points into each bucket through a group of hash functions, and ensure that adjacent data points are mapped into the same bucket with a higher probability and data points far away are mapped into the same bucket with a lower probability, so that other data points in the bucket where a query data point is located can be regarded as adjacent points of the query data. However, the results of the detection are often unsatisfactory due to the too strict definition of the BOW model for the local features and the efficiency of LSH conversion with accuracy. In addition, the existing near-duplicate image detection system generally adopts a single node to perform operation, and with the explosive increase of data volume, the single node system is far from meeting the current application requirement. Therefore, multi-node parallel computing is the inevitable choice, and the HADOOP system is the most stable and efficient among many distributed frameworks.

Disclosure of Invention

In view of the above problems in the prior art, an object of the present invention is to provide a system and a method for extracting a sparse representation-based distributed image approximate duplicate set for a large-scale image set. The method can improve the efficiency of processing the large-scale image set, and can effectively improve the accuracy of the detection result compared with the traditional method.

In order to realize the task, the invention adopts the following technical scheme:

the detection method is proposed based on a hadoop distributed computation framework, and comprises the following steps of obtaining IDF weighted sparse codes g' of all images in an image set I, wherein I ═ I₁,I₂,...,I_i,...，I_w,…,I_z,…,I_R)，I_iIs g_i′，g_i'∈ g', i being a natural number equal to or greater than 1, w being a natural number greater than i, z being a natural number greater than w, and R being a natural number greater than z, wherein the method further comprises:

(1) extracting an image I_iIDF weighted sparse coding g of_iA non-zero element of'; g_ik′∈g_i', k is a natural number of 1 or more, g_iThe non-zero element in is (g)_iu′,...，g_iv') is provided with m nonzero elements, m is a natural number more than or equal to 1, m is less than or equal to k, and g_iu′≠0,g_iv' notequal to 0, u is a natural number not less than 1, v is a natural number not less than 1, k>v>u；

(2) K groups were established, named:wherein,is an empty matrix;

(3) transforming the image I by using the matrix of (equation 1)_iIDF weighted sparse coding g of_i' hashing into m groups respectively corresponding to subscripts (u,... multidot.v) of non-zero elements;

(4) by usingCalculating each pair of images in each of the m groups obtained in the step (3)<I_i,I_j>Similarity Y of IDF weighted sparse coding, if Y is more than 0.7, image<I_i,I_j>A pair of similar images;

wherein j is a natural number more than or equal to 1, and i is not equal to j; g'_iAnd g'_jRespectively representing images I_iAnd I_jThe IDF weighted sparse coding of (1);

(5) and (4) combining the similar image pairs with the same image in the result obtained in the step (4) to generate a similar image subset.

Further, the obtaining of the IDF weighted sparse coding g' of all the pictures in the picture set I includes the following steps:

extracting the local features of each image in a parallelization manner to obtain the local features S of all the images in the image set I;

extracting an image clustering center to obtain a feature dictionary E;

calculating the weight of each cluster center in the E;

and extracting the IDF weighted sparse code g' of the image according to the weight of each cluster center in the E.

Further, the parallelization extraction of the local features of the images is to extract the SIFT features of all the images of the image set I.

Further, the extracting of the SIFT features of all the images specifically comprises the following steps:

each image I in the image set I_iThe size of the image is standardized and the gray level is processed to obtain a gray level image set with standard size; wherein I ═ I (I)₁,I₂,...,I_i,...,I_w,…,I_z,…,I_R)；

Dividing a standard-size gray level image set into cluster nodes, extracting SIFT features of each image in a parallelization manner, and expressing the SIFT features of all the images as S, wherein S is_a∈S，S_aRepresenting a vector, a being a natural number not less than 1, item I_iSIFT feature of each image is F_iIn which F is_ib∈F_i，F_ibRepresents a vector, and b is a natural number of 1 or more.

Further, the extracting of the image clustering center specifically comprises the following steps:

(11) calculating S_aTo the clustering center A^dIs a Euclidean distance of S_aAs value, will go to S_aTaking the clustering center with the nearest Euclidean distance as a key value; a. the_dk∈A^dD is a non-negative integer greater than or equal to 0, k is greater than or equal toA natural number equal to 1, A_dkA representative vector; randomly selecting k SIFT features from S as initial k clustering centers to form an initial clustering center set A⁰，A_0k∈A⁰；

(21) Obtaining S with the same key value_aD is d +1, and each average value is used as a new cluster center a^d；

(31) Calculating a new clustering center A^dClustering center A from the last cycle^d-1If the Euclidean distance mean value is larger than 0.05, jumping back to the step (11); if the Euclidean distance mean value is less than 0.05, then the new clustering center A^dOutputting as a feature dictionary E of the image set I, wherein E_k∈ E, k is a natural number of 1 or more, E_kAnd the feature dictionary E is a vector, namely an image clustering center.

Further, the calculating the weight of each cluster center in the step E specifically comprises the following steps: by usingCalculating the weight of each cluster center in the E, wherein: d is the total number of all SIFT features in S, D is a natural number which is more than or equal to 1,representation to be attributed to E_kTotal number of all SIFT features in the center.

Further, the extracting of the IDF weighted image sparse code specifically includes the steps of:

separately computing images I_iOf each feature vector F_ibEuclidean distance h from a feature dictionary E, where E_k∈E，h_k∈h＝(h₁,h₂,..,) k is a natural number which is more than or equal to 1, and h is selected from E_kMinimum m E_kThe feature dictionary E ', E ═ E' (E) is formed_f，...，E_g) Wherein, the E' has m vectors, f is a natural number which is more than or equal to 1, g is a natural number which is more than or equal to 1,g＞f；

by C_ib＝(E′-1F_ib ^T)(E′-1F_ib ^T)^TCalculating F_ibSquared difference matrix C with feature dictionary E_ib；

Calculating F_ibSparse coding c in a feature dictionary E_ib， Wherein h is_m′∈h′＝(h₁′,h₂', …,) is a feature vector F_ibEuclidean distances to the visual words in E ', respectively, diag (h ') representing the principal diagonal of the matrix with the elements of vector h ';

extraction of c_ibK maximum values of (1), obtaining an image I_iImage sparse coding g_iWherein g is_ik∈g_i；

According toSparsely encoding an image g_iAnd carrying out inverse visual word frequency weighting normalization to obtain IDF weighted sparse codes g'.

The invention has the beneficial effects that:

(1) according to the method, the KMeans algorithm is operated in a MapReduce parallel mode to learn the overcomplete dictionary features on line, so that the global feature structure is extracted as much as possible, and therefore the atomic combination with the most characteristic capability can be found from the overcomplete dictionary to represent the image features. The online dictionary learning is ended by iterating the MapReduce Jop every time and judging according to a set threshold, and the calculation efficiency of the KMeans clustering algorithm for the large-scale data set is greatly improved by the parallelization calculation mode;

(2) the invention introduces a sparse representation theory and reconstructs K code words in a dictionary set and nearest neighbor of characteristic features through lookup, compared with the vector quantization of original BOW, the reconstruction mode has smaller reconstruction error by using a plurality of code words, local smooth sparsity can be achieved by reconstructing through selecting local neighbors, and the mode has faster realization method without excessive solution optimization process compared with the traditional sparse representation method;

(3) according to the method, the global inverse visual word frequency IDF is counted, the sparse representation vector of the image is weighted, so that the representation vector is stronger in representation, meanwhile, the weight of a code word with lower representation in sparse coding is reduced, the weight of a code word with higher representation is increased, so that the sparsity is stronger, and therefore the high representation characteristics of similar images have higher probability of being the same or similar;

(4) according to the method, high-characterization features in sparse representation are extracted to hash to a candidate similar set, and matching is performed by calculating the similarity of every two Jaccard indexes in each hash bucket, so that a similar image pair can be simply and efficiently obtained, meanwhile, the method can fully utilize the characteristics of sparse representation and a Mapreduce model to perform parallel calculation, and the matching efficiency and accuracy of a large-scale image data set are greatly improved;

(5) the distributed image near-repeat detection system based on the HDFS of the HADOOP distributed storage system and the MapReduce parallel computing model is provided, the existing image near-repeat detection system generally only supports a single-node detection framework, but with the development of the mobile internet, image data grows exponentially, and the conventional system cannot meet storage and operation on large data volume at all. The invention not only has high expansibility aiming at mass data, but also can greatly improve the calculation efficiency of near-repeated detection.

Drawings

FIG. 1 is a schematic structural diagram of an approximate repetitive image detection method based on sparse representation according to the present invention;

FIG. 2 is a flow chart of the parallel KMeans algorithm based on MapReduce implemented in the invention;

FIG. 3 is a flowchart of an image sparse coding algorithm for implementing local-priority search according to the present invention;

FIG. 4 is a flow chart of a parallel similarity set detection algorithm based on MapReduce and significant dimension features according to the present invention;

FIG. 5 shows the clustering results obtained from different methods for 5 keywords in example 2;

fig. 6 shows the clustering result of example 2.

Detailed Description

The following embodiments of the present invention are provided, and it should be noted that the present invention is not limited to the following embodiments, and all equivalent changes based on the technical solutions of the present invention are within the protection scope of the present invention.

Example 1

The method is provided based on a hadoop distributed computing framework, and in the figure 1, the approximate repeated image detection method based on sparse representation comprises a feature extraction module, a feature dictionary construction module, an image sparse coding representation module, a similar image matching filtering module and a near repeated image pair merging module. The characteristic extraction module is used for extracting all original local image characteristics in the image set in a parallelization manner; and the feature dictionary construction module is used for constructing an original feature dictionary set by adopting a parallel K-Means online dictionary learning algorithm for the image features extracted by the feature extraction module. The image sparse coding representation module is used for mapping the original local features of each image into a sparse vector by utilizing the constructed dictionary set and sparse coding method to represent each image; the similarity matching and filtering module is used for calculating the similarity of the filtered image pair in a parallelization manner and outputting the image pair with the similarity larger than a certain threshold value; the near-duplicate image pair merging module is used for merging all the similar image pairs output by the similar image matching and filtering module into a near-duplicate image set.

The detection method comprises the following steps:

step 1, an image set I, wherein I ═ I₁,I₂,...,I_i,...,I_w,…,I_z,…,I_R) In each image I_iThe size normalization of the invention, such as setting to a normalized (128 x 128, 256 x 256 or 512 x 512) size, the images selected in the invention are uniformly normalized to 256 x 256 size, then the gray processing is performed to obtain a set of gray images of standard size, the images are serialized by a custom imagewriteable, a binary data stream of the images is output, and all the images are in a key-value pair form<Image ID, value, imageWritable>Compressing and storing the compressed and stored image bundle in a user-defined manner, and finally uploading the compressed and stored image bundle to a distributed storage framework HDFS;

step 2, downloading the ImageBundle file from the HDFS, enabling the hadoop to divide the ImageBundle file into Map functions of different nodes according to the number of the nodes in the cluster, obtaining the key value pair form of the image through the Map functions, extracting a plurality of SIFT feature vectors of each image in the image set based on a public SIFT algorithm to form an SIFT feature vector sequence of the image, and expressing the SIFT features of all the images as S, wherein the S is S_a∈S，S_aRepresenting a vector, a being a natural number not less than 1, item I_iSIFT feature of each image is F_iIn which F is_ib∈F_i，F_ibRepresenting vector, b is a natural number greater than or equal to 1, and collecting each image in the image set<Image ID, value: fi (radio of Fidelity)>The key-value pair form is stored on the HDFS;

step 3, calculating S_aTo the clustering center A^dIs a Euclidean distance of S_aAs value, will go to S_aCluster center a with the closest euclidean distance_qAs a key value; a. the_dk∈A^dD is a nonnegative integer of 0 or more, k is a natural number of 1 or more, A_dkRepresentsVector quantity; randomly selecting k SIFT features from S as initial k clustering centers to form an initial clustering center set A⁰，A_0k∈A⁰(ii) a By key value pair<key:A_q,value:S_a>And (5) outputting the form.

Step 4, obtaining S with the same key value_aD is d +1, and each average value is used as a new cluster center a^d；

Step 5, calculating a new clustering center A^dClustering center A from the last cycle^d-1If the Euclidean distance mean value is larger than 0.05, jumping back to the step 3 for execution; if the Euclidean distance mean value is less than 0.05, then the new clustering center A^dOutputting as a feature dictionary E of the image set I, wherein E_k∈ E, k is a natural number of 1 or more, E_kAnd the feature dictionary E is a vector, namely an image clustering center.

Step 6, adoptCalculating the weight of each cluster center in the E, wherein: d is the total number of all SIFT features in S, D is a natural number more than or equal to 1, if S_aTo E_kHas a minimum Euclidean distance of S_aIs classified into E_kThe center of the device is provided with a central hole,representation to be attributed to E_kTotal number of all SIFT features in the center.

Step 8, respectively calculating the images I_iOf each feature vector F_ibEuclidean distance h from a feature dictionary E, where E_k∈E，h_k∈h＝(h₁,h₂,..,) k is a natural number which is more than or equal to 1, and h is selected from E_kMinimum m E_kThe feature dictionary E ', E ═ E' (E) is formed_f，...，E_g) Wherein, the E' has m vectors, f is a natural number which is more than or equal to 1, g is a natural number which is more than or equal to 1, and g is more than f;

Step 9, calculating F_ibSparse coding c in a feature dictionary E_ib， Wherein h is_m′∈h′＝(h₁′,h₂', …,) is a feature vector F_ibEuclidean distances to the visual words in E ', respectively, diag (h ') representing the principal diagonal of the matrix with the elements of vector h ';

According toSparsely encoding an image g_iCarrying out inverse visual word frequency weighting normalization to obtain IDF weighted sparse code g', wherein g_i′∈g′。

Step 10, extracting image I_iIDF weighted sparse coding g of_iA non-zero element of'; g_ik′∈g_i', k is a natural number of 1 or more, g_iThe non-zero element in is (g)_iu′,...，g_iv') is provided with m nonzero elements, m is a natural number more than or equal to 1, m is less than or equal to k, and g_iu′≠0,g_iv' notequal to 0, u is a natural number not less than 1, v is a natural number not less than 1, k>v>u；

Step 11, establishing k groups, named as:wherein,is an empty matrix;

step 12, transforming the image I by the matrix of (formula 1)_iIDF weighted sparse coding g of_i' hashing into m groups respectively corresponding to subscripts (u,... multidot.v) of non-zero elements;

step 13, utilizingCalculating each pair of images in each of the m groups obtained in the step (3)<I_i,I_j>Similarity Y of IDF weighted sparse coding, if Y is more than 0.7, image<I_i,I_j>A pair of similar images;

and step 14, combining the similar image pairs with the same image in the result obtained in the step 13 to generate a similar image subset.

Example 2

This example utilizes a hundred degree picture search engine to retrieve pictures of 30 globally known sights or buildings (e.g., Egyptian iron tower, white House, wild goose tower, etc.), and manually picks 100 clear and accurate pictures from the retrieved pictures for each type to form 3000 near-duplicate pictures, while randomly selecting 17,000 pictures from the public data set Flickr-100M (source: http:// webscope. sandbox. yahoo. com) as an interference item, and forming an experimental image data set with the near-duplicate pictures, for a total of 20,000 pictures.

The Hadoop 2.4.0 version is selected as an experimental platform, and 10 node computers form the Hadoop cluster in the embodiment. Since Hadoop does not support reading and processing of image data, we define two types based on the open source framework of java of Hadoop: ImageBundle and ImageWritable. ImageBundle is similar to Hadoop self-contained sequence File, and can combine a large number of image format files into one large file and store the large file in HDFS in a fixed key-value pair form < key: imageID, value: imageWritable >. The user-defined ImageWritable inherits the Writable of Hadoop and is used for coding and decoding ImageBundle. Two key functions encode () and decode () inherited to writeable are used to decode a binary file of an image into a key-value pair form and encode the key-value pair into a binary format, respectively. The following experimental procedure was followed:

1. the 20,000 images were scaled for normalization and grayed out, with the normalized size of the image being 256 × 256 in this example.

2. And encoding the 20,000 images into a key-value pair form by using an ImageFile, uniformly storing the key-value pair form in an ImageFile file, and uploading the key-value pair form to the HDFS.

3. And (3) performing parallelization feature extraction on the ImageBundle file uploaded in the step (2) based on Mapreduce, and performing parallelization feature extraction on the ImageBundle file by using key values to obtain a value of (key: the image feature > form is stored in the sequence file and uploaded to the HDFS. The image extracted in the embodiment is an SIFT feature, the SIFT feature is an image local feature, the number of features contained in each feature is variable, and each SIFT feature has 128 dimensions.

4. And (3) carrying out parallelization Kmeasn clustering on the image feature sequence File file generated in the step (3) based on the MapReduce-Kmeans algorithm described in the embodiment 1. In this embodiment, the clustering center K is 512, the loop termination condition is 0.01, and the loop is terminated when the average euclidean distance between the clustering centers generated by two loops is less than 0.01. In the step, 512 clustering centers are generated to serve as a visual dictionary, and each clustering center is a 128-dimensional vector and serves as a visual word.

5. Based on step 7 described in example 1, IDF weight evaluation is performed on each visual word in the visual dictionary generated in step 4.

6. And (3) downloading the image feature sequence File generated in the step (3) from the HDFS to perform MapReduce parallelization sparse coding and similarity calculation, wherein the Map function performs sparse coding on the image set by using the steps (8) and (9) in the embodiment 1 and hashes the image set to the Reduce function according to the step (10) in the embodiment 1, and the Reduce function performs similarity calculation according to the method in the step (11) in the embodiment 1 after receiving the sparse coding sent by the Map function and outputs a similarity pair larger than a threshold value. In the embodiment, the sparsity L is 10, 512-dimensional sparsity coding is performed on each image, and the similarity threshold is 0.7.

7. And (4) merging the similar image pairs generated in the step 6 of the embodiment, and finally outputting a near-repetitive similar image set.

The experimental results of the test data show that the Precision of the algorithm can be 0.9 when the recall is 0.86, and the total consumption time is 3.24 killosecond. The experimental result fig. 5 shows 9 images obtained by randomly sampling 5 keywords (Flower, Iphone, Colosseum, Elephant, Cosmopolitan) from clustering results obtained by different methods, and the F value is F1-measure index. The comparison methods are a Partition min-Hash algorithm (PmH), a Geometricmin-Hash algorithm (GmH), a min-Hash method (mH), a standard LSH algorithm (st.LSH) and a tree search algorithm (baseline) based on Bag-of-Visual-Words. Experimental results fig. 6 shows the results of clustering 17,000 Flickr photos by the present method, where Cluster size is the number of photos in the corresponding Cluster set.

Claims

1. The detection method is proposed based on a hadoop distributed computation framework, and comprises the following steps of obtaining IDF weighted sparse codes g' of all images in an image set I, wherein I ═ I₁,I₂,...,I_i,...，I_w,…,I_z,…,I_R)，I_iIs g_i′，g_i'∈ g', i is a natural number not less than 1, w is a natural number not less than i, z is a natural number not less than w, and R is a natural number not less than z, characterized in that the method further comprisesThe method comprises the following steps:

(2) K groups were established, named:wherein,is an empty matrix;

2. The sparse representation-based approximate repetitive image detection method of claim 1, wherein the obtaining of IDF-weighted sparse coding g' of all images in the image set I comprises the steps of:

extracting an image clustering center to obtain a feature dictionary E;

calculating the weight of each cluster center in the E;

3. The approximate repetitive image detection method based on sparse representation as claimed in claim 2, wherein the parallelized extraction of image local features is to extract SIFT features of all images of the image set I.

4. The approximate repeated image detection method based on sparse representation as claimed in claim 3, wherein said extracting SIFT features of all images comprises the following specific steps:

5. The approximate iterated image detection method based on sparse representation as claimed in claim 2, wherein the extracting of the image clustering center comprises the specific steps of:

(11) calculating S_aTo the clustering center A^dIs a Euclidean distance of S_aAs value, will go to S_aTaking the clustering center with the nearest Euclidean distance as a key value; a. the_dk∈A^dD is a nonnegative integer of 0 or more, k is a natural number of 1 or more, A_dkA representative vector; randomly selecting k SIFT features from S as initial k clustering centers to form an initial clustering center set A⁰，A_0k∈A⁰；

6. The approximate iterated image detection method based on sparse representation as claimed in claim 2, wherein the calculating of the weight of each cluster center in E comprises the following specific steps: by usingCalculating the weight of each cluster center in the E, wherein: d is the total number of all SIFT features in S, D is a natural number which is more than or equal to 1,representation to be attributed to E_kTotal number of all SIFT features in the center.

7. The sparse representation-based approximate repeated image detection method as claimed in claim 2, wherein said extracting IDF weighted image sparse code comprises the following specific steps:

separately computing images I_iOf each feature vector F_ibEuclidean distance h from a feature dictionary E, where E_k∈E，h_k∈h＝(h₁,h₂,..,) k is a natural number which is more than or equal to 1, and h is selected from E_kMinimum m E_kThe feature dictionary E ', E ═ E' (E) is formed_f，...，E_g) Wherein, the E' has m vectors, f is a natural number which is more than or equal to 1, g is a natural number which is more than or equal to 1, and g is more than f;