CN106599917A - Similar image duplicate detection method based on sparse representation - Google Patents

Similar image duplicate detection method based on sparse representation Download PDF

Info

Publication number
CN106599917A
CN106599917A CN201611130891.8A CN201611130891A CN106599917A CN 106599917 A CN106599917 A CN 106599917A CN 201611130891 A CN201611130891 A CN 201611130891A CN 106599917 A CN106599917 A CN 106599917A
Authority
CN
China
Prior art keywords
image
equal
natural number
cluster centre
detection method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611130891.8A
Other languages
Chinese (zh)
Inventor
赵万青
罗迒哉
范建平
彭进业
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwest University
Original Assignee
Northwest University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwest University filed Critical Northwest University
Priority to CN201611130891.8A priority Critical patent/CN106599917A/en
Priority to PCT/CN2017/070197 priority patent/WO2018103179A1/en
Publication of CN106599917A publication Critical patent/CN106599917A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/513Sparse representations

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a similar image duplicate detection method based on sparse representation. The method is proposed based on a hadoop distributed calculation frame, and the detection method includes the following steps: first obtaining an image set I, the sparse codes of all images in the image set I are g'; extracting non-zero elements from g', hashing the sparse code g<i>' of an image I<i> to a set corresponding to subscripts of the non-zero elements, and calculating each pair of images <Iw, Iz> in each Reduce function; the similarity of the sparse codes being Y, if Y is larger than 0.7, outputting similar image pairs <Iw, Iz>; and merging the similar image pairs having the Image Iw, so as to generate a similar image subset. Through a parallel calculation mode, the similar image duplicate detection method based on sparse representation greatly improves the calculation efficiency of a KMeans clustering algorithm aiming at a large-scale data set, introduces a sparse representation theory, and has a faster implementation method without too many solution optimization processes.

Description

A kind of approximate image duplicate detection method based on rarefaction representation
Technical field
The invention belongs to image approximate duplicate detection field, is related to a kind of image of the parallelization based on rarefaction representation and closely weighs Multiple detection method, can efficiently and accurately to large nuber of images collection extract nearly multiimage's set.
Background technology
With the development of mobile Internet and digital camera, the multi-medium data of shooting is more and more shared by people On the Internet, due to the position of photographer, the object for shooting, angle it is identical, so as to result in occur in that on the Internet it is a large amount of Similar picture.Heavy filtration not only can be carried out to image searching result by extracting these similar diagram image sets, while Many image processing fields such as image clustering, image recognition, image classification etc. are also essential step.
Generally approximate multiimage is obtained by some approximate multiimages' conversion by certain width original image, typically can be with Produce approximate multiimage conversion include translation, scaling, selection, the change of picture tone, addition word, format change, point Resolution change etc..And nearly multiimage's detection refers to given query image, the nearly repetition with this image is focused to find out in data Image, or extract all nearly multiimage's subsets in data set.At present, most nearly multiimage's detection is to adopt Bag-of-words and LSH method constructing systems.Bag-of-words models are using training by the local feature of each image Good dictionary is mapped as a vision word frequency histogram vectors.Typically wrapped based on the graphical representation model method of Bag-of-words Include 3 parts:1) extract the local feature of image;2) by the local feature of cluster image set, build visual dictionary;3) map every The partial vector of width figure is a word frequency rectangular histogram.LSH (Locality-Sensitive Hashing) is one kind to high dimension According to the random method for setting up index, with certain lookup accuracy rate as cost, approximately linear is carried out in high-dimensional data space Search, return the approximate KNN data of inquiry data.Its basic thought is by input data point by one group of hash function It is mapped in each bucket, and ensures that the data point of neighbour is mapped in same bucket with larger probability, data apart from each other Point is mapped in same bucket with less probability, and other data points in the bucket of such a inquiry data point place just can be by Regard the Neighbor Points of this inquiry data as.Define with LSH with standard for local feature is excessively strict yet with BOW models Exactness change the characteristic of efficiency often lead to detect result cannot be satisfactory.The nearly multiimage's detecting system that in addition, there will be Computing is carried out using single node usually, with data volume explosive growth, single node system far can not meet mesh Front application needs.Therefore, the parallel computation of multinode just becomes inevitable choice, in numerous Distributed Architecture, with HADOOP systems are the most stable and efficient.
The content of the invention
For above-mentioned problems of the prior art, it is an object of the present invention to propose a kind of for large-scale image The distributed image based on rarefaction representation of collection approximately repeatedly collects extraction system and method.It is big that the method can not only lift process The efficiency of scale image set, and the accuracy of testing result can be effectively improved compared with traditional method.
In order to realize above-mentioned task, the present invention is employed the following technical solutions:
A kind of Near-duplicate image detection method based on rarefaction representation, the method are based on hadoop distributed computing frameworks Propose, the detection method comprises the steps, obtain IDF weighting the sparse coding g ', wherein I=of all images in image set I (I1,I2,...,Ii..., Iw,…,Iz,…,IR), IiIDF weighting sparse coding be gi', gi' ∈ g ', i are more than or equal to 1 Natural number, w is the natural number more than i, and z is the natural number more than w, and R is the natural number more than z, it is characterised in that method is also Including:
(1) extract image IiIDF weighting sparse coding gi' in nonzero element;gik′∈gi', k is more than or equal to 1 Natural number, gi' interior nonzero element is (giu' ..., giv'), if nonzero element is m, m is the natural number more than or equal to 1, m ≤k,giu′≠0,giv' ≠ 0, u is the natural number more than or equal to 1, natural numbers of the v more than or equal to 1, k>v>u;
(2) k group is set up, is respectively designated as:Wherein,For empty matrix;
(3) matrixing of (formula 1) is utilized, by image IiIDF weighting sparse coding gi' nonzero element is hashed respectively Subscript (u ..., v) in corresponding m group;
(4) utilizeEach pair image in each group in m groups obtained by calculation procedure (3)< Ii,Ij>IDF weights similarity Y of sparse coding, if Y is more than 0.7, image<Ii,Ij>For similar image pair;
Wherein, j is the natural number more than or equal to 1, and i ≠ j;g'iAnd g'jImage I is represented respectivelyiAnd IjIDF weighting it is dilute Dredge coding;
(5) similar image with identical image in step (4) acquired results is generated into similar image subset to merging.
Further, in the acquisition image set I, the IDF of all images weights sparse coding g ', comprises the following steps:
The local feature per sub-picture is extracted in parallelization, obtains the local feature S of all images in image set I;
Image clustering center is extracted, characteristics dictionary E is obtained;
Calculate each cluster centre weight in E;
According to each cluster centre weight in E, the IDF weighting sparse coding g ' of image are extracted.
Further, it is the SIFT feature for extracting all images of image set I that image local feature is extracted in the parallelization.
Further, the SIFT feature for extracting all images, concretely comprises the following steps:
By every sub-picture I in image set IiSize criteriaization and gray proces, obtain the gray-scale maps image set of normal size; Wherein, I=(I1,I2,...,Ii,...,Iw,…,Iz,…,IR);
The gray-scale maps image set of normal size is split to each cluster node, the SIFT feature of each image is extracted in parallelization, The SIFT feature of all images is expressed as S, wherein Sa∈ S, SaRepresentation vector, a is the natural number more than or equal to 1, IiIndividual image SIFT feature be Fi, wherein Fib∈Fi, FibRepresentation vector, b are the natural number more than or equal to 1.
Further, the extraction image clustering center, concretely comprises the following steps:
(11) calculate SaTo cluster centre AdEuclidean distance, by SaAs value, will be to SaEuclidean distance is nearest to gather Class center is used as key values;Adk∈Ad, d is the nonnegative integer more than or equal to 0, and k is the natural number more than or equal to 1, AdkRepresent to Amount;K SIFT feature of volume is randomly selected from S as initial k cluster centre, initial cluster center set A is formed0, A0k ∈A0
(21) key value identicals S are asked foraMeansigma methodss, d=d+1, using each meansigma methodss as new cluster centre Ad
(31) calculate new cluster centre AdWith the cluster centre A of last circulationd-1Euclidean distance average, if the Europe Family name then skips back to step (11) execution apart from average > 0.05;If Euclidean distance average < 0.05, new cluster centre Ad Export as the characteristics dictionary E of image set I, wherein Ek∈ E, k are the natural number more than or equal to 1, EkFor vector, the tagged word Allusion quotation E is image clustering center.
Further, each cluster centre weight in the calculating E, concretely comprises the following steps:Using Each cluster centre weight in E is calculated, wherein:D is the sum of all SIFT features in S, and D is the natural number more than or equal to 1,Expression belongs to EkAll SIFT feature sums at center.
Further, the extraction IDF weighted image sparse codings, concretely comprise the following steps:
Image I is calculated respectivelyiIn each characteristic vector FibWith Euclidean distance h of characteristics dictionary E, wherein Ek∈ E, hk∈ h= (h1,h2...), k is the natural number more than or equal to 1, and h is chosen from EkM minimum Ek, composition characteristic dictionary E ', E '= (Ef..., Eg), have in wherein E ' m it is vectorial, f is the natural number more than or equal to 1, and g is the natural number more than or equal to 1, g > f;
Using Cib=(E ' -1Fib T)(E′-1Fib T)TCalculate FibWith the difference of two squares Matrix C of characteristics dictionary E 'ib
Calculate FibSparse coding c in characteristics dictionary E 'ib, Wherein hm' ∈ h '=(h1′,h2' ...) it is characterized vectorial FibRespectively to the middle visual words of E ' Euclidean distance, diag (h ') represent using the element of vectorial h ' as matrix leading diagonal;
Extract cibIn k maximum, obtain image IiImage sparse coding gi, wherein gik∈gi
According toG is encoded to image sparseiReverse vision word frequency weighting normalization is carried out, Obtain IDF weighting sparse coding g '.
The invention has the beneficial effects as follows:
(1) present invention runs the excessively complete dictionary feature of KMeans algorithms on-line study by MapReduce parallel modes, makes Which extracts global characteristics structure as much as possible, such that it is able to find from excessively complete dictionary with the atom group for most characterizing ability Close to represent characteristics of image.Online dictionary learning is by each iteration MapReduce Jop, and is judged eventually according to given threshold Only, the calculation of this parallelization substantially increases the computational efficiency for large-scale dataset KMeans clustering algorithms;
(2) present invention introduces sparse representation theory, and by searching K code word in wordbook with characteristic feature arest neighbors It is reconstructed, this reconstruct mode is compared with the vector quantization of original BOW, there can be less reconstruct with the reconstruct of multiple code words By choosing local neighbor, error, to be reconstructed that can more reach local smoothing method openness while this mode and traditional rarefaction representation There is method faster implementation method not need excessive solving-optimizing process;
(3) to image sparse, the present invention represents that vector is weighted, makes by counting global reverse visual word frequency IDF Which is representational higher, while reducing the weight of representational relatively low code word in sparse coding, raises representational higher code word power Weight so as to openness higher, so that the high characteristic feature of similar image has higher probability same or similar;
(4) present invention carries out hash to candidate's similar set by extracting high characteristic feature in rarefaction representation, while every Matched by calculating Jaccard index similarities two-by-two in individual hash bucket, the method can simply, efficiently obtain phase Like image pair, while the characteristics of the method can make full use of rarefaction representation and Mapreduce models carry out parallel computation, significantly Degree lifts the matching efficiency and accuracy of large-scale image data collection;
(5) distribution of distributed memory system HDFS and parallel computational model MapReduce of the present invention based on HADOOP The nearly duplicate detection system of formula image, the nearly duplicate detection system of existing image are usually only to support a detection framework for single node, but With the development of mobile Internet, the growth of view data exponentially level, conventional system can not meet so big number at all According to storage and computing in amount.Not only there is the high scalability for mass data by the present invention, while significantly can carry The computational efficiency of high nearly duplicate detection.
Description of the drawings
Fig. 1 show Near-duplicate image detection method structural representation of the present invention based on rarefaction representation;
Fig. 2 is that the present invention realizes the parallel KMeans algorithm flow charts based on MapReduce;
Fig. 3 is that the present invention realizes the image sparse encryption algorithm flow chart based on local preference search;
Fig. 4 is that the present invention is realized based on MapReduce and the parallel similar set detection algorithm flow process of notable dimensional characteristics Figure;
Fig. 5 is the cluster result that obtains from distinct methods to 5 keywords respectively in embodiment 2;
Cluster results of the Fig. 6 for embodiment 2.
Specific embodiment
The specific embodiment of the present invention given below, implement it should be noted that is the invention is not limited in detail below Example, all equivalents done on the basis of technical scheme each fall within protection scope of the present invention.
Embodiment 1
This method is proposed based on hadoop distributed computing frameworks, in Fig. 1, a kind of to repeat to scheme based on the approximate of rarefaction representation As detection method, which includes characteristic extracting module, characteristics dictionary constructing module, image sparse coded representation module, similar diagram With filtering module, nearly multiimage to merging module.Characteristic extracting module is used for parallelization and extracts all originals in image collection Beginning local image characteristics;Characteristics dictionary constructing module adopts parallel K- for the characteristics of image for extracting characteristic extracting module The online dictionary learning algorithms of Means build primitive character wordbook.Told image sparse coded representation module is used for will be per width figure The original local feature of picture utilizes constructed wordbook and sparse coding method to be mapped as a sparse vector to represent per width Image;Told similar diagram matching filtering module is used for that parallelization to calculate the similarity of the image pair after filtering and to export similarity big In the image pair of a certain threshold value;Told nearly multiimage is to merging module for similar diagram is matched the institute that exported of filtering module There is similar image to merging into nearly multiimage's set.
The detection method comprises the steps:
Step 1, by image set I, wherein I=(I1,I2,...,Ii,...,Iw,…,Iz,…,IR) in per sub-picture Ii's Size criteria, is such as arranged to standardization (128*128,256*256 or 512*512) size, the figure selected in the present invention It is 256*256 as unified standard turns to size, carries out gray proces afterwards, obtain the gray-scale maps image set of normal size, using certainly Defining imageWritable carries out serializing operation by image, the binary data stream of output image, and by all images with key Value is to form<key:Image ID, value:imageWritable>Compression is stored in self-defined ImageBundle, final to upload To distributed storage framework HDFS;
Step 2, downloads ImageBundle files from HDFS, and hadoop voluntarily can be split according to cluster interior nodes number Map function of the ImageBundle files to different nodes, Map functions obtain the key-value pair form of image, based on disclosed SIFT Algorithm extracts multiple SIFT feature vectors of each image in image set, constitutes the SIFT feature sequence vector of image, owns The SIFT feature of image is expressed as S, wherein Sa∈ S, SaRepresentation vector, a is the natural number more than or equal to 1, IiIndividual image SIFT feature is Fi, wherein Fib∈Fi, FibRepresentation vector, b are the natural number more than or equal to 1, and by each image in image set Respectively with<key:Image ID, value:Fi>Key-value pair form is stored on HDFS;
Step 3, calculates SaTo cluster centre AdEuclidean distance, by SaAs value, will be to SaEuclidean distance is nearest Cluster centre AqAs key values;Adk∈Ad, d is the nonnegative integer more than or equal to 0, and k is the natural number more than or equal to 1, AdkRepresent Vector;K SIFT feature of volume is randomly selected from S as initial k cluster centre, initial cluster center set A is formed0, A0k∈A0;With key-value pair<key:Aq,value:Sa>Form is exported.
Step 4, asks for key value identicals SaMeansigma methodss, d=d+1, using each meansigma methodss as new cluster centre Ad
Step 5, calculates new cluster centre AdWith the cluster centre A of last circulationd-1Euclidean distance average, if should Euclidean distance average > 0.05, then skip back to step 3 and perform;If Euclidean distance average < 0.05, new cluster centre Ad Export as the characteristics dictionary E of image set I, wherein Ek∈ E, k are the natural number more than or equal to 1, EkFor vector, the tagged word Allusion quotation E is image clustering center.
Step 6, adoptsEach cluster centre weight in E is calculated, wherein:D is all in S The sum of SIFT feature, D are the natural number more than or equal to 1, if SaTo EkEuclidean distance it is minimum, i.e. SaBelong to EkCenter,Expression belongs to EkAll SIFT feature sums at center.
Step 8, calculates image I respectivelyiIn each characteristic vector FibWith Euclidean distance h of characteristics dictionary E, wherein Ek∈ E, hk∈ h=(h1,h2...), k is the natural number more than or equal to 1, and h is chosen from EkM minimum Ek, composition characteristic dictionary E ', E '=(Ef..., Eg), have in wherein E ' m it is vectorial, f is the natural number more than or equal to 1, and g is the natural number more than or equal to 1, g > f;
Using Cib=(E ' -1Fib T)(E′-1Fib T)TCalculate FibWith the difference of two squares Matrix C of characteristics dictionary E 'ib
Step 9, calculates FibSparse coding c in characteristics dictionary E 'ib, Wherein hm' ∈ h '=(h1′,h2' ...) it is characterized vectorial FibRespectively to the middle visual words of E ' Euclidean distance, diag (h ') represent using the element of vectorial h ' as matrix leading diagonal;
Extract cibIn k maximum, obtain image IiImage sparse coding gi, wherein gik∈gi
According toG is encoded to image sparseiReverse vision word frequency weighting normalization is carried out, Obtain IDF weighting sparse coding g ', wherein gi′∈g′。
Step 10, extracts image IiIDF weighting sparse coding gi' in nonzero element;gik′∈gi', k be more than or equal to 1 natural number, gi' interior nonzero element is (giu' ..., giv'), if nonzero element is m, m is the nature more than or equal to 1 Number, m≤k, giu′≠0,giv' ≠ 0, u is the natural number more than or equal to 1, natural numbers of the v more than or equal to 1, k>v>u;
Step 11, sets up k group, is respectively designated as:Wherein,For empty matrix;
Step 12, utilizes the matrixing of (formula 1), by image IiIDF weighting sparse coding gi' non-zero is hashed respectively Element subscript (u ..., v) in corresponding m group;
Step 13, utilizesEach pair figure in each group in m groups obtained by calculation procedure (3) Picture<Ii,Ij>IDF weights similarity Y of sparse coding, if Y is more than 0.7, image<Ii,Ij>For similar image pair;
Wherein, j is the natural number more than or equal to 1, and i ≠ j;g'iAnd g'jImage I is represented respectivelyiAnd IjIDF weighting it is dilute Dredge coding;
Step 14, by the similar image with identical image in step 13 acquired results to merging, generates similar image Collection.
Embodiment 2
The present embodiment is retrieved 30 global famous sites respectively using Baidu's photographic search engine or is built (for example:It is Egyptian Eiffel Tower, the White House, Great Wild Goose Pagoda etc.) picture, and manually select clear, accurate from the image retrieved to each type 100 width images, the composition nearly multiimage of 3000 width, while from public data collection Flickr-100M (sources:http:// Webscope.sandbox.yahoo.com 17,000 secondary photo is randomly choosed in) as distracter, together with nearly multiimage's group Into the image data set of experiment, totally 20,000 image.
Hadoop 2.4.0 versions are selected as experiment porch, totally 10 node computers constitute the Hadoop of the present embodiment Cluster.As Hadoop itself does not support the reading and process of view data, so we are increased income based on the java of Hadoop Self-defined two types of framework:ImageBundle and ImageWritable.ImageBundle is carried similar to Hadoop SequenceFile, substantial amounts of image format file can be merged into a big file by it, and to fix key-value pair form< key:imageID,value:imageWritable>Serializing is stored in HDFS.Self-defined ImageWritable inherit with The Writable of Hadoop, for the encoding and decoding to ImageBundle.Inherit two Key Functions with Writable Encoder () and decoder () are respectively used to be decoded as compiling for key-value pair form and by key-value pair by the binary file of image Code is binary format.It is experimental procedure below:
1st, to 20, scaling that 000 image is standardized simultaneously carries out gray processing process, image standardization in the present embodiment Size is 256*256 afterwards.
2nd, to 20,000 imagery exploitation ImageBundle by picture coding is for key-value pair form and unifies to be stored in ImageBundle files, are uploaded to HDFS.
3rd, parallelization feature extraction is carried out to the ImageBundle files that step 2 is uploaded based on Mapreduce, and with key Value is right<key:Image ID, value:Characteristics of image>Form is stored in SequenceFile files and is uploaded to HDFS.This enforcement The image extracted in example is SIFT feature, and SIFT feature is image local feature, and the feature quantity that every width feature is included is indefinite, often Individual SIFT feature 128 is tieed up.
4th, based on the MapReduce-Kmeans algorithms described in embodiment 1, the characteristics of image generated to step 3 SequenceFile files carry out parallelization Kmeasn clusters.Cluster centre K=512 in the present embodiment, loop termination condition is 0.01, the loop ends when the Euclidean distance average of the cluster centre for circulating generation twice is less than 0.01.This step will produce 512 Used as visual dictionary, each cluster centre is 128 dimensional vectors as visual word to individual cluster centre.
5th, based on step 7 described in embodiment 1, IDF weights are carried out to each visual word in the visual dictionary of step 4 generation and is commented Estimate.
6th, the characteristics of image SequenceFile files generated from HDFS download steps 3 carry out MapReduce parallelizations Sparse coding and Similarity Measure, wherein Map functions utilize step 8 described in embodiment 1,9 pairs of image sets to carry out sparse coding simultaneously Hash by 1 step 10 of embodiment and give Reduce functions, Reduce functions are received from after the sparse coding of Map functions transmission according to reality Applying 1 step 11 methods described of example carries out Similarity Measure and exports similar right more than threshold value.Wherein the present embodiment is using sparse Degree L=10, the sparse coding 512 of each image are tieed up, and similarity threshold is 0.7.
7th, the similar image produced to the present embodiment step 6 is to merging, the nearly duplication similarity image set of final output.
By to test data test result indicate that, our algorithm can recall be 0.86 when Precision For 0.9, total time-consuming 3.24kilosecond.Wherein experimental result Fig. 5 for respectively to 5 keywords (Flower, Iphone, Colosseum, Elephant, Cosmopolitan) stochastic sampling is extracted in the cluster result that obtains from distinct methods 9 width figures Picture, F values are F1-measure indexs.The method of contrast is respectively Partition min-Hash algorithms (PmH), Geometric Min-Hash algorithms (GmH), min-hash methods (mH), standard LSH algorithm (st.LSH) and be based on Bag-of-Visual- The tree finding algorithm (baseline) of Words.Experimental result Fig. 6 is the knot that this method is clustered to 17,000 secondary Flickr photos Really, Cluster size are the number of pictures of corresponding cluster set.

Claims (7)

1. a kind of Near-duplicate image detection method based on rarefaction representation, the method are carried based on hadoop distributed computing frameworks Go out, the detection method comprises the steps, obtain IDF weighting the sparse coding g ', wherein I=of all images in image set I (I1,I2,...,Ii..., Iw,…,Iz,…,IR), IiIDF weighting sparse coding be gi', gi' ∈ g ', i are more than or equal to 1 Natural number, w is the natural number more than i, and z is the natural number more than w, and R is the natural number more than z, it is characterised in that method is also Including:
(1) extract image IiIDF weighting sparse coding gi' in nonzero element;gik′∈gi', k is the nature more than or equal to 1 Number, gi' interior nonzero element is (giu' ..., giv'), if nonzero element is m, m is the natural number more than or equal to 1, m≤k, giu′≠0,giv' ≠ 0, u is the natural number more than or equal to 1, natural numbers of the v more than or equal to 1, k>v>u;
(2) k group is set up, is respectively designated as:Wherein,For empty matrix;
(3) matrixing of (formula 1) is utilized, by image IiIDF weighting sparse coding gi' hashed under nonzero element respectively Mark (u ..., v) in corresponding m group;
(4) utilizeEach pair image in each group in m groups obtained by calculation procedure (3)<Ii,Ij> IDF weights similarity Y of sparse coding, if Y is more than 0.7, image<Ii,Ij>For similar image pair;
Wherein, j is the natural number more than or equal to 1, and i ≠ j;g'iAnd g'jImage I is represented respectivelyiAnd IjIDF weight sparse volume Code;
(5) similar image with identical image in step (4) acquired results is generated into similar image subset to merging.
2. Near-duplicate image detection method as claimed in claim 1 based on rarefaction representation, it is characterised in that the acquisition figure In image set I, the IDF weighting sparse coding g ' of all images, comprise the following steps:
The local feature per sub-picture is extracted in parallelization, obtains the local feature S of all images in image set I;
Image clustering center is extracted, characteristics dictionary E is obtained;
Calculate each cluster centre weight in E;
According to each cluster centre weight in E, the IDF weighting sparse coding g ' of image are extracted.
3. Near-duplicate image detection method as claimed in claim 2 based on rarefaction representation, it is characterised in that the parallelization It is the SIFT feature for extracting all images of image set I to extract image local feature.
4. Near-duplicate image detection method as claimed in claim 3 based on rarefaction representation, it is characterised in that the extraction institute There is the SIFT feature of image, concretely comprise the following steps:
By every sub-picture I in image set IiSize criteriaization and gray proces, obtain the gray-scale maps image set of normal size;Wherein, I=(I1,I2,...,Ii,...,Iw,…,Iz,…,IR);
The gray-scale maps image set of normal size is split to each cluster node, the SIFT feature of each image is extracted in parallelization, own The SIFT feature of image is expressed as S, wherein Sa∈ S, SaRepresentation vector, a is the natural number more than or equal to 1, IiIndividual image SIFT feature is Fi, wherein Fib∈Fi, FibRepresentation vector, b are the natural number more than or equal to 1.
5. Near-duplicate image detection method as claimed in claim 2 based on rarefaction representation, it is characterised in that the extraction figure As cluster centre, concretely comprise the following steps:
(11) calculate SaTo cluster centre AdEuclidean distance, by SaAs value, will be to SaThe nearest cluster centre of Euclidean distance As key values;Adk∈Ad, d is the nonnegative integer more than or equal to 0, and k is the natural number more than or equal to 1, AdkRepresentation vector;From S K SIFT feature of volume is randomly selected as initial k cluster centre, initial cluster center set A is formed0, A0k∈A0
(21) key value identicals S are asked foraMeansigma methodss, d=d+1, using each meansigma methodss as new cluster centre Ad
(31) calculate new cluster centre AdWith the cluster centre A of last circulationd-1Euclidean distance average, if the Euclidean distance Average > 0.05, then skip back to step (11) execution;If Euclidean distance average < 0.05, new cluster centre AdAs figure The characteristics dictionary E outputs of image set I, wherein Ek∈ E, k are the natural number more than or equal to 1, EkFor vector, the characteristics dictionary E is Image clustering center.
6. Near-duplicate image detection method as claimed in claim 2 based on rarefaction representation, it is characterised in that the calculating E In each cluster centre weight, concretely comprise the following steps:UsingEach cluster centre weight in E is calculated, its In:D is the sum of all SIFT features in S, and D is the natural number more than or equal to 1,Expression belongs to EkThe institute at center There is SIFT feature sum.
7. Near-duplicate image detection method as claimed in claim 2 based on rarefaction representation, it is characterised in that the extraction IDF weighted image sparse codings, concretely comprise the following steps:
Image I is calculated respectivelyiIn each characteristic vector FibWith Euclidean distance h of characteristics dictionary E, wherein Ek∈ E, hk∈ h=(h1, h2...), k is the natural number more than or equal to 1, and h is chosen from EkM minimum Ek, composition characteristic dictionary E ', E '= (Ef..., Eg), have in wherein E ' m it is vectorial, f is the natural number more than or equal to 1, and g is the natural number more than or equal to 1, g > f;
Using Cib=(E ' -1Fib T)(E′-1Fib T)TCalculate FibWith the difference of two squares Matrix C of characteristics dictionary E 'ib
Calculate FibSparse coding c in characteristics dictionary E 'ib, Its Middle hm' ∈ h '=(h1′,h2' ...) it is characterized vectorial FibRespectively to the Euclidean distance of the middle visual words of E ', diag (h ') is represented will Leading diagonal of the element of vectorial h ' as matrix;
Extract cibIn k maximum, obtain image IiImage sparse coding gi, wherein gik∈gi
According toG is encoded to image sparseiReverse vision word frequency weighting normalization is carried out, is obtained IDF weights sparse coding g '.
CN201611130891.8A 2016-12-09 2016-12-09 Similar image duplicate detection method based on sparse representation Pending CN106599917A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201611130891.8A CN106599917A (en) 2016-12-09 2016-12-09 Similar image duplicate detection method based on sparse representation
PCT/CN2017/070197 WO2018103179A1 (en) 2016-12-09 2017-01-05 Near-duplicate image detection method based on sparse representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611130891.8A CN106599917A (en) 2016-12-09 2016-12-09 Similar image duplicate detection method based on sparse representation

Publications (1)

Publication Number Publication Date
CN106599917A true CN106599917A (en) 2017-04-26

Family

ID=58598522

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611130891.8A Pending CN106599917A (en) 2016-12-09 2016-12-09 Similar image duplicate detection method based on sparse representation

Country Status (2)

Country Link
CN (1) CN106599917A (en)
WO (1) WO2018103179A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110738260A (en) * 2019-10-16 2020-01-31 名创优品(横琴)企业管理有限公司 Method, device and equipment for detecting placement of space boxes of retail stores of types
CN111325245A (en) * 2020-02-05 2020-06-23 腾讯科技(深圳)有限公司 Duplicate image recognition method and device, electronic equipment and computer-readable storage medium
TWI714321B (en) * 2018-11-01 2020-12-21 大陸商北京市商湯科技開發有限公司 Method, apparatus and electronic device for database updating and computer storage medium thereof
CN112699294A (en) * 2020-12-30 2021-04-23 深圳前海微众银行股份有限公司 Software head portrait management method, system, equipment and computer storage medium

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109726724B (en) * 2018-12-21 2023-04-18 浙江农林大学暨阳学院 Water gauge image feature weighted learning identification method under shielding condition
CN111080525B (en) * 2019-12-19 2023-04-28 成都海擎科技有限公司 Distributed image and graphic primitive splicing method based on SIFT features
CN112488221B (en) * 2020-12-07 2022-06-14 电子科技大学 Road pavement abnormity detection method based on dynamic refreshing positive sample image library
CN113554082B (en) * 2021-07-15 2023-11-21 广东工业大学 Multi-view subspace clustering method for self-weighted fusion of local and global information

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103745465A (en) * 2014-01-02 2014-04-23 大连理工大学 Sparse coding background modeling method
CN104504406A (en) * 2014-12-04 2015-04-08 长安通信科技有限责任公司 Rapid and high-efficiency near-duplicate image matching method
CN104778476A (en) * 2015-04-10 2015-07-15 电子科技大学 Image classification method
CN106023098A (en) * 2016-05-12 2016-10-12 西安电子科技大学 Image repairing method based on tensor structure multi-dictionary learning and sparse coding

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462199B (en) * 2014-10-31 2017-09-12 中国科学院自动化研究所 A kind of approximate multiimage searching method under network environment
CN104392250A (en) * 2014-11-21 2015-03-04 浪潮电子信息产业股份有限公司 Image classification method based on MapReduce

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103745465A (en) * 2014-01-02 2014-04-23 大连理工大学 Sparse coding background modeling method
CN104504406A (en) * 2014-12-04 2015-04-08 长安通信科技有限责任公司 Rapid and high-efficiency near-duplicate image matching method
CN104778476A (en) * 2015-04-10 2015-07-15 电子科技大学 Image classification method
CN106023098A (en) * 2016-05-12 2016-10-12 西安电子科技大学 Image repairing method based on tensor structure multi-dictionary learning and sparse coding

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
JINJUN WANG 等: "Locality-constrained Linear Coding for Image Classification", 《2010 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 *
WANQING ZHAO 等: "MapReduce-based clustering for near-duplicate image identification", 《MULTIMEDIA TOOLS AND APPLICATIONS》 *
张行: "面向交通场景的图像分类技术研究", 《中国硕士学位论文全文数据库信息科技辑》 *
钱晓亮 等: "一种基于加权稀疏编码的频域视觉显著性检测算法", 《电子学报》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI714321B (en) * 2018-11-01 2020-12-21 大陸商北京市商湯科技開發有限公司 Method, apparatus and electronic device for database updating and computer storage medium thereof
CN110738260A (en) * 2019-10-16 2020-01-31 名创优品(横琴)企业管理有限公司 Method, device and equipment for detecting placement of space boxes of retail stores of types
CN111325245A (en) * 2020-02-05 2020-06-23 腾讯科技(深圳)有限公司 Duplicate image recognition method and device, electronic equipment and computer-readable storage medium
CN111325245B (en) * 2020-02-05 2023-10-17 腾讯科技(深圳)有限公司 Repeated image recognition method, device, electronic equipment and computer readable storage medium
CN112699294A (en) * 2020-12-30 2021-04-23 深圳前海微众银行股份有限公司 Software head portrait management method, system, equipment and computer storage medium

Also Published As

Publication number Publication date
WO2018103179A1 (en) 2018-06-14

Similar Documents

Publication Publication Date Title
CN106599917A (en) Similar image duplicate detection method based on sparse representation
Li et al. Recent developments of content-based image retrieval (CBIR)
JP5926291B2 (en) Method and apparatus for identifying similar images
Van Der Maaten Barnes-hut-sne
Mikulík et al. Learning a fine vocabulary
CN110609916A (en) Video image data retrieval method, device, equipment and storage medium
Aly et al. Indexing in large scale image collections: Scaling properties and benchmark
CN104036012B (en) Dictionary learning, vision bag of words feature extracting method and searching system
Ji et al. Towards low bit rate mobile visual search with multiple-channel coding
CN111198959A (en) Two-stage image retrieval method based on convolutional neural network
Huang et al. Unconstrained multimodal multi-label learning
Huang et al. Object-location-aware hashing for multi-label image retrieval via automatic mask learning
CN108304573A (en) Target retrieval method based on convolutional neural networks and supervision core Hash
Bai et al. Deep progressive hashing for image retrieval
CN111027559A (en) Point cloud semantic segmentation method based on expansion point convolution space pyramid pooling
Hu et al. Delving into deep representations for remote sensing image retrieval
Wang et al. Learning A deep l1 encoder for hashing
Wu et al. A multi-sample, multi-tree approach to bag-of-words image representation for image retrieval
CN110110120B (en) Image retrieval method and device based on deep learning
Xue et al. Mobile image retrieval using multi-photos as query
Farhangi et al. Informative visual words construction to improve bag of words image representation
Shi et al. Sift-based elastic sparse coding for image retrieval
Tian et al. A Novel Deep Embedding Network for Building Shape Recognition
Liu et al. Selection of canonical images of travel attractions using image clustering and aesthetics analysis
Ji et al. A lowbit rate vocabulary coding scheme for mobile landmark search

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170426

WD01 Invention patent application deemed withdrawn after publication