CN103049514B

CN103049514B - A kind of equilibrium image clustering method based on hierarchical cluster

Info

Publication number: CN103049514B
Application number: CN201210545637.XA
Authority: CN
Inventors: 薛亮; 孙凯
Original assignee: HANGZHOU TAOTAOSOU TECHNOLOGY Co Ltd
Current assignee: HANGZHOU TAOTAOSOU TECHNOLOGY Co Ltd
Priority date: 2012-12-14
Filing date: 2012-12-14
Publication date: 2016-08-10
Anticipated expiration: 2032-12-14
Also published as: CN103049514A

Abstract

The invention discloses a kind of equilibrium image clustering method based on hierarchical cluster, the present invention is directed to dress ornament class commodity image high dimensional feature data, use method based on hierarchical clustering, it is thus achieved that the clustering cluster of size equalization, and the data volume that single clustering cluster comprises is less than the threshold value limited.During retrieval, after the data that will be retrieved and all cluster centres carry out distance calculating, choose nearest multiple clustering cluster, inside multiple clustering cluster, carry out data traversal, it is thus achieved that last Query Result.Relative to general indexing means based on cluster, this method avoid the problem that ergodic data amount is excessive when the data that are retrieved are in big clustering cluster, it is ensured that the performance of inquiry.Meanwhile, by the way of traversal multi-cluster bunch, Query Result has higher registration with the Query Result of SSA, improves inquiry effect.

Description

A kind of equilibrium image clustering method based on hierarchical cluster

Technical field

The present invention relates to picture search technical field, particularly relate to a kind of image higher-dimension based on hierarchical cluster to Amount quickly approximation k-neighbour's search method.

Background technology

In picture search technology based on content (Content-Based Image Retrieval, CBIR), When user uploads a width commodity image and expects to search the commodity same or like with this figure, search engine pair The commodity image that user uploads carries out feature extraction, and chooses and it from thumbnail feature vector data storehouse K image closest in higher dimensional space returns as result.A large amount of index property data bases are looked into Asking k nearest characteristics of image, most basic method is SSA method.SSA method is retrieved by calculating Image has put the distance of image in storage with each, and the mode being then ranked up these distances obtains nearest k Individual image.This is a kind of accurate k neighbour retrieval (k-Nearest Neighbor, kNN).But, When in characteristics of image dimension and storehouse, amount of images is bigger, the inquiry of the method is the biggest, it is impossible to meet Requirement of engineering.

The method of cluster is introduced in CBIR.The method using cluster, by data according to it at higher dimensional space Distribution, assembles and becomes clustering cluster；During retrieval, first calculate the image distance with the center of all bunches that is retrieved, Determine the clustering cluster being retrieved belonging to image, then bunch interior data are traveled through, it is thus achieved that nearest k Image.Owing to needing the minimizing of the data volume of traversal, the mode recall precision that the method travels through relative to forward Increase, but there is problems in that

1, query time efficiency depend on be queried belonging to image bunch size, if cluster produce bunch Size is unbalanced, and query time can be caused to produce lack of uniformity.Belong to comprise image number when being queried image When bigger bunch, need the image volume of traversal and inquiry is time-consuming increases.Bunch representative big owing to comprising data volume The more characteristics of image of " common ", is queried probability that image falls wherein more than comprising the clustering cluster that data volume is few. Therefore, if the data volume that certain clustering cluster comprises is far above meansigma methods, it will have a strong impact on commodity image and search The average response time that index is held up.

2, in data traversal is limited at bunch, if there being k-neighbour's data to be in other bunches, then at retrieval knot It is lost in Guo, causes inquiring about effect and reduce.

Summary of the invention

Present invention aims to the deficiencies in the prior art, it is provided that the image clustering method of a kind of optimization.

It is an object of the invention to be achieved through the following technical solutions: a kind of equilibrium figures based on hierarchical cluster As clustering method, comprise the steps of:

(1) when setting up index, first image feature data is carried out initial clustering；

(2) each clustering cluster obtaining step (1) carries out clustering slicing operation.Concretely comprise the following steps: checking should The image number that cluster is comprised.If the image number that this cluster centre comprises is more than upper limit N arranged_top, Then carry out two segregation classes at intra-cluster.If the data volume that the result of two segregation classes comprises is still above N_top, then This process of result iteration to two segregation classes.By data volume less than N_topClustering cluster central record to cluster in In heart file.Afterwards all for this classification image feature datas are organized according to the cluster centre obtained.

(3) when retrieval, the characteristic to query image, calculate its all cluster centres arriving affiliated classification Distance, and these distances are carried out ascending sort, obtain front c the clustering cluster mark that distance is minimum, c value Specified by systematic parameter.Carry out data traversal in the inside of c clustering cluster afterwards, obtain last inquiry knot Really.

The invention has the beneficial effects as follows, this patent, for dress ornament class commodity image high dimensional feature data, uses base Method in hierarchical clustering, it is thus achieved that the clustering cluster of size equalization, and the data volume that single clustering cluster comprises do not surpasses Cross the threshold value limited.During retrieval, after the data that will be retrieved and all cluster centres carry out distance calculating, choose Nearest multiple clustering cluster, carry out data traversal, it is thus achieved that last Query Result inside multiple clustering cluster. Relative to general indexing means based on cluster, this method avoid when the data that are retrieved are in big clustering cluster Time the excessive problem of ergodic data amount, it is ensured that the performance of inquiry.Meanwhile, by traveling through the side of multi-cluster bunch Formula, Query Result has higher registration with the Query Result of SSA, improves inquiry effect.

Accompanying drawing explanation

Fig. 1 is that commodity image characteristic indexes Establishing process；

Fig. 2 is that commodity image characteristic clusters cutting flow chart；

Fig. 3 is that commodity image characteristic puts flow chart in storage；

Fig. 4 is retrieval flow figure；

Fig. 5 is " edge effect " schematic diagram under two-dimensional case.

Detailed description of the invention

Below with the cluster of dress ornament class commodity image, index set up, retrieve and safeguard as a example by, detailed in conjunction with accompanying drawing Carefully describe the present invention, the purpose of the present invention and effect will be apparent from.

As it is shown in figure 1, the present invention based on hierarchical cluster equilibrium image clustering method index set up include as Lower step:

Step 1: commodity image is carried out image characteristics extraction, view data is converted into feature vector data.

The purpose of feature extraction is that the low structure obtaining image describes.Each feature is represented by d n dimensional vector n.

The present invention uses the global characteristics of image, the corresponding high dimensional feature vectors of the most each sub-picture. Every one dimensional numerical of characteristic vector is all used for phenogram picture feature in terms of some, such as shape, color, The information such as texture, structure.Image characteristic extracting method is a lot, and MPEG-7 visual feature extraction tools is one Plant popular method.The method include color layout describe (Color Layout Descriptor, CLD), Edge histogram descriptor (Edge Histogram Descriptor, EHD) etc..Wherein, CLD uses 8*8 12 coefficients of DCT, are suitable for the compactest and that resolution is constant color and represent.EHD uses 80 Rectangular histogram window describes the content from 16 subimages.

For the ease of data storage with calculate, it is whole that every one-dimensional characteristic value is quantified as in the range of [0,255] by we Number.Characteristic vector after quantization, each dimension can be stored as a byte.

Step 2: the initial characteristic data obtaining step 1 carries out initial clustering, cluster centre number is set to One less integer.The purpose that data carry out initial clustering is the distribution shape in order to probably embody data State.The algorithm that cluster is used is k-average (K-Means).

N data object of input is divided into K-Means algorithm k cluster so that obtained is poly- Class meets: the object similarity in same cluster is higher；And object similarity in different cluster is less.

The major parameter of K-Means algorithm includes: cluster number k and range formula d (x, y).

Owing to simply data being carried out initial clustering, smaller k value is set here.It is desirable that carry out After initial clustering, the average amount that each clustering cluster comprises is fixing value N_s.The k of initial clustering Data total amount N when value can be set up by index_totalAnd N_sCalculate:

k = \frac{N_{t o t a l}}{N_{s}};

For range formula, general distance include manhatton distance (L1 distance), Euclidean distance (L2 away from From), mahalanobis distance (Mahalanobis distance) etc..Range formula does not requires to calculate all of feature dimensions Degree, can carry out distance with the dimension that discrimination in selected characteristic vector is bigger and calculate.For same classification, should The follow-up step that range formula needs keep consistent.Different distances can be used for different classifications Computing formula.

Step 3: the data obtaining step 2 carry out clustering cutting, such as Fig. 2.I.e. obtain step 2 is every Individual clustering cluster, proceeds as follows:

Step 3.1: check the image number that this cluster is comprised.If the image number that this cluster centre comprises Less than threshold value N arranged_topThen jump to step 3.3, otherwise jump to step 3.2.

N_topSetting depend on calculating and the IO performance of server.The complexity calculated due to higher dimensional space distance Property, we ignore the amount of calculation of distance-taxis and merger calculating when, and be conceived to make single inquiry away from Minimum from calculation times.If the clustering cluster number of single inquiry traversal is c.It is the most saturated that we introduce clustering cluster Degree α, is defined as data volume N that average each clustering cluster comprises_meanWith N_topRatio:

α = \frac{N_{m e a n}}{N_{t o p}};

Therefore, number N of total clustering cluster_c:

Retrieval time, it would be desirable to distance calculate include two parts: choose c nearest cluster centre, And travel through inside c cluster centre.In order to obtain c nearest cluster centre, the distance of needs Calculation times is equal to number N of clustering cluster_c.The distance that carrying out traversal inside c cluster centre needs calculates Number of times is c × N_mean.Total distance calculation times is:

C_{d i s} = N_{c} + c \times N_{m e a n} = \frac{N_{t o t a l}}{α \times N_{t o p}} + c \times α \times N_{t o p} .

WhenTime, above formula obtains minima.Now:

m i n (N_{t o p}) = 2 \times \sqrt{N_{t o t a l} \times c} .

Illustrating: comprise 5 for a certain classification, 000,000 image, single inquiry travels through nearest 8 and gathers Class center, then

N_{t o p} = 2 \times \sqrt{5000000 \times 8} \approx 1300.

Step 3.2: carry out 2 cluster centre clusters in this clustering cluster.Owing to our target is to ensure that each The data volume that clustering cluster comprises is less than N_top.Therefore, for comprising data volume more than N_topClustering cluster, I Carry out inside this clustering cluster 2 cluster centres k-Means cluster.The process of cluster and step 2 phase Seemingly, difference is that k value here is set to 2 by force.Cluster obtains 2 new clustering cluster, and the two The data volume that new clustering cluster comprises still can exceed that N_top.For data volume more than N_topNew clustering cluster, weight Perform the operation of step 3.2 again, until there is no the data volume that clustering cluster comprises more than N_top；

Step 3.3: for comprising data volume less than or equal to N_topClustering cluster, by the centre coordinate of this clustering cluster It is written in cluster centre file.The dimension of cluster centre is equal with the dimension of tag file.In order to ensure essence Degree, every one-dimensional data of cluster centre is saved as a floating number by us.One cluster centre file may Comprise the cluster centre data of multiple classification, and classification comprises the coordinate of multiple cluster centre point.

Step 4: as it is shown on figure 3, after obtaining the cluster centre of classification, it would be desirable to initial characteristic data Reorganize according to the cluster centre obtained, in order to read during retrieval.Concrete method is, for Each image feature data in primitive character file, calculates its distance to all cluster centres of this classification, Choose wherein minimum distance, and it is special that this image feature data belongs to clustering cluster corresponding to minimum range Solicit articles in part.

Step 5: preserve each clustering cluster data to fixed disk file.Characteristics of image file depositing on hard disk Need to ensure that the data of same clustering cluster save as continuous print hard disc data block.

As shown in Figure 4, the retrieval of the image clustering method of a kind of improvement of the present invention comprises the steps:

Step 6: load cluster centre data to internal memory；

For inquiring about each time, the cluster centre data of classification belonging to commodity all can be used for distance and calculate, Access for cluster centre data is the most frequently.Therefore, cluster centre data need to reside in internal memory In.

Knowable to the analysis of step 3.3, the image number that average each clustering cluster comprises is: N_top× α is corresponding One cluster centre.Each cluster centre coordinate is d floating number, takies the internal memory of d*4 byte.Cause This, EMS memory occupation altogether is:

V_center=N_total/(α×N_top)×d×4；

Such as, for the characteristic of some classification, N_total=5000000, N_top=2000, d=600, α=0.6.Memory headroom required for so storing this classification cluster centre is about 10M byte.

Step 7: carry out image characteristics extraction to being queried image；It is identical that this operates in step 1, here Repeat no more.

Step 8: the image feature data obtaining step 7, calculates its all cluster centres to affiliated classification Distance, obtain front c the CID that distance is minimum after sequence；

The result of K-neighbour retrieval, in higher dimensional space, constitutes one to be queried vector surpassing as the centre of sphere Ball.

Traditional index based on cluster and inquiry mode, each inquiry only travels through with to be queried image nearest Data in clustering cluster.If the center being queried the image feature data clustering cluster affiliated with it is relatively near, then Hypersphere is completely contained in the envelope of this clustering cluster.So, even if only traveling through nearest 1 cluster centre, Can also obtain and travel through identical Query Result with full storehouse forward.Q in Fig. 5₀Illustrate two-dimensional space In this situation.C0, C1, C2, C3, C4 are 5 cluster centres, q₀It is to be queried characteristics of image Vector.Under two-dimensional case, hypersphere is degenerated to one with q₀Centered by circle.In Figure 5, with q₀For the center of circle Circle be completely in C0 clustering cluster envelope.

If being queried the center of the image feature data clustering cluster affiliated with it farther out, then traditional cluster Inquiry mode can cause loss of data.Such as q in Fig. 5₁Shown in, it is queried characteristics of image vector and is in C1 cluster Bunch marginal position, Query Result hypersphere intersects with tetra-clustering cluster of C1, C2, C3, C4 simultaneously.If Only traversal C1 clustering cluster, then hypersphere is lost with the data of C2, C3, C4 intersecting area.This loss Effect is properly termed as " edge effect ".

In the case of high-dimensional, " edge effect " is exaggerated.Imagining a space envelope is the clustering cluster of hypersphere, Radius is r, then the volume of this hypersphere is:

V (r)=a × r^d,

Wherein, α is invariant.

We be defined into the distance at clustering cluster center be within r/2 for from center " close to ", then this " nearer " Domain Volume be:

V (r/s)=a × (r/2)^d；

" nearer " Domain Volume accounts for the ratio of whole volume of hypersphere:

\frac{V (r / 2)}{V (r)} = r \times 2^{- d};

Exponentially reduce along with the increase of dimension d due to this ratio, therefore, in higher dimensional space, be queried It is normality that data fall at clustering cluster edge, and " edge effect " cannot be left in the basket.In order to give for change by " edge effect " The data lost, use the mode of multi-cluster center traversal, can significantly reduce the data volume of loss.Example As in the example of fig. 5, traversal C1 and C2 clustering cluster simultaneously, recall ratio is substantially higher in only traveling through C1 Clustering cluster.Travel through C1, C2, C3 simultaneously and then can obtain the Query Result identical with full storehouse traversal with C4.

Traveling through multiple neighbour's clustering cluster can bring bigger distance to calculate and data IO expense simultaneously, and this problem can With by controlling N_topMode, the mode reducing the data volume that each clustering cluster is comprised overcomes.

Traversal neighbour's clustering cluster number c can be as the controllable parameter of search engine system.When hardware performance carries When height or system loading are relatively low, can preferably inquire about effect by increasing c acquisition.When under hardware performance When fall or system loading are higher, can be by reducing c, sacrificial section inquiry effect is to obtain preferably inquiry Efficiency.

Step 9: each CID obtaining step 8, reads its corresponding characteristic from disk, calculates These distances are entered by the distance of the image feature data that step 7 obtains each data internal with this clustering cluster Row ascending order arranges, it is thus achieved that the image identification (Image ID, IID) of k the image that distance is minimum and correspondence Distance value；

This step comprises substantial amounts of distance and calculates and digital independent.In step 3.1, we enter amount of calculation of adjusting the distance Go detail discussion.Here the hard disk IO expense that our more consideration digital independent cause.

If the data volume that average each clustering cluster comprises is D (Byte).The average seek time of hard disk is t_f, Data reading speed is s_read.The data of each clustering cluster are continuous distribution on hard disk.So single is looked into Ask the average data IO time be:

t = c \times (t_{f} + \frac{D}{s_{r e a d}}) = c \times t_{f} + \frac{c \times D}{s_{r e a d}}

For an application example, c=8, D=1MB.We calculate the digital independent under the conditions of different hardware Time-consumingly.

If data deposit in common SAS (Serial Attached SCSI) hard disk, t_fRepresentative value be 3ms, s_readRepresentative value be 150MB/s.So digital independent of single inquiry is time-consumingly: 3*8+8/150*1000= 75ms.And for for solid state hard disc (Solid State Disk, SSD), t_fRepresentative value be 0.1ms, s_read Representative value be 500MB/s, single inquiry digital independent be time-consumingly 8*0.1+8/500*1000=17ms.

For structure used in the present invention, owing to using multi-cluster bunch retrieval, hard disk tracking can be increased Number of times, now the low tracking time response of SSD can be substantially reduced the data IO time.As can be seen here, use SSD replaces SAS hard disk, can effectively save the data IO time, reduces the average response time of engine.

Step 10: c the IID sequence obtained in merger step 9, obtain k wherein minimum IID and Corresponding distance value, returns as result.

The image clustering method Query Result that the present invention optimizes is higher with the result registration that full storehouse travels through；Inquiry The most shorter, more equalize；By simple mode, inquiry effect and performance can be weighed.

Claims

1. an equilibrium image clustering method based on hierarchical cluster, it is characterised in that comprise the steps of:

(2) each clustering cluster obtaining step (1) carries out clustering slicing operation；Concretely comprise the following steps: checking should The image number that cluster is comprised；If the image number that this cluster centre comprises is more than upper limit N arranged_top, Then carry out two segregation classes at intra-cluster；If the data volume that the result of two segregation classes comprises is still above N_top, then This process of result iteration to two segregation classes；By data volume less than N_topClustering cluster central record to cluster in In heart file；Afterwards all for this classification image feature datas are organized according to the cluster centre obtained；

(3) when retrieval, the characteristic to query image, calculate its all cluster centres arriving affiliated classification Distance, and these distances are carried out ascending sort, obtain front c the clustering cluster mark that distance is minimum, c value Specified by systematic parameter；Carry out data traversal in the inside of c clustering cluster afterwards, obtain last inquiry knot Really.

Clustering method the most according to claim 1, it is characterised in that described cluster divides 2 steps to carry out, And two steps use identical characteristics of image and range formula.