CN107633065B

CN107633065B - Identification method based on hand-drawn sketch

Info

Publication number: CN107633065B
Application number: CN201710860271.8A
Authority: CN
Inventors: 聂为之; 邓宗慧; 苏育挺
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2017-09-21
Filing date: 2017-09-21
Publication date: 2020-06-02
Anticipated expiration: 2037-09-21
Also published as: CN107633065A

Abstract

The invention discloses an identification method based on a hand-drawn sketch, which comprises the following steps: adjusting each sketch in the initial category set according to a preset size, uniformly extracting a plurality of interest points, and extracting a square block for each interest point; extracting pixel point gradients in each square block, and quantizing the gradient to units in 4 directions according to the direction to serve as local features of each square block; constructing a visual dictionary by using a k-means clustering method, wherein each sketch is represented by a vector with 500 dimensions; performing cluster clustering analysis and dimension reduction on the feature vector of each cluster to obtain a classified database; and matching the query sketch with the classified database to obtain a final retrieval result. The invention realizes that the sketch is used as a visual input mode and robust visual local features are designed for the sketch, and improves the identification accuracy.

Description

Identification method based on hand-drawn sketch

Technical Field

The invention relates to the field of image retrieval, in particular to an identification method based on a hand-drawn sketch.

Background

And content-based retrieval^[1]In contrast, the user input based on sketch retrieval is a simple binary sketch. Content retrieval based methods do not learn sketches and therefore generally do not enable an understanding of the sketches semantics. The retrieval result is purely based on the geometric similarity between the sketch and the image content^[2][3]。

An image composition system based on sketch retrieval allows a user to create novel realistic images. A sketch-only integrated system must rely on large amounts of data to counteract the problem of geometric dissimilarity between sketch and image content^[4]Or require the user to add text labels to the sketch^[5]. Using template matchingThe system proposed by the face part can help the user to obtain the correct scale when drawing the portrait^[6]。

Based on this idea, real-time feedback assisted sketches are generalized to tens of object classes, finding geometrically similar objects using fast nearest neighbor matching, and blending those object edges to the rough shadow guideline. As with other sketch-based retrieval systems, the user must faithfully draw edges to enable retrieval with a large number of object categories.

The sketch feature extraction method based on sketch shape representation can be used in a mode of extracting global features or local features. The global features represent the distribution and structure of the image as a whole, and represent relative distribution; local features, for example, Dalal, find points that appear stably and are well distinguishable^[7]And randomly extracting feature points distributed in the image, extracting gradient directions around the feature points to obtain a gradient direction histogram, and obtaining related directions to form HOG (histogram of oriented gradient) features.

Disclosure of Invention

The invention provides an identification method based on a hand-drawn sketch, which realizes that the sketch is used as a visual input mode and robust visual local features are designed for the sketch, improves the identification accuracy and is described in detail in the following description:

a recognition method based on a hand-drawn sketch comprises the following steps:

adjusting each sketch in the initial category set according to a preset size, uniformly extracting a plurality of interest points, and extracting a square block for each interest point;

extracting pixel point gradients in each square block, and quantizing the gradient to units in 4 directions according to the direction to serve as local features of each square block; constructing a visual dictionary by using a k-means clustering method, wherein each sketch is represented by a vector with 500 dimensions;

performing cluster clustering analysis and dimension reduction on the feature vector of each cluster to obtain a classified database;

and matching the query sketch with the classified database to obtain a final retrieval result.

Wherein, the initial category set specifically comprises:

20000 sketch data sets are obtained, keywords cover common categories through preprocessing, and the preprocessed sketch data sets serve as initial category sets.

Wherein the pretreatment specifically comprises the following steps:

20000 sketch data sets are constructed to serve as the basis for learning, evaluation and application; extracting 1000 common labels from a label library, and manually removing repeated labels which do not accord with the rules by taking the 1000 common labels as a reference;

and supplementing the keywords to the deleted sketch data set through a preset standard and the keywords in the preset data set, and taking the preprocessed sketch data set as an initial classification set.

Wherein the preset standard and the preset data set specifically are: the preston shape benchmark and Caltech256 dataset.

The visual dictionary is constructed by using a k-means clustering method, and each sketch is expressed by a vector with 500 dimensions, specifically:

calculating an average feature vector:

from the mean feature vector a_iWhere the closest local feature is found, the final representation r in the cluster_i: then is equal to r_iThe corresponding sketch is a symbolic sketch of the class;

wherein, C_iRepresenting all local feature sets belonging to cluster i; h is_jThe jth local feature is represented.

The technical scheme provided by the invention has the beneficial effects that:

1. the invention can carry out real-time sketch recognition, and can synchronously output while inputting partial sketch lines, so that the technology has wide application in real life;

2. the database has multiple types, and can cover a larger range to ensure that the retrieval result is more accurate;

3. according to the method, the local spatial distribution information of the sketch lines is added into the feature vector, so that the feature vector can contain more information, and the retrieval result is more accurate and rapid.

Drawings

FIG. 1 is a flow chart of a hand-drawn sketch-based recognition method;

FIG. 2 is a schematic diagram of automatically computed representative sketch samples for each category;

FIG. 3 is a schematic of accuracy curves for four methods.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

Example 1

A method for identification based on hand-drawn sketches, see fig. 1, the method comprising the steps of:

101: adjusting each sketch in the initial category set according to a preset size, uniformly extracting a plurality of interest points, and extracting a square block for each interest point;

102: extracting pixel point gradients in each square block, and quantizing the gradient to units in 4 directions according to the direction to serve as local features of each square block; constructing a visual dictionary by using a k-means clustering method, wherein each sketch is represented by a vector with 500 dimensions;

103: performing cluster clustering analysis and dimension reduction on the feature vector of each cluster to obtain a classified database;

104: and matching the query sketch with the classified database to obtain a final retrieval result.

The initial category set in step 101 is specifically:

Wherein, the pretreatment specifically comprises the following steps:

The preset standard and the preset data set are specifically as follows: the preston shape benchmark and Caltech256 dataset.

In summary, the embodiment of the present invention, through the above steps 101 to 104, realizes that the sketch is used as the visual input mode and the robust visual local feature is designed for the sketch, so as to improve the recognition accuracy.

Example 2

The scheme of example 1 is further described below with reference to specific calculation formulas and examples, which are described in detail below:

201: acquiring 20000 sketch data sets, enabling key words to cover most common categories through preprocessing, and taking the preprocessed sketch data sets as initial category sets;

wherein, the step 201 specifically includes:

1) firstly, 20000 sketch data sets are constructed to serve as the basis of learning, evaluation and application;

2) the 1000 most common tags are extracted from the LabelMe (tag library), and the duplicate tags and the non-compliance rules are manually removed based on the 1000 common tags (which are set according to the needs of the practical application, and the embodiment of the present invention is not limited thereto, for example: chair, cup);

3) and supplementing the keywords to the deleted sketch data set through a preset standard and the keywords in the preset data set, and taking the preprocessed sketch data set as an initial classification set.

Wherein the preset standard and the preset data set can beTo Princeton shape reference PSB (Princetonshape benchmark) and Caltech^[8]256 data sets.

In a specific implementation, the number of the data sets, the number of the tags, and the number of the keywords are not limited, but are set according to the needs of the actual application.

202: adjusting each sketch in the initial category set according to a preset size, uniformly extracting a plurality of interest points (in the embodiment of the present invention, a 28 × 28 — 784 example is taken as an example), and extracting a square block for each interest point;

in particular, the sketch S in the initial category set is simply regarded as a bitmap, wherein S belongs to R^m×n. Because incoherent features have invariance (e.g., scale invariance and translation invariance), there is a distinction between classes and classes. The sketch bitmap is mapped to a lower-dimensional feature space. A mapping function f: r^m×n→R^dD < m × n is also called a feature vector or descriptor, m denotes a row of the pixel matrix, and n denotes the number of columns of the pixel matrix. Ideally, the mapping f retains the information necessary to distinguish between other class sketches.

The most straightforward way to define a feature space for a sketch is to use a (possibly reduced) bitmap representation directly, which however does not work well. In contrast, embodiments of the present invention employ computer vision methods and use local feature vectors that encode the distribution of image properties to represent a sketch.

Specifically, the distribution of the line direction is encoded in a local area of the sketch. The pixel merging process helps get a slight shift in direction better and unchanged during histogram construction compared to directly encoding pixel values.

Each sketch is first rescaled by isotropically (i.e., the longest edge of the bounding box is fixed in length and scaled to the same size, each sketch being 256 x 256 in size) to achieve global scale and translation invariance, and the longest edge of the bounding box is fixed in length and scaled to the same size, each sketch being 256 x 256 in size. In specific implementation, the size of the sketch is not limited in the embodiment of the invention, and is set according to the requirements in practical application.

BOF (Bag-of-features)^[9]As an intermediate step of the mapping function f. Encoding the local orientation estimate yields a large number of local features that do not contain any spatial position information of the sketch, and these local features are used to represent the sketch, i.e. the sketch is represented by

Denotes the gradient of S, coordinates (u, v), angle O_uv∈[0,π)。

Calculating the gradient g using the Gaussian derivative_uvTo achieve reliable direction estimation. Roughly according to angle O_uvDivide | g | into r directions. Linearly interpolated into neighboring cells to avoid sharp energy changes at cell boundaries. r direction response image o, encode the direction energy at the given discrete direction value, find through the comparative experiment that r is the best effect 4.

For each response image, local features l are extracted by dividing the underlying directional response values into several small local histograms using 4 × 4 spatial cells_j. Local features l_jSynthesizing a column vector d ═ l₁,…,l_r]^TRepresenting local features of a block region. Normalizing by | d |₂1. This result is representational and used for SIFT^[10]But only the direction is stored here.

For 784 uniformly extracted interest points, each interest point extracts a square block with a size of only 12.5% of the sketch S, and extracts a local feature in the square block, the set of local features is called feature bag BOF.

203: extracting pixel point gradients in each square block, and quantizing the gradient to units in 4 directions according to the direction to serve as local features of each square block;

since each square block is relatively large, the regions of local features overlap significantly, and each pixel is divided into about 100 different histograms (each histogram being made up of local features). The local feature is accelerated by observing that the total energy accumulated in each histogram (using linear interpolation) is proportional to the convolution of the local image area with a planar tent function of twice the bandwidth (this term is well known to those skilled in the art and is not described in detail in this embodiment of the present invention).

Therefore, before creating the histogram, the response image o with the corresponding function (accelerated using FFT) is first subjected to_1…rPerforming the convolution, the task of filling each histogram now becomes to find the central response of each histogram, enabling efficient extraction of a large number of local histograms by the above-described process.

204: constructing a visual dictionary by using a k-means clustering method, and finally expressing each sketch by using a vector with 500 dimensions;

and randomly sampling the sketch data set to obtain n local features d as a training set. Constructing a visual dictionary by using a k-means clustering method, dividing local features into k clusters which are separated from each other, and defining a formula of the visual dictionary v as follows:

and is

Wherein d is_jIs the jth local feature; c_iIs the ith cluster separated from each other.

Finally, the sketch is represented by a frequency histogram h of the visual vocabulary. The "hard" assignment of local features is compared to constructing a histogram with "soft" kernel-codebook coding. A feature vector may be similarly close to multiple visual words, but in the hard-assigned case this information cannot be captured. Instead, the weighted distances for all visual words are encoded using the coring distance between local features, which will decay rapidly for distant visual words. Define histogram h as:

wherein the content of the first and second substances,

is d_iA collection of (a).

q(d_i) Is used to quantify local features d with respect to visual vocabulary v_iVector value quantization function of (1):

_q(d_i)＝[K(d_i,μ₁),...,K(d_i,μ_k)]^T

distance between samples was measured using a gaussian kernel:

where σ is the variance.

The histogram h is divided by the number of samples to obtain the final representation. Thus, this representation is insensitive to the total amount of local features in the sketch, but is more sensitive to local structure and line orientation. σ of 0.1 was used in the experiments of the examples of the present invention.

Feature space sparsity (only 20,000 points in high dimensional space) makes clustering in high dimensional space difficult. k-means clustering is an efficient method and does not provide meaningful clustering because they use rigid and simple distance metrics and require that the number of clusters be defined before. In contrast, the basic nearest neighbor search problem is accelerated by using the variable bandwidth mean shift clustering and the locality sensitive hashing algorithm. The density function of each histogram h in the adaptive mean-shift estimation feature space:

wherein, b_iIs the bandwidth associated with each point, K is the Gaussian kernel, h_iIs a histogram. The maximum value of f is calculated using an iterative gradient ascent method. Features whose mean shifts are close to the same maximum are classified as a cluster. Extracting representatives from each categoryFor the exemplary sketch, the average cluster number obtained in 250 classes is 1.39, i.e., each class can be represented by 1 or 2 sketch maps.

With C_iRepresenting all local feature sets belonging to cluster i. To identify the symbolic sketch representing each cluster, the following strategies are proposed:

a) an average feature vector is computed from all features in the cluster:

b) from the mean feature vector a_iWhere the closest local feature is found, the final representation r in the cluster_i: then is equal to r_iThe corresponding sketch is a symbolic sketch of the class.

205: carrying out unsupervised analysis, intra-class classification analysis and dimension reduction processing on the feature vector of each class by adopting two methods to obtain a classified database;

in order to visualize sketch distribution in the feature space, the feature vectors of each category are subjected to dimension reduction. The dimensions are reduced to two dimensions so that their distribution can be visualized in a two-dimensional space. Dimensionality reduction is achieved using methods such as principal Component analysis (pca) or multidimensional scaling (mds), but both methods tend to crowd together many data points when mapped into two-dimensional space, resulting in a useless layout.

Therefore, by adopting t-Distributed neighborhood embedding algorithm t-SNE (t-Distributed stored neighboring embedding), the dimensionality reduction technology for solving the congestion problem can calculate the mapping from the distance in the high-dimensional space to the distance in the low-dimensional space, so that the smaller pair-wise distance in the high-dimensional pace (which will generate the congestion problem) is increased when the small pair-wise distance is mapped to the two-dimensional space, and the overall global distance is still kept.

Nearest neighbor classification: for a given histogram h, k nearest neighbors in the feature space are found, and the k nearest neighbors are found to belong to which category most, so that the histogram h is classified into the category.

SVM^[11]And (4) classification: learning one classifier function for each class:

wherein the support vector s_jDetermining weights α during the SVM training phase_jAnd a deviation b. Therefore, K (·,) is a gaussian kernel that measures the similarity between the support vector and the histogram h. Given a training data set, the sketch in category i is used as a positive example, and the remaining sketch is used as a negative example.

Because SVMs are binary classifiers, each class is trained for a total of 250 classifiers. To determine cat (h), the sketch is grouped into the class that yields the largest classification response:

206: and matching the query sketch with the classified database to obtain a final retrieval result.

In summary, the embodiment of the present invention, through the above steps 201 to 206, realizes that the sketch is used as the visual input mode and the robust visual local feature is designed for the sketch, so as to improve the recognition accuracy.

Example 3

The feasibility of the protocols of examples 1 and 2 was verified by combining specific experimental data and mathematical formulas as described below:

experimental reports

First, database

20000 hand-drawn sketch database sets, which are the bases for learning, evaluation and application. The 1000 most common labels were extracted from LabelMe, duplicate and non-rule-compliant labels were manually removed, and this was taken as the initial set of categories. The number of categories is increased by the keywords of the PSB and Caltech256 datasets. Finally, the keywords are manually supplemented, and 250 keywords are finally obtained, and the words cover most of common categories.

Using 250 complete categories, convert each sketch to a grayed-out bitmap and unify the size to 256 × 256. Each sketch extracts 784 local features, 28 × 28, and one feature is extracted in each grid. To create a visual dictionary, a huge number of examples are required. Defining the visual dictionary V as 20000 multiplied by 784 local features, setting the number of visual words as 500, and roughly estimating a good classification rate under the condition of moderate quantization cost.

In order to determine the optimal parameters for calculating sketch identification, a cross-validation method is adopted. Using 3-fold cross validation: the corresponding data set is divided into 3 parts, of which 2 are used for training and the other part for testing. Hierarchical sampling is used to ensure that each subset contains (roughly) the same number of class instances.

Second, evaluation criteria

With accuracy as a criterion, for example: ratio of correctly classified samples to the total number of positive samples.

Third, experimental results

The classification performance of the proximity algorithm Knn (K-Nearest Neighbor) and SVM classifiers is quite sensitive to the model parameters selected by the user. Thus, the best performance model parameters for each case are first determined, and a standard grid search is performed on the 2d space of the model parameters. For knn, k e {1, …,5} (the number of nearest neighbors used for classification) and the distance metric d ═ l are used₁,l₂Cosine, correlation }. For SVM, there is a logarithmic separation using γ ∈ {10, …,100} (gaussian kernel parameters) and C ∈ {1, …,100} (regularization constants). For SVM the search is accelerated using 1/4 subsamples of the entire data set, for knn the complete data set is used. Table 1 lists the resulting optimal model parameters:

table 1: knn/SVM model optimal parameter generated according to soft/hard quantization

To determine how the amount of training data affects the classification accuracy, the entire data set is divided into several progressively larger sub-data sets (8, 16.., 80 sketch of each class). For each subdata set, its average 3-fold cross-validation accuracy was measured. It turns out that the classification accuracy depends on the number of training instances. As the sub-data set gets closer to the full data set size, the performance gain becomes smaller: this indicates that the data set is large enough to capture most of the variance in each class. The experimental result shows that the SVM method is superior to other methods.

Reference to the literature

[1]DATTA,R.,JOSHI,D.,LI,J.,AND WANG,J.2008.Image retrieval:ideas,influences,and trends ofthe new age.ACM Computing Surveys 40,2,1–60.

[2]CHALECHALE,A.,NAGHDY,G.,AND MERTINS,A.2005.Sketch-based imagematching using angular partitioning.IEEE Trans.Systems,Man and Cybernetics,Part A 35,1,28–41.

[3]EITZ,M.,HILDEBRAND,K.,BOUBEKEUR,T.,AND ALEXA,M.2011.Sketch-basedimage retrieval:benchmark and bag-offeatures descriptors.IEEETrans.Visualization and Computer Graphics 17,11,1624–1636.

[4]EITZ,M.,RICHTER,R.,HILDEBRAND,K.,BOUBEKEUR,T.,AND ALEXA,M.2011.Photosketcher:interactive sketch-based image synthesis.IEEE ComputerGraphics and Applications 31,6,56–66.

[5]CHEN,T.,CHENG,M.,TAN,P.,SHAMIR,A.,AND HU,S.2009.Sketch2Photo:internet image montage.ACM Trans.Graph.(Proc.SIGGRAPH ASIA)28,5,124:1–124:10

[6]DIXON,D.,PRASAD,M.,AND HAMMOND,T.2010.iCanDraw？:using sketchrecognition and corrective feedback to assist a user in drawing humanfaces.In Proc.Int’l.Conf.on Human Factors in Computing Systems,897–906.

[7]Dalal N,Triggs B.Histograms oforiented gradients for humandetection[C]//IEEE Computer Society Conference on Computer Vision&PatternRecognition.IEEE Computer Society,2005:886-893.

[8]GRIFFIN,G.,HOLUB,A.,AND PERONA,P.2007.Caltech-256object categorydataset.Tech.rep.,California Institute ofTechnology.

[9]SIVIC,J.,AND ZISSERMAN,A.2003.Video Google:a text retrievalapproach to object matching in videos.InIEEE Int’l.Conf.Computer Vision,1470–1477

[10]LOWE,D.G.2004.Distinctive image features from scale-invariantkeypoints.Int’l.Journal of Computer Vision 60,2,91–110.

[11]SCHOLKOPF,B.,AND SMOLA,A.2002.Learning with kernels.MIT Press.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A recognition method based on a hand-drawn sketch is characterized by comprising the following steps:

the initial category set specifically includes:

20000 sketch data sets are obtained, keywords cover common categories through preprocessing, and the preprocessed sketch data sets serve as initial category sets;

the pretreatment specifically comprises the following steps:

supplementing the keywords to the deleted sketch data set through a preset standard and keywords in the preset data set, and taking the preprocessed sketch data set as an initial classification set;

representing the sketch by a frequency histogram of a visual vocabulary, wherein the representation is insensitive to the total amount of local features in the sketch but sensitive to local structures and line directions;

performing dimension reduction processing on the feature vector of each category by using cluster clustering analysis to obtain a classified database;

2. The identification method based on the hand-drawn sketch as claimed in claim 1, wherein the preset criteria and the preset data set specifically are: the preston shape benchmark and Caltech256 dataset.

3. The hand-drawn sketch recognition method based on claim 1, wherein the visual dictionary is constructed by using a k-means clustering method, and each sketch is represented by a vector with 500 dimensions, specifically:

calculating an average feature vector:

from the mean feature vector a_iWhere the closest local feature is found, the final representation r in the set_i: then is equal to r_iThe corresponding sketch is a symbolic sketch of the class;

wherein, C_iRepresenting all local feature sets belonging to cluster i; h is_jRepresenting the jth local feature。