CN102629328B

CN102629328B - Probabilistic latent semantic model object image recognition method with fusion of significant characteristic of color

Info

Publication number: CN102629328B
Application number: CN 201210062379
Authority: CN
Inventors: 杨金福; 王锴; 李明爱; 王阳丽; 杨宛露; 傅金融
Original assignee: Beijing University of Technology
Current assignee: Maowao Technology (Tianjin) Co., Ltd.
Priority date: 2012-03-12
Filing date: 2012-03-12
Publication date: 2013-10-16
Anticipated expiration: 2032-03-12
Also published as: CN102629328A

Abstract

The invention provides a probabilistic latent semantic model object image recognition method with the fusion of a significant characteristic of a color, belonging to the field of image recognition technology. The method is characterized by: using an SIFT algorithm to extract a local significant characteristic of an image, adding a color characteristic simultaneously, generating a HSV_SIFT characteristic, introducing TF_IDF weight information to carry out characteristic reconstruction such that the local significant characteristic has discrimination more, using a latent semantic characteristicmodel to obtain an image latent semantic characteristic, and finally using a nearest neighbor KNN classifier to carry out classification. According to the method, not only is color information of theimage considered, but also the distribution of a visual word in a whole image set is fully considered, thus the local significant characteristic of an object has discrimination more, and the ability of recognition is raised.

Description

A kind of notable feature probability latent semantic model subject image recognition methods of Fusion of Color

Technical field

The invention belongs to the image recognition technology field, introduce a kind of notable feature probability latent semantic model subject image recognition methods of Fusion of Color information.When extracting the image notable feature, add colouring information, introducing TF-IDF(term frequency – inverse document frequency) word frequency weight statistical method makes local notable feature have more discrimination, obtain on this basis the potential semantic feature of image according to latent semantic model, dwindle the semantic gap that exists in the object identification, the identification problem of easier solution image.

Background technology

Current, the mobile robot has been widely applied to the numerous areas such as industry, space flight, military affairs, service.Along with the expansion of application, people are also more and more higher to mobile robot's intelligent requirement.Intelligent independent formula mobile robot has become the study hotspot in intelligence system field.Because the robotic vision system is near the mode of human perception environment, and can provide abundant perception information for the mobile robot, therefore, attracted a large amount of researchists to participate in based on mobile robot's environment sensing problem of vision.Wherein object identification is basis and the core of mobile robot technology, also is to improve the intelligentized gordian technique of mobile robot.Because in circumstances not known, the mobile robot need to obtain by vision sensor the image of surrounding environment, then the object in the image is identified and is understood, and then carry out corresponding task.

Feature extraction is a very important link in the subject image identifying, its objective is to finish the conversion of image information from the data space to the feature space.In some sense, for the subject image identification mission, feature extraction result's quality has played vital effect to recognition result.And the feature of image local is with its superior performance, and more and more studied personnel pay close attention to.

Generally, local notable feature has comprised human interested important goal, can express the content of image.If give different processing priority for different characteristics of image, can not only reduce the complexity of analytic process, and can improve the efficient of analytical calculation.Harris.C.J in 1988, Stephens.M.A combined corner and edge detector.Proc.4th Alvey Vision Conferences, 1988:147-151 describes point of interest based on Moravec, utilize the autocorrelation matrix of luminance function to realize detection to unique point (angle point), and centered by point of interest the abstract image local feature; Lindeberg.T.Feature detection with automatic scale selection.International Journal of Computer Vision in 1998,1998,30 (2): 79-116 has used the method for automatic scale selection to come extract minutiae, add unique point yardstick information, when determining characteristic point position, also determined the characteristic dimension of this point; Calendar year 2001 Mikolajczyk.C.S.K.Indexing based on scale invariant interest points.Proc.8th International Conference on Computer Vision, Institute of Electrical and Electronics Engineers Inc, 2001:525-531 utilizes the Laplace operator to detect the yardstick of Harris angle point, made up a kind of Harris-Laplace operator with yardstick unchangeability, and the Harris-Laplace operator has been expanded to the Harris-affine operator with affine unchangeability; David G.Lowe.Distinctive image features from scale-invariant interest points.International Journal of Computer Vision in 2004,2004,60 (2): 91-110 utilizes DOG(Difference of Gaussian) thereby operator replaces the Laplace operator to improve the speed that point of interest detects, and proposes and perfect SIFT(Scale Invariant Feature Transform) algorithm.K.Mikolajczyk in 2005, C.Schmid.A performance evaluation of local descriptors.IEEE Transactions on pattern Analysis and Machine Intelligence, 2005,27 (10): 1615-1630 is for the whole bag of tricks of feature extraction, finds that the performance of SIFT algorithm in the situations such as illumination variation, image geometry distortion, differences in resolution, rotation, fuzzy and compression of images is best.Yet still there is the limitation of some in the method, at first traditional SIFT algorithm has only utilized pixel grey scale and the gradient information of image, the chromatic information of having ignored image, thus it inevitably for those close on the gray scale and on colouring information differentiated image produce the mistake coupling.

Can tentatively obtain the local notable feature of each subject image by top method, behind the process vector quantization, these features just can image think some vision words, because every width of cloth image all is comprised of a large amount of vision words, subject image identification is sought its corresponding image with regard to being similar to the frequency that occurs according to certain class visual word, the word bag model of design of graphics picture (Bag of Words, BOW) thus.And traditional BOW model has only utilized the information of vision word at single image, does not but take into full account the distribution situation of vision word in whole image collection.

Traditional probability latent semantic analysis method (Probabilistic latent semantic analysis, PLSA) is applied in the document information retrieval field at first, and the potential theme that calculates each document distributes.Because image and text have very large similarity, therefore adopt similar principle, the PLSA method also can be used for problem of image recognition, calculates the potential theme of every width of cloth image.

The present invention is based on the local notable feature semantic feature of Fusion of Color and carries out subject image identification.Utilize the local notable feature of SIFT algorithm extraction image and add color characteristic, these characteristic highly significants and relatively easily obtaining, in huge property data base, be easy to pick out object to be detected, after extracting local notable feature, add the TF-IDF weight information and carry out feature reconstruction, make local notable feature have more discrimination, finally utilize latent semantic model to obtain the potential semantic feature of image and finish the subject image identification mission.

Summary of the invention

The present invention is by adding color characteristic in the SIFT notable feature, and introduces the weight information of vision word in visual dictionary, designed and Implemented whole subject image recognition methods.Because traditional SIFT algorithm has only utilized pixel grey scale and the gradient information of image, the chromatic information of having ignored image, thereby it inevitably for those close on the gray scale and on colouring information differentiated image produce the mistake coupling, color space commonly used is the RGB color space, but the distance between the color that rgb space calculates can not characterize the real difference between two kinds of colors that people's reality perceives well.Therefore, this patent hsv color model that meets visual characteristics of human eyes, the method after the improvement has overcome the shortcoming of above classic method.Traditional BOW model is not considered the distribution situation of vision word in whole image collection simultaneously, has just utilized the information of vision word at single image.The present invention introduces the TF-IDF statistical method, and the method is common in information retrieval and text mining, in order to assess a words for the significance level of a file in a file set or the corpus.Behind vector quantization, can regard a vision word as with each notable feature of extracting in the sampled images, each width of cloth image just can be thought a document, introduce the TF-IDF weighing computation method, just considered simultaneously the distribution of vision word in single image and whole image collection.If the frequency that certain vision word occurs in piece image is high, and in other images, seldom occur, think that then this vision word has good class discrimination ability, be fit to classification.The recycling latent semantic model calculates the potential semantic feature of all images, dwindles the semantic gap that exists in the subject image identification, the identification problem of easier solution complicated image.

The invention is characterized in, in computing machine, realize according to the following steps successively:

In the robotic training stage, train according to the following steps:

Step (1) structure tranining database, the N kind that computer acquisition and input are divided by the object purposes, the subject image that classification is numbered 1～N comprise T width of cloth training image in every type objects image, and P is used in the set of structure training image _TrainExpression adds up to: N * T=Q width of cloth image;

Step (2) adopts yardstick invariant features change algorithm according to the following steps, and namely the SIFT algorithm calculates described training image set P _TrainIn the remarkable characteristic of every width of cloth training image, add colouring information and generate notable feature, represent with HSV_SIFT, thereby form the HSV_SIFT notable feature storehouse O of described training image set _{HSV_SIFT}:

Step (2.1) makes up the every width of cloth image in the described training image set successively, uses d _i(x, y) expression, i ∈ P _Train, (x, y) is the coordinate of pixel, presses following formula and gaussian kernel function G (x, y, σ _m) carry out convolution operation m time:

G (x, y, σ_{m}) = \frac{1}{2 π {σ_{m}}^{2}} e^{\frac{- (x^{2} + y^{2})}{2 {σ_{m}}^{2}}}, m = 1 . . . 10

Wherein: σ _mRepresent scale factor, initial value σ ₀=1.6, σ _m=α σ _M-1,

Obtain thus one group of totally ten gaussian pyramid space L _i(x, y, σ _m), each is expressed as:

L _i(x,y,σ _m)=G(x,y,σ _m)*d _i(x,y),i∈P _train

Step (2.2) is subtracted each other adjacent two gaussian pyramid spaces by following formula, obtain one group of totally nine Gauss's residual pyramid space, and each described Gauss's residual pyramid space representation is Dog _i(x, y, σ _M-1):

Dog _i(x,y,σ _m-1)=(G(x,y,ασ _m-1)-G(x,y,σ _m-1))*d _i(x,y)=L _i(x,y,ασ _m-1)-L _i(x,y,σ _m-1)

Step (2.3), in Gauss's residual pyramid space of described every width of cloth image i, be in up and down 9 pixels of correspondence position in 8 pixels that every layer pixel and same layer are adjacent, adjacent upper and lower two-layer each layer, amounting to 26 pixels compares, if described every layer pixel is all larger or all little than the value of these 26 pixels, then every layer pixel as a unique point;

Step (2.4) is selected in the unique point that obtains from step (2.3) according to the following steps and the reservation remarkable characteristic;

Step (2.4.1) is Gauss's residual pyramid Dog of described every each layer of width of cloth training image _i(x, y, σ _M-1) represent with Taylor expansion at the unique point place that step (2.3) obtains, and get front two and obtain Dog _i(X _Max), X=(x, y, σ wherein _M-1), Dog _{I, o}Expression first of Taylor expansion, T represents transposition, obtains:

{Dog}_{i} (X_{\max}) = {Dog}_{i, o} + \frac{1}{2} {(\frac{&PartialD; {Dog}_{i} (X)}{&PartialD; X})}^{T} X,

Wherein

{(\frac{&PartialD; {Dog}_{i} (X)}{&PartialD; X})}^{T} = ({Dog}_{i, x}, {Dog}_{i, y}, {Dog}_{i, σ_{m - 1}})

If | Dog (X _Max) | 〉=0.03, then keep this unique point, otherwise filter out;

Step (2.4.2) to being positioned at the unique point at this residual pyramid edge of described every floor height, is filtered according to following formula, if

Think that then this unique point is positioned at the image border and it is filtered out, otherwise just keep this unique point, Tr (H _Hess) be to use H _HessThe Hessian matrix trace of expression, Det (H _Hess) be to use H _HessThe Hessian determinant of a matrix of expression,

H_{hess} = | \begin{matrix} D_{xx} & D_{xy} \\ D_{xy} & D_{yy} \end{matrix} |

Tr(H _hess)=D _xx+D _yy

Det(H _hess)=D _xxD _yy-(D _xy) ²

D _XxD _YyRespectively that described Taylor expansion is at the second-order partial differential coefficient of x direction, y direction, D _XyX, the mixed partial derivative of y both direction, the unique point that remains in described step (2.3), the step (2.4) is called remarkable characteristic;

Step (2.5), press the principal direction of each remarkable characteristic in the following formula determining step (2.4), described principal direction refers to around each described remarkable characteristic in 8 pixels gradient direction corresponding to high gradient mould, square e (x, y) of each described remarkable characteristic place gradient-norm ²For:

e(x,y) ²=(L _i(x+1,y,σ _m-1)-L _i(x-1,y,σ _m-1)) ²+(L _i(x,y+1,σ _m-1)-L _i(x,y-1,σ _m-1)) ²

Each described remarkable characteristic gradient direction θ of place (x, y) is:

θ(x,y)=tan ^-1((L _i(x,y+1,σ _m-1)-L _i(x,y-1,σ _m-1))/(L _i(x+1,y,σ _m-1)-L _i(x-1,y,σ _m-1)))

Take gradient-norm as ordinate, gradient direction gradient direction corresponding to high gradient mould in the gradient orientation histogram of horizontal ordinate, represented the principal direction of each described remarkable characteristic;

Step (2.6), generate the SIFT feature of each described remarkable characteristic, each SIFT feature by 4 * 4 totally 16 Seed Points form, wherein each described Seed Points is again 4 * 4 image slices vegetarian refreshments, each pixel has the vector information of 8 directions, final generation 4 * 4 * 8 is the described SIFT proper vector of totally 128 dimensions, and each described SIFT proper vector is made of gradient-norm and gradient direction;

Step (2.7) generates each described image d according to the following steps _iThe color characteristic of (x, y):

Step (2.7.1) is pressed following formula every width of cloth image d _i(x, y) by the RGB color space conversion to the hsv color space, wherein:

H is the hue angle of angle, H ∈ [0 °, 360 °),

S is saturation degree, S ∈ [0,1],

V is brightness, V ∈ [0,1],

R, G, B represent the red, green, blue color component value of pixel successively,

If max=max (R, G, B), min=min (R, G, B):

V=max,

Step (2.7.2) is described whole hsv color space quantization 72 kinds of colors according to the following steps, generates 72 dimension color characteristics, three component H, S in the described hsv color space, V are carried out respectively different equal interval quantizings, and hue angle H is divided into 8 parts, and span is: 0-7, h value of every part of correspondence, saturation degree S is divided into 3 parts, and span is: 0-2, s value of every part of correspondence, brightness V is divided into 3 parts, span is: 0-2, and v value of every part of correspondence, by following formula:

ζ=H*S*V

The most described image d _iThe hsv color space quantization of (x, y) is 72 kinds of main colors, generates 72 dimension color characteristics;

Step (2.8) merges described remarkable characteristic SIFT feature and described training image d _iThe color characteristic of (x, y) is with described training image d _iColor characteristic splicing each remarkable characteristic SIFT feature back in this training image of (x, y) consists of one 200 feature of tieing up, and each is called: the HSV_SIFT notable feature also claims the vision word, described training image set P _TrainIn every width of cloth image d _i(x, y), i ∈ P _Train, in all the HSV_SIFT notable features consist of the notable feature storehouse U of this training image _i, P in the described whole training image set _TrainIn the notable feature storehouse ∑ U of whole Q width of cloth images _i, i=1,2...Q, thus consist of a HSV_SIFT notable feature storehouse O _{HSV_SIFT}

Step (3) consists of according to the following steps a word bag model BOW and represents described HSV_SIFT notable feature storehouse O _{HSV_SIFT}, every width of cloth training image I _i(x, y) is expressed as one and comprises whole HSV_SIFT notable feature w in this training image _{J, i}, w _{J, i}=(w _{1, i}, w _{2, i}, w _{3, i}..., w _J.i..., w _J.i), wherein: J ∈ [1,200], J represent this training image d _iThe number of (x, y) interior HSV_SIFT notable feature, the number of vision word j namely,

w_{j, i} = t f_{j, i} \times \log (\frac{Q}{d f_{j}})

Wherein: Q represents described training image set P _TrainThe number of interior all training images,

Tf _{J, i}Expression is that a vision word j is at every described training image d _i(x, y) notable feature storehouse U _iThe number of times of middle appearance,

Df _jRepresent described HSV_SIFT notable feature storehouse O _{HSV_SIFT}In the number of the visual word j that comprises;

Step (4) is calculated described training image set P successively according to the following steps _TrainIn every width of cloth training image d _iThe potential semantic feature vector Z of (x, y) _{Train, i}, wherein said Z _{Train, i}Every width of cloth training image set d _iThe set of potential semantic topic in (x, y), K is the number of potential semantic topic, gathers P with described training image on value _TrainThe classification N of interior training image has following relation: K=N ± 5, and described potential semantic topic refers to every described training image d _iThe generalities of some concrete objects statement in (x, y):

Step (4.1), initialization is to P (z _k| d _i) and P (w _{J, i}| z _k) give respectively random number between one 0 to 1 as initial value, wherein: P (z _k| d _i) be every described training image d _i(x, y) potential semantic topic z _k, the distribution probability of k ∈ [1, K], P (w _{J, i}| z _k) be described potential semantic topic z _kDistribution probability in described vision word;

Step (4.2) is calculated as follows every width of cloth d _iAny one vision word w in the described training image of expression _{J, i}To producing described potential semantic topic z _kPosterior probability P (z _k| d _i, w _{J, i}):

P (z_{k} | d_{i}, w_{j, i}) = \frac{P (w_{j, i} | z_{k}) P (z_{k} | d_{i})}{Σ_{k = 1}^{K} P (w_{j, i} | z_{k}) P (z_{k} | d_{i})}

Step (4.3) is pressed the respectively described P (z of calculation procedure (4.2) of following formula _k| d _i) and P (w _{J, i}| z _k), wherein, π (d _i, w _{J, i}) for using d _iVision word w described in every width of cloth training image of expression _{J, i}The number of times that occurs:

P (w_{j, i} | z_{k}) = \frac{Σ_{i = 1}^{Q} π (d_{i}, w_{j, i}) P (z_{k} | d_{i}, w_{j, i})}{Σ_{j = 1}^{J} Σ_{i = 1}^{Q} π (d_{i}, w_{j, i}) P (z_{k} | d_{i}, w_{j, i})}

P (z_{k} | d_{i}) = \frac{Σ_{j = 1}^{J} π (d_{i}, w_{j, i}) P (z_{k} | d_{i}, w_{j, i})}{Σ_{j = 1}^{J} Σ_{i = 1}^{Q} π (d_{i}, w_{j, i}) P (z_{k} | d_{i}, w_{j, i})}

Step (4.4) is calculated as follows described training image set P _TrainIn every width of cloth training image d _iIn potential semantic feature vector Z _{Train, i}:

Z _train,i={z _1,i,z _2,i,z _3,i,...,z _K,i}，i∈P _train；

Step (4.4.1), according to step (4.2) and (4.3) obtaining described potential semantic feature z _kUsing d _iPosterior probability P (z in every width of cloth training image of expression _k| d _i, w _{J, i});

Step (4.4.2) is calculated as follows likelihood function Likelihood the λ time _λ:

{Likelihood}_{λ} = Σ_{i = 1}^{Q} Σ_{j = 1}^{J} π (d_{i}, w_{j, i}) \log (d_{i}, w_{j, i});

Step (4.4.3) is by using d _iA vision word of every appearance w in every width of cloth training image of expression _{J, i}As iterations, judge the value of likelihood function recruitment described in adjacent twice iteration, when less than a setting threshold φ=0.5, just stop iteration, obtain described training image set P _TrainIn every width of cloth image I _iThe potential semantic feature vector Z of (x, y) _{Train, i}, otherwise, just continue iteration, until the recruitment of affiliated function is less than φ=0.5;

φ=Likelihood _λ-Likelihood _λ-1；

The robot cognitive phase, identify according to the following steps:

Step (5), the test pattern that calculates Real-time Collection by step (1)～step (4) is gathered P _TestIn, the potential semantic feature vector Z of the test pattern of every described Real-time Collection _{Test, i'}:

Z _test,i'={z _1,i',z _2,i',z _3,i',...,z _K,i'}，i'∈P _test；

Step (6) is utilized following arest neighbors KNN sorter model, calculates described training image P _TrainTest pattern set P with described Real-time Collection _TrainDistance B is on potential semantic feature vector, the minimum classification of distance is exactly corresponding object classification:

Dis = \sqrt{Z_{train, i} - Z_{test, i^{'}}} .

100 times subject image identification experimental result is: average recognition rate is rec=78.2%, wherein the highest discrimination is 81.6%, minimum discrimination is 75.3%, and the time of on average identifying every width of cloth image is 0.644 second, and recognition time and discrimination satisfy the requirement of mobile robot in the laboratory.

Description of drawings

Fig. 1 generates HSV_SIFT feature process flow diagram;

Fig. 2 makes up gaussian pyramid and Gauss's residual pyramid;

Fig. 3 is the SIFT Seed Points;

Fig. 4 is very big, minimum point detection;

Fig. 5 is PLSA model framework figure;

Embodiment

1. in the robotic training stage, because robot identification need to make up first tranining database, with the training image that gathers in advance, purposes definition N kind according to objects in images, classification is numbered 1～N, comprises T width of cloth image in each image category, whole training image set P _TrainAding up to of middle image: N * T=Q;

2. every width of cloth image of concentrating for training image adopts the SIFT algorithm, calculate the remarkable characteristic of every width of cloth training image, and generating the HSV_SIFT notable feature, key step is as follows: image characteristic point detects, and keeps remarkable characteristic, determine remarkable characteristic principal direction, generate remarkable characteristic SIFT feature, synthetic image color characteristic, SIFT feature and the color of image feature of merging remarkable characteristic, generate the HSV_SIFT notable feature, finally make up training image set P _TrainHSV_SIFT notable feature storehouse O _{HSV_SIFT}, see respectively Fig. 1, Fig. 2, Fig. 3 and Fig. 4;

3. for the deficiency of statistics with histogram method in the traditional B OW model, introduce TF-IDF weight statistical method, make the HSV_SIFT feature have more discrimination, use following TF-IDF weight formula:

w_{j, i} = t f_{j, i} \times \log (\frac{Q}{d f_{j}})

Wherein: tf _{J, i}Refer to certain vision word j at image notable feature storehouse U _i, i ∈ P _TrainThe number of times of middle appearance, the sum of Q representative training picture, df _jRepresent HSV_SIFT notable feature storehouse O _{HSV_SIFT}In comprise the number of vision word j.Every width of cloth image finally is expressed as w _J,I=(w _{1, i}, w _{2, i}, w _{3, i}..., w _J.i..., w _J.i), j ∈ [1, J] wherein, the common value 200 of J, this vector is called image B OW to be described;

4. after redescribing through the BOW model by the every width of cloth image of top step, use the potential semantic topic model of PLSA method computed image, potential semantic topic is some concepts that may comprise in the image, can think such as: computer and to comprise: the themes such as mouse, keyboard, display, cabinet, following formula represents the conditional probability of " image-vision word ":

P (d_{i}, w_{j, i}) = Σ_{k = 1}^{K} P (d_{i}) P (w_{j, i} | z_{k}) P (z_{k} | d_{i})

Wherein: w _{J, i}For image B OW in the upper step describes, d _iRepresent i width of cloth image, P (d _i) represent that i image is at whole training image set P _TrainThe probability of middle appearance, P (w _{J, i}| z _k) be the distribution probability of potential semantic topic on the vision word, P (z _k| d _i) the potential semantic topic distribution probability of presentation video, wherein: k ∈ [1, K], K is the number of potential semantic topic, gathers P with described training on value _TrainThe classification N of interior training image has following relation: K=N ± 5, and concrete PLSA model is seen Fig. 5;

{Likelihood}_{λ} = Σ_{i = 1}^{Q} Σ_{j = 1}^{J} π (d_{i}, w_{j, i}) \log (d_{i}, w_{j, i})

φ=Likelihood _λ-Likelihood _λ-1

When the recruitment of likelihood function expectation value stops iteration during less than a setting threshold φ=0.5, otherwise, iteration continued, until satisfy threshold value φ.By the calculating of PLSA algorithm, obtain training image set P _TrainIn the potential semantic feature vector Z of every width of cloth image _{Train, i}={ z _{1, i}, z _{2, i}, z _{3, i}..., z _{K, i}, i ∈ P _Train

5. at the robot cognitive phase, for the test pattern set P of Real-time Collection _TestAdopt top identical method, calculate the potential semantic feature vector Z of every width of cloth image _{Test, i'}={ z _{1, i'}, z _{2, i'}, z _{3, i'}..., z _{K, i'}, i' ∈ P _Test

6. utilize arest neighbors KNN(K-Nearest Neighbor) sorter model

I ∈ P _Train, i' ∈ P _Test, the potential semantic feature vector of gathering according to the test pattern of training image set and robot Real-time Collection carries out object classification.

Claims

1. the notable feature probability latent semantic model subject image recognition methods of a Fusion of Color is characterized in that, realizes according to the following steps successively in computing machine:

In the robotic training stage, train according to the following steps:

G (x, y, σ_{m}) = \frac{1}{{2 πσ}_{m}^{2}} e^{\frac{- (x^{2} + y^{2})}{{2 σ}_{m}^{2}}}, m = 1 . . . 10

Wherein: σ _mRepresent scale factor, initial value σ ₀=1.6,

L _i(x,y,σ _m)=G(x,y,σ _m)*d _i(x,y),i∈P _train

{Dog}_{i} (X_{\max}) = {Dog}_{i, o} + \frac{1}{2} {(\frac{{&PartialD; Dog}_{i} (X)}{&PartialD; X})}^{T} X,

Wherein

{(\frac{{&PartialD; Dog}_{i} (X)}{&PartialD; X})}^{T} = ({Dog}_{i, x}, {Dog}_{i, y}, {Dog}_{i, σ_{m - 1}})

If | Dog _i(X _Max) | 〉=0.03, then keep this unique point, otherwise filter out;

H_{hess} = |\begin{matrix} D_{xx} & D_{xy} \\ D_{xy} & D_{yy} \end{matrix}|

Tr(H _hess)=D _xx+D _yy

Det(H _hess)=D _xxD _yy-(D _xy) ²

e(x,y) ²=(L _i(x+1,y)-L _i(x-1,y)) ²+(L _i(x,y+1)-L _i(x,y-1)) ²

θ(x,y)=tan ^-1((L _i(x,y+1)-L _i(x,y-1))/(L _i(x+1,y)-L _i(x-1,y)))

H is the hue angle of angle, H ∈ [0 °, 360 °),

S is saturation degree, S ∈ [0,1],

V is brightness, V ∈ [0,1],

If max=max (R, G, B), min=min (R, G, B):

V=max,

ζ=H*S*V

Step (3) consists of according to the following steps a word bag model BOW and represents described HSV_SIFT notable feature storehouse O _{HSV_SIFT}, every width of cloth training image d _i(x, y) is expressed as one and comprises whole HSV_SIFT notable feature w in this training image _{J, i}, w _{J, i}=(w _{1, i}, w _{2, i}, w _{3, i}..., w _J.i..., w _J.i), wherein: J ∈ [1,200], J represent this training image d _iThe number of (x, y) interior HSV_SIFT notable feature, the number of vision word j namely,

w_{j, i} = {tf}_{j, i} \times \log (\frac{Q}{{df}_{j}})

Step (4) is calculated described training image set P successively according to the following steps _TrainIn every width of cloth training image d _iThe potential semantic feature vector Z of (x, y) _{Train, i}, wherein said Z _{Train, i}Every width of cloth training image d _iThe set of potential semantic topic in (x, y), K is the number of potential semantic topic, gathers P with described training image on value _TrainThe classification N of interior training image has following relation: K=N ± 5, and described potential semantic topic refers to every described training image d _iThe generalities of some concrete objects statement in (x, y):

P (z_{k} | d_{i}, w_{j, i}) = \frac{P (w_{j, i} | z_{k}) P (z_{k} | d_{i})}{Σ_{k = 1}^{K} P (w_{j, i} | z_{k}) P (z_{k} | d_{i})}

P (w_{j, i} | z_{k}) = \frac{Σ_{i = 1}^{Q} π (d_{i}, w_{j, i}) P (z_{k} | d_{i}, w_{j, i})}{Σ_{j = 1}^{J} Σ_{i = 1}^{Q} π (d_{i}, w_{j, i}) P (z_{k} | d_{i}, w_{j, i})}

P (z_{k} | d_{i}) = \frac{Σ_{j = 1}^{J} π (d_{i}, w_{j, i}) P (z_{k} | d_{i}, w_{j, i})}{Σ_{j = 1}^{J} Σ_{i = 1}^{Q} π (d_{i}, w_{j, i}) P (z_{k} | d_{i}, w_{j, i})}

Z _train,i={z _1,i,z _2,i,z _3,i,...,z _K,i}，i∈P _train；

{Likelihood}_{λ} = Σ_{i = 1}^{Q} Σ_{j = 1}^{J} π (d_{i}, w_{j, i}) \log (d_{i}, w_{j, i});

φ=Likelihood _λ-Likelihood _λ-1；

The robot cognitive phase, identify according to the following steps:

Z _test,i'={z _1,i',z _2,i',z _3,i',...,z _K,i'}，i'∈P _test；

Dis = \sqrt{Z_{train, i} - Z_{test, i^{'}}} .