CN105740891A

CN105740891A - Target detection method based on multilevel characteristic extraction and context model

Info

Publication number: CN105740891A
Application number: CN201610056601.3A
Authority: CN
Inventors: 刘波; 马增妍
Original assignee: Beijing University of Technology
Current assignee: Shenzhen Xiaofeng Technology Co ltd
Priority date: 2016-01-27
Filing date: 2016-01-27
Publication date: 2016-07-06
Anticipated expiration: 2036-01-27
Also published as: CN105740891B

Abstract

The invention discloses a target detection method based on multilevel characteristic extraction and a context model. The model constructed by the invention mainly carries out statistics on a spatial position relationship between images in a real picture so as to improve target detection accuracy. Both the images of the same category and the images of different categories have certain spatial position relationships. Firstly, one picture is selected and searched to generate a great quantity of region proposals, then, all region proposals of each picture are subjected to feature extraction by a seven-layer convolutional neural network, and finally, a support vector machine is adopted for classification. The invention provides a new method for finding an optimal object detection position, mainly solves the technical problem of providing a new context model, replaces an original non-maximum suppression method, and is used for obtaining better target detection accuracy.

Description

Target detection based on multi-level feature extraction and context model

Technical Field

The invention belongs to the field of computer machine learning, and particularly relates to an algorithm for positioning a target position in a picture by applying a machine learning algorithm to target detection in image processing.

Background

Target detection is an algorithm that locates the position of a target in an image, which combines the segmentation and recognition of the target into one. Summarizing the average precision of various algorithms trained on standard visual inspection pascalloc datasets between 2010-2012, it was not difficult to find that the algorithms progressed slowly, almost all being integrated systems or a little improvement over existing algorithms. Until 2012, AlexKrizhevsky applied convolutional neural networks (convolutional neural networks) to image classification and made a great progress, and on the basis of AlexKrizhevsky, rossgirschick proposed that the convolutional neural networks (convolutional neural networks) were applied to the paschaloc data set to extract the features of the image, and a linear support vector machine (supportvector machine) was used to classify the image, so as to determine the category to which the image belongs, i.e., to realize target detection.

The concrete content of RossGirshick is that firstly, a picture is selected and searched to generate a large number of regionproposals, then all the regionproposals of each picture are subjected to feature extraction, a 7-layer convolutional neural network is adopted, and finally, a support vector machine is used for classification.

The traditional target detection algorithm generally adopts a sliding window method to solve the problem of positioning, but because RossGirshick adopts a 7-layer convolutional neural network, the size of an image obtained after 5-layer convolution is very small, the image is not suitable for a sliding window, and the sliding window consumes long time. Therefore, RossGirshick generated a large number of regionproposals using the method of selective search. The graph is divided into a plurality of regions by a segmentation method, and pixels are combined by using a minimum spanning tree algorithm to form region responses. The method comprises the steps of firstly, segmenting a picture, representing the Image (Image) into a Graph structure (Graph), regarding each pixel in the Image as each vertex in the Graph, forming an edge by the relationship between the pixels, obtaining a weight value of the edge as a gray level difference value between the pixels, combining the pixels by using a minimum spanning tree algorithm, and forming regions. The second part combines the regions, firstly calculates the color similarity, the texture similarity, the size similarity and the matching similarity between all the regions, and then combines the 4 similarities to obtain the final similarity. And merging the most similar 2 regions according to the similarity, and recalculating the similarity of the newly generated region and other remaining regions. And repeating the process until the whole picture is aggregated into a large region, randomly scoring each region, sorting the regions according to scores, and selecting the first K subsets.

And taking a 7-layer convolutional neural network proposed by AlexKrizhevsky as a framework, and performing feature extraction on all regionproposals of each picture. The first 5 layers in the 7-layer convolutional neural network are volume base layers, and the last 2 layers are full-connection layers. Because the input of the convolutional neural network is 227 × 227 in a fixed format, and the sizes of the regionproposals obtained after the selection search are not fixed, the length and the width of each regionproposal are firstly adjusted and then input into the network. And finally, classifying the result output from the network by using a linear support vector machine so as to detect the target. In order to eliminate redundant frames and find the best position for object detection, a non-maximum suppression (non-maximum suppression) method is generally used.

Object of the Invention

The invention provides a new method for finding the optimal object detection position. The main technical problem to be solved is to provide a new context model to replace the original non-maximum suppression (non-maximum suppression) method, so as to obtain better target detection accuracy.

The model constructed by the invention mainly counts the spatial position relation between the images in the real picture, thereby improving the accuracy of target detection. Whether the images are of the same or different classes, there will be some specific spatial relationship. For example, the spatial position relationship between two images, such as a person and a bicycle, may only be that the person is on the bicycle (above) or the person is beside the bicycle (next-to), and the spatial position relationship that the bicycle is on the person (above) rarely occurs. The invention has the flow as shown in figure 1, and the main steps are as follows:

s1 construction context model

Firstly, the methodA context model is constructed for capturing the relationships between the target detectors. A picture is explicitly represented by a series of windows with coincidence, the position of the ith window is represented by its center and its length and width, written as I_iWhere N denotes a picture with N windows, x_iRepresenting the picture features extracted from the ith window, X ═ X for the entire picture_i: i is 1, … … N, K represents the number of image classes (the method uses a pashellvoc data set, so K is 20), y_i∈ {0, … …, K } represents the label of the ith window, 0 represents the background, and Y ═ Y_iI 1, … … N }. Defining the fraction between X and Y as:

S (X, Y) = \underset{i, j}{Σ} w_{y i, y j}^{T} d_{i, j} + \underset{i}{Σ} w_{y i}^{T} x_{i} - - - (1)

wherein w_yi,yjDenotes y_iClass and y_jWeight between classes, w_yiLocal templates representing classes i, d_ijThe spatial position relationship between the window i and the window j is shown, and the position relationship is divided into an upper surface (above), a lower surface (below), an overlapping (overlapping) and a side surfaceEdge (next-to), near (near) and far (far), thus, d_ijIs a sparse one-dimensional vector, and only the corresponding items satisfying the spatial position relationship among each other will be assigned as 1. The spatial position relationship between people is always next-to, not above, so above and other positions have their corresponding values 0 and next-to 1.

Since the calculation of maxS (X, Y) is a non-deterministic polynomial NP (non-deterministic polynomial) hard, the method employs greedy algorithm-like to solve the NPhard problem. The algorithm comprises the following steps:

(1) initializing a vector Y of each window into a background class;

(2) the greedy selection is not a single window of the background class, and the value of S (X, Y) is increased to the maximum extent;

(3) when any one window is selected, the value of S (X, Y) is not increased but is decreased, and the operation is stopped;

i represents a series of instantiated window-classes I ═ pairs (I, c), let y (I) represent the associated label vectors, y when all pairs are in set I_iC, otherwise y_i0; the value of S (X, Y) is changed by adding a window-class pair (I, c) to set I, as shown below:

(i,c)＝S(X，Y(I{(i,C)}))-S(X，Y(I))

initializing I { }, S { }, 0, (I, c) ═ w^T _cx_iIteration of

1)(i*，c*)＝argmax_(i,j)I(i，c)

2)I＝I{(i*，c*)}

3)S＝S+(i*,c*)

4)(i，c)＝(i，c)+w^T _c*,cd_i*,i+w^T _c,c*d_i,i*

The end condition is (i, c) 0 or all windows are instantiated.

S2 convex training with tangent plane optimization

To describe the learning algorithm of the present invention, equation (1) needs to be written as follows:

equivalent to S (X, Y) ═ W^TΨ(X,Y)

Wherein,

namely:

Y * = \arg \underset{Y}{m a x} W^{T} Ψ (X, Y)

the purpose of convex training is to assume that a series of training pictures X is given_iAnd label Y_iIt is desirable to obtain an optimal model of W, so that a new picture X is given_iA tag vector Y ═ Y can be generated_i. Thus, the result of the convex training is to get an optimal model of W, such that Y and Y_iThe difference value of the W-shaped optimal model is as small as possible, the process of obtaining the W-shaped optimal model through convex training is a process of solving the extreme value of the following mathematical formula,

\arg \underset{w, ξ_{i} &GreaterEqual; 0}{m i n} w^{T} w + C \underset{i}{Σ} ξ_{i} - - - (3)

s.t.,H_iw^TΔΨ(X_i,Y_i,H_i)≥l(Y_i,H_i)-ξ_i

wherein, Δ Ψ (X)_i,Y_i,H_i)＝Ψ(X_i,Y_i)-Ψ(Y_i,H_i)，

l (Y_{i}, H_{i}) = Σ_{i = 1}^{N} l (y_{i}, h_{i})

H_iIs a label that is calculated by itself,

for ease of optimization, the constraint problem of equation (3) is equivalent to the unconstrained problem of equation (4):

w * = \arg \underset{w}{m i n} L (w) - - - (4)

wherein

L (w) = \frac{1}{2} {|| w ||}^{2} + C R (w)

R (w) = Σ_{i = 1}^{N} \underset{H}{m a x} (0, l (Y_{i}, H) - w^{T} Δ Ψ (X_{i}, Y_{i}, H))

And (4) performing tangent plane optimization on the formula (4) to obtain an optimal model of W.

S3 Overall implementation flow

Firstly, obtaining the regionproposals of the picture through a search algorithm, secondly, adopting a trained 7-layer convolutional neural network CNNS of RossGirshick to extract the characteristics of the picture, changing the final output of the 7-layer CNNS network structure of RossGirshick into 21(20 VOC classes and 1 background class) because the training set and the test set of the invention both adopt PASCALLVOC data sets (20 classes), and finally, carrying out picture classification by using a linear Support Vector Machine (SVM), and in the classification process, in order to find the position of the best object detection, using a trained context model to carry out more accurate positioning, thereby finally obtaining better target detection accuracy.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a spatial positional relationship explanatory diagram.

Detailed Description

As shown in fig. 1-2, experiments were performed according to the above method. Experiments compare results of whether the contextual model target detection is adopted, and the experiments adopt PASCALLVOC data sets with 20 classes. Since the method only stores 7 spatial position relations (same type or non-same type) between categories, if one of the 7 spatial position relations is satisfied, the target detection accuracy of the corresponding category should be higher than that of a method which is not trained by a context model, otherwise, if the spatial position relations are not determined, the context model which is possibly learned plays a role of reverse, and due to non-maximum suppression (non-maximum suppression), the detection result may be disturbed, so that the corresponding target detection accuracy is reduced.

Table 1: comparison of Experimental results

Table 1: experimental results comparison Class	No context model (unit%)	Context model (unit%)
			aero	66.9	70.7
bike	23.7	21.2
			bird	58.3	53.7
boat	37.4	39.8
			bottle	55.4	50.1
bus	73.3	35.8
			car	58.7	34.8
cat	56.5	59.5
			chair	9.7	9.6
cow	45.5	53
			table	29.5	15.9
dog	49.3	43.6
			horse	40.1	34
mbike	57.8	52.8
			person	53.9	57.4
plant	33.8	13.3
			sheep	60.7	36.9
soft	22.7	23.2
			train	47.1	55.9
tv	41.3	41.9

Claims

1. The target detection method based on multi-level feature extraction and context model is characterized in that:

the model constructed by the method mainly counts the spatial position relation between images in a real picture, so that the accuracy of target detection can be improved; whether the images are of the same type or different types, the images have certain spatial position relations; the two images of the person and the bicycle have the spatial position relationship that the person is on the bicycle (above) or the person is beside the bicycle (next-to), and the spatial position relationship that the bicycle is on the person (above) rarely occurs; the spatial position relationship of a person and a person is generally the spatial position relationship of the person beside the person (next-to) and few persons on the person (above); the main steps of the method are as follows,

s1 construction context model

Firstly, constructing a context model for capturing the relation between target detectors; a picture is explicitly represented by a series of windows with coincidence, the position of the ith window is represented by its center and its length and width, written as I_iWhere N denotes a picture with N windows, x_iRepresenting the picture features extracted from the ith window, X ═ X for the entire picture_i: i is 1, … … N, K represents the number of image classes (the method uses a pashellvoc data set, so K is 20), y_i∈ {0, … …, K } represents the label of the ith window, 0 represents the background, and Y ═ Y_iI is 1, … … N }; defining the fraction between X and Y as:

S (X, Y) = \underset{i, j}{Σ} w_{y i, y j}^{T} d_{i, j} + \underset{i}{Σ} w_{y i}^{T} x_{i} - - - (1)

wherein w_yi,yjDenotes y_iClass and y_jWeight between classes, w_yiLocal templates representing classes i, d_ijThe spatial position relationship between window i and window j is shown, and the position relationship is divided into, upper (above), lower (below), overlapping (overlapping), next-to, near (near), and far (far), so d_ijIs a sparse one-dimensional vector, and only the corresponding items meeting the spatial position relationship among each other can be assigned as 1; the spatial position relation between people is always next-to, but not above, so above and other corresponding items in positions are assigned as 0, and next-to is assigned as 1;

since the calculation maxS (X, Y) is a non-deterministic polynomial NP (non-deterministic polynomial) hard, the method employs greedy algorithm-like to solve the NPhard problem;

(i,c)＝S(X，Y(I{(i,C)}))-S(X，Y(I))

initializing I { }, S { }, 0, (I, c) ═ w^T _cx_iIteration of

1)(i*，c*)＝argmax_(i,j)I(i，c)

2)I＝I{(i*，c*)}

3)S＝S+(i*,c*)

4)

(i, c) = (i, c) + {w^{T}}_{c *, c} d_{i *, i} + {w^{T}}_{c, c *} d_{i, i *}

The end condition is (i, c) 0 or all windows are instantiated;

s2 convex training with tangent plane optimization

To describe the learning algorithm of the present method, equation (1) needs to be written as follows:

equivalent to S (X, Y) ═ W^TΨ(X,Y)

Wherein,

W = (\begin{matrix} w_{s} \\ w_{a} \end{matrix})

namely:

Y * = \arg \underset{Y}{m a x} W^{T} Ψ (X, Y)

the purpose of convex training is to assume that a series of training pictures X is given_iAnd label Y_iIt is desirable to obtain an optimal model of W, so that a new picture X is given_iA tag vector Y ═ Y can be generated_i(ii) a Thus, the result of the convex training is to get an optimal model of W, such that Y and Y_iThe difference value of the W-shaped optimal model is as small as possible, the process of obtaining the W-shaped optimal model through convex training is a process of solving the extreme value of the following mathematical formula,

\arg \underset{w, ξ_{i} &GreaterEqual; 0}{m i n} w^{T} w + C \underset{i}{Σ} ξ_{i} - - - (3)

\begin{matrix} s . t . &ForAll; i, H_{i} & w^{T} Δ Ψ (X_{i}, Y_{i}, H_{i}) &GreaterEqual; l (Y_{i}, H_{i}) - ξ_{i} \end{matrix}

wherein, Δ Ψ (X)_i,Y_i,H_i)＝Ψ(X_i,Y_i)-Ψ(Y_i,H_i)，

l (Y_{i}, H_{i}) = Σ_{i = 1}^{N} l (y_{i}, h_{i})

H_iIs a label that is calculated by itself,

l (y_{i}, h_{i}) = \{\begin{matrix} 1 : y_{i} &NotEqual; b g^h_{i} &NotEqual; y_{i} \\ 1 : h_{i} &NotEqual; b g^~ &Exists; j \\ 0 : o t h e r w i s e \end{matrix}\}

w * = \arg \underset{w}{m i n} L (w) - - - (4)

wherein

L (w) = \frac{1}{2} | | w | |^{2} + C R (w)

R (w) = Σ_{i = 1}^{N} \underset{H}{m a x} (0, l (Y_{i}, H) - w^{T} Δ Ψ (X_{i}, Y_{i}, H))

Performing tangent plane optimization on the formula (4) to obtain an optimal model of W;

s3 Overall implementation flow

Firstly, obtaining the regionproposals of the picture through a search algorithm, secondly, adopting a trained 7-layer convolutional neural network CNNS of RossGirshick to extract the characteristics of the picture, changing the final output of the 7-layer CNNS network structure of RossGirshick into 21(20 VOC classes and 1 background class) because the training set and the test set of the method both adopt PASCALLVOC data sets (20 classes), and finally, carrying out picture classification by using a linear support vector machine SVM, and in the classification process, in order to find the position of the best object detection, using a trained context model to carry out more accurate positioning so as to finally obtain better target detection accuracy.

2. The method for detecting a target based on multi-level feature extraction and context model as claimed in claim 1, wherein:

the steps of the algorithm are as follows,

(1) initializing a vector Y of each window into a background class;

(3) when any one of the windows is selected, the value of S (X, Y) does not increase but decreases, and stops.