CN108491880A

CN108491880A - Object classification based on neural network and position and orientation estimation method

Info

Publication number: CN108491880A
Application number: CN201810243399.4A
Authority: CN
Inventors: 张向东; 张泽宇
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2018-03-23
Filing date: 2018-03-23
Publication date: 2018-09-04
Anticipated expiration: 2038-03-23
Also published as: CN108491880B

Abstract

The invention discloses a kind of object classification based on neural network and position and orientation estimation method, mainly solve the problems, such as that prior art precision when carrying out object detection and Attitude estimation using convolutional neural networks is low.Its implementation is：1) each CAD model multi-view image in data set is obtained；2) mathematical model of joint-detection is built according to the multi-view image of CAD model；3) convolutional neural networks and the multi-view image training convolutional neural networks using CAD model are built；(4) multi-view image of each CAD model in test set is input to neural network, the class label and pose label of output nerve neural network forecast.Present invention incorporates neural network shallow-layer characteristic patterns and further feature figure so that had both remained abundant posture information in conjunction with later characteristic pattern, and had also remained good classification information, and had improved the accuracy of classification and pose estimation.It can be used for intelligent machine arm and robot crawl.

Description

Object classification based on neural network and position and orientation estimation method

Technical field

The invention belongs to artificial intelligence field, it is related to a kind of object classification and position and orientation estimation method, can be used for intelligent machine Arm and robot crawl.

Background technology

Convolutional neural networks CNN is a kind of feedforward neural network, by convolutional layer, full articulamentum, pond layer and active coating Composition.Compared to the neural network that tradition connects entirely, convolutional neural networks are made by the connection of application part and weights technology of sharing The neuron weights obtained on same Feature Mapping face are identical, greatly reduce the number of parameters of network, reduce the complexity of network Degree.Activation primitive is also gradually evolved into the ReLU of unilateral inhibition by sigmoid.Activation primitive is continuously improved so that neuron It is more nearly the characteristic of biological neuron activation.In addition, CNN avoids the complicated pre-processing to image, including complicated Feature extraction and data reconstruction can directly input original image.Gradient declines and the application of chain type Rule for derivation so that nerve Network is capable of the mutual iteration of good progress propagated forward and backpropagation, and accuracy of detection is continuously improved.And in numerous depths It spends in learning framework, caffe is relatively common one kind, using more in terms of video, image procossing.The modularization of Caffe, Expression is detached with realization, Python the and Matlab interfaces for facilitating switching and offer between gpu and cpu so that Wo Menke To use Caffe easily to carry out network structure regulation and network training.

In recent years, deep learning achieved significantly in image classification, object detection, semantic segmentation, example segmentation etc. Progress.General vision system needs solve two problems：The Attitude estimation of object classification and object, so-called Attitude estimation, Refer to posture of the object relative to camera.Object pose estimation is crucial in many applications, such as robot crawl Etc..But object classification and pose estimation are conflicting again, no matter categorizing system needs object in any posture, all may be used Correctly to classify.Therefore categorizing system study is and the incoherent feature of viewpoint.And object pose is estimated, system needs Study keeps the feature of object geometry and vision, to distinguish its pose.For convolutional neural networks, the characteristic pattern of shallow-layer tends to In more generally, the uncertain feature of classification, but contain the feature between more different positions and poses.Further feature figure is more Add abstract, category feature is more obvious, but the information of specific pose is because of high abstraction and unobvious.Existing detection method one As be all the layer for selecting a centre feature, the feature of this layer is all better in the performance that classification and pose are estimated, Therefore it is a kind of method of compromise, the precision of object detection and Attitude estimation cannot be made while reaching best.

Method MVCNN by Hang Su et al. a kind of object classifications and pose estimation proposed in 2015, this method propose It converts sample 3D data in the various visual angles picture of 2D, carries out Data Dimensionality Reduction under the premise of ensureing accuracy of detection, although can It to simplify processing procedure, but needs to carry out feature extraction to the picture at all visual angles of object, remerges each multi-perspective picture Information.This is in actual scene, because target object has phenomena such as blocking, blocking, gives from all predefined view collection objects Body various visual angles picture brings difficulty, does not meet the demand in actual scene.

Invention content

It is an object of the invention in view of the above shortcomings of the prior art, propose a kind of object classification based on neural network And position and orientation estimation method accelerates detection speed, meets the need of actual scene to improve the precision of object detection and pose estimation It asks.

The present invention technical thought be：By improving object in conjunction with convolutional neural networks middle-shallow layer feature and further feature Detection and pose estimated accuracy；By the iteration of the image to detection object part visual angle, accelerate the speed of detection.Its realization side Case includes as follows：

(1) training set and test set, the corresponding image of setting CAD model are obtained：

3429 CAD models are taken out from ModelNet10 data sets as training set, take out 1469 CAD as test Collection；

To the CAD model of each sample in ModelNet10 data sets, two kinds of tactful pretreatments are carried out successively：The first Be where CAD model visual angle circle on 12 predefined viewpoints are equably set, this 12 it is predefined each regard The corresponding image of point acquisition CAD model；Second is that CAD model is placed on regular dodecahedron center, by regular dodecahedron 20 vertex are set as predefined viewpoint, in this 20 each predefined corresponding image of view collection CAD model；

(2) according to the multi-view image that each CAD model pre-processes is concentrated to data, the mathematics of joint-detection is built Model：

(2a) is denoted as { v using the pose label of each CAD model as hidden variable_i}；

(2b) is by M image of CAD model different visual anglesIt is fixed with the class label y ∈ { 1 .., N } of CAD model Justice is training sample, and wherein N is total classification number of CAD model, each multi-view image x_i, a visual angle label v is corresponded to respectively_i∈ {1,..,M}；

Object identification and pose estimation task are abstracted as following optimization by (2c) basis above to the definition of training sample Problem：

Wherein R is neural network weight parameter,For the class label of neural network prediction,

It is the probability that the class label of the Softmax layers output in convolutional neural networks CNN is y；

(3) structure and training convolutional neural networks CNN：

(3a) on the basis of existing AlexNet networks, increase Eltwise1 layers, fc_a1 layers, fc_a2 layers, It Eltwise2 layers, obtains one and contains 16 layer convolutional neural networks CNN, wherein：

The Eltwise1 layers for melting Conv3 layers in AlexNet networks with Conv4 layers of characteristic pattern corresponding position It closes；

The fc_a1 layers by Eltwise1 layers of characteristic pattern for being mapped as feature vector；

Pool5 layers of Feature Mapping in AlexNet networks are feature vector by the fc_a2；

The Eltwise2 layers for melting fc_a1 layers, fc_a2 layers and Eltwise1 layers of characteristic pattern corresponding position It closes；

(3b) is by the multi-view image of each CAD model in training setIt is input in convolutional network, iterative convolution Neural network, optimization neural network parameter R, until neural network are trained in the forward calculation of neural network CNN and backpropagation Until loss function J (θ)≤0.0001, trained neural network CNN is obtained；

(4) test network

By the multi-view image of each CAD model in ModelNet10 test setsIt is input to trained nerve In network, the precision of object classification and Attitude estimation is counted.

Compared with the prior art, the present invention has the following advantages：

1. the present invention is melted due to merging the element of the characteristic pattern relative position of different depth in convolutional neural networks It closes obtained new characteristic pattern and had both contained posture information abundant in shallow-layer characteristic pattern, also contain in further feature figure and be abstracted Specific classification information, therefore improve the precision of detection.

2. the present invention generates its corresponding multi-view image, i.e., due to concentrating each 3D CAD model to data It converts the sample data of 3D to the multi-view image of 2D, dimension-reduction treatment is carried out to data, therefore reduce the complexity of data, The calculation amount for reducing feature extraction accelerates the speed of detection.

Description of the drawings

Fig. 1 is the implementation flow chart of the present invention；

Fig. 2 is two kinds of predefined viewpoint strategy schematic diagrames in the present invention；

Fig. 3 is the convolutional neural networks CNN structure charts built in the present invention.

Specific implementation mode

Below in conjunction with the accompanying drawings, the example and effect of the present invention are described in further detail.

Referring to Fig.1, steps are as follows for realization of the invention：

Step 1, CAD model multi-view image is obtained.

To the CAD model of each sample in ModelNet10 data sets, two kinds of tactful pretreatments are carried out successively.

As shown in Fig. 2 (a), the first pretreatment strategy is equably arranged 12 in advance on the visual angle circle where CAD model The viewpoint of definition first fixes an axis as rotary shaft, then every 30 degree of settings, one sight on the visual angle circle where object It examines a little, so that it may on 360 degree of visual angle circle, obtain the image that each CAD model corresponds to 12 different visual angles；

As shown in Fig. 2 (b), second of pretreatment strategy is that CAD model is placed on regular dodecahedron center, by positive 12 20 vertex of face body are set as predefined viewpoint, corresponding in this 20 each predefined view collection CAD model Image.

Step 2, the multi-view image that each CAD model pre-processes is concentrated according to data, builds joint-detection Mathematical model.

(2b) is by M image of CAD model different visual anglesIt is fixed with the class label y ∈ { 1 .., N } of CAD model Justice is training sample, and wherein N is total classification number of CAD model, x_iFor multi-view image, each multi-view image x_iOne is corresponded to respectively Visual angle label v_i∈{1,..,M}；

Wherein R is neural network weight parameter,For the class label of neural network prediction,It is volume The probability that the class label of Softmax layers output in product neural network CNN is y；

It willIt is denoted asThen optimization problem is expressed as following form：

Wherein (i) indicates input picture x_i, k expression images x_iClass label, j indicates image x_iIt is predefined from j-th What viewing point arrived.

Step 3, structure convolutional neural networks CNN.

(3a) structure convolutional neural networks CNN containing 16 layers as shown in Figure 3, this 16 layers are the first convolutional layer successively Conv1, the first pond layer Pool1, the second convolutional layer Conv2, the second pond layer Pool2, third convolutional layer Conv3, Volume Four Lamination Conv4, fisrt feature fused layer Eltwise1, the 5th convolutional layer Conv5, the 5th pond layer Pool5, the first full articulamentum Fc_a1, the second full articulamentum fc_a2, the full articulamentum fc6 of third, the 4th full articulamentum fc7, second feature fused layer Eltwise2, the 5th full articulamentum fc8, classification layer Softmax, every layer of feature extraction details are as follows：

The image of 227*227 pixel sizes is input to the first convolutional layer Conv1 by (3a1), and convolution kernel size is carried out to it For the convolution operation that 11*11 pixels and step-length are 4 pixels the spy of 96 55*55 pixel sizes is obtained in total with 96 convolution kernels Sign figure；

96 characteristic patterns of the first convolutional layer Conv1 outputs are input to the first pond layer Pool1 by (3a2), are carried out to it The size of maximum pondization operation, pond block is 3*3 pixels, and step-length is 2 pixels, obtains the characteristic pattern of 96 27*27 pixel sizes；

First pond layer Pool1,96 characteristic patterns exported are input to the second convolutional layer Conv2 by (3a3), are carried out to it It is big to obtain 256 27*27 pixels in total with 256 convolution kernels for the convolution operation that convolution kernel size is 5*5 pixels and step-length is 1 Small characteristic pattern；

256 characteristic patterns of the second convolutional layer Conv2 outputs are input to the second pond layer Pool2 by (3a4), are carried out to it The size of maximum pondization operation, pond block is 3*3 pixels, and step-length is 2 pixels, obtains the feature of 256 13*13 pixel sizes Figure；

Second pond layer Pool2,256 characteristic patterns exported are input to third convolutional layer Conv3 by (3a5), are carried out to it The convolution operation that convolution kernel size is 3*3 pixels and step-length is 1 pixel shares 384 convolution kernels, obtains 384 13*13 pictures The characteristic pattern of plain size；

384 characteristic patterns that third convolutional layer Conv3 is exported are input to Volume Four lamination Con4 by (3a6), are carried out to it The convolution operation that convolution kernel size is 3*3 pixels and step-length is 1 pixel shares 384 convolution kernels, obtains 384 13*13 pictures The characteristic pattern of plain size；

(3a7) by the characteristic pattern of third convolutional layer Conv3 and Volume Four lamination Conv4 be input to the first Eltwise1 layers into Row characteristic pattern merges, and obtains the characteristic pattern of 384 13*13 pixel sizes；

384 characteristic patterns that Volume Four lamination Conv4 is exported are input to the 5th convolutional layer Conv5 by (3a8), are carried out to it The convolution operation that convolution kernel size is 3*3 pixels and step-length is 1 pixel obtains 256 13*13 pixels that is, with 256 convolution kernels The characteristic pattern of size；

256 characteristic patterns of the 5th convolutional layer Conv5 outputs are input to the 5th pond layer Pool5 by (3a9), are carried out to it Maximum pondization operation, pond block size are 3*3 pixel sizes, and step-length is 2 pixels, obtains the feature of 256 6*6 pixel sizes Figure；

384 characteristic patterns of the first Eltwise1 layers of output are input to the first full articulamentum fc_a1 by (3a10), by feature Figure is mapped as the feature vector of 1*1*4096 pixel sizes；

256 characteristic patterns of the layer Pool5 layers of output in the 5th pond are input to the second full articulamentum fc_a2 by (3a11), will Characteristic pattern is mapped as the feature vector of 1*1*4096 pixel sizes；

256 characteristic patterns of the layer Pool5 layers of output in the 5th pond are input to the full articulamentum fc6 of third by (3a12), will be special Sign figure is mapped as the feature vector of 1*1*4096 pixel sizes；

The feature vector of the 1*1*4096 pixel sizes of the full articulamentum fc6 layers of output of third is input to the 4th entirely by (3a13) Articulamentum fc7 continues feature extraction, obtains the feature vector of 1*1*4096 pixel sizes；

First full articulamentum fc_a1, the second full connection fc_a2 and the 4th are connected fc7 layers of feature vector by (3a14) entirely The 2nd Eltwise2 layers are input to, the fusion of feature vector is carried out, obtains the feature vector of 1*1*4096 pixel sizes；

The characteristic pattern of the 2nd Eltwise2 layers of 1*1*4096 pixel sizes exported is input to the 5th full connection by (3a15) Layer fc8, by the feature vector that maps feature vectors are 1*1*11*M pixel sizes, wherein M is multi-view image number, symbol " * " It indicates to be multiplied；

(3a16) is by 1*1*11*M) feature vector of plain size is input to classification layer Softmax, obtain image x_iClassification Label, selection is so that the maximum visual angle label v of class probability_iAs its pose label；

Step 4, convolutional neural networks CNN training is carried out.

(3b1) takes a training sample in the propagated forward stage from training set, by the multi-view image of the training sampleIt is input to the input layer of convolutional neural networks CNN, by feature extraction and Feature Mapping, most by Softmax layers of output Terminate fruit；

(3b2) calculates the ideal output of convolutional neural networks CNN reality outputs and training sample in back-propagation phase Difference, by the method for minimization error, backpropagation adjusts the weighting parameter R of convolutional neural networks；

(3b3) repeats the operation of (3b1) and (3b2), until convolutional neural networks CNN loss functions J (θ)≤0.0001 is Only, trained neural network is obtained.

Step 5, test network.

By the multi-view image of each CAD model in ModelNet10 test setsIt is input to trained nerve In network, the class label and pose label of output nerve neural network forecast；

Statistical test concentrates the CAD model number of class label and pose tag error to account for all CAD moulds in test set respectively The percentage of type quantity obtains object classification and Attitude estimation precision.

With reference to emulation, the effect of the present invention is further described：

1, simulated conditions

Computer operating system used in the emulation experiment of the present invention is the 64 Ubuntu systems for being, CPU is Intel Core I3 4.2GHz inside save as 16.00GB, and GPU is GeForce GTX 1070, and the deep learning frame used is Caffe2.

2, experiment content and result

In experiment, the training and test of network are carried out using ModelNet10 data sets.It is wrapped in ModelNet10 data sets 4898 CAD models containing 10 classifications, CAD model number is 3429 wherein in training set, and CAD model is in test set 1469, to each CAD model that data are concentrated, generate its multi-view image；

The multi-view image of sample in test set is input in trained convolutional network, wherein neural network prediction The CAD model number of class label mistake is 77, and the CAD model number of pose tag error is 609.Statistics obtains point of network Class and Attitude estimation precision, and compared with several existing detection methods, as shown in the table：

Table 1

Method	Nicety of grading	Attitude estimation precision
			The present invention	94.76	58.52
Rotationnet	94.38	58.33
			MVCNN	92.10	-
FusionNet	90.80	-

Wherein, RotationNet is twiddle iteration algorithm,

MVCNN is that various visual angles merge algorithm,

FusionNet is characterized blending algorithm, it is that existing several more advanced object identifications and pose are estimated Method.

As seen from Table 1, the method for merging the characteristic pattern of network different depth layer proposed in the present invention, Ke Yiti The precision of high-class and Attitude estimation.

Claims

1. the method for object classification and pose estimation based on neural network, including：

3429 CAD models are taken out from ModelNet10 data sets as training set, take out 1469 CAD as test set；

To the CAD model of each sample in ModelNet10 data sets, two kinds of tactful pretreatments are carried out successively：The first be 12 predefined viewpoints are equably set on visual angle circle where CAD model, are adopted in this 12 each predefined viewpoint Collect the corresponding image of CAD model；Second is that CAD model is placed on regular dodecahedron center, by 20 of regular dodecahedron Vertex is set as predefined viewpoint, in this 20 each predefined corresponding image of view collection CAD model；

(2) according to the multi-view image that each CAD model pre-processes is concentrated to data, the mathematical modulo of joint-detection is built Type：

(2b) is by M image of CAD model different visual anglesWith the class label y ∈ { 1 .., N } of CAD model, it is defined as Training sample, wherein N are total classification number of CAD model, each multi-view image x_i, a visual angle label v is corresponded to respectively_i∈ {1,..,M}；

Object identification and pose estimation task to the definition of training sample, are abstracted as following optimization and asked by (2c) basis above Topic：

Wherein R is neural network weight parameter,For the class label of neural network prediction,It is convolutional Neural The probability that the class label of Softmax layers output in network C NN is y；

(3) structure and training convolutional neural networks CNN：

(3a) increases Eltwise1 layers, fc_a1 layers, fc_a2 layers, Eltwise2 on the basis of existing AlexNet networks Layer obtains one and contains 16 layer convolutional neural networks CNN, wherein：

The Eltwise1 layers for merging Conv3 layers in AlexNet networks with Conv4 layers of characteristic pattern corresponding position；

The Eltwise2 layers for merging fc_a1 layers, fc_a2 layers and Eltwise1 layers of characteristic pattern corresponding position；

(3b) is by the multi-view image of each CAD model in training setIt is input in convolutional network, iterative convolution nerve Neural network, optimization neural network parameter R, until the loss of neural network are trained in the forward calculation of network C NN and backpropagation Until function J (θ)≤0.0001, trained neural network CNN is obtained；

(4) test network

By the multi-view image of each CAD model in ModelNet10 test setsIt is input to trained neural network In, count the precision of object classification and Attitude estimation.

2. according to the method described in claim 1, wherein the first pretreatment strategy of step (1) is in regarding where CAD model 12 predefined viewpoints are equably set on the circle of angle, are first to fix an axis as rotary shaft, then at the visual angle where object One point of observation is set on circle every 30 degree, i.e., on 360 degree of visual angle circle, obtains 12 differences of each CAD model correspondence and regards The image at angle.

3. according to the method described in claim 1, optimization problem wherein in step (2c), is realized as follows：

Wherein (i) indicates input picture x_i, k expression images x_iClass label, j indicates image x_iIt is from j-th of predefined viewpoint It observes, R is neural network weight parameter.

4. according to the method described in claim 1, convolutional neural networks CNN of the structure containing 16 layers wherein in step (3a), step It is rapid as follows：

The image of 227*227 pixel sizes is input to the first convolutional layer Conv1 by (3a1), and it is 11* that convolution kernel size is carried out to it 11 pixels and the convolution operation that step-length is 4 pixels obtain the characteristic pattern of 96 55*55 pixel sizes in total with 96 convolution kernels；

96 characteristic patterns of the first convolutional layer Conv1 outputs are input to the first pond layer Pool1 by (3a2), and maximum is carried out to it Pondization operates, and the size of pond block is 3*3 pixels, and step-length is 2 pixels, obtains the characteristic pattern of 96 27*27 pixel sizes；

First pond layer Pool1,96 characteristic patterns exported are input to the second convolutional layer Conv2 by (3a3), and convolution is carried out to it The convolution operation that core size is 5*5 pixels and step-length is 1 obtains 256 27*27 pixel sizes in total with 256 convolution kernels Characteristic pattern；

256 characteristic patterns of the second convolutional layer Conv2 outputs are input to the second pond layer Pool2 by (3a4), and maximum is carried out to it Pondization operates, and the size of pond block is 3*3 pixels, and step-length is 2 pixels, obtains the characteristic pattern of 256 13*13 pixel sizes；

Second pond layer Pool2,256 characteristic patterns exported are input to third convolutional layer Conv3 by (3a5), and convolution is carried out to it The convolution operation that core size is 3*3 pixels and step-length is 1 pixel shares 384 convolution kernels, it is big to obtain 384 13*13 pixels Small characteristic pattern；

384 characteristic patterns that third convolutional layer Conv3 is exported are input to Volume Four lamination Con4 by (3a6), and convolution is carried out to it The convolution operation that core size is 3*3 pixels and step-length is 1 pixel shares 384 convolution kernels, it is big to obtain 384 13*13 pixels Small characteristic pattern；

The characteristic pattern of third convolutional layer Conv3 and Volume Four lamination Conv4 are input to the first Eltwise1 layers of progress spy by (3a7) Sign figure fusion, obtains the characteristic pattern of 384 13*13 pixel sizes；

384 characteristic patterns that Volume Four lamination Conv4 is exported are input to the 5th convolutional layer Conv5 by (3a8), and convolution is carried out to it The convolution operation that core size is 3*3 pixels and step-length is 1 pixel obtains 256 13*13 pixel sizes that is, with 256 convolution kernels Characteristic pattern；

256 characteristic patterns of the 5th convolutional layer Conv5 outputs are input to the 5th pond layer Pool5 by (3a9), and maximum is carried out to it Pondization operates, and pond block size is 3*3 pixel sizes, and step-length is 2 pixels, obtains the characteristic pattern of 256 6*6 pixel sizes；

384 characteristic patterns of the first Eltwise1 layers of output are input to the first full articulamentum fc_a1 by (3a10), and characteristic pattern is reflected It penetrates as the feature vector of 1*1*4096 pixel sizes；

256 characteristic patterns of the layer Pool5 layers of output in the 5th pond are input to the second full articulamentum fc_a2 by (3a11), by feature Figure is mapped as the feature vector of 1*1*4096 pixel sizes；

256 characteristic patterns of the layer Pool5 layers of output in the 5th pond are input to the full articulamentum fc6 of third by (3a12), by characteristic pattern It is mapped as the feature vector of 1*1*4096 pixel sizes；

The feature vector of the 1*1*4096 pixel sizes of the full articulamentum fc6 layers of output of third is input to the 4th full connection by (3a13) Layer fc7 continues feature extraction, obtains the feature vector of 1*1*4096 pixel sizes；

The feature vector that first full articulamentum fc_a1, the second full connection fc_a2 and the 4th are connected fc7 layers by (3a14) entirely inputs To the 2nd Eltwise2 layers, the fusion of feature vector is carried out, the feature vector of 1*1*4096 pixel sizes is obtained；

The characteristic pattern of the 2nd Eltwise2 layers of 1*1*4096 pixel sizes exported is input to the 5th full articulamentum by (3a15) Fc8, by the feature vector that maps feature vectors are 1*1*11*M pixel sizes, wherein M is multi-view image number, symbol " * " table Show multiplication；

(3a16) is by 1*1*11*M) feature vector of plain size is input to classification layer Softmax, obtain image x_iClass label, Selection is so that the maximum visual angle label v of class probability_iAs its pose label.

5. according to the method described in claim 1, training convolutional neural networks CNN wherein in step (3b), as follows into Row：

(3b1) takes a training sample in the propagated forward stage from training set, by the multi-view image of the training sampleIt is input to the input layer of convolutional neural networks CNN, it is final by Softmax layers of output by feature extraction and Feature Mapping As a result；

(3b2) calculates the difference of the ideal output of convolutional neural networks CNN reality outputs and training sample in back-propagation phase, By the method for minimization error, backpropagation adjusts the weighting parameter R of convolutional neural networks；

(3b3) repeats the operation of (3b1) and (3b2), until convolutional neural networks CNN loss functions J (θ)≤0.0001.