CN108629789A

CN108629789A - A kind of well-marked target detection method based on VggNet

Info

Publication number: CN108629789A
Application number: CN201810457552.3A
Authority: CN
Inventors: 郭炜强; 徐绍栋; 张宇; 郑波
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2018-05-14
Filing date: 2018-05-14
Publication date: 2018-10-09

Abstract

The well-marked target detection method based on VggNet that the invention discloses a kind of, this method are to be characterized as that main target, extraction high-level semantics features are the second target to retain image bottom edges.By the way that image bottom edges feature and high-level semantics features are carried out convolution, and then obtain target significant image.While ensuring to effectively improve Detection accuracy, network training speed is significantly promoted by reducing full articulamentum and increasing normalization layer.By Image Adjusting it is fixed size by linear interpolation, to realize that different size of image can be handled effectively for the RGB image of arbitrary size.The method of the present invention rapid extraction high-level semantics features and can retain image bottom edges information, to effectively solve the problems, such as that the network structure of simple extraction feature class can not obtain image border.

Description

A kind of well-marked target detection method based on VggNet

Technical field

The present invention relates to the technical fields of Digital Image Processing, refer in particular to a kind of well-marked target detection based on VggNet Method.

Background technology

Convolutional neural networks are proposed based on artificial neural network.The nervous system of artificial Neural Network Simulation people, by A certain number of neurons are constituted.In a supervised learning problem, there is one group of training data (x_i,y_i), x is sample, and y is class It does not mark, they is inputted artificial neural network, a nonlinear Optimal Separating Hyperplane h can be obtained_w,b(x), super flat by this Face can classify all image datas of input.

One neuron is an arithmetic element in neural network, it is substantially exactly a function.As shown in Figure 1, For the schematic diagram of a neuron：

There are 3 input x₁、x₂、x₃,+1 is a bias, outputF is sharp Function living, w are the shared proportion of each input, and b is bias, and activation primitive here is sigmoid functions：

Artificial neural network is combined by multiple above-mentioned neurons, as shown in Fig. 2, small-sized artificial for one The schematic diagram of neural network：

In convolutional neural networks in figure, input is exactly a series of image, and weight w is exactly convolution mask, It is finally the neural network connected entirely for the weight of different neurons, usually convolutional layer and down-sampling layer alternating, The artificial neural network of namely above-mentioned classics.As shown in figure 3, being a simple convolutional neural networks schematic diagram：

In figure, C is convolutional layer, and S is down-sampling layer, and the output of S4 is pulled into a vector by Stretch layers, inputs tradition god Through in network N N, then being exported.

VggNet is one kind of convolutional neural networks, it obtain ILSVRC tournament sorting projects in 2014 second place and The first place of positioning project.Its main contribution is that the depth for showing network is the key component of algorithm excellent performance. VggNet includes five kinds of structures, wherein best network structure contains 16 convolution/full articulamentum.Network structure very one It causes, all uses the down-sampling layer of 3 × 3 convolution kernel 2 × 2 from the beginning to the end.Up to the present, VggNet is still extensive Ground is used for extracting characteristics of image.

Multilayer convolution blur in VggNet network structures image bottom edges information, full articulamentum considerably increase ginseng Several quantity is unfavorable for carrying out target detection work.In view of the above problems, the present invention proposes that a kind of well-marked target that is suitable for detects Solution.

Invention content

The shortcomings that it is an object of the invention to overcome the prior art and deficiency, it is proposed that one kind can rapid extraction high level language Adopted feature, and the well-marked target detection method based on VggNet of image bottom edges information can be retained, it can effectively solve simple The problem of network structure of extraction feature class can not obtain image border.

To achieve the above object, technical solution provided by the present invention is：A kind of well-marked target detection based on VggNet Method, this method are characterized as main target to retain image bottom edges, and extraction high-level semantics features are the second target, pass through by Image bottom edges feature carries out convolution with high-level semantics features, and then obtains target significant image, is ensuring to effectively improve detection While accuracy rate, network training speed is significantly promoted by reducing full articulamentum and increasing normalization layer, for arbitrarily large Small RGB image, by linear interpolation by Image Adjusting be fixed size, to realize that different size of image can obtain Effective processing；It includes the following steps：

1) input picture is pre-processed by linear interpolation, all images is made to meet network inputs requirement；

2) characteristics of image is extracted by level by convolution operation, multiple channels is obtained according to convolution kernel size and step-length respectively Convolution results；

3) convolution results are handled by down-sampling operation, reduces data manipulation time and space consuming；

4) the case where layer normalized image data being normalized by batch, preventing gradient disperse or gradient explosion；

5) deconvolution operation is carried out to convolution results and image array is restored to original image size, prepare for notable figure；

6) result after deconvolution is subjected to convolution with bottom edges information, obtains final goal notable figure.

In step 1), the RGB image of arbitrary size is adjusted to the image of 224 × 224 sizes using linear interpolation, Core concept is to carry out once linear interpolation respectively in two directions, and linear interpolation operating process and formula are as follows：

There are four point (x on known image matrix data₀,y₀)、(x₀,y₁)、(x₁,y₀)、(x₁,y₁),f(x₀,y₀)、f (x₀,y₁)、f(x₁,y₀)、f(x₁,y₁) it is respectively the corresponding value of four points；

It is x to abscissa₀Y-axis direction carry out linear interpolation calculation formula it is as follows：

In formula, Z₁Indicate that result of calculation, v indicate the point for calculating interpolation result from coordinate (x₀,y₀) y-axis direction distance；

It is x to abscissa₁Y-axis direction carry out linear interpolation calculation formula it is as follows：

In formula, Z₂For result of calculation, v indicates the point for calculating interpolation result from coordinate (x₁,y₀) y-axis direction distance；

The calculation formula that linear interpolation is carried out to x-axis direction is as follows：

In formula, Z is final interpolation result, Z₂With Z₁Result of calculation before respectively, u indicate calculating difference result Point is from x₀X-axis direction distance.

In step 2), the convolution kernel size is divided into two kinds：The convolution kernel of 3 × 3 sizes can capture up and down and The size of central concept can obtain neighbor information, can also play the effect of nonlinear adjustment data structure, 1 × 1 size Convolution kernel can realize dimensionality reduction operation as auxiliary convolution kernel by controlling the quantity of convolution kernel.

In step 3), the down-sampling operation is divided into mean value sampling and maximum value sampling, and the mean value sampling is to neighborhood Interior characteristic point is averaged, and can be reduced estimated value variance caused by Size of Neighborhood is limited and be increased problem；The maximum value sampling To the characteristic point maximizing in neighborhood, the offset problem that convolutional layer parameter error causes estimation mean value can be solved.

In step 4), the normalized specific effect of batch is to conclude the statistical distribution of unified samples so that institute There is its mean value of the input signal of sample to be less than setting value close to 0 or compared with its mean square deviation.

In step 5), the deconvolution is the inverse process of convolution operation, the characteristic pattern that deconvolution is obtained with convolution operation It for input, is calculated, obtains deconvolution as a result, to verify the characteristic pattern for showing that each layer obtains.

In step 6), the meaning that deconvolution result and bottom edges information are carried out to convolution is that multiple convolution is fuzzy The edge feature of image, the bottom convolution copy advanced for compensate for the deficiency of deconvolution result so that the positioning of target is more It is accurate to add.

Compared with prior art, the present invention having the following advantages that and advantageous effect：

1, the copy that image edge information is remained in convolution early stage, is conducive to carry out convolution behaviour again when exporting result Make acquisition target significant image.

2, the full articulamentum in convolutional neural networks is replaced with into convolutional layer, is reducing training parameter, is accelerating the training time While be more effectively extracted the high-level semantics features of image.

3, addition batch normalization layer solves the problems, such as that gradient disappears and gradient is exploded, and further shortens convergence time.

Description of the drawings

Fig. 1 is the schematic diagram of a neuron in background technology.

Fig. 2 is the schematic diagram of a small-sized artificial neural network in background technology.

Fig. 3 is a simple convolutional neural networks schematic diagram in background technology.

Fig. 4 is schematic network structure used in the method for the present invention.

Fig. 5 is the flow chart of the method for the present invention.

Fig. 6 is the schematic diagram of convolutional calculation process.

Fig. 7 is the input and output case of the method for the present invention.

Specific implementation mode

The present invention is further explained in the light of specific embodiments.

Before introducing the present invention, need that VggNet network structures are introduced.The VggNet network knots of normal mode Structure includes convolutional layer, down-sampling layer and full articulamentum.The size of convolution kernel is based on 3 × 3 while realizing local receptor field Also can guarantee effectively reduces network architecture parameters quantity.

The network structure being illustrated in figure 4 used in the method for the present invention, the structure is relative to normal mode VggNet's Difference lies in the last full articulamentum of network structure is replaced with convolutional layer, batch normalization layer and warp lamination.Cancellation connects entirely The reason of connecing layer is as follows：The high-level semantics features of image can further be extracted by replacing full articulamentum using convolutional layer, improve inspection Survey the accuracy of result；Addition batch normalization layer can accelerate network convergence rate, reduce training time and hardware cost；Add Add warp lamination that intermediate data size is reduced to original image size, is conducive to the generation of final detection result.

The well-marked target detection method based on VggNet that this implementation is provided, it is specifically special to retain image bottom edges Sign is main target, and extraction high-level semantics features are the second target, by by image bottom edges feature and high-level semantics features Carry out convolution, and then obtain target significant image, while ensuring to effectively improve Detection accuracy, by reduce full articulamentum and Increase normalization layer and significantly promote network training speed, for the RGB image of arbitrary size, by linear interpolation by image tune Whole is fixed size, to realize that different size of image can be handled effectively；It includes the following steps：

1) input picture is pre-processed by linear interpolation, all images is made to meet network inputs requirement；Wherein, sharp The RGB image of arbitrary size is adjusted to the image of 224 × 224 sizes with linear interpolation, core concept is in both direction Upper to carry out once linear interpolation respectively, linear interpolation operating process and formula are as follows：

2) characteristics of image is extracted by level by convolution operation, multiple channels is obtained according to convolution kernel size and step-length respectively Convolution results；Wherein, the convolution kernel size is divided into two kinds：The convolution kernel of 3 × 3 sizes can capture up and down and center The size of concept can obtain neighbor information, can also play the effect of nonlinear adjustment data structure, the convolution of 1 × 1 size Core can realize dimensionality reduction operation as auxiliary convolution kernel by controlling the quantity of convolution kernel.

3) convolution results are handled by down-sampling operation, reduces data manipulation time and space consuming；Wherein, institute It states down-sampling operation and is divided into mean value sampling and maximum value sampling, the mean value sampling averages to characteristic point in neighborhood, can It reduces estimated value variance caused by Size of Neighborhood is limited and increases problem；The maximum value sampling asks maximum to the characteristic point in neighborhood Value can solve the offset problem that convolutional layer parameter error causes estimation mean value.

4) the case where layer normalized image data being normalized by batch, preventing gradient disperse or gradient explosion；Its In, the normalized specific effect of batch is to conclude the statistical distribution of unified samples so that the input signal of all samples Its mean value is close to 0 or the very little compared with its mean square deviation.

5) deconvolution operation is carried out to convolution results and image array is restored to original image size, prepare for notable figure； Wherein, the deconvolution is the inverse process of convolution operation, and deconvolution is input with the characteristic pattern that convolution operation obtains, and is counted It calculates, obtains deconvolution as a result, to verify the characteristic pattern for showing that each layer obtains.

6) result after deconvolution is subjected to convolution with bottom edges information, obtains final goal notable figure；It wherein, will be anti- The meaning that convolution results carry out convolution with bottom edges information is that multiple convolution has obscured the edge feature of image, advances for Bottom convolution copy compensate for the deficiency of deconvolution result so that the positioning of target is more accurate.

Concrete operations flow as shown in Figure 5.Training process is the result that propagated forward is combined with backpropagation.Forward direction passes It is the process that convolutional calculation is carried out using convolution collecting image to broadcast, and during convolutional calculation, convolution kernel KernalW is covered in defeated Enter on figure InputX, corresponding position quadrature, which sums to obtain one again, is worth and is assigned to the corresponding positions output figure OutputY.Per secondary volume Product core mobile position on InputX from left to right obtains output matrix after overlapped coverage one time from top to bottom OutputY (as shown in Figure 6).If the input figure InputX of convolution kernel is m × n sizes, convolution kernel is w × w sizes, and step-length is 1,2 are filled with, then output figure OutputY is (m-w+1) × (n-w+1) sizes.

In error signal back-propagation process, first obtained in the grader of tail portion according to the wrong anti-pass mode of neural network The error signal of each neuron, then feature extractor propagation of the error signal by grader forwardly.Error signal is adopted from down The characteristic pattern of sample layer forward propagate and to be completed by primary full convolution process by the characteristic pattern of convolutional layer.

Convolutional neural networks are actually a flight data recorder, it would be desirable to according to the result of network output to network middle layer Parameter is adjusted to obtain better feature extraction or classifying quality.Therefore training set and test set are just produced, instructed The effect for practicing collection is for training neural network to carry out feature extraction, can simply be interpreted as finding and predicting potential relationship Mass data (being typically a large amount of pictures).The effect of test set is for carrying out an accuracy to trained neural network Test and measurement, test set are independently of training data but follow one group of data of probability distribution identical with training data. Common test method includes to reserve method and cross-validation method.

As its name suggests, the operating process for reserving method is to select p element in complete or collected works X as testing, then remaining n-p A element is as training set.It can be obtained according to theorem mathematically, the selection method of p element hasIt is a, Middle n！Indicate the factorial of n.Under this meaning, it is also very high to stay the time complexity that p is verified.When p=1,1 is stayed The complexity of verification is precisely n.

Next, by taking k-fold as an example, cross-validation method is simply introduced.In k-fold cross validations, complete or collected works' X quilts The random set A for being divided into k same sizes₁,...,A_k, and | A₁|=...=| A_k|.Here | A_i| it refers to gathering A_iElement number, that is, the gesture gathered.I is needed to be traversed for this whens from 1 to k, A_iGather as test, other set Gather as training.According to the test statistics of model, A can be obtained_iThe fruiting quantities n of test errors in set_i.If complete or collected works X Gesture be n if, can obtain the error rate of the model isIn order to improve the accuracy of model, k-fold can be handed over The above-mentioned steps of fork verification repeat t times, are all random division complete or collected works X each time.In being tested at t times, the mistake of t model can be obtained Accidentally rate E₁,...,E_t.The error rate of the model in this way is exactly：

Wherein, it inputs shown in Figure 7 with the image effect of output.

In conclusion the present invention be substantially generated by following three kinds of modes it is a kind of for carrying out the new of well-marked target detection Type network：

1) copy that image edge information is remained in convolution early stage, is conducive to carry out convolution behaviour again when exporting result Make acquisition target significant image.

2) the full articulamentum in convolutional neural networks is replaced with into convolutional layer, is reducing training parameter, is accelerating the training time While be more effectively extracted the high-level semantics features of image.

3) addition batch normalization layer solves the problems, such as that gradient disappears and gradient is exploded, and further shortens convergence time.

The examples of implementation of the above are only the preferred embodiments of the invention, and the implementation model of the present invention is not limited with this It encloses, therefore changes made by all shapes according to the present invention, principle, should all cover within the scope of the present invention.

Claims

1. a kind of well-marked target detection method based on VggNet, it is characterised in that：This method is special to retain image bottom edges Sign is main target, and extraction high-level semantics features are the second target, by by image bottom edges feature and high-level semantics features Carry out convolution, and then obtain target significant image, while ensuring to effectively improve Detection accuracy, by reduce full articulamentum and Increase normalization layer and significantly promote network training speed, for the RGB image of arbitrary size, by linear interpolation by image tune Whole is fixed size, to realize that different size of image can be handled effectively；It includes the following steps：

2) characteristics of image is extracted by level by convolution operation, obtains the volume in multiple channels respectively according to convolution kernel size and step-length Product result；

2. a kind of well-marked target detection method based on VggNet according to claim 1, it is characterised in that：In step 1) In, the RGB image of arbitrary size is adjusted to the image of 224 × 224 sizes using linear interpolation, core concept is at two Once linear interpolation is carried out on direction respectively, linear interpolation operating process and formula are as follows：

There are four point (x on known image matrix data₀,y₀)、(x₀,y₁)、(x₁,y₀)、(x₁,y₁),f(x₀,y₀)、f(x₀, y₁)、f(x₁,y₀)、f(x₁,y₁) it is respectively the corresponding value of four points；

In formula, Z is final interpolation result, Z₂With Z₁Result of calculation before respectively, u indicate calculating difference result point from x₀X-axis direction distance.

3. a kind of well-marked target detection method based on VggNet according to claim 1, it is characterised in that：In step 2) In, the convolution kernel size is divided into two kinds：The convolution kernel of 3 × 3 sizes can capture the size with central concept up and down, energy Neighbor information is enough obtained, can also play the effect of nonlinear adjustment data structure, the convolution kernel of 1 × 1 size is as secondary volume Product core can realize dimensionality reduction operation by controlling the quantity of convolution kernel.

4. a kind of well-marked target detection method based on VggNet according to claim 1, it is characterised in that：In step 3) In, the down-sampling operation is divided into mean value sampling and maximum value sampling, and the mean value sampling averages to characteristic point in neighborhood, Estimated value variance caused by Size of Neighborhood is limited can be reduced and increase problem；The characteristic point in neighborhood is sought in the maximum value sampling Maximum value can solve the offset problem that convolutional layer parameter error causes estimation mean value.

5. a kind of well-marked target detection method based on VggNet according to claim 1, it is characterised in that：In step 4) In, the normalized specific effect of batch is to conclude the statistical distribution of unified samples so that the input signal of all samples Its mean value is less than setting value close to 0 or compared with its mean square deviation.

6. a kind of well-marked target detection method based on VggNet according to claim 1, it is characterised in that：In step 5) In, the deconvolution is the inverse process of convolution operation, and deconvolution is input with the characteristic pattern that convolution operation obtains, and is calculated, Deconvolution is obtained as a result, to verify the characteristic pattern for showing that each layer obtains.

7. a kind of well-marked target detection method based on VggNet according to claim 1, it is characterised in that：In step 6) In, the meaning that deconvolution result and bottom edges information are carried out to convolution is that multiple convolution has obscured the edge feature of image, The bottom convolution copy advanced for compensates for the deficiency of deconvolution result so that the positioning of target is more accurate.